# Missing Data Handling
In this document, we will explore different ways of handling the missing data during the modelling phase.

All missing data in the summary dataset is systematic and expected. Particularly, players were only excluded for two reasons related to the number of games they had played:

1. The player had batted in fewer than 10 One-Day International matches.
2. The player had batted in fewer than 10 Domestic matches.

Resultantly, it is possible for players to be recorded in the summary dataset without having played all domestic formats. In these cases, the players summary for the format(s) they had not participated in were intentionally left blank (NaN).

While blank fields make intuitive sense where data has not been recorded, very few models can reasonably deal with missing data. Here, we will attempt to determine the best method for managing missing data in our model.

To begin, we will import necessary libraries, and load the summary dataset:

In [9]:
# Import necessary libraries and information.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from lib.constants import DATA_PATH

# Load batter summary data.
summary = pd.read_csv(DATA_PATH + "/Batter_Summary.txt", delimiter="\t")

# Replace batter hand with a numeric representation (Right Hand = 0, Left Hand = 1).
summary["Hand"] = np.where(summary["Hand"] == "Right", 0, 1)
summary = summary.drop(columns=["Name", "Batter_ID"])


## 1. Method of Determining Success.
Before we can determine the "best" method for managing missing data, we must devise a metric for measuring how successful a method is. 

Considering the scope and goal of this project, we will quantify the success of a missing data management method by how well a Random Forest Regression model can predict the One-Day International batting average of the 12 most recent players, when trained on all prior players.

Functions are defined below to help with this objective.

In [11]:
# Function to create a test and training split.
def test_train_split(X, y):
  X_train = X[:-12]
  X_test = X[-12:]
  y_train = y[:-12]
  y_test = y[-12:]

  # Return the training and test split data.
  return (X_train, X_test, y_train, y_test)


# Function for testing the accuracy of a missing data method.
def test_model_accuracy(X_train, X_test, y_train, y_test):
  num_tests = 500
  accuracy = 0

  for _ in range(num_tests):
    # Train a Random Forest Regressor.
    rfr = RandomForestRegressor()
    rfr.fit(X_train, y_train)

    # Track the accuracy of the model.
    accuracy += rfr.score(X_test, y_test)

  # Return the average accuracy of the model.
  return accuracy/num_tests


## 2. Handle Missing Data.
Here, we will test the performance of different methods for handling missing data. Particularly, these methods include:

1. Removing columns containing missing data.
2. Removing rows containing missing data.
3. Filling missing data with the columns mean.
4. Filling missing data with the columns median.

Ideally, a 5th method would be used whereby missing data is approximated according to the players other attributes. However, due to time restrictions with this project, this method will have the be ignored for now.

### 2.1 Remove Columns Containing Missing Data.
This method will remove all columns from the summary data that have any missing values.

In [12]:
# Remove NaN columns.
data = summary.dropna(axis=1)

# Split the data into features and labels.
X = data.drop(columns=["International_One_Day_Batting_Average"])
y = data["International_One_Day_Batting_Average"]

# Split the data into a test and training set.
X_train, X_test, y_train, y_test = test_train_split(X, y)

# Train a Random Forest Regressor model.
accuracy = test_model_accuracy(X_train, X_test, y_train, y_test)

# Check model accuracy.
print("Accuracy: ", accuracy)


Accuracy:  0.6930346236784929


### 2.2 Remove Rows Containing Missing Data.
This method will remove all rows from the summary data that have any missing values.

In [13]:
# Remove NaN columns.
data = summary.dropna(axis=0)

# Split the data into features and labels.
X = data.drop(columns=["International_One_Day_Batting_Average"])
y = data["International_One_Day_Batting_Average"]

# Split the data into a test and training set.
X_train, X_test, y_train, y_test = test_train_split(X, y)

# Train a Random Forest Regressor model.
accuracy = test_model_accuracy(X_train, X_test, y_train, y_test)

# Check model accuracy.
print("Accuracy: ", accuracy)


Accuracy:  0.6511740013171261


### 2.3 Fill Missing Data with Column Mean.
This method fills any missing data with the mean of the column.

In [14]:
# Fill missing data with the columns mean.
data = summary.copy()
for i in data.columns[data.isnull().any(axis=0)]:
    data[i].fillna(data[i].mean(), inplace=True)

# Split the data into features and labels.
X = data.drop(columns=["International_One_Day_Batting_Average"])
y = data["International_One_Day_Batting_Average"]

# Split the data into a test and training set.
X_train, X_test, y_train, y_test = test_train_split(X, y)

# Train a Random Forest Regressor model.
accuracy = test_model_accuracy(X_train, X_test, y_train, y_test)

# Check model accuracy.
print("Accuracy: ", accuracy)


Accuracy:  0.6763713610294009


### 2.4 Fill Missing Data with Column Median.
This metho fills any missing data with the median of the column.

In [15]:
# Fill missing data with the columns mean.
data = summary.copy()
for i in data.columns[data.isnull().any(axis=0)]:
    data[i].fillna(data[i].median(), inplace=True)

# Split the data into features and labels.
X = data.drop(columns=["International_One_Day_Batting_Average"])
y = data["International_One_Day_Batting_Average"]

# Split the data into a test and training set.
X_train, X_test, y_train, y_test = test_train_split(X, y)

# Train a Random Forest Regressor model.
accuracy = test_model_accuracy(X_train, X_test, y_train, y_test)

# Check model accuracy.
print("Accuracy: ", accuracy)


Accuracy:  0.6832819701424099


## Conclusion
With each missing data management method tested, we can rank their success as follows:

1. Remove columns containing missing data.
2. Fill missing data with the columns median.
3. Fill missing data with the columns mean.
4. Remove rows containing missing data.

From this ranking, we can choose the method to use for the remainder of the project. Particularly, it was chosen that missing data would be filled with the column median. Initially, this may seem counterintuitive as removing the columns with missing data was a more successful approach. However, it is believed that the reason for the success of removing the missing data columns is more the result of feature selection - removing features that are detrimental to the performance of the model. As feature selection is explored later in this project, we will assess this belief then.