<a href="https://colab.research.google.com/github/mehrdadkazemi254/MachineLearning/blob/main/MissingValues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Three ways to deal with missing values:**


1.   Delete the entire column with a missing value
2.   Imputation
3.   Imputation + Adding a new column indicating the location of imputed values





We first import the data which could be downloaded from here:
https://www.kaggle.com/dansbecker/melbourne-housing-snapshot/home

Then, we define a function to measure the performance ( in this case accuracy ) of each method and use it compare them.







In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('melb_data.csv')

#select the target variable
y = data['Price']
predictors = data.drop(['Price'], axis=1)

#select the predictors
X = predictors.select_dtypes(exclude=['object'])

#divide data into training and validation
X_train, X_valid, y_train, y_valid = train_test_split(X,y, train_size= 0.8, test_size= 0.2, random_state= 0)

In [7]:
#define a function to measure mean absolute error(MAE)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

def get_mae(X_train,X_valid,y_train,y_valid):
  model = RandomForestRegressor(n_estimators= 10, random_state= 0)
  model.fit(X_train,y_train)
  predicts = model.predict(X_valid)
  return mean_absolute_error(y_valid,predicts)

**Approach 1:**
Drop columns with missing values

In [14]:
#we first need to identify columns with missing values

missing_values_cols = [col for col in X.columns if X[col].isnull().any()]
print(f"Columns with missing values are: {missing_values_cols}") 


shrunk_X_train = X_train.drop(missing_values_cols, axis=1)
shrunk_X_valid = X_valid.drop(missing_values_cols, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(get_mae(shrunk_X_train,shrunk_X_valid,y_train,y_valid))

Columns with missing values are: ['Car', 'BuildingArea', 'YearBuilt']
MAE from Approach 1 (Drop columns with missing values):
183550.22137772635


**Approach 2:**
imputation--> we can use different techniques to impute missing values. here, we will use the average value of that column for imputation.

In [17]:
#choose imputer

from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()

#imputation for X_train and X_valid

imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

#notice that we used .fit_transform for X_train to train the model but when it comes to the validation data we only use .transform since we
#do not want the model to have the advantage of using the validation dataset to make better predictions.

#imputation removes the columns' names; hence, we need to rename them again

imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation using the mean value):")
print(get_mae(imputed_X_train,imputed_X_valid,y_train,y_valid))

MAE from Approach 2 (Imputation using the mean value):
178166.46269899711


**Approach 3:**
An extension to imputation: we'll add another column to represent the locaion of the imputed values

In [26]:
#we get a copy of the data first, so that we can build our modified dataset for this approach

X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

#making a new column that shows the location of imputed values

for col in missing_values_cols:
  X_train_plus[col + 'was_missing?'] = X_train_plus[col].isnull()
  X_valid_plus[col + 'was_missing?'] = X_valid_plus[col].isnull()
#ignore the resulted error for now

#Imputation
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

#Imputation removes colums, so we need to rename the columns
imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columsn = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(get_mae(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))




MAE from Approach 3 (An Extension to Imputation):
178927.503183954


### As you can see, Imputation outperforms dropping the entire column
---
