<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Boston-house-prices-data" data-toc-modified-id="Boston-house-prices-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Boston house prices data</a></span></li><li><span><a href="#Splitting-the-data-into-train-and-test-set" data-toc-modified-id="Splitting-the-data-into-train-and-test-set-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Splitting the data into train and test set</a></span></li><li><span><a href="#Examine-AGE-column" data-toc-modified-id="Examine-AGE-column-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Examine <code>AGE</code> column</a></span></li><li><span><a href="#Examine-the-model-performance-with-different-data" data-toc-modified-id="Examine-the-model-performance-with-different-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Examine the model performance with different data</a></span><ul class="toc-item"><li><span><a href="#Performance-of-Completed-Data" data-toc-modified-id="Performance-of-Completed-Data-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Performance of Completed Data</a></span></li><li><span><a href="#Performance-of-Dropped-Data" data-toc-modified-id="Performance-of-Dropped-Data-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Performance of Dropped Data</a></span></li><li><span><a href="#Performance-of-Imputed-Data" data-toc-modified-id="Performance-of-Imputed-Data-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Performance of Imputed Data</a></span></li></ul></li></ul></div>

# Missing Value Imputation with Linear Regression

`boston_dropna_df` will be a dataset to see what would happen if we just dropped rows with missing values.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
%matplotlib inline

## Boston house prices data

In [None]:
boston = load_boston() # Dataset from the sklearn library

In [None]:
print (boston.DESCR)

In [None]:
boston_df = pd.read_pickle('data/boston_df.p')

In [None]:
boston_df.head()

In [None]:
boston_df.shape

## Splitting the data into train and test set

#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston_df.iloc[:,:-1],boston_df.iloc[:,-1:], random_state = 1)

In [None]:
map(pd.np.shape,[X_train, X_test, y_train, y_test])

In [None]:
y_train

These testing and training sets form our full dataset, we have prepared some data with missing vaues

## Examine `AGE` column

In [None]:
#read the data into two dataframes
boston_dropna_df = pd.read_pickle('data/boston_dropna_df.p') 

In [None]:
boston_impute_df = boston_dropna_df.copy()

In [None]:
boston_dropna_df['AGE'].isnull().sum()

In [None]:
boston_impute_df['AGE'].isnull().sum()

`boston_dropna_df` will be a dataset to see what would happen if we just dropped rows with missing values

In [None]:
boston_dropna_df.dropna(subset=['AGE'],axis=0,inplace=True)

In [None]:
boston_dropna_df['AGE'].isnull().sum()

In [None]:
boston_impute_df['AGE'].isnull().sum()

In [None]:
boston_dropna_df.shape

In [None]:
boston_dropna_df.head()

## Examine the model performance with different data

In [None]:
lm_fitting_df = boston_dropna_df.drop('y',axis=1)
lm_fitting_df

Our target now is the 'AGE' column, we will use the `boston_dropna_df` as the data to fit, we can use data with missing values to train a model

In [None]:
lm_for_impute = LinearRegression() #instatiate

In [None]:
lm_for_impute.fit(lm_fitting_df[[x for x in lm_fitting_df.columns if x != 'AGE']],lm_fitting_df['AGE']) #fit

In [None]:
boston_impute_df[boston_impute_df['AGE'].isnull()].head()

In [None]:
lm_for_impute.predict(boston_impute_df.drop(['AGE','y'],axis=1)) 
#this uses the other features to predict 'AGE' with the model

In [None]:
boston_impute_df['AGE'][boston_impute_df['AGE'].isnull()] = lm_for_impute.predict(boston_impute_df.drop(['AGE','y'],axis=1))

In [None]:
boxplot = pd.DataFrame({'imputed': boston_impute_df['AGE'],'full': boston_df['AGE'],'dropped': boston_dropna_df['AGE']})
boxplot.plot(kind='box')

So these are the imputed values predicted by the `lm` trained on the data that have a value for 'AGE'.

Predicting the price with the full data,

\begin{equation}
y_i = \beta_0 + \beta_1 X_i + \epsilon_i
\end{equation}


### Performance of Completed Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_df.iloc[:,:-1],
                                                    boston_df.iloc[:,-1:],
                                                    random_state=111)

In [None]:
map(pd.np.shape,[X_train, X_test, y_train, y_test])

In [None]:
lm_full = LinearRegression()
lm_full.fit(X_train,y_train)

In [None]:
print ('r-squared for completed model = ',lm_full.score(X_test,y_test))

### Performance of Dropped Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_dropna_df.iloc[:,:-1],
                                                    boston_dropna_df.iloc[:,-1:],
                                                    random_state=111)

In [None]:
lm_impute = LinearRegression()
lm_impute.fit(X_train,y_train)

print ('r-squared for this model = ',lm_impute.score(X_test,y_test))

### Performance of Imputed Data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(boston_impute_df.iloc[:,:-1],
                                                    boston_impute_df.iloc[:,-1:],
                                                    random_state=111)

In [None]:
lm_impute = LinearRegression()
lm_impute.fit(X_train,y_train)

print ('r-squared for this model = ',lm_impute.score(X_test,y_test))