# Data Imputation

Date: 13/05/2018

Version: 3.0

Environment: Python 3.6.0 and Anaconda 4.3.0 (64-bit)

Libraries used:
* pandas
* LinearRegression from sklearn.linear_model
* train_test_split from sklearn.cross_validation 


## 1. Introduction
This project comprises the execution of different imputation methods applied to different varaibles in order to impute the missing data correctly.

Tasks:
1. Importing libraries
2. Reading data in
3. Exploring and checking the imported data
4. Imputation
5. Writing data back
6. Summary

More details for each task will be given in the following sections.

## 2.  Libraries

In [1]:
#importing libraries; more libraries will be added later if needed
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split



## 3. Importing the data in

In [2]:
df = pd.read_csv('dataset3_with_missing.csv')

ATTRIBUTE|	DESCRIPTION|
---|:---
Id	|Sale Id
date|	Date of the property sold, e.g., 20140502T000000
price|	Property sold price
bedrooms|	Number of bedrooms
bathrooms|	Number of bathrooms, the value of which can be either an integer or a fraction ending with .25, .5, and .75. For example, 0.5 accounts for a room with a toilet but no shower
sqft_living|	Square footage of the property's interior living space
sqft_lot|	Square footage of the land space
floors|	Number of floors
waterfront|	Whether the property was overlooking the waterfront or not
view|	An index from 0 to 4 of how good the view of the property was
condition|	An index from 1 to 5 on the condition of the property.
sqft_above|	The square footage of the interior living space that is above ground level
sqft_basement|	The square footage of the interior living space that is below ground level
yr_built|	The year the property was initially built
yr_renovated|	The year of the property's last renovation
zipcode|	The zip code area where the property is, which contains state and zip code, separated by a space. 
lat|	Latitude of the property
long|	Longitude of the property

In [3]:
df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long
0,1999700045,20140502T000000,313000,3,1.5,1340.0,7912,1.5,0,0,3,7,1340.0,0.0,1955,0,98133,47.7658,-122.339
1,1860600135,20140502T000000,2384000,5,2.5,3650.0,9050,2.0,0,4,5,10,3370.0,280.0,1921,0,98119,47.6345,-122.367
2,5467900070,20140502T000000,342000,3,2.0,1930.0,11947,1.0,0,0,4,8,1930.0,0.0,1966,0,98042,47.3672,-122.151
3,4040800810,20140502T000000,420000,3,2.25,2000.0,8030,1.0,0,0,4,8,1000.0,1000.0,1963,0,98008,47.6188,-122.114
4,7197300105,20140502T000000,550000,4,2.5,1940.0,10500,1.0,0,0,4,7,1140.0,800.0,1976,0,98052,47.683,-122.114


## 4. Data Checking

Before doing the imputation task, my first step was to understand what the dataset is and looks like. The total number of missing values and the data types of each variable.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9967 entries, 0 to 9966
Data columns (total 19 columns):
id               9967 non-null int64
date             9967 non-null object
price            9967 non-null int64
bedrooms         9967 non-null int64
bathrooms        9567 non-null float64
sqft_living      9901 non-null float64
sqft_lot         9967 non-null int64
floors           9967 non-null float64
waterfront       9967 non-null int64
view             9967 non-null int64
condition        9967 non-null int64
grade            9967 non-null int64
sqft_above       9900 non-null float64
sqft_basement    9900 non-null float64
yr_built         9967 non-null int64
yr_renovated     9967 non-null int64
zipcode          9967 non-null int64
lat              9967 non-null float64
long             9967 non-null float64
dtypes: float64(7), int64(11), object(1)
memory usage: 1.4+ MB


In [5]:
# total null values in each column
df.isnull().sum()

id                 0
date               0
price              0
bedrooms           0
bathrooms        400
sqft_living       66
sqft_lot           0
floors             0
waterfront         0
view               0
condition          0
grade              0
sqft_above        67
sqft_basement     67
yr_built           0
yr_renovated       0
zipcode            0
lat                0
long               0
dtype: int64

We can see that there are quite a few missing values in bathrooms, and a few more in three other variables. The count of missing values came out to be:

* Missing values
    * bathrooms - 400
    * sqft_living - 66
    * sqft_above - 67
    * sqft_basement - 67
    
There are multiple ways of imputation such as mean, mode, median, and a few other ones that are a bit more complicated such as linear regression imputation etc. Since its housing data I feel the variables might be interlinked hence imputing using linear regression might provide us with better imputed values than just doing it with mean, mode, or median. 

## 5. Imputation

Lets start the above list from the bottom.

Also, for imputation all variables will be used except for the following:
 
* id: because it does not represent any value in terms of the attributes of the house
* date: the date in this dataset is given in a very different format and hence would not be used in creating the model.

### Starting imputation with creating linear model for Basement

For imputation with linear regression, the best way is to divide the data into training and testing data sets. This helps us check the score or r squared value of our model created. 

For training the model a clean dataset will be used, meaning, a dataset with no missing values.

In [6]:
#creating prediction and clean datasets
df_clean = df.dropna(how='any')

using the above the training and testing datasets will be created

In [7]:
#selecting columns for exttracting trainign and testing datasets
X = df_clean.drop(['id','date','sqft_basement'], axis=1) #independent
y = df_clean.sqft_basement #target

#get training and testing data
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=111)

The linear regression will be created after which it will be trained and tested interms of its r-square

In [8]:
#initiating linear regression
lm = LinearRegression()
lm

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [9]:
lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [10]:
#checking the model
print('Estimated intercept coefficient:', lm.intercept_)

Estimated intercept coefficient: -9.981704351957887e-11


In [11]:
#checking the r squared value
lm.score(X_test, y_test)

1.0

The R squired for the model is a perfect 1. Although the score is great, this could mean that the variable can fully be explained by other varaibles in the datasets. To check more the coefficents of the model are also viewed.

In [12]:
#checking the coefficients
pd.DataFrame(list(zip(X.columns, lm.coef_)), columns=['features', 'estimatedCoefficients'])

Unnamed: 0,features,estimatedCoefficients
0,price,6.128828e-19
1,bedrooms,-1.599408e-12
2,bathrooms,2.321476e-13
3,sqft_living,1.0
4,sqft_lot,-2.220446e-16
5,floors,-2.152722e-13
6,waterfront,-4.74714e-13
7,view,2.296123e-14
8,condition,1.984307e-15
9,grade,-1.256605e-13


In order to understand the data better the dataset is viewed again

In [13]:
df_clean.head(10)

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long
0,1999700045,20140502T000000,313000,3,1.5,1340.0,7912,1.5,0,0,3,7,1340.0,0.0,1955,0,98133,47.7658,-122.339
1,1860600135,20140502T000000,2384000,5,2.5,3650.0,9050,2.0,0,4,5,10,3370.0,280.0,1921,0,98119,47.6345,-122.367
2,5467900070,20140502T000000,342000,3,2.0,1930.0,11947,1.0,0,0,4,8,1930.0,0.0,1966,0,98042,47.3672,-122.151
3,4040800810,20140502T000000,420000,3,2.25,2000.0,8030,1.0,0,0,4,8,1000.0,1000.0,1963,0,98008,47.6188,-122.114
4,7197300105,20140502T000000,550000,4,2.5,1940.0,10500,1.0,0,0,4,7,1140.0,800.0,1976,0,98052,47.683,-122.114
5,5100401414,20140502T000000,490000,2,1.0,880.0,6380,1.0,0,0,3,7,880.0,0.0,1938,1994,98115,47.6924,-122.322
6,7525100520,20140502T000000,335000,2,2.0,1350.0,2560,1.0,0,0,3,8,1350.0,0.0,1976,0,98052,47.6344,-122.107
7,2591720070,20140502T000000,482000,4,2.5,2710.0,35868,2.0,0,0,3,9,2710.0,0.0,1989,0,98038,47.375,-122.022
8,1323089184,20140502T000000,452500,3,2.5,2430.0,88426,1.0,0,0,4,7,1570.0,860.0,1985,0,98045,47.4828,-121.718
9,6127600110,20140502T000000,640000,4,2.0,1520.0,6200,1.5,0,0,3,7,1520.0,0.0,1945,0,98115,47.678,-122.269


After some data checking visually we can see that the total living area is equal to the sum of the basement and the above area. This could be a coincidence hence to verify this lets apply the formula to all of the data.

In [14]:
#check if the function works 
set(df_clean.sqft_living == df_clean.sqft_above + df_clean.sqft_basement)

{True}

### Formula identified: Imputing sqft_living, sqft_basement, and sqft_above with formula

So the formula seems to be true. Hence in order to impute the values for sqft of living, basement and above, we can just apply the formula. 

However this can only be done if for all rows of missing values only one of the attributes of this equation is missing. hence we check if there is any overlap of the nulls by checking similar sale ids of the missing rows for each. 

In [15]:
#storing sets of ids for mssing values of each
li_na_set = set(df[df.sqft_living.isna()].id)
ab_na_set = set(df[df.sqft_above.isna()].id)
ba_na_set = set(df[df.sqft_basement.isna()].id)

#checking if any null values overlapp
print('overlap li & ab:', bool(li_na_set & ab_na_set))
print('overlap li & ba:', bool(li_na_set & ba_na_set))
print('overlap ab & ba:', bool(ab_na_set & ba_na_set))

overlap li & ab: False
overlap li & ba: False
overlap ab & ba: False


Since there are no rows with more that one attribute missing of the equation we can just imput the three variables by applying the formuala to the dataframe. 

In [16]:
#storing bools
living_na = df.sqft_living.isna()
basement_na = df.sqft_basement.isna()
above_na = df.sqft_above.isna()

#imputing using the formula
df.loc[living_na, 'sqft_living'] = df[living_na].sqft_basement + df[living_na].sqft_above
df.loc[above_na, 'sqft_above'] = df[above_na].sqft_living - df[above_na].sqft_basement
df.loc[basement_na, 'sqft_basement'] = df[basement_na].sqft_living - df[basement_na].sqft_above

After impuation lets check the total number of missing in the dataframe. There should only be missing values for bathroom now. 

In [17]:
#checking the number of nas after imputing 
df.isna().sum()

id                 0
date               0
price              0
bedrooms           0
bathrooms        400
sqft_living        0
sqft_lot           0
floors             0
waterfront         0
view               0
condition          0
grade              0
sqft_above         0
sqft_basement      0
yr_built           0
yr_renovated       0
zipcode            0
lat                0
long               0
dtype: int64

For a final test, the formula will be agian applied to the dataframe to check if it still holds after the impuatation and there were no errors in the impuation process.

In [18]:
#check if the function holds true even after imputation
set(df_clean.sqft_living == df_clean.sqft_above + df_clean.sqft_basement)

{True}

Since the only attribute will missing values left is bathroom we can go ahead and impute it with a linear regression. 

### Imputing Bathrooms with Linear Regression

getting the clean data as we were doing before.

In [19]:
df_clean = df.dropna(how='any')

Selectin the columns that will be used for the training. Again, id and date will not be used for the linear model. 

In [20]:
#bathroom
X = df_clean.drop(['id','date','bathrooms'], axis=1) #independent
y = df_clean.bathrooms #target

#get training and testing data
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=111)

initiating the linear model.

In [21]:
#initiating linear regression
lm = LinearRegression()
lm

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

fitting the training data set

In [22]:
lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Onec the model is created, lets check the model.

In [23]:
#checking the model
print('Estimated intercept coefficient:', lm.intercept_)

Estimated intercept coefficient: 15.384137328208961


In [24]:
print('Number of coefficients:', len(lm.coef_))

Number of coefficients: 16


In [25]:
pd.DataFrame(list(zip(X.columns, lm.coef_)), columns=['features', 'estimatedCoefficients'])

Unnamed: 0,features,estimatedCoefficients
0,price,5.221929e-08
1,bedrooms,0.0738782
2,sqft_living,0.0003138064
3,sqft_lot,-3.736596e-07
4,floors,0.2743651
5,waterfront,-0.002950644
6,view,-0.01673563
7,condition,0.04519535
8,grade,0.03470617
9,sqft_above,3.524649e-05


In [26]:
lm.score(X_test, y_test)

0.7399552224392887

We can see that although the r squared is not perfect, it is quite high. This means that our model explains the variation in the data well and hence it can then be used to impute the missing values.

In [27]:
len(lm.predict(df[df.bathrooms.isna()].drop(['bathrooms', 'id', 'date'], axis=1)))

400

In [28]:
df.loc[df.bathrooms.isna(), 'bathrooms'] = lm.predict(df[df.bathrooms.isna()].drop(['bathrooms', 'id', 'date'], axis=1))

Once the data is imputed lets check the total nulls in the final dataset; it should come out to be zero.

In [29]:
df.isna().sum()

id               0
date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
dtype: int64

Finally, we konw that the bathrooms are integer values that end with either .25, .5, .75, or .0 hence we convert all the imputed values to the desired format. 

In [30]:
df.loc[:, 'bathrooms'] = round(df.bathrooms*4)/4

In [31]:
df.bathrooms.unique()

array([1.5 , 2.5 , 2.  , 2.25, 1.  , 1.75, 2.75, 3.  , 1.25, 3.25, 3.5 ,
       4.  , 4.5 , 3.75, 4.25, 0.75, 0.5 , 0.  ])

The format of the imputed dataset now seems to be correct, finally we check the dataset for nulls one last time, and if correct we will write the imputed dataset back. 

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9967 entries, 0 to 9966
Data columns (total 19 columns):
id               9967 non-null int64
date             9967 non-null object
price            9967 non-null int64
bedrooms         9967 non-null int64
bathrooms        9967 non-null float64
sqft_living      9967 non-null float64
sqft_lot         9967 non-null int64
floors           9967 non-null float64
waterfront       9967 non-null int64
view             9967 non-null int64
condition        9967 non-null int64
grade            9967 non-null int64
sqft_above       9967 non-null float64
sqft_basement    9967 non-null float64
yr_built         9967 non-null int64
yr_renovated     9967 non-null int64
zipcode          9967 non-null int64
lat              9967 non-null float64
long             9967 non-null float64
dtypes: float64(7), int64(11), object(1)
memory usage: 1.4+ MB


## 6. Writing the correct dataset

In [33]:
df.to_csv('dataset3_solution.csv')

## 7. Summary

Data imputation was mainly done with linear regression rather than just with mean, mode or median imputation. This is because for example the variable bathrooms was depedent on other variables aswell and with prediction a better imutation relative to other variables could be done as well as with adding a bit of variability due to error. 

Sqft living, basement and above could easily be imputed through using a formula as there was a clear linear relationship betwween the variables and not more than one variable was missing in each row. If more than one wouldve been missing in particular rows predictions with linear regression wouldve been doen for one and then to another. 