
<img src="../images/house.jpeg" style="float: center; margin: 20px; height: 55px">

## Ames Housing Data and Kaggle Challenge

_Author: Afolabi Cardoso_

---

## Data Cleaning

---
#### Contents:
[Overview](#Overview) | [Imports](#Imports) | [Data Cleaning Training Set](#Data-Cleaning-Training-Set) | [Data Cleaning Test Set](#Data-Cleaning-Test-Set) | [Exports](#Exports)

---
## Overview

This notebooks utilizes advanced python techniques in cleaning the Ames Housing dataset. 
- Dropping the columns with information not relevant to the housing
- Drop features have too many missing values
- Replace null values in the numerical feature with it's mean
- Replace null values in the categorical data with it's mode

---
## Imports

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

#### Import Training Set

In [34]:
df = pd.read_csv('../datasets/train.csv')

In [43]:
len(df.dtypes[df.dtypes == object])

42

#### Import Test Set

In [4]:
df_test = pd.read_csv('../datasets/test.csv')

---
## Data Cleaning Training Set

#### Drop PID column

In [5]:
df.drop(columns=['PID'], inplace=True)

I am dropping <b>'PID'</b> because it don't have any useful information needed to create the model

#### Drop features with too many missing values

Get features with more than 40% of missing values and drop them

In [6]:
to_many_na = df.isna().sum()[df.isna().sum()>0.1*len(df)].index

In [7]:
df.drop(columns=to_many_na, inplace = True)

I dropped <b>'Lot Frontage', 'Alley', 'Fireplace Qu', 'Pool QC', 'Fence','Misc Feature'</b> because they all have missing values that account for more than 40% of their data.

#### Convert Floats to Integers

I created a function to replace all null values with the mean of its feature.

In [8]:
def fill_na_mean(column):
    if column.dtype == float:
        return column.fillna(np.mean(column)).map(lambda x: float(x))
    return column

Call function on dataframe 

In [9]:
df = df.apply(fill_na_mean)

#### Convert Objects to integers

Some non-numeric variables still have null values. I created a fuction to replace the null values with the mode of its feature 

In [10]:
def fill_na_mode(column):
    if column.dtype == object:
        return column.fillna(column.mode()[0])
    return column

Call fill_na_mode function on the dataframe

In [11]:
df = df.apply(fill_na_mode)

Check to make sure there is null value

In [12]:
df.isna().sum().sum()

0

---
## Data Cleaning Test Set

#### Drop PID column

In [13]:
df_test.drop(columns=['PID'], inplace=True)

I dropped <b>'Lot Frontage', 'Alley', 'Fireplace Qu', 'Pool QC', 'Fence','Misc Feature'</b> because they all have missing values that account for more than 40% of their data.

In [14]:
df_test.drop(columns=to_many_na, inplace = True)
df_test.shape 

(878, 73)

Using the fill_na_mean and fill_na_mode function, replace the null values with mean or mode just as the training set

In [15]:
df_test = df_test.apply(fill_na_mean)
df_test = df_test.apply(fill_na_mode)

---
## Exports

#### Export train dataset

In [83]:
df.to_csv('../datasets/train_clean.csv')

#### Export test dataset

In [84]:
df_test.to_csv('../datasets/test_clean.csv')