<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 100px">

# Project 2: Ames Housing Data and Kaggle Challenge
#                  Part 1 - Train.csv - P. Statement, Background, Cleaning & Dummifying

### Contents:
- [Problem Statement](#Problem-Statement)
- [Background](#Background)
- [Data Import & Cleaning](#Data-Import-and-Cleaning-for-1st-Dataset---train.csv)
- [Dummifying Columns](#Dummify-Columns)

## Problem Statement

This project aims to identify areas contributing to high transacted prices and where the highest transacted volume occurs, using a data science approach, so as to help realtors of Skywalker Property Advisors gain a competitive advantage in the Ames Housing Market.

## Background

The city of Ames is one of cities in Iowa State, US. 

In Ames, the situation of property market housing is stable, with the number of houses sold per year from year 2006-2010 keep relatively stable at ~400+, even though US is experiencing Subprime financial crisis due that period. 

There are plenty of Property Advisors in Ames, and Skywalker Property Advisors is one of them. 

As data scientist to advise the realtors of Skywalker Property Advisors in the year 2010, data of houses sold in Ames in 2006-2010 (until July) have being extensively analysed on and various observations are obtained. With these observations, recommendations were made to the Realtors to help them to improve their sales, and to gain a competitive advantage in Ames Housing Market.

## Data Import and Cleaning for 1st Dataset - train.csv

### 1. Import all necessary libraries

In [1]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn import metrics
import scipy.stats as stats
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import r2_score

%matplotlib inline

### 2. Read and understand the dataset

In [2]:
# Read from train csv and save it in "train" as dataframe
train = pd.read_csv("../data/train.csv")

In [3]:
# Print 1st 5 rows of dataframe
train.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


In [4]:
#Check number of rows and columns in dataset
train.shape

# train dataset have 2051 rows and 81 columns

(2051, 81)

In [5]:
# Check through the data types of columns and corss-check with data dictionary given
train.info()

# All the data types are same as data dictionary given

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2051 non-null   int64  
 1   PID              2051 non-null   int64  
 2   MS SubClass      2051 non-null   int64  
 3   MS Zoning        2051 non-null   object 
 4   Lot Frontage     1721 non-null   float64
 5   Lot Area         2051 non-null   int64  
 6   Street           2051 non-null   object 
 7   Alley            140 non-null    object 
 8   Lot Shape        2051 non-null   object 
 9   Land Contour     2051 non-null   object 
 10  Utilities        2051 non-null   object 
 11  Lot Config       2051 non-null   object 
 12  Land Slope       2051 non-null   object 
 13  Neighborhood     2051 non-null   object 
 14  Condition 1      2051 non-null   object 
 15  Condition 2      2051 non-null   object 
 16  Bldg Type        2051 non-null   object 
 17  House Style   

### 3. Check for Missing Values and deal with Missing Values

In [6]:
# Check for any missing values in the all columns of data
for i in range (0,len(train.columns)):                                          # for i in range of 0 to number of columns,
    if train[train.columns[i]].isnull().sum() != 0:                             # if total amount of null values cells is not 0,
        print(str(train.columns[i]), train[train.columns[i]].isnull().sum())    # print the column name, the amount of null values
    else:
        pass                                                                    # else pass

Lot Frontage 330
Alley 1911
Mas Vnr Type 22
Mas Vnr Area 22
Bsmt Qual 55
Bsmt Cond 55
Bsmt Exposure 58
BsmtFin Type 1 55
BsmtFin SF 1 1
BsmtFin Type 2 56
BsmtFin SF 2 1
Bsmt Unf SF 1
Total Bsmt SF 1
Bsmt Full Bath 2
Bsmt Half Bath 2
Fireplace Qu 1000
Garage Type 113
Garage Yr Blt 114
Garage Finish 114
Garage Cars 1
Garage Area 1
Garage Qual 114
Garage Cond 114
Pool QC 2042
Fence 1651
Misc Feature 1986


 #### 1. Replace empty cells in the below 13 columns from NA to no. This is because the meaning of NA in these columns means no existence of it. Hence, these NA is not missing values per say

In [7]:
# Imputing missing values with No
train['Alley'].replace(np.nan,"No",inplace = True)
train['Bsmt Qual'].replace(np.nan,"No",inplace = True)
train['Bsmt Cond'].replace(np.nan,"No",inplace = True)
train['BsmtFin Type 1'].replace(np.nan,"No",inplace = True)
train['BsmtFin Type 2'].replace(np.nan,"No",inplace = True)
train['Fireplace Qu'].replace(np.nan,"No",inplace = True)
train['Garage Type'].replace(np.nan,"No",inplace = True)
train['Garage Finish'].replace(np.nan,"No",inplace = True)
train['Garage Qual'].replace(np.nan,"No",inplace = True)
train['Garage Cond'].replace(np.nan,"No",inplace = True)
train['Pool QC'].replace(np.nan,"No",inplace = True)
train['Fence'].replace(np.nan,"No",inplace = True)
train['Misc Feature'].replace(np.nan,"No",inplace = True)

In [8]:
# Recheck for any missing values in the all columns of data, and see if the columns above still here
for i in range (0,len(train.columns)):                                          # for i in range of 0 to number of columns,
    if train[train.columns[i]].isnull().sum() != 0:                             # if total amount of null values cells is not 0,
        print(str(train.columns[i]), train[train.columns[i]].isnull().sum())    # print the column name, the amount of null values
    else:
        pass
    
# All the 13 columns above have no null values anymore

Lot Frontage 330
Mas Vnr Type 22
Mas Vnr Area 22
Bsmt Exposure 58
BsmtFin SF 1 1
BsmtFin SF 2 1
Bsmt Unf SF 1
Total Bsmt SF 1
Bsmt Full Bath 2
Bsmt Half Bath 2
Garage Yr Blt 114
Garage Cars 1
Garage Area 1


#### 2. Replace blanks in column "Lot Frontage" to 0 as those blank cells mean there is 0 Lot Frontage, not missing data

In [9]:
# Imputating missing values with 0
train['Lot Frontage'].replace(np.nan,int(0), inplace = True)

#### 3. Replace blanks in following columns to 0, using the deductive imputation, assuming that the blank data is due to non-existence of the item, hence they didnt fill in any value for the cell

In [10]:
# Imputating missing values with 0
train['Mas Vnr Area'].replace(np.nan,int(0), inplace = True)
train['BsmtFin SF 1'].replace(np.nan,int(0), inplace = True)
train['BsmtFin SF 2'].replace(np.nan,int(0), inplace = True)
train['Bsmt Unf SF'].replace(np.nan,int(0), inplace = True)
train['Total Bsmt SF'].replace(np.nan,int(0), inplace = True)
train['Bsmt Full Bath'].replace(np.nan,int(0), inplace = True)
train['Bsmt Half Bath'].replace(np.nan,int(0), inplace = True)
train['Garage Cars'].replace(np.nan,int(0), inplace = True)
train['Garage Area'].replace(np.nan,int(0), inplace = True)

# Replace blanks in following columns to No, using the deductive imputation, assuming that the blank data is due to non-existence of the item, hence they didnt fill in any value for the cell
train['Mas Vnr Type'].replace(np.nan,"No", inplace = True)

In [11]:
# Recheck for any missing values in the all columns of data
for i in range (0,len(train.columns)):                                          # for i in range of 0 to number of columns,
    if train[train.columns[i]].isnull().sum() != 0:                             # if total amount of null values cells is not 0,
        print(str(train.columns[i]), train[train.columns[i]].isnull().sum())    # print the column name, the amount of null values
    else:
        pass

Bsmt Exposure 58
Garage Yr Blt 114


#### 4. Remove the 'Garage Yr Blt' column as this column is unneccessary. Most values of 'Garage Yr Blt' is similar to values in "Year Built" or "Year Remod/Add"

In [12]:
# Removing 'Garage Yr Blt' column
train.drop('Garage Yr Blt', axis=1, inplace = True)

#### 5. Replace NaN values in 'Bsmt Exposure' column to No_Basement, "None" to "No_Exposure"

In [13]:
#5. Replace NaN values
train['Bsmt Exposure'].replace(np.nan,"No_Basement", inplace = True)
train['Bsmt Exposure'].replace("None","No_Exposure", inplace = True)

In [14]:
# Recheck for any missing values in the all columns of data
for i in range (0,len(train.columns)):                                          # for i in range of 0 to number of columns,
    if train[train.columns[i]].isnull().sum() != 0:                             # if total amount of null values cells is not 0,
        print(str(train.columns[i]), train[train.columns[i]].isnull().sum())    # print the column name, the amount of null values
    else:
        pass
    
# No columns now have null values.

### 4. Check contents of each column to see if any abnormality exist

In [15]:
#Checking contents for each column to see if any abnormality exist
np.unique(train['MS Zoning'])

# Replace certain columns to remove the spacing

array(['A (agr)', 'C (all)', 'FV', 'I (all)', 'RH', 'RL', 'RM'],
      dtype=object)

In [16]:
# Replace cells in 'Ms Zoning' from "A (agr)" to "A", from "C (all)" to "C", from "I (all)", "I"
train['MS Zoning'].replace("A (agr)","A", inplace = True)
train['MS Zoning'].replace("C (all)","C", inplace = True)
train['MS Zoning'].replace("I (all)","I", inplace = True)

In [17]:
# Rechecking contents for column 'Ms Zoning'
np.unique(train['MS Zoning'])

# Now no spacing in cell

array(['A', 'C', 'FV', 'I', 'RH', 'RL', 'RM'], dtype=object)

In [18]:
#Checking contents for each column to see if any abnormality exist
np.unique(train['Exterior 1st'])

# Replace certain columns to remove the spacing

array(['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace', 'CBlock', 'CemntBd',
       'HdBoard', 'ImStucc', 'MetalSd', 'Plywood', 'Stone', 'Stucco',
       'VinylSd', 'Wd Sdng', 'WdShing'], dtype=object)

In [19]:
# Replace cells in 'Exterior 1st' from "Wd Sdng" to "WdSdng" to remove spacing
train['Exterior 1st'].replace("Wd Sdng","WdSdng", inplace = True)

In [20]:
# Rechecking contents for column 'Bldg Type'
np.unique(train['Exterior 1st'])

# Now no spacing in cell

array(['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace', 'CBlock', 'CemntBd',
       'HdBoard', 'ImStucc', 'MetalSd', 'Plywood', 'Stone', 'Stucco',
       'VinylSd', 'WdSdng', 'WdShing'], dtype=object)

In [21]:
#Checking contents for each column to see if any abnormality exist
np.unique(train['Exterior 2nd'])

# Replace certain columns to remove the spacing

array(['AsbShng', 'AsphShn', 'Brk Cmn', 'BrkFace', 'CBlock', 'CmentBd',
       'HdBoard', 'ImStucc', 'MetalSd', 'Plywood', 'Stone', 'Stucco',
       'VinylSd', 'Wd Sdng', 'Wd Shng'], dtype=object)

In [22]:
# Replace cells in 'Exterior 2nd' from "Brk Cmn" to "BrkComm", "Wd Sdng" to "WdSdng", "Wd Shng" to "WdShing" to remove spacing
train['Exterior 2nd'].replace("Brk Cmn","BrkComm", inplace = True)
train['Exterior 2nd'].replace("Wd Sdng","WdSdng", inplace = True)
train['Exterior 2nd'].replace("Wd Shng","WdShing", inplace = True)

In [23]:
# Rechecking contents for column 'Bldg Type'
np.unique(train['Exterior 2nd'])

# Now no spacing in cell

array(['AsbShng', 'AsphShn', 'BrkComm', 'BrkFace', 'CBlock', 'CmentBd',
       'HdBoard', 'ImStucc', 'MetalSd', 'Plywood', 'Stone', 'Stucco',
       'VinylSd', 'WdSdng', 'WdShing'], dtype=object)

### 5. Changing Column Names

In [24]:
# Lowercasifying all letters in column names
train.columns = train.columns.str.lower()

In [25]:
# Rename all columns by replacing " " to "_"
for i in range(0,len(train.columns)):                                                        # for i in range from 0 to no of columns,
    if (train.columns.str.contains(" ")[i]) == True:                                         # if the name of column contains space,
        train = train.rename(columns={train.columns[i]: train.columns[i].replace(' ',"_")})  # replace the space with "_"


In [26]:
# Checking the name of columns again
train.columns

# All column names are successfully changed.

Index(['id', 'pid', 'ms_subclass', 'ms_zoning', 'lot_frontage', 'lot_area',
       'street', 'alley', 'lot_shape', 'land_contour', 'utilities',
       'lot_config', 'land_slope', 'neighborhood', 'condition_1',
       'condition_2', 'bldg_type', 'house_style', 'overall_qual',
       'overall_cond', 'year_built', 'year_remod/add', 'roof_style',
       'roof_matl', 'exterior_1st', 'exterior_2nd', 'mas_vnr_type',
       'mas_vnr_area', 'exter_qual', 'exter_cond', 'foundation', 'bsmt_qual',
       'bsmt_cond', 'bsmt_exposure', 'bsmtfin_type_1', 'bsmtfin_sf_1',
       'bsmtfin_type_2', 'bsmtfin_sf_2', 'bsmt_unf_sf', 'total_bsmt_sf',
       'heating', 'heating_qc', 'central_air', 'electrical', '1st_flr_sf',
       '2nd_flr_sf', 'low_qual_fin_sf', 'gr_liv_area', 'bsmt_full_bath',
       'bsmt_half_bath', 'full_bath', 'half_bath', 'bedroom_abvgr',
       'kitchen_abvgr', 'kitchen_qual', 'totrms_abvgrd', 'functional',
       'fireplaces', 'fireplace_qu', 'garage_type', 'garage_finish',
       'g

### 6. Drop Unnecessary column

In [27]:
# The column "gr_liv_area" is an addition of 3 columns "1st_flr_sf","2nd_flr_sf", and "low_qual_fin_sf".
# Hence, "gr_liv_area" is to be dropped to solve the collineraity within these 4 columns
train.drop('gr_liv_area', axis=1, inplace = True)

# Dropping the "pid" column as well, as it is repetitive. We already have the ID column to identify a house, hence pid is redundant.
train.drop('pid', axis=1, inplace = True)

In [28]:
# Checking the name of columns again
train.columns

# "gr_liv_area" and "pid" is being removed.

Index(['id', 'ms_subclass', 'ms_zoning', 'lot_frontage', 'lot_area', 'street',
       'alley', 'lot_shape', 'land_contour', 'utilities', 'lot_config',
       'land_slope', 'neighborhood', 'condition_1', 'condition_2', 'bldg_type',
       'house_style', 'overall_qual', 'overall_cond', 'year_built',
       'year_remod/add', 'roof_style', 'roof_matl', 'exterior_1st',
       'exterior_2nd', 'mas_vnr_type', 'mas_vnr_area', 'exter_qual',
       'exter_cond', 'foundation', 'bsmt_qual', 'bsmt_cond', 'bsmt_exposure',
       'bsmtfin_type_1', 'bsmtfin_sf_1', 'bsmtfin_type_2', 'bsmtfin_sf_2',
       'bsmt_unf_sf', 'total_bsmt_sf', 'heating', 'heating_qc', 'central_air',
       'electrical', '1st_flr_sf', '2nd_flr_sf', 'low_qual_fin_sf',
       'bsmt_full_bath', 'bsmt_half_bath', 'full_bath', 'half_bath',
       'bedroom_abvgr', 'kitchen_abvgr', 'kitchen_qual', 'totrms_abvgrd',
       'functional', 'fireplaces', 'fireplace_qu', 'garage_type',
       'garage_finish', 'garage_cars', 'garage_area', '

### 7. Changing Contents of Columns

In [29]:
# 1. Change content of column "central_air" from " Yes" "No" to 1 and 0
train['central_air'].replace({"Y":1,"N":0}, inplace = True)

In [30]:
# 2. Change'mas_vnr_type' column of the None value to No
train['mas_vnr_type'].replace("None","No", inplace = True)

In [31]:
# Changing the ordinal columns from categories to numerical values
train["exter_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["exter_cond"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["bsmt_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["bsmt_cond"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["heating_qc"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["kitchen_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1}, inplace = True)
train["fireplace_qu"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["garage_qual"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["garage_cond"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "Po":1, "No":0}, inplace = True)
train["pool_qc"].replace({"Ex": 5, "Gd":4, "TA":3, "Fa":2, "No":0}, inplace = True)
train['bsmt_exposure'].replace({"Gd": 3, "Av":2, "Mn":1, "No_Basement":0, "No":0}, inplace = True)

In [32]:
# Changing the ordinal columns from categories to numerical values
train["lot_shape"].replace({"IR3": 1, "IR2":2, "IR1":3, "Reg":4}, inplace = True)
train["utilities"].replace({"AllPub": 4, "NoSewr":3, "NoSeWa":2, "ELO":1}, inplace = True)
train["bsmtfin_type_1"].replace({"GLQ": 6, "ALQ": 5, "BLQ": 4, "Rec":3, "LwQ":2, "Unf":1, "No":0}, inplace = True)
train["bsmtfin_type_2"].replace({"GLQ": 6, "ALQ": 5, "BLQ": 4, "Rec":3, "LwQ":2, "Unf":1, "No":0}, inplace = True)
train["functional"].replace({"Typ": 8, "Min1": 7, "Min2": 6, "Mod":5, "Maj1":4, "Maj2":3, "Sev": 2, "Sal":1}, inplace = True)
train["garage_finish"].replace({"Fin":3, "RFn":2, "Unf":1, "No":0}, inplace = True)
train["paved_drive"].replace({"Y":2, "P":1, "N":0}, inplace = True)

### 8. Save Dataset

In [33]:
train.shape

(2051, 78)

In [34]:
# Save dataset to csv as a modified dataset
train.to_csv('../data/train_modified.csv', index = False)

## Dummify Columns

In [35]:
# Dummify categorical columns for regressions to take place later
train_modified = pd.get_dummies(columns=['ms_subclass','ms_zoning','street','alley','land_contour','lot_config','land_slope', 
                                'neighborhood','condition_1','condition_2','bldg_type','house_style','roof_style', 'roof_matl',
                                'exterior_1st','exterior_2nd','mas_vnr_type','foundation','heating','electrical','garage_type',
                                'fence','misc_feature','sale_type'], drop_first=True, data = train)

In [36]:
# Checking the number of columns after dummifying
train_modified.shape

# There are 217 columns

(2051, 217)

### 10. Save Dataset Again

In [37]:
#Save dataset to csv as a modified dataset
train_modified.to_csv('../data/train_modified_dummified.csv', index = False)