# Ames Housing Saleprice

## Problem Statement

Create a regression model where we are able to predict the price of the house at sales.

## Executive Summary

### Contents:
- [6. Pre-Processing](#6.-Pre-Processing)


Links:
[Kaggle challenge link](https://www.kaggle.com/c/dsi-us-6-project-2-regression-challenge/data)

## 6. Pre Processing

In [1]:
#Imports:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
plt.style.use('ggplot')

In [2]:
# Importing cleaned dataset for Pre Processing
df = pd.read_csv("../datasets/AHD_EDA.csv", na_filter=False)
df_train = pd.read_csv('../datasets/train.csv')
df_test = pd.read_csv('../datasets/test.csv')

df.shape, df_train.shape, df_test.shape

((2712, 31), (2051, 81), (879, 80))

In [3]:
df_train_rows = df_train['Id'].tolist()
df_test_rows = df_test['Id'].tolist()

df_train = df.loc[df['Id'].isin(df_train_rows)]
df_test = df.loc[df['Id'].isin(df_test_rows)]

df_train.shape ,df_test.shape

((1833, 31), (879, 31))

In [4]:
df.head()

Unnamed: 0,Id,Neighborhood,Condition 1,Condition 2,Overall Qual,Year Built,Year Remod/Add,Roof Matl,Mas Vnr Area,Exter Qual,...,Kitchen Qual,TotRms AbvGrd,Fireplaces,Fireplace Qu,Garage Finish,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,SalePrice
0,109,Sawyer,RRAe,Norm,6,1976,2005,CompShg,289.0,4,...,4,6,0,0,2,2.0,475.0,0,44,130500
1,544,SawyerW,Norm,Norm,7,1996,1997,CompShg,132.0,4,...,4,8,1,3,2,2.0,559.0,0,74,220000
2,153,NAmes,Norm,Norm,5,1953,2007,CompShg,0.0,3,...,4,5,0,0,1,1.0,246.0,0,52,109000
3,318,Timber,Norm,Norm,5,2006,2007,CompShg,0.0,3,...,3,7,0,0,3,2.0,400.0,100,0,174000
4,255,SawyerW,Norm,Norm,6,1900,1993,CompShg,0.0,3,...,3,6,0,0,1,2.0,484.0,0,59,138500


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2712 entries, 0 to 2711
Data columns (total 31 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Id              2712 non-null   int64  
 1   Neighborhood    2712 non-null   object 
 2   Condition 1     2712 non-null   object 
 3   Condition 2     2712 non-null   object 
 4   Overall Qual    2712 non-null   int64  
 5   Year Built      2712 non-null   int64  
 6   Year Remod/Add  2712 non-null   int64  
 7   Roof Matl       2712 non-null   object 
 8   Mas Vnr Area    2712 non-null   float64
 9   Exter Qual      2712 non-null   int64  
 10  Foundation      2712 non-null   int64  
 11  Bsmt Qual       2712 non-null   int64  
 12  Bsmt Exposure   2712 non-null   int64  
 13  BsmtFin Type 1  2712 non-null   int64  
 14  BsmtFin SF 1    2712 non-null   float64
 15  Total Bsmt SF   2712 non-null   float64
 16  Heating         2712 non-null   object 
 17  Heating QC      2712 non-null   i

# 6.1 Scaling of Data

In [6]:
num_data = df.select_dtypes(['int64', 'float64']).keys()
num_data = [x for x in num_data if ((x != 'SalePrice') & (x != 'Id'))]

nums = df[num_data]
ss = StandardScaler()
ss.fit(nums)
nums_scaled = ss.transform(nums)

nums_scaled.shape

(2712, 24)

In [7]:
nums_scaled_pd = pd.DataFrame(nums_scaled, columns = num_data) #create pd for combining later
nums_scaled_pd.head()

Unnamed: 0,Overall Qual,Year Built,Year Remod/Add,Mas Vnr Area,Exter Qual,Foundation,Bsmt Qual,Bsmt Exposure,BsmtFin Type 1,BsmtFin SF 1,...,Full Bath,Kitchen Qual,TotRms AbvGrd,Fireplaces,Fireplace Qu,Garage Finish,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF
0,-0.050561,0.14928,0.994848,1.31085,1.104871,0.522811,-0.683119,-0.604267,1.170945,0.246993,...,0.83645,0.775516,-0.236309,-0.919587,-0.96125,0.328165,0.349953,0.054276,-0.783038,0.002612
1,0.702849,0.818009,0.610917,0.278886,1.104871,-0.915184,0.630795,-0.604267,1.170945,0.493413,...,0.83645,0.775516,1.133073,0.678055,0.71486,0.328165,0.349953,0.465604,-0.783038,0.506197
2,-0.803971,-0.619757,1.090831,-0.588753,-0.688313,0.522811,-0.683119,-0.604267,1.170945,0.716138,...,-1.01686,0.775516,-0.921001,-0.919587,-0.96125,-0.796972,-1.005865,-1.067083,-0.783038,0.136901
3,-0.803971,1.152373,1.090831,-0.588753,-0.688313,-0.915184,0.630795,-0.604267,-1.237004,-1.015906,...,0.83645,-0.777807,0.448382,-0.919587,-0.96125,1.453302,0.349953,-0.312981,0.101021,-0.73598
4,-0.050561,-2.391888,0.418952,-0.588753,-0.688313,-0.915184,-1.997034,-0.604267,-1.237004,-1.015906,...,0.83645,-0.777807,-0.236309,-0.919587,-0.96125,-0.796972,0.349953,0.098347,-0.783038,0.254405


# 6.2 One-hot encode categorical variables

- Creating dummies for dataframe

In [8]:
#selecting object dtypes to create dummies
obj_data = df.select_dtypes(['object']).keys()
print(obj_data)
len(obj_data)

Index(['Neighborhood', 'Condition 1', 'Condition 2', 'Roof Matl', 'Heating'], dtype='object')


5

In [9]:
obj_processed = pd.get_dummies(df[obj_data], columns = obj_data)
obj_processed

Unnamed: 0,Neighborhood_Blmngtn,Neighborhood_Blueste,Neighborhood_BrDale,Neighborhood_BrkSide,Neighborhood_ClearCr,Neighborhood_CollgCr,Neighborhood_Crawfor,Neighborhood_Edwards,Neighborhood_Gilbert,Neighborhood_Greens,...,Roof Matl_Metal,Roof Matl_Roll,Roof Matl_Tar&Grv,Roof Matl_WdShake,Roof Matl_WdShngl,Heating_Floor,Heating_GasA,Heating_GasW,Heating_Grav,Heating_OthW
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2707,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2708,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2709,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2710,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [10]:
#remove columns with NA count values
na_col = obj_processed.filter(regex = 'NA')
na_col_keys = na_col.keys()
na_col_keys

Index(['Neighborhood_NAmes'], dtype='object')

In [11]:
df_new = pd.concat([df['Id'], nums_scaled_pd, obj_processed, df['SalePrice']], ignore_index=False, sort=False, axis = 1)
df_new.shape

(2712, 82)

In [12]:
df_new.columns

Index(['Id', 'Overall Qual', 'Year Built', 'Year Remod/Add', 'Mas Vnr Area',
       'Exter Qual', 'Foundation', 'Bsmt Qual', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin SF 1', 'Total Bsmt SF', 'Heating QC',
       '1st Flr SF', 'Gr Liv Area', 'Full Bath', 'Kitchen Qual',
       'TotRms AbvGrd', 'Fireplaces', 'Fireplace Qu', 'Garage Finish',
       'Garage Cars', 'Garage Area', 'Wood Deck SF', 'Open Porch SF',
       'Neighborhood_Blmngtn', 'Neighborhood_Blueste', 'Neighborhood_BrDale',
       'Neighborhood_BrkSide', 'Neighborhood_ClearCr', 'Neighborhood_CollgCr',
       'Neighborhood_Crawfor', 'Neighborhood_Edwards', 'Neighborhood_Gilbert',
       'Neighborhood_Greens', 'Neighborhood_GrnHill', 'Neighborhood_IDOTRR',
       'Neighborhood_Landmrk', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NAmes', 'Neighborhood_NPkVill', 'Neighborhood_NWAmes',
       'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_OldTown',
       'Neighborhood_SWISU', 'Nei

In [13]:
df_new.to_csv("../datasets/data_PP_FE.csv", index=False)