# Ames Housing Saleprice

## Problem Statement

Create a regression model where we are able to predict the price of the house at sales.

## Executive Summary

### Contents:
- [5. Exploratory Data Analysis(EDA)](#5.-Exploratory-Data-Analysis(EDA))


Links:
[Kaggle challenge link](https://www.kaggle.com/c/dsi-us-6-project-2-regression-challenge/data)

## 5. Exploratory Data Analysis(EDA)

In [1]:
#Imports:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.style.use('ggplot')

In [2]:
# Importing cleaned dataset for EDA
df_train_eda = pd.read_csv("../datasets/test_clean.csv", na_filter=False)
df_train_s_eda = pd.read_csv("../datasets/train_clean.csv", na_filter=False)

In [3]:
df_train_eda.shape

(879, 80)

In [4]:
df_train_eda.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,0.0,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,0.0,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [5]:
df_train_eda.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 879 entries, 0 to 878
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               879 non-null    int64  
 1   PID              879 non-null    int64  
 2   MS SubClass      879 non-null    int64  
 3   MS Zoning        879 non-null    object 
 4   Lot Frontage     879 non-null    float64
 5   Lot Area         879 non-null    int64  
 6   Street           879 non-null    object 
 7   Alley            879 non-null    object 
 8   Lot Shape        879 non-null    object 
 9   Land Contour     879 non-null    object 
 10  Utilities        879 non-null    object 
 11  Lot Config       879 non-null    object 
 12  Land Slope       879 non-null    object 
 13  Neighborhood     879 non-null    object 
 14  Condition 1      879 non-null    object 
 15  Condition 2      879 non-null    object 
 16  Bldg Type        879 non-null    object 
 17  House Style     

In [6]:
#checking for null value
df_train_eda.isna().sum().sum()

0

## 5.1 removing data with zero correlation to Sales price 
- removing columns that has no correlation to Sales price
- as it will not help us with our predictive model

In [7]:
#remove columns with no correlation to sales price
df_corr = df_train_s_eda[df_train_s_eda.columns[1::]].corr()['SalePrice'][:]

df_corr[df_corr <= 0].sort_values(ascending = False)

Misc Val          -0.007375
Yr Sold           -0.015203
Low Qual Fin SF   -0.041594
Bsmt Half Bath    -0.045290
MS SubClass       -0.087335
Overall Cond      -0.097019
Kitchen AbvGr     -0.125444
Enclosed Porch    -0.135656
PID               -0.255052
Name: SalePrice, dtype: float64

In [8]:
#sort out keys with negative correlation with sales price
df_corr[df_corr <= 0].keys()

Index(['PID', 'MS SubClass', 'Overall Cond', 'Low Qual Fin SF',
       'Bsmt Half Bath', 'Kitchen AbvGr', 'Enclosed Porch', 'Misc Val',
       'Yr Sold'],
      dtype='object')

In [9]:
no_corr = df_corr[df_corr <= 0].keys() # 9 columns removed
df_train_eda.drop(columns=no_corr,inplace = True) #9 columns removed 
df_train_eda.shape

(879, 71)

## 5.2 Plot data against Sales price
- check for correlation between data
- remove data that has no significant influence to Sales Price

## 5.2.1 Sorting according to datatypes

In [10]:
num_data = df_train_eda.select_dtypes(['int64', 'float64']).keys()
num_data

Index(['Id', 'Lot Frontage', 'Lot Area', 'Overall Qual', 'Year Built',
       'Year Remod/Add', 'Mas Vnr Area', 'BsmtFin SF 1', 'BsmtFin SF 2',
       'Bsmt Unf SF', 'Total Bsmt SF', '1st Flr SF', '2nd Flr SF',
       'Gr Liv Area', 'Bsmt Full Bath', 'Full Bath', 'Half Bath',
       'Bedroom AbvGr', 'TotRms AbvGrd', 'Fireplaces', 'Garage Yr Blt',
       'Garage Cars', 'Garage Area', 'Wood Deck SF', 'Open Porch SF',
       '3Ssn Porch', 'Screen Porch', 'Pool Area', 'Mo Sold'],
      dtype='object')

In [11]:
df_train_eda_num = df_train_eda.loc[:,num_data]
print(df_train_eda_num.shape)
df_train_eda_num.head()

(879, 29)


Unnamed: 0,Id,Lot Frontage,Lot Area,Overall Qual,Year Built,Year Remod/Add,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,...,Fireplaces,Garage Yr Blt,Garage Cars,Garage Area,Wood Deck SF,Open Porch SF,3Ssn Porch,Screen Porch,Pool Area,Mo Sold
0,2658,69.0,9142,6,1910,1950,0.0,0,0,1020,...,0,1910.0,1,440,0,60,0,0,0,4
1,2718,0.0,9662,5,1977,1977,0.0,0,0,1967,...,0,1977.0,2,580,170,0,0,0,0,8
2,2414,58.0,17104,7,2006,2006,0.0,554,0,100,...,1,2006.0,2,426,100,24,0,0,0,9
3,1989,60.0,8520,5,1923,2006,0.0,0,0,968,...,0,1935.0,2,480,0,0,0,0,0,7
4,625,0.0,9500,6,1963,1963,247.0,609,0,785,...,2,1963.0,2,514,0,76,0,185,0,7


### 5.2.1 comments on plots:
To be removed: 
- Id: Should be the identification for the house, no relevance to sales price
- Lot Frontage: No clear correlation with Salesprice
- Lot Area: No clear correlation with Salesprice
- Bsmtfin SF 2: Data seems to be around the same prices, otherwise is recorded as not having a basement
- Bsmt Full Bath: No clear correlation with Salesprice
- Half Bath: No clear correlation with Salesprice
- Garage Yr Blt: Too many is built in the 2000s, making all the data congregate in that year, thus not helpful for model
- 3Ssn Porch: too many null value
- Screen Porch: data seems to be randomly spread, no correlation to sales price
- Pool Area: too little house has pool area
- Mo Sold: no correlation to sales price

In [12]:
# dropping redundant datas
scatter_to_drop = ['Id' , 'Lot Frontage', 'Lot Area',
           'BsmtFin SF 2', 'Bsmt Full Bath', 'Half Bath', 
           'Garage Yr Blt', '3Ssn Porch', 'Screen Porch', 
           'Pool Area', 'Mo Sold'] # 11 columns removed
df_train_eda.drop(columns=scatter_to_drop,inplace = True) #11 columns removed 
df_train_eda.shape

(879, 60)

## 5.2.2 Sorting Object Dtypes

In [13]:
obj_data = df_train_eda.select_dtypes(['object']).keys()
obj_data

Index(['MS Zoning', 'Street', 'Alley', 'Lot Shape', 'Land Contour',
       'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1',
       'Condition 2', 'Bldg Type', 'House Style', 'Roof Style', 'Roof Matl',
       'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Exter Qual',
       'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure',
       'BsmtFin Type 1', 'BsmtFin Type 2', 'Heating', 'Heating QC',
       'Central Air', 'Electrical', 'Kitchen Qual', 'Functional',
       'Fireplace Qu', 'Garage Type', 'Garage Finish', 'Garage Qual',
       'Garage Cond', 'Paved Drive', 'Pool QC', 'Fence', 'Misc Feature',
       'Sale Type'],
      dtype='object')

In [14]:
len(obj_data)

42

## 5.2.2 comments:
To be removed:
- Utilities 
- Lot Shape 
- Land Contour 
- Lot Config
- Land Slope 
- Condition 2
- Roof Matl
- Mas Vnr Type
- BsmtFin Type 2
- Pool QC
- Fence 
- Misc Feature

In [15]:
# dropping redundant datas
box_to_drop = ['Utilities', 'Lot Shape', 'Land Contour', 
               'Lot Config', 'Land Slope', 'Condition 2',
               'Roof Matl', 'Mas Vnr Type', 'BsmtFin Type 2',
               'Pool QC', 'Fence', 'Misc Feature'] # 12 columns removed
df_train_eda.drop(columns=box_to_drop,inplace = True) #12 columns removed 
df_train_eda.shape

(879, 48)

In [16]:
df_train_eda.columns

Index(['MS Zoning', 'Street', 'Alley', 'Neighborhood', 'Condition 1',
       'Bldg Type', 'House Style', 'Overall Qual', 'Year Built',
       'Year Remod/Add', 'Roof Style', 'Exterior 1st', 'Exterior 2nd',
       'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual',
       'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1',
       'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air',
       'Electrical', '1st Flr SF', '2nd Flr SF', 'Gr Liv Area', 'Full Bath',
       'Bedroom AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional',
       'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Finish',
       'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond',
       'Paved Drive', 'Wood Deck SF', 'Open Porch SF', 'Sale Type'],
      dtype='object')

In [17]:
#32 columns were removed after EDA

#export dataset for Pre - Processing

df_train_eda.to_csv("../datasets/test_EDA_sorted.csv", index=False)

# To be continued