# Ames Housing Saleprice

## Problem Statement

Create a regression model where we are able to predict the price of the house at sales.

## Executive Summary

### Contents:
- [5. Exploratory Data Analysis(EDA)](#5.-Exploratory-Data-Analysis(EDA))


Links:
[Kaggle challenge link](https://www.kaggle.com/c/dsi-us-6-project-2-regression-challenge/data)

## 5. Exploratory Data Analysis(EDA)

In [1]:
#Imports:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
plt.style.use('ggplot')

In [8]:
# Importing cleaned dataset for EDA
df_train_eda = pd.read_csv("../datasets/test_clean.csv", na_filter=False)
df_train_s_eda = pd.read_csv("../datasets/train_clean.csv", na_filter=False)

In [3]:
df_train_eda.shape

(879, 80)

In [4]:
df_train_eda.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,3Ssn Porch,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type
0,2658,902301120,190,RM,69.0,9142,Pave,Grvl,Reg,Lvl,...,0,0,0,,,,0,4,2006,WD
1,2718,905108090,90,RL,0.0,9662,Pave,,IR1,Lvl,...,0,0,0,,,,0,8,2006,WD
2,2414,528218130,60,RL,58.0,17104,Pave,,IR1,Lvl,...,0,0,0,,,,0,9,2006,New
3,1989,902207150,30,RM,60.0,8520,Pave,,Reg,Lvl,...,0,0,0,,,,0,7,2007,WD
4,625,535105100,20,RL,0.0,9500,Pave,,IR1,Lvl,...,0,185,0,,,,0,7,2009,WD


In [5]:
df_train_eda.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 879 entries, 0 to 878
Data columns (total 80 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               879 non-null    int64  
 1   PID              879 non-null    int64  
 2   MS SubClass      879 non-null    int64  
 3   MS Zoning        879 non-null    object 
 4   Lot Frontage     879 non-null    float64
 5   Lot Area         879 non-null    int64  
 6   Street           879 non-null    object 
 7   Alley            879 non-null    object 
 8   Lot Shape        879 non-null    object 
 9   Land Contour     879 non-null    object 
 10  Utilities        879 non-null    object 
 11  Lot Config       879 non-null    object 
 12  Land Slope       879 non-null    object 
 13  Neighborhood     879 non-null    object 
 14  Condition 1      879 non-null    object 
 15  Condition 2      879 non-null    object 
 16  Bldg Type        879 non-null    object 
 17  House Style     

In [6]:
#checking for null value
df_train_eda.isna().sum().sum()

0

## 5.1 removing data with zero correlation to Sales price 
- removing columns that has no correlation to Sales price
- as it will not help us with our predictive model

In [7]:
#remove columns with no correlation to sales price
df_corr = df_train_s_eda[df_train_s_eda.columns[1::]].corr()['SalePrice'][:]

df_corr[df_corr <= 0].sort_values(ascending = False)

KeyError: 'SalePrice'

In [None]:
#sort out keys with negative correlation with sales price
df_corr[df_corr <= 0].keys()

In [None]:
no_corr = df_corr[df_corr <= 0].keys() # 9 columns removed
df_train_eda.drop(columns=no_corr,inplace = True) #9 columns removed 
df_train_eda.shape

## 5.2 Plot data against Sales price
- check for correlation between data
- remove data that has no significant influence to Sales Price

## 5.2.1 Sorting according to datatypes

In [None]:
num_data = df_train_eda.select_dtypes(['int64', 'float64']).keys()
num_data

In [None]:
df_train_eda_num = df_train_eda.loc[:,num_data]
print(df_train_eda_num.shape)
df_train_eda_num.head()

In [None]:
def subplot_scatter(xcolumns, xlabels, dataframe = df_train_eda_num):
    nrows = int(np.ceil(len(xcolumns)/3)) # Makes sure you have enough rows
    fig, ax = plt.subplots(nrows=nrows, ncols=3, figsize=(15, 4*nrows)) # You'll want to specify your figsize
    plt.subplots_adjust(hspace = 0.4, wspace = 0.4)
    ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
    for i, column in enumerate(xcolumns):  # Gives us an index value to get into all our lists
        sns.scatterplot(dataframe[column], dataframe['SalePrice'], ax = ax[i]) #plot Scatter plot for each data
        ax[i].set_xlabel(xlabels[i])# Set x label for each plot

In [None]:
subplot_scatter(xcolumns = num_data, xlabels = num_data)

### 5.2.1 comments on plots:
To be removed: 
- Id: Should be the identification for the house, no relevance to sales price
- Lot Frontage: No clear correlation with Salesprice
- Lot Area: No clear correlation with Salesprice
- Bsmtfin SF 2: Data seems to be around the same prices, otherwise is recorded as not having a basement
- Bsmt Full Bath: No clear correlation with Salesprice
- Half Bath: No clear correlation with Salesprice
- Garage Yr Blt: Too many is built in the 2000s, making all the data congregate in that year, thus not helpful for model
- 3Ssn Porch: too many null value
- Screen Porch: data seems to be randomly spread, no correlation to sales price
- Pool Area: too little house has pool area
- Mo Sold: no correlation to sales price

In [None]:
# dropping redundant datas
scatter_to_drop = ['Id' , 'Lot Frontage', 'Lot Area',
           'BsmtFin SF 2', 'Bsmt Full Bath', 'Half Bath', 
           'Garage Yr Blt', '3Ssn Porch', 'Screen Porch', 
           'Pool Area', 'Mo Sold'] # 11 columns removed
df_train_eda.drop(columns=scatter_to_drop,inplace = True) #11 columns removed 
df_train_eda.shape

## 5.2.2 Sorting Object Dtypes

In [None]:
obj_data = df_train_eda.select_dtypes(['object']).keys()
obj_data

In [None]:
len(obj_data)

In [None]:
df_train_eda_obj = df_train_eda.loc[:, obj_data]
df_train_eda_obj['SalePrice'] = df_train_eda['SalePrice'] #adding in saleprice column for obj dataframe
print(df_train_eda_obj.shape)
df_train_eda_obj.head()

In [None]:
def subplot_boxplot(xcolumns, xlabels, dataframe = df_train_eda_obj):
    nrows = int(np.ceil(len(xcolumns)/2)) # Makes sure you have enough rows
    fig, ax = plt.subplots(nrows=nrows, ncols=2, figsize=(14, 5*nrows)) # You'll want to specify your figsize
    plt.subplots_adjust(hspace = 0.4, wspace = 0.4)
    ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate
    for i, column in enumerate(xcolumns):  # Gives us an index value to get into all our lists
        sns.boxplot(x = column, y = 'SalePrice', ax = ax[i], data = dataframe) #plot Scatter plot for each data
        
        medians = dataframe.groupby(column)['SalePrice'].median().values # Calculate number of obs per group & median to position labels
        nobs = dataframe[column].value_counts().values
        nobs = [str(x) for x in nobs.tolist()]
        nobs = ["n: " + i for i in nobs]
        
        pos = range(len(nobs))
        for tick,label in zip(pos,ax[i].get_xticklabels()):
            ax[i].text(pos[tick], medians[tick] + 0.03, nobs[tick],
                    horizontalalignment='center', size='x-small', color='r', weight='semibold')

        
        ax[i].set_xlabel(xlabels[i])# Set x label for each plot
        ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45);

In [None]:
subplot_boxplot(xcolumns = obj_data, xlabels = obj_data)

## 5.2.2 comments:
To be removed:
- Utilities 
- Lot Shape 
- Land Contour 
- Lot Config
- Land Slope 
- Condition 2
- Roof Matl
- Mas Vnr Type
- BsmtFin Type 2
- Pool QC
- Fence 
- Misc Feature

In [None]:
# dropping redundant datas
box_to_drop = ['Utilities', 'Lot Shape', 'Land Contour', 
               'Lot Config', 'Land Slope', 'Condition 2',
               'Roof Matl', 'Mas Vnr Type', 'BsmtFin Type 2',
               'Pool QC', 'Fence', 'Misc Feature'] # 12 columns removed
df_train_eda.drop(columns=box_to_drop,inplace = True) #12 columns removed 
df_train_eda.shape

In [None]:
df_train_eda.columns

In [None]:
#32 columns were removed after EDA

#export dataset for Pre - Processing

df_train_eda.to_csv("../datasets/test_EDA_sorted.csv", index=False)

# To be continued