# Project 2 - Ames Housing Data and Kaggle Challenge
![](./assets/images/Suburb_header_image.png)
[Image Source](https://homeownershipmatters.realtor/issues/these-3-suburbs-are-leading-the-way-in-the-u-s-markets-suburban-boom/)

## Table of Content

1. [Background](#Background)
2. [Dataset](#Dataset)
3. [Python Libraries Used](#Libraries)
4. [Train Dataset](#Train_Dataset)
5. [Data Cleaning/Prep](#Data_Cleaning)

## Background


Purpose of this project is to create a regression model based on the Ames Housing Dataset. This model will be used to predict the price of a house on sales.

The Ames Housing Dataset is an exceptionally detailed and robust dataset with over 70 columns of different features relating to houses.

I'll be using: train-test split, cross-validation, and data with unknown values for the target to simulate the modeling process

The results of testing the model will be submitted to [Kaggle](https://www.kaggle.com/c/dsi-us-11-project-2-regression-challenge)

## Dataset

There will be two datasets used. One to train the model and the other to test the model.

They can be found [here](https://www.kaggle.com/c/dsi-us-11-project-2-regression-challenge/data).

The description of the dataset can be found [here](http://jse.amstat.org/v19n3/decock/DataDocumentation.txt)

## Libraries

In [44]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Train_Dataset

In [100]:
train_dataset = './assets/datasets/train.csv'
df = pd.read_csv(train_dataset)
df.shape

(2051, 81)

In [101]:
df.head()

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
0,109,533352170,60,RL,,13517,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,130500
1,544,531379050,60,RL,43.0,11492,Pave,,IR1,Lvl,...,0,0,,,,0,4,2009,WD,220000
2,153,535304180,20,RL,68.0,7922,Pave,,Reg,Lvl,...,0,0,,,,0,1,2010,WD,109000
3,318,916386060,60,RL,73.0,9802,Pave,,Reg,Lvl,...,0,0,,,,0,4,2010,WD,174000
4,255,906425045,50,RL,82.0,14235,Pave,,IR1,Lvl,...,0,0,,,,0,3,2010,WD,138500


In [102]:
df.dtypes

Id                int64
PID               int64
MS SubClass       int64
MS Zoning        object
Lot Frontage    float64
                 ...   
Misc Val          int64
Mo Sold           int64
Yr Sold           int64
Sale Type        object
SalePrice         int64
Length: 81, dtype: object

In [103]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2051 entries, 0 to 2050
Data columns (total 81 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Id               2051 non-null   int64  
 1   PID              2051 non-null   int64  
 2   MS SubClass      2051 non-null   int64  
 3   MS Zoning        2051 non-null   object 
 4   Lot Frontage     1721 non-null   float64
 5   Lot Area         2051 non-null   int64  
 6   Street           2051 non-null   object 
 7   Alley            140 non-null    object 
 8   Lot Shape        2051 non-null   object 
 9   Land Contour     2051 non-null   object 
 10  Utilities        2051 non-null   object 
 11  Lot Config       2051 non-null   object 
 12  Land Slope       2051 non-null   object 
 13  Neighborhood     2051 non-null   object 
 14  Condition 1      2051 non-null   object 
 15  Condition 2      2051 non-null   object 
 16  Bldg Type        2051 non-null   object 
 17  House Style   

## Data_Cleaning

### Alley

#### Lets look at the attribute Alley since it only has 140 filled.

In [104]:
df['Alley'].unique()

array([nan, 'Pave', 'Grvl'], dtype=object)

#### From the Dataset description: NA No alley access
Does houses with Pave or Grvl translate to a higher sales price?

In [106]:
df['Alley'].fillna('None',inplace=True)

In [107]:
df['SalePrice'].groupby(df['Alley']).mean()

Alley
Grvl    120835.635294
None    184366.258503
Pave    174534.709091
Name: SalePrice, dtype: float64

#### The average seems to show those without Alley has a higher overall sales price
Perhaps due to outliers in the dataset maybe looking at the min and max will help

In [113]:
print('min sales price:',df['SalePrice'].groupby(df['Alley']).min())
print('\nmax sales price:',df['SalePrice'].groupby(df['Alley']).max())

min sales price: Alley
Grvl    35000
None    12789
Pave    40000
Name: SalePrice, dtype: int64

max sales price: Alley
Grvl    256000
None    611657
Pave    345000
Name: SalePrice, dtype: int64


#### It does look like whether there is access to the alley affects the minimum sales price. But there are other factors that can boost the sales price of a house much higher even without access to an alley.
With that I shall keep it as it is after filling in the nan as 'None'

### Lot Frontage

In [123]:
_ = df['Lot Frontage'].isnull().sum()
print(f'{_} out of {df.shape[0]} is null')

330 out of 2051 is null


#### Are there any similarities for these properties?

The description places Lot Frontage as: "Lot Frontage (Continuous): Linear feet of street connected to property"

In [130]:
df['Street'].loc[df['Lot Frontage'].isnull()].unique()

array(['Pave', 'Grvl'], dtype=object)

In [131]:
df['MS Zoning'].loc[df['Lot Frontage'].isnull()].unique()

array(['RL', 'FV', 'RM', 'RH'], dtype=object)

In [132]:
df['MS Zoning'].loc[df['Lot Frontage'].isnull()].value_counts()

RL    289
RM     25
FV     13
RH      3
Name: MS Zoning, dtype: int64

In [133]:
df['Street'].value_counts()

Pave    2044
Grvl       7
Name: Street, dtype: int64

In [134]:
df['Street'].loc[df['Lot Frontage'].isnull()].value_counts()

Pave    329
Grvl      1
Name: Street, dtype: int64

In [154]:
df[(df['Street'] == 'Grvl') & (df['Lot Frontage'].isnull())]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
75,1360,903452025,30,RM,,6291,Grvl,,IR1,Lvl,...,0,0,,,,0,7,2008,WD,93850


In [156]:
df[(df['Street'] == 'Grvl')]

Unnamed: 0,Id,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Screen Porch,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice
75,1360,903452025,30,RM,,6291,Grvl,,IR1,Lvl,...,0,0,,,,0,7,2008,WD,93850
410,308,911204100,30,C (all),66.0,8712,Grvl,,Reg,Lvl,...,0,0,,,,0,6,2010,WD,50138
581,946,912251110,30,I (all),109.0,21780,Grvl,,Reg,Lvl,...,0,0,,,,0,3,2009,ConLD,57625
636,2174,908127100,90,RL,81.0,11841,Grvl,,Reg,Lvl,...,0,0,,,,0,5,2007,WD,118500
692,2883,911225110,50,C (all),60.0,8520,Grvl,,Reg,Bnk,...,0,0,,,,0,4,2006,WD,78000
1192,307,911204090,20,C (all),66.0,8712,Grvl,,Reg,Bnk,...,0,0,,,Shed,54,6,2010,WD,55993
1224,1631,527175130,20,RL,160.0,18160,Grvl,,Reg,Lvl,...,0,0,,MnPrv,,0,3,2007,WD,154204


In [159]:
df['Lot Frontage'][(df['Street'] == 'Grvl')].mean()

90.33333333333333