# EDA REPORT: MACHINE LEARNING PROJECT <br>(Housing Data set for Ames, IA)<br><br> - by Sabbir Mohammed

- - -

**INITIALIZING NOTEBOOK:**

In [9]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import HTML

# HTML('''<script>
# code_show=true; 
# function code_toggle() {
#  if (code_show){
#  $('div.input').hide();
#  } else {
#  $('div.input').show();
#  }
#  code_show = !code_show
# } 
# $( document ).ready(code_toggle);
# </script>
# <form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

**LOADING TRAINING DATASET:**

In [2]:
full_df = pd.read_csv('./house-prices-advanced-regression-techniques/train.csv')

In [3]:
print('Shape of Data set: ', full_df.shape)

Shape of Data set:  (1460, 81)


In [4]:
full_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


 - - -

**ALL FEATURE VARIABLES:**

In [5]:
colnames = list(full_df.columns)[1::]
for a,b,c,d in zip(colnames[::4],colnames[1::4],colnames[2::4],colnames[3::4]):
    print ('{:15}{:15}{:15}{}'.format(a,b,c,d))

MSSubClass     MSZoning       LotFrontage    LotArea
Street         Alley          LotShape       LandContour
Utilities      LotConfig      LandSlope      Neighborhood
Condition1     Condition2     BldgType       HouseStyle
OverallQual    OverallCond    YearBuilt      YearRemodAdd
RoofStyle      RoofMatl       Exterior1st    Exterior2nd
MasVnrType     MasVnrArea     ExterQual      ExterCond
Foundation     BsmtQual       BsmtCond       BsmtExposure
BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2
BsmtUnfSF      TotalBsmtSF    Heating        HeatingQC
CentralAir     Electrical     1stFlrSF       2ndFlrSF
LowQualFinSF   GrLivArea      BsmtFullBath   BsmtHalfBath
FullBath       HalfBath       BedroomAbvGr   KitchenAbvGr
KitchenQual    TotRmsAbvGrd   Functional     Fireplaces
FireplaceQu    GarageType     GarageYrBlt    GarageFinish
GarageCars     GarageArea     GarageQual     GarageCond
PavedDrive     WoodDeckSF     OpenPorchSF    EnclosedPorch
3SsnPorch      ScreenPorch    PoolArea

- - - 

## **SUBSET of feature variables (1-27)**

- - -

### **VARIABLE TYPES:**<br>
1. MSSubClass: Identifies the type of dwelling involved in the sale.
    - *Ordinal*
2. MSZoning: Identifies the general zoning classification of the sale.
    - *Nominal*
3. LotFrontage: Linear feet of street connected to property
    - *Quanitative*
4. LotArea: Lot size in square feet
    - *Quanitative*
5. Street: Type of road access to property
    - *Nominal (binary)*
6. Alley: Type of alley access to property
    - *Nominal*
7. LotShape: General shape of property
    - *Nominal*
8. LandContour: Flatness of the property
    - *Ordinal (?)*
9. Utilities: Type of utilities available
    - *Nominal*
10. LotConfig: Lot configuration
    - *Nominal*
11. LandSlope: Slope of property
    - *Ordinal*
12. Neighborhood: Physical locations within Ames city limits
    - *Nominal*
13. Condition1: Proximity to various conditions
	- *Nominal*
14. Condition2: Proximity to various conditions (if more than one is present)
    - *Nominal*
15. BldgType: Type of dwelling
    - *Nominal*
16. HouseStyle: Style of dwelling
    - *Nominal*
17. OverallQual: Rates the overall material and finish of the house
    - *Ordinal*
18. OverallCond: Rates the overall condition of the house
    - *Ordinal*
19. YearBuilt: Original construction date
    - *Ordinal*
20. YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
    - *Ordinal*
21. RoofStyle: Type of roof
    - *Nominal*
22. RoofMatl: Roof material
    - *Nominal*
23. Exterior1st: Exterior covering on house
    - *Nominal*
24. Exterior2nd: Exterior covering on house (if more than one material)
    - *Nominal*
25. MasVnrType: Masonry veneer type
    - *Nominal*
26. MasVnrArea: Masonry veneer area in square feet
    - *Quanitative*
27. ExterQual: Evaluates the quality of the material on the exterior
    - *Ordinal*

- - -

### **MISSINGNESS:**

In [6]:
dfPart1 = full_df.iloc[:,0:28]
missingRows = dfPart1.isnull().any(axis=1)
missingCols = dfPart1.isnull().any(axis=0)

In [7]:
missingValues1  = pd.DataFrame(np.sum(dfPart1.isnull())).reset_index()
tempMask1       = missingValues1[0]>0
missingValues1  = missingValues1[tempMask1]
missingValues1.columns = ['Feature', 'NumberMissingValues']
missingValues1

Unnamed: 0,Feature,NumberMissingValues
3,LotFrontage,259
6,Alley,1369
25,MasVnrType,8
26,MasVnrArea,8


**Missingness Analysis:**
- **LotFrontage:** Variable is quantitative of the length that the property is attached to the street("Linear feet of street connected to property"). "Street" feature is the categorical variable of the _type_ of street access for property. Street has no missing values that explain missingness in LotFrontage. Street shows 6 properties with _GRAVEL_ road access and 1,454 properties with _PAVED_ road access. Missingness _NOT EXPLAINED_.<br>
<br>    
- **Alley:** Missing values are due to NO alley access. _NO ERROR_. Can be imputed with '0' if dummified.<br>
<br>
- **MasVnrType:** CBlock (Cinder Block) listed as one of the categories for this feature in the data description but not shown in value counts. None is already a category. Missingness _NOT EXPLAINED_.<br>
<br>
- **MasVnrArea:** Matches missingness in MasVnrType. _NOT RANDOM_.

- - -

### **FEATURE DISTRIBUTION:**

**Quantitative Features:**<br>LotFrontage, LotArea, MasVnrArea

In [8]:
quant_features.corr(method='pearson')

NameError: name 'quant_features' is not defined

In [None]:
quant_features = full_df[['LotFrontage','LotArea','MasVnrArea']]
quant_features.hist(bins=50, figsize=[20,15])
plt.show()

**CATEGORICAL FEATURES:**

In [None]:
hood = full_df.Neighborhood.value_counts()
hood = pd.DataFrame(hood).reset_index()
plt.figure(figsize=[10,5])
plt.bar(x = hood['index'], height = hood['Neighborhood'])
plt.xticks(rotation=90)
plt.show()

In [None]:
year = full_df.YearBuilt.value_counts()
year = pd.DataFrame(year).reset_index()
plt.figure(figsize=[10,5])
plt.bar(x = year['index'], height = year['YearBuilt'])
plt.xticks(rotation=90)
plt.show()

**VALUE COUNTS OF CATEGORICAL FEATURES:**

In [None]:
cat_columns = full_df.columns[1:28]
ex_list     = ['LotFrontage', 'LotArea', 'MasVnrArea', 'YearBuilt', 'YearRemodAdd', 'Neighborhood']
cat_columns = [i for i in cat_columns if i not in ex_list]
for feature in cat_columns:
    print(full_df[feature].value_counts())
    print('')

- - -

## **FLAGGED FEATURES** <br>
- **MSZoning**<br> 
(can transform to binary if viewed as Residential and Not Residential)<br><br>
- **Street**<br>
(can transform to binary with only two categories: Gravel and Paved)<br><br>
- **Utilities**<br>
(categorical feature with only _one_ varying observation/ property: All properties list all utilities except 1)<br><br>
- **Condition2**<br>
(secondary feature following a primary)<br><br>
- **Exterior2nd**<br> 
(secondary feature following a primary)<br><br>
- **MSSubClass**<br> 
(might _be replaced by_ HouseStyle)<br><br>
- **HouseStyle**<br> 
(might replace MSSubClass)<br><br>
- **LotFrontage**<br>
(missingness)<br><br>
- **MasVnrType**<br>
(missingness)<br><br>
- **MasVnrArea**<br>
(might be made redundant by size of house... Type might be a better indicator of Price)<br><br>

- - - 

## **FEATURE DETAILS:**

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density
	
LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

       Grvl	Gravel	
       Pave	Paved
       	
Alley: Type of alley access to property

       Grvl	Gravel
       Pave	Paved
       NA 	No alley access
		
LotShape: General shape of property

       Reg	Regular	
       IR1	Slightly irregular
       IR2	Moderately Irregular
       IR3	Irregular
       
LandContour: Flatness of the property

       Lvl	Near Flat/Level	
       Bnk	Banked - Quick and significant rise from street grade to building
       HLS	Hillside - Significant slope from side to side
       Low	Depression
		
Utilities: Type of utilities available
		
       AllPub	All public Utilities (E,G,W,& S)	
       NoSewr	Electricity, Gas, and Water (Septic Tank)
       NoSeWa	Electricity and Gas Only
       ELO	Electricity only	
	
LotConfig: Lot configuration

       Inside	Inside lot
       Corner	Corner lot
       CulDSac	Cul-de-sac
       FR2	Frontage on 2 sides of property
       FR3	Frontage on 3 sides of property
	
LandSlope: Slope of property
		
       Gtl	Gentle slope
       Mod	Moderate Slope	
       Sev	Severe Slope
	
Neighborhood: Physical locations within Ames city limits

       Blmngtn	Bloomington Heights
       Blueste	Bluestem
       BrDale	Briardale
       BrkSide	Brookside
       ClearCr	Clear Creek
       CollgCr	College Creek
       Crawfor	Crawford
       Edwards	Edwards
       Gilbert	Gilbert
       IDOTRR	Iowa DOT and Rail Road
       MeadowV	Meadow Village
       Mitchel	Mitchell
       Names	North Ames
       NoRidge	Northridge
       NPkVill	Northpark Villa
       NridgHt	Northridge Heights
       NWAmes	Northwest Ames
       OldTown	Old Town
       SWISU	South & West of Iowa State University
       Sawyer	Sawyer
       SawyerW	Sawyer West
       Somerst	Somerset
       StoneBr	Stone Brook
       Timber	Timberland
       Veenker	Veenker
			
Condition1: Proximity to various conditions
	
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
	
Condition2: Proximity to various conditions (if more than one is present)
		
       Artery	Adjacent to arterial street
       Feedr	Adjacent to feeder street	
       Norm	Normal	
       RRNn	Within 200' of North-South Railroad
       RRAn	Adjacent to North-South Railroad
       PosN	Near positive off-site feature--park, greenbelt, etc.
       PosA	Adjacent to postive off-site feature
       RRNe	Within 200' of East-West Railroad
       RRAe	Adjacent to East-West Railroad
	
BldgType: Type of dwelling
		
       1Fam	Single-family Detached	
       2FmCon	Two-family Conversion; originally built as one-family dwelling
       Duplx	Duplex
       TwnhsE	Townhouse End Unit
       TwnhsI	Townhouse Inside Unit
	
HouseStyle: Style of dwelling
	
       1Story	One story
       1.5Fin	One and one-half story: 2nd level finished
       1.5Unf	One and one-half story: 2nd level unfinished
       2Story	Two story
       2.5Fin	Two and one-half story: 2nd level finished
       2.5Unf	Two and one-half story: 2nd level unfinished
       SFoyer	Split Foyer
       SLvl	Split Level
	
OverallQual: Rates the overall material and finish of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average
       5	Average
       4	Below Average
       3	Fair
       2	Poor
       1	Very Poor
	
OverallCond: Rates the overall condition of the house

       10	Very Excellent
       9	Excellent
       8	Very Good
       7	Good
       6	Above Average	
       5	Average
       4	Below Average	
       3	Fair
       2	Poor
       1	Very Poor
		
YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

       Flat	Flat
       Gable	Gable
       Gambrel	Gabrel (Barn)
       Hip	Hip
       Mansard	Mansard
       Shed	Shed
		
RoofMatl: Roof material

       ClyTile	Clay or Tile
       CompShg	Standard (Composite) Shingle
       Membran	Membrane
       Metal	Metal
       Roll	Roll
       Tar&Grv	Gravel & Tar
       WdShake	Wood Shakes
       WdShngl	Wood Shingles
		
Exterior1st: Exterior covering on house

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast	
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
Exterior2nd: Exterior covering on house (if more than one material)

       AsbShng	Asbestos Shingles
       AsphShn	Asphalt Shingles
       BrkComm	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       CemntBd	Cement Board
       HdBoard	Hard Board
       ImStucc	Imitation Stucco
       MetalSd	Metal Siding
       Other	Other
       Plywood	Plywood
       PreCast	PreCast
       Stone	Stone
       Stucco	Stucco
       VinylSd	Vinyl Siding
       Wd Sdng	Wood Siding
       WdShing	Wood Shingles
	
MasVnrType: Masonry veneer type

       BrkCmn	Brick Common
       BrkFace	Brick Face
       CBlock	Cinder Block
       None	None
       Stone	Stone
	
MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior 
		
       Ex	Excellent
       Gd	Good
       TA	Average/Typical
       Fa	Fair
       Po	Poor
		

In [None]:
featureString = 'MSSubClass,Ordinal,MSZoning,Nominal,LotFrontage,Quanitative,LotArea,Quanitative,Street,Nominal,Alley,Nominal,LotShape,Nominal,LandContour,Nominal,Utilities,Nominal,LotConfig,Nominal,LandSlope,Ordinal,Neighborhood,Nominal,Condition1,Nominal,Condition2,Nominal,BldgType,Nominal,HouseStyle,Nominal,OverallQual,Ordinal,OverallCond,Ordinal,YearBuilt,Ordinal,YearRemodAdd,Ordinal,RoofStyle,Nominal,RoofMatl,Nominal,Exterior1st,Nominal,Exterior2nd,Nominal,MasVnrType,Nominal,MasVnrArea,Quanitative,ExterQual,Ordinal'

In [None]:
missingValues    = pd.Series(np.sum(full_df.iloc[:,1:28].isnull()))

featureList      = featureString.split(sep=',')
feature27        = featureList[0::2]
type27           = featureList[1::2]

featType         = pd.DataFrame(type27, feature27)
featType         = pd.concat([featType, missingValues], axis=1)
featType         = featType.reset_index()
featType.columns = ['feature', 'type', 'missing']

featType