<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">
# House Prices in Ames (Final Project EDA)
_Osama Alfaify_

---

### EDA Content
- [Data Source](#data-source)
	- [What Are the Features/Covariates/Predictors?](#what-are-the-featurescovariatespredictors)
	- [What Is the Outcome/Response?](#what-is-the-outcomeresponse)
	- [What Do You Think Each Row in the Data Set Represents?](#what-do-you-think-each-row-in-the-dataset-represents)
- [Math Review](#math-review)
	- [Covariance](#covariance)
	- [Correlation](#correlation)
	- [The Variance-Covariance Matrix](#the-variance-covariance-matrix)
- [Causation and Correlation](#causation-and-correlation)
	- [Structure of Causal Claims](#structure-of-causal-claims)
	- [Why Do We Care?](#why-do-we-care)
	- [How Do We Determine if Something is Causal?](#how-do-we-determine-if-something-is-causal)
- [The Pearlean Causal DAG Model](#pearlean-causal-dag-model)
	- [What Is a DAG?](#what-is-a-dag)
	- [X Causes Y](#its-possible-that-x-causes-y)
	- [Y Causes X](#y-causes-x)
	- [The Correlation Between X and Y Is Not Statistically Significant](#the-correlation-between-x-and-y-is-not-statistically-significant)
	- [X or Y May Cause One or the Other Indirectly Through Another Variable](#x-or-y-may-cause-one-or-the-other-indirectly-through-another-variable)
	- [There is a Third Common Factor That Causes Both X and Y](#there-is-a-third-common-factor-that-causes-both-x-and-y)
	- [X and Y Cause a Third Factor, But Our Data Collect the Third Factor Unevenly](#both-x-and-y-cause-a-third-variable-and-the-dataset-does-not-represent-that-third-variable-evenly)
	- [Controlled Experiments](#controlled-experiments)
	- [When Is it OK to Rely on Association?](#when-is-it-ok-to-rely-on-association)
	- [How Does Association Relate to Causation?](#how-does-association-relate-to-causation)
- [Sampling Bias](#sampling-bias)
	- [Forms of Sampling Bias](#forms-of-sampling-bias)
	- [Problems From Sampling Bias](#problems-from-sampling-bias)
	- [Recovering From Sampling Bias](#recovering-from-sampling-bias)
    - [Stratified Random Sampling](#stratified-random-sampling)
- [Missing Data](#missing-data)
	- [Types of Missing Data](#types-of-missing-data)
	- [De Minimis](#de-minimis)
	- [Class Imbalance](#class-imbalance)
    - [Relation to Machine Learning](#relation-to-machine-learning)
- [Introduction to Hypothesis Testing](#introduction-to-hypothesis-testing)
	- [Validate Your Findings](#validate-your-findings)
	- [Confidence Intervals](#confidence-intervals)
	- [Error Types](#error-types)
- [Scenario](#scenario)
	- [Exercises](#exercises)
	- [Statistical Tests](#statistical-tests)
	- [Interpret Your Results](#interpret-your-results)

### Load packages:

In [76]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

### Load the data

In [77]:
df_file = './all/train.csv'

In [78]:
df = pd.read_csv(df_file)

In [79]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [80]:
from sklearn.linear_model import LinearRegression
lr  = LinearRegression()

In [82]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [83]:
y = df[['SalePrice']]
x = df[['Alley']]
df.Alley.unique()

array([nan, 'Grvl', 'Pave'], dtype=object)

In [91]:
for name, group in df.groupby('Alley'):
    lst_obj = [i for i in df.columns if df[i].dtype == 'object']
    for i in lst_obj:
        if i == 'Fence' and group[i].count() >4 :
            df['Alley'].fillna("Grvl",inplace=True)
        else : 
            df['Alley'].fillna("Pave",inplace=True)
        print(name, "Alley has this many type of object for", i, group[i].count())
    print('*'*30,'GROUP BREAK','*'*30)

Grvl Alley has this many type of object for MSZoning 50
Grvl Alley has this many type of object for Street 50
Grvl Alley has this many type of object for Alley 50
Grvl Alley has this many type of object for LotShape 50
Grvl Alley has this many type of object for LandContour 50
Grvl Alley has this many type of object for Utilities 50
Grvl Alley has this many type of object for LotConfig 50
Grvl Alley has this many type of object for LandSlope 50
Grvl Alley has this many type of object for Neighborhood 50
Grvl Alley has this many type of object for Condition1 50
Grvl Alley has this many type of object for Condition2 50
Grvl Alley has this many type of object for BldgType 50
Grvl Alley has this many type of object for HouseStyle 50
Grvl Alley has this many type of object for RoofStyle 50
Grvl Alley has this many type of object for RoofMatl 50
Grvl Alley has this many type of object for Exterior1st 50
Grvl Alley has this many type of object for Exterior2nd 50
Grvl Alley has this many type 

In [92]:
df.Alley.value_counts(dropna=False)

Pave    1410
Grvl      50
Name: Alley, dtype: int64

In [41]:
pd.get_dummies(df.Alley).sum()

Grvl    50
Pave    41
dtype: int64

In [39]:
lr.fit(pd.get_dummies(df.Alley), y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [93]:
df.isnull().sum()

Id                  0
MSSubClass          0
MSZoning            0
LotFrontage       259
LotArea             0
Street              0
Alley               0
LotShape            0
LandContour         0
Utilities           0
LotConfig           0
LandSlope           0
Neighborhood        0
Condition1          0
Condition2          0
BldgType            0
HouseStyle          0
OverallQual         0
OverallCond         0
YearBuilt           0
YearRemodAdd        0
RoofStyle           0
RoofMatl            0
Exterior1st         0
Exterior2nd         0
MasVnrType          8
MasVnrArea          8
ExterQual           0
ExterCond           0
Foundation          0
                 ... 
BedroomAbvGr        0
KitchenAbvGr        0
KitchenQual         0
TotRmsAbvGrd        0
Functional          0
Fireplaces          0
FireplaceQu       690
GarageType         81
GarageYrBlt        81
GarageFinish       81
GarageCars          0
GarageArea          0
GarageQual         81
GarageCond         81
PavedDrive

In [21]:
pd.get_dummies(df.Street, prefix='Str')

Unnamed: 0,Str_Grvl,Str_Pave
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
5,0,1
6,0,1
7,0,1
8,0,1
9,0,1


In [8]:
total = df.isnull().sum().sort_values(ascending=False)
percent_1 = df.isnull().sum()/df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total', '%'])
missing_data.head(81)

Unnamed: 0,Total,%
PoolQC,1453,99.5
MiscFeature,1406,96.3
Alley,1369,93.8
Fence,1179,80.8
FireplaceQu,690,47.3
LotFrontage,259,17.7
GarageCond,81,5.5
GarageType,81,5.5
GarageYrBlt,81,5.5
GarageFinish,81,5.5


In [9]:
[df.dtypes.unique()]

[array([dtype('int64'), dtype('O'), dtype('float64')], dtype=object)]

In [119]:
df.shape

(1460, 81)

In [11]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0
