In [1]:
import pandas as pd

pd.set_option('max_colwidth', 200)

pd.set_option('display.float_format', lambda x: '%.3f' % x)

from statsmodels.stats.weightstats import *
import scipy.stats

city_hall_dataset = pd.read_csv("./train.csv")

In [3]:
city_hall_dataset.shape

(1460, 81)

>*'%.3f' % x*
De .3f geeft aan dat het een 3 decimale float moet zijn 
De % x geeft aan dat de % een placeholder is voor de waarde van x



The basic process of statistical tests is the following :

-   Stating a Null Hypothesis (most often : "the two values are not different")
-   Stating an Alternative Hypothesis (most often : "the two values are different")
-   Defining an alpha value, which is a confidence level (most often : 95%). The higher it is, the harder it will be to validate the Alternative Hypothesis, but the more confident we will be if we do validate it.
-   Depending on data at disposal, we choose the relevant test (Z-test, T-test, etc... More on that later)
-   The test computes a score, which corresponds to a p-value.
-   If p-value is below *1 - alpha* (0.05 if alpha is 95%), we can accept the Alternative Hypothesis (or "reject the Null Hypothesis"). If it is over, we'll have to stick with the Null Hypothesis (or "fail to reject the Null Hypothesis").
-   There's a built-in function for most statistical tests out there.
-   Let's also build our own function to summarize all the information.
-   All tests we will conduct from now on are based on alpha = 95%.

In [6]:
def results(p):
    if(p['p_value']<0.05):p['hypothesis_accepted'] = 'alternative'
    if(p['p_value']>=0.05):p['hypothesis_accepted'] = 'null'

    df = pd.DataFrame(p, index=[''])
    cols = ['value1', 'value2', 'score', 'p_value', 'hypothesis_accepted']
    return df[cols]

# Two-tailed and One-tailed
- Two-tails tests are used to show two values are just "different".
- One-tail tests are used to show one value is either "larger" or "lower" than another one.

This has an influence on the p-value : in case of one-tail tests, p-value has to be divided by 2.

Most of the functions we'll use (those from the statweights modules) do that by themselves if we input the right information in the parameters.
We'll have to do it on our own with functions from the scipy module.

Types of tests
There are different types of tests, here are the ones we will cover :

# Type of tests
- T-tests. Used for small sample sizes (n<30), and when population's standard deviation is unknown.
- Z-tests. Used for large sample sizes (n=>30), and when population's standard deviation is known.
- F-tests. Used for comparing values of more than two variables.
- Chi-square. Used for comparing categorical data.

# Normal distribution
Also, most tests - parametric tests - require a population that is normally distributed.
It it not the case for SalePrice - which we'll use for most tests - but we can fix this by log-transforming the variable.
Note that to go back to our original scale and understand values vs. our $120 000, we'll to exponantiate values back.

In [7]:
import numpy as np
city_hall_dataset['SalePrice'] = np.log1p(city_hall_dataset['SalePrice'])
logged_budget = np.log1p(120000) #logged $120 000 is 11.695
logged_budget

11.695255355062795

np.log1(120000) = ln(1 + 120000) en de exponent waar je e mee moet verheven totdat je op 1 + x uitkomt

Practice
So let's say we are ready to dive into the data, but not ready to pay the small fee for the large sample size.
We'll be starting with the free samples of 25 observations.

In [8]:
sample = city_hall_dataset.sample(n=25)
p = {} #dictionnary we'll use to stock information and results

In [9]:
sample.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
576,577,50,RL,52.0,6292,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2009,WD,Normal,11.884
474,475,120,RL,41.0,5330,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,12.433
373,374,20,RL,79.0,10634,Pave,,Reg,Lvl,AllPub,...,0,,GdWo,,0,11,2009,WD,Normal,11.72
522,523,50,RM,50.0,5000,Pave,,Reg,Lvl,AllPub,...,0,,,,0,10,2006,WD,Normal,11.977
79,80,50,RM,60.0,10440,Pave,Grvl,Reg,Lvl,AllPub,...,0,,MnPrv,,0,5,2009,WD,Normal,11.608


One sample T-test | Two-tailed | Means
So first question we want to ask is : How are our $120 000 situated vs. the average Ames house SalePrice?
In other words, is 120 000 (11.7 logged) any different from the mean SalePrice of the population?
To know that from a 25 observations sample, we need to use a One Sample T-Test.

Null Hypothesis : Mean SalePrice = 11.695
Alternative Hypothesis : Mean SalePrice ≠ 11.695

In [10]:
p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget
p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,12.006,11.695,3.609,0.001,alternative


In [11]:
p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget
p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)
p['p_value'] = p['p_value']/2 #one-tailed test (with scipy function), we need to divide p-value by 2 ourselves
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,12.006,11.695,3.609,0.001,alternative


In [12]:
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=25) 
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=25)


De : in [:730] is een slicing-operator in Python, en het wordt gebruikt om een deel van een lijst, array, of pandas DataFrame te selecteren. Hier is een gedetailleerde uitleg:

Slicing Syntax
De algemene vorm van slicing is start:stop:step. Wanneer je [:730] schrijft, gebruik je de slicing-operator zonder een specifieke start of step te definiëren. Dit betekent:

start (optioneel): De index van de eerste rij die je wilt selecteren. Als deze niet is gespecificeerd, wordt de slicing gestart vanaf het begin (index 0).
stop (verplicht): De index van de eerste rij die je niet wilt selecteren. Dus [:730] betekent selecteer tot en met de 729e rij.
step (optioneel): De interval tussen de rijen die je selecteert. Als deze niet is gespecificeerd, wordt de standaardwaarde 1 gebruikt.

Now we first want to know if the two samples, extracted from two different populations, have significant differences in their average SalePrice.

Null Hypothesis : SalePrice of smaller houses = SalePrice of larger houses
Alternative Hypothesis : SalePrice of smaller houses ≠ SalePrice of larger houses

In [13]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'])
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.776,12.276,-5.601,0.0,alternative



Obviously, larger houses have a higher SalePrice.
Let's prove it this with one-tailed test.

Null Hypothesis : SalePrice of smaller houses >= SalePrice of larger houses
Alternative Hypothesis : SalePrice of smaller houses < SalePrice of larger houses

In [14]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.776,12.276,-5.601,0.0,alternative


In [15]:
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=100, random_state=1)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=100, random_state=1) # random state betekent dat je dezelfde resultaten hebt bij de dezelfde state

In [16]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.786,12.249,-10.772,0.0,alternative


Two sample Z-test | One-tailed | Proportions
Instead of means, we can also run tests on proportions.
Is the proportion of houses over $120 000 higher in the larger houses populations than in smaller houses population?

Null Hypothesis : Proportion of smaller houses with SalePrice over 11.695 >= Proportion of larger houses with SalePrice over 11.695
Alternative Hypothesis : Proportion of smaller houses with SalePrice over 11.695 < Proportion of larger houses with SalePrice over 11.695

In [17]:
from statsmodels.stats.proportion import *
A1 = len(smaller_houses[smaller_houses.SalePrice>logged_budget])
B1 = len(smaller_houses)
A2 = len(larger_houses[larger_houses.SalePrice>logged_budget])
B2 = len(larger_houses)
p['value1'], p['value2'] = A1/B1, A2/B2
p['score'], p['p_value'] = proportions_ztest([A1, A2], [B1, B2], alternative='smaller')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,0.67,0.95,-5.047,0.0,alternative


One sample Z-test | One-tailed | Means
So now let's see how our $120 000 (11.7 logged) are doing against smaller houses only, based on the 100 observations sample.

Null Hypothesis : Mean SalePrice of smaller houses => 11.695
Alternative Hypothesis : Mean SalePrice of smaller houses < 11.695

In [18]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), logged_budget
p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], value=logged_budget, alternative='larger')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.786,11.695,3.593,0.0,alternative




alternative='larger': De alternatieve hypothese is dat het gemiddelde van SalePrice groter is dan logged_budget ( \mu_{\text{SalePrice}} > \text{logged_budget} ).
alternative='smaller': De alternatieve hypothese is dat het gemiddelde van SalePrice kleiner is dan logged_budget ( \mu_{\text{SalePrice}} < \text{logged_budget} ).
In beide gevallen wordt de z-toets gebruikt om te bepalen of het steekproefgemiddelde significant verschilt van de logged_budget in de aangegeven richting.

One sample Z-test | One-tailed | Proportions
Our $120 000 do not seem too far from the average SalePrice of small houses though.
Let's see if at least 25\% of houses have a SalePrice in our budget.

Null Hypothesis : Proportion of smaller houses with SalePrice under 11.695 <= 25%
Alternative Hypothesis : Proportion of smalousler hes with SalePrice under 11.695 > 25%

In [31]:

A = len(smaller_houses[smaller_houses.SalePrice<logged_budget])
B = len(smaller_houses)
p['value1'], p['value2'] = A/B, 0.25
p['score'], p['p_value'] = proportions_ztest(A, B, alternative='larger', value=0.25)
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,0.32,0.25,1.501,0.067,


F-test (ANOVA)
The House Price Dataset has a MSZoning variable, which identifies the general zoning classification of the house.
For instance, it lets you know if the house is situated in a residential or a commerical zone.

We'll therefore try to know if there is a significant difference in SalePrice based on the zoning.
And then know where we will be more likely to live with our budget.
Based on the 100 observations samples of smaller houses, let's first have an overview of mean SalePrice by zone.

In [27]:
replacement = {'FV': "Floating Village Residential", 'C (all)': "Commercial", 'RH': "Residential High Density",
              'RL': "Residential Low Density", 'RM': "Residential Medium Density"}
smaller_houses['MSZoning_FullName'] = smaller_houses['MSZoning'].replace(replacement)
mean_price_by_zone = smaller_houses.groupby('MSZoning_FullName')['SalePrice'].mean().to_frame() # To frame converts the seriesi in to df
mean_price_by_zone

Unnamed: 0_level_0,SalePrice
MSZoning_FullName,Unnamed: 1_level_1
Commercial,11.59
Floating Village Residential,12.03
Residential High Density,11.705
Residential Low Density,11.828
Residential Medium Density,11.617


To know if there is a significant difference between these values, we run an ANOVA test. (because there a more than 2 values to compare)
The test won't not able to tell us what attributes are different from the others, but at least we'll know if there is a difference or not.

Null Hypothesis : No difference between SalePrice means
Alternative Hypothesis : Difference between SalePrice means

In [28]:
sh = smaller_houses.copy()
p['score'], p['p_value'] = stats.f_oneway(sh.loc[sh.MSZoning=='FV', 'SalePrice'], 
               sh.loc[sh.MSZoning=='C (all)', 'SalePrice'],
               sh.loc[sh.MSZoning=='RH', 'SalePrice'],
               sh.loc[sh.MSZoning=='RL', 'SalePrice'],
               sh.loc[sh.MSZoning=='RM', 'SalePrice'],)
results(p)[['score', 'p_value', 'hypothesis_accepted']]

Unnamed: 0,score,p_value,hypothesis_accepted
,4.146,0.004,alternative


There is a difference between SalePrices based on where the house is located.
Looking at the Average SalePrice by zone, Commerical Zones and Residential High Density zones seem to be the most affordable for our budget.

Chi-square test
One last question we'll address : can we get a garage? If yes, what type of garage?
If not, then we won't bother saving up for a car, and we'll try to get a house next to Public Transportion.
The dataset contains a categorical variable, GarageType, that will help us answer the question.


In [30]:
smaller_houses.fillna({'GarageType': 'GarageType' }, inplace=True)
smaller_houses['GarageType'].value_counts().to_frame()

Unnamed: 0_level_0,count
GarageType,Unnamed: 1_level_1
Attchd,46
Detchd,41
No Garage,10
CarPort,2
Basment,1


We know we can get a house in at least the bottom 25% of smaller houses.
We would ideally like to know if distribution of Garage Types among these 25% is different than in the three other quarters
We are now friends with the City Hall, so we can ask them one last favor :
Split the smaller houses population in 4 based on surface, and give us a sample of each quarter.
Because we working here with categorical data, we'll run a Chi-Square test.

In [32]:
city_hall_dataset.fillna({'GarageType': 'GarageType' }, inplace=True)
sample1 = city_hall_dataset.sort_values('GrLivArea')[:183].sample(n=100)
sample2 = city_hall_dataset.sort_values('GrLivArea')[183:366].sample(n=100)
sample3 = city_hall_dataset.sort_values('GrLivArea')[366:549].sample(n=100)
sample4 = city_hall_dataset.sort_values('GrLivArea')[549:730].sample(n=100)
dff = pd.concat([
    sample1['GarageType'].value_counts().to_frame(),
    sample2['GarageType'].value_counts().to_frame(), 
    sample3['GarageType'].value_counts().to_frame(), 
    sample4['GarageType'].value_counts().to_frame()], 
    axis=1, sort=False)
dff.columns = ['Sample1 (smallest houses)', 'Sample2', 'Sample3', 'Sample4 (largest houses)']
dff

Unnamed: 0_level_0,Sample1 (smallest houses),Sample2,Sample3,Sample4 (largest houses)
GarageType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Detchd,52.0,40.0,31.0,28.0
Attchd,28.0,46.0,60.0,56.0
GarageType,18.0,8.0,6.0,7.0
CarPort,1.0,,,2.0
Basment,1.0,3.0,,1.0
BuiltIn,,2.0,3.0,6.0
2Types,,1.0,,


In [33]:
dff = dff[:3] #chi-square tests do not work when table contains some 0, we take only the most frequent attributes
dff 

Unnamed: 0_level_0,Sample1 (smallest houses),Sample2,Sample3,Sample4 (largest houses)
GarageType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Detchd,52.0,40.0,31.0,28.0
Attchd,28.0,46.0,60.0,56.0
GarageType,18.0,8.0,6.0,7.0


Null Hypothesis : No difference between GarageType distribution
Alternative Hypothesis : Difference between GarageType distribution

In [34]:
p['score'], p['p_value'], p['ddf'], p['contigency'] = stats.chi2_contingency(dff)
p

{'value1': 0.32,
 'value2': 0.25,
 'score': 30.76589721728685,
 'p_value': 2.8095678196183777e-05,
 'hypothesis_accepted': 'null',
 'df': 48.0,
 'ddf': 6,
 'contigency': array([[38.94210526, 37.35263158, 38.54473684, 36.16052632],
        [49.        , 47.        , 48.5       , 45.5       ],
        [10.05789474,  9.64736842,  9.95526316,  9.33947368]])}

In [35]:
p.pop('contigency')
p

{'value1': 0.32,
 'value2': 0.25,
 'score': 30.76589721728685,
 'p_value': 2.8095678196183777e-05,
 'hypothesis_accepted': 'null',
 'df': 48.0,
 'ddf': 6}

In [36]:
results(p)[['score', 'p_value', 'hypothesis_accepted']]

Unnamed: 0,score,p_value,hypothesis_accepted
,30.766,0.0,alternative
