<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Theory" data-toc-modified-id="Theory-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Theory</a></span><ul class="toc-item"><li><span><a href="#The-process" data-toc-modified-id="The-process-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>The process</a></span></li><li><span><a href="#Two-tailed-and-One-tailed" data-toc-modified-id="Two-tailed-and-One-tailed-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Two-tailed and One-tailed</a></span></li><li><span><a href="#Types-of-tests" data-toc-modified-id="Types-of-tests-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Types of tests</a></span></li><li><span><a href="#Normal-distribution" data-toc-modified-id="Normal-distribution-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Normal distribution</a></span></li></ul></li><li><span><a href="#Practice" data-toc-modified-id="Practice-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Practice</a></span><ul class="toc-item"><li><span><a href="#One-sample-T-test-|-Two-tailed-|-Means" data-toc-modified-id="One-sample-T-test-|-Two-tailed-|-Means-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>One sample T-test | Two-tailed | Means</a></span></li><li><span><a href="#One-sample-T-test-|-One-tailed-|-Means" data-toc-modified-id="One-sample-T-test-|-One-tailed-|-Means-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>One sample T-test | One-tailed | Means</a></span></li><li><span><a href="#Two-sample-T-test-|-Two-tailed-|-Means" data-toc-modified-id="Two-sample-T-test-|-Two-tailed-|-Means-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Two sample T-test | Two-tailed | Means</a></span></li><li><span><a href="#Two-sample-T-test-|-One-tailed-|-Means" data-toc-modified-id="Two-sample-T-test-|-One-tailed-|-Means-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Two sample T-test | One-tailed | Means</a></span></li><li><span><a href="#Two-sample-Z-test-|-One-tailed-|-Means" data-toc-modified-id="Two-sample-Z-test-|-One-tailed-|-Means-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Two sample Z-test | One-tailed | Means</a></span></li><li><span><a href="#Two-sample-Z-test-|-One-tailed-|-Proportions" data-toc-modified-id="Two-sample-Z-test-|-One-tailed-|-Proportions-3.6"><span class="toc-item-num">3.6&nbsp;&nbsp;</span>Two sample Z-test | One-tailed | Proportions</a></span></li><li><span><a href="#One-sample-Z-test-|-One-tailed-|-Means" data-toc-modified-id="One-sample-Z-test-|-One-tailed-|-Means-3.7"><span class="toc-item-num">3.7&nbsp;&nbsp;</span>One sample Z-test | One-tailed | Means</a></span></li><li><span><a href="#One-sample-Z-test-|-One-tailed-|-Proportions" data-toc-modified-id="One-sample-Z-test-|-One-tailed-|-Proportions-3.8"><span class="toc-item-num">3.8&nbsp;&nbsp;</span>One sample Z-test | One-tailed | Proportions</a></span></li><li><span><a href="#F-test-(ANOVA)" data-toc-modified-id="F-test-(ANOVA)-3.9"><span class="toc-item-num">3.9&nbsp;&nbsp;</span>F-test (ANOVA)</a></span></li><li><span><a href="#Chi-square-test" data-toc-modified-id="Chi-square-test-3.10"><span class="toc-item-num">3.10&nbsp;&nbsp;</span>Chi-square test</a></span></li></ul></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

Let's say we are planning to move to Ames, Iowa, with a $120 000 budget to buy a house. We have no idea about the real estate market in the city. However the City Hall owns a precious piece of information : the House Price Dataset. It contains about 1500 lines of data about houses in the city, with attributes like Sale Price, Living Area, Garage Type, etc. The bad news is we can not access the entire database, it is too expensive. The good news is the City Hall proposes some samples : free for up to 25 observations, with a small fee for up to 100 observations. So we'll make use of this great offer to know a bit more about the real estate market, and understand what we can get for our money. And guess what? This is will be a good way to go through statistical tests. Note that we won't go too much into the theory here. This notebooks main goal is to have an overview of which statistical tests to use depending on the situation faced, and how to use them.

In [1]:
import pandas as pd
pd.set_option('max_colwidth', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
from statsmodels.stats.weightstats import *
import scipy.stats

In [2]:
#this is the entire dataset, but we'll only be able to use to extract samples from it.
FILE_PATH = 'data/house-prices-advanced-regression-techniques/train.csv'
city_hall_dataset = pd.read_csv(FILE_PATH)

# Introduction

# Theory

What we will be trying to do in this tutorial is make assumptions on the whole population of houses based only on the samples at our disposal.<br>
This is what statistical tests do, but one must know a few principles before using them.

## The process

The basic process of statistical tests is the following : 
- Stating a Null Hypothesis (most often : "the two values are not different")
- Stating an Alternative Hypothesis (most often : "the two values are different")
- Defining an alpha value, which is a confidence level (most often : 95%). The higher it is, the harder it will be to validate the Alternative Hypothesis, but the more confident we will be if we do validate it.
- Depending on data at disposal, we choose the relevant test (Z-test, T-test, etc... More on that later)
- The test computes a score, which corresponds to a p-value.
- If p-value is below 1-alpha (0.05 if alpha is 95%), we can accept the Alternative Hypothesis (or "reject the Null Hypothesis"). If it is over, we'll have to stick with the Null Hypothesis (or "fail to reject the Null Hypothesis").


There's a built-in function for most statistical tests out there.<br>
Let's also build our own function to summarize all the information.<br>
All tests we will conduct from now on are based on alpha = 95%.

In [3]:
def results(p):
    if(p['p_value']<0.05):p['hypothesis_accepted'] = 'alternative'
    if(p['p_value']>=0.05):p['hypothesis_accepted'] = 'null'

    df = pd.DataFrame(p, index=[''])
    cols = ['value1', 'value2', 'score', 'p_value', 'hypothesis_accepted']
    return df[cols]

## Two-tailed and One-tailed

Two-tails tests are used to show two values are just "different".<br>
One-tail tests are used to show one value is either "larger" or "lower" than another one.<br><br>
This has an influence on the p-value : in case of one-tail tests, p-value has to be divided by 2.<br>
<br>
Most of the functions we'll use (those from the statweights modules) do that by themselves if we input the right information in the parameters.<br>
We'll have to do it on our own with functions from the scipy module.

## Types of tests

There are different types of tests, here are the ones we will cover : 
- T-tests. Used for small sample sizes (n<30), and when population's standard deviation is unknown.
- Z-tests. Used for large sample sizes (n=>30), and when population's standard deviation is known.
- F-tests. Used for comparing values of more than two variables.
- Chi-square. Used for comparing categorical data.

## Normal distribution

Also, most tests - parametric tests - require a population that is normally distributed.<br>
It it not the case for SalePrice - which we'll use for most tests - but we can fix this by log-transforming the variable.<br>
Note that to go back to our original scale and understand values vs. our \$120 000, we'll to exponantiate values back.

In [4]:
import numpy as np
city_hall_dataset['SalePrice'] = np.log1p(city_hall_dataset['SalePrice'])
logged_budget = np.log1p(120000) #logged $120 000 is 11.695
logged_budget

11.695255355062795

# Practice

So let's say we are ready to dive into the data, but not ready to pay the small fee for the large sample size.<br>
We'll be starting with the free samples of 25 observations.

In [5]:
sample = city_hall_dataset.sample(n=25)
p = {} #dictionnary we'll use to stock information and results

## One sample T-test | Two-tailed | Means

So first question we want to ask is : How are our $120 000 situated vs. the average Ames house SalePrice? <br>
In other words, is  120 000 (11.7 logged) any different from the mean SalePrice of the population?<br>
To know that from a 25 observations sample, we need to use a One Sample T-Test.

<b>Null Hypothesis</b> :  Mean SalePrice = 11.695 <br>
<b>Alternative Hypothesis</b> :  Mean SalePrice ≠ 11.695 <br>

In [6]:
p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget
p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.984,11.695,3.548,0.002,alternative


So we know our initial budget is significantely different from the mean SalePrice.<br>
From the table above, it unfortunately seems lower.<br>

## One sample T-test | One-tailed | Means

Let's make sure our budget is lower by running a one-tailed test.<br>
Question now is : is 120 000 (11.695 logged) lower than the mean SalePrice of the population?<br>

<b>Null Hypothesis</b> :  Mean SalePrice <= 11.695 <br>
<b>Alternative Hypothesis</b> :  Mean SalePrice > 11.695 <br>

In [7]:
p['value1'], p['value2'] = sample['SalePrice'].mean(), logged_budget
p['score'], p['p_value'] = stats.ttest_1samp(sample['SalePrice'], popmean=logged_budget)
p['p_value'] = p['p_value']/2 #one-tailed test (with scipy function), we need to divide p-value by 2 ourselves
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.984,11.695,3.548,0.001,alternative


Unfortunately it is!<br>
We have 95% chance of believing that our starting budget won't let us buy a house at the average Ames price.

## Two sample T-test | Two-tailed | Means

Now that our expectations are lowered, we realize something important :<br>
The entire dataset probably contains some big houses fitted for entire families as well as small houses for fewer inhabitants.<br>
Prices are probably really different in-between the two types.<br>
And we are moving in alone, so we probably don't need that big of a house.<br><br>
What if we could ask the City Hall to give us a sample for big houses, and a sample for smaller houses?<br>
We first could see if there is a significant difference in prices.<br>
And then see how our \$120 000 are doing against the small houses average SalePrice.<br><br>
We do ask the City Hall, and because they understand it is also for the sake of this tutorial, they accept.<br>
They say they'll split the dataset in two, based on the surface area of the houses.<br>
They will give us a sample from the top 50\% houses in terms of surface, and another sample from the bottom 50\%.

In [8]:
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=25)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=25)

Now we first want to know if the two samples, extracted from two different populations, have significant differences in their average SalePrice.

<b>Null Hypothesis</b> : SalePrice of smaller houses = SalePrice of larger houses <br>
<b>Alternative Hypothesis</b> :  SalePrice of smaller houses ≠ SalePrice of larger houses <br>

In [9]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'])
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.883,12.198,-4.855,0.0,alternative


As expected, the two samples show some significant differences in SalePrice.

## Two sample T-test | One-tailed | Means

Obviously, larger houses have a higher SalePrice.<br>
Let's prove it this with one-tailed test.

<b>Null Hypothesis</b> : SalePrice of smaller houses >= SalePrice of larger houses <br>
<b>Alternative Hypothesis</b> :  SalePrice of smaller houses < SalePrice of larger houses <br>

In [10]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'], p['df'] = ttest_ind(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.883,12.198,-4.855,0.0,alternative


Still as expected, SalePrice is significantly higher for larger houses.

## Two sample Z-test | One-tailed | Means

Now that the City Hall has already splitted the population in two, why not ask them for larger samples?<br>
We'll pay a fee but that's all right, this is fake money.

In [11]:
smaller_houses = city_hall_dataset.sort_values('GrLivArea')[:730].sample(n=100, random_state=1)
larger_houses = city_hall_dataset.sort_values('GrLivArea')[730:].sample(n=100, random_state=1)

<b>Null Hypothesis</b> : SalePrice of smaller houses >= SalePrice of larger houses <br>
<b>Alternative Hypothesis</b> :  SalePrice of smaller houses < SalePrice of larger houses <br>

In [12]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), larger_houses['SalePrice'].mean()
p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], larger_houses['SalePrice'], alternative='smaller')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.786,12.249,-10.772,0.0,alternative


Higher sample sizes show the same results : SalePrice is significantely higher for larger houses.

## Two sample Z-test | One-tailed | Proportions

Instead of means, we can also run tests on proportions.<br>
Is the proportion of houses over \$120 000 higher in the larger houses populations than in smaller houses population?

<b>Null Hypothesis</b> : Proportion of smaller houses with SalePrice over 11.695 >= Proportion of larger houses with SalePrice over 11.695 <br>
<b>Alternative Hypothesis</b> :  Proportion of smaller houses with SalePrice over 11.695 < Proportion of larger houses with SalePrice over 11.695 <br>

In [13]:
from statsmodels.stats.proportion import *
A1 = len(smaller_houses[smaller_houses.SalePrice>logged_budget])
B1 = len(smaller_houses)
A2 = len(larger_houses[larger_houses.SalePrice>logged_budget])
B2 = len(larger_houses)
p['value1'], p['value2'] = A1/B1, A2/B2
p['score'], p['p_value'] = proportions_ztest([A1, A2], [B1, B2], alternative='smaller')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,0.67,0.95,-5.047,0.0,alternative


Logically, the test shows that the larger houses population has a higher ratio of houses sold over \\$120 000 vs. the smaller houses population.

## One sample Z-test | One-tailed | Means

So now let's see how our \$120 000  (11.7 logged) are doing against smaller houses only, based on the 100 observations sample.

<b>Null Hypothesis</b> :  Mean SalePrice of smaller houses => 11.695 <br>
<b>Alternative Hypothesis</b> :  Mean SalePrice of smaller houses < 11.695 <br>

In [14]:
p['value1'], p['value2'] = smaller_houses['SalePrice'].mean(), logged_budget
p['score'], p['p_value'] = ztest(smaller_houses['SalePrice'], value=logged_budget, alternative='larger')
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,11.786,11.695,3.593,0.0,alternative


That's quite depressing : our \\$120 000 do not even beat the average price of smaller houses.

## One sample Z-test | One-tailed | Proportions

Our \$120 000 do not seem too far from the average SalePrice of small houses though.<br>
Let's see if at least 25\% of houses have a SalePrice in our budget.

<b>Null Hypothesis</b> : Proportion of smaller houses with SalePrice under 11.695 <= 25% <br>
<b>Alternative Hypothesis</b> :  Proportion of smaller houses with SalePrice under 11.695 > 25% <br>

In [15]:
from statsmodels.stats.proportion import *
A = len(smaller_houses[smaller_houses.SalePrice<logged_budget])
B = len(smaller_houses)
p['value1'], p['value2'] = A/B, 0.25
p['score'], p['p_value'] = proportions_ztest(A, B, alternative='larger', value=0.25)
results(p)

Unnamed: 0,value1,value2,score,p_value,hypothesis_accepted
,0.32,0.25,1.501,0.067,


So at least, now we know we can buy a house among at least 25% of the smaller houses.<br>

## F-test (ANOVA)

The House Price Dataset has a MSZoning variable, which identifies the general zoning classification of the house.<br>
For instance, it lets you know if the house is situated in a residential or a commerical zone.<br><br>
We'll therefore try to know if there is a significant difference in SalePrice based on the zoning.<br>
And then know where we will be more likely to live with our budget.<br>
Based on the 100 observations samples of smaller houses, let's first have an overview of mean SalePrice by zone.

In [16]:
replacement = {'FV': "Floating Village Residential", 'C (all)': "Commercial", 'RH': "Residential High Density",
              'RL': "Residential Low Density", 'RM': "Residential Medium Density"}
smaller_houses['MSZoning_FullName'] = smaller_houses['MSZoning'].replace(replacement)
mean_price_by_zone = smaller_houses.groupby('MSZoning_FullName')['SalePrice'].mean().to_frame()
mean_price_by_zone

Unnamed: 0_level_0,SalePrice
MSZoning_FullName,Unnamed: 1_level_1
Commercial,11.59
Floating Village Residential,12.03
Residential High Density,11.705
Residential Low Density,11.828
Residential Medium Density,11.617


To know if there is a significant difference between these values, we run an ANOVA test. (because there a more than 2 values to compare)<br>
The test won't not able to tell us what attributes are different from the others, but at least we'll know if there is a difference or not.

<b>Null Hypothesis</b> : No difference between SalePrice means <br>
<b>Alternative Hypothesis</b> : Difference between SalePrice means <br>

In [17]:
sh = smaller_houses.copy()
p['score'], p['p_value'] = stats.f_oneway(sh.loc[sh.MSZoning=='FV', 'SalePrice'], 
               sh.loc[sh.MSZoning=='C (all)', 'SalePrice'],
               sh.loc[sh.MSZoning=='RH', 'SalePrice'],
               sh.loc[sh.MSZoning=='RL', 'SalePrice'],
               sh.loc[sh.MSZoning=='RM', 'SalePrice'],)
results(p)[['score', 'p_value', 'hypothesis_accepted']]

Unnamed: 0,score,p_value,hypothesis_accepted
,4.146,0.004,alternative


There is a difference between SalePrices based on where the house is located.<br>
Looking at the Average SalePrice by zone, Commerical Zones and Residential High Density zones seem to be the most affordable for our budget.

## Chi-square test

One last question we'll address : can we get a garage? If yes, what type of garage?<br>
If not, then we won't bother saving up for a car, and we'll try to get a house next to Public Transportion.<br>
The dataset contains a categorical variable, GarageType, that will help us answer the question.<br>
<br>

In [18]:
smaller_houses['GarageType'].fillna('No Garage', inplace=True)
smaller_houses['GarageType'].value_counts().to_frame()

Unnamed: 0,GarageType
Attchd,46
Detchd,41
No Garage,10
CarPort,2
Basment,1


We know we can get a house in at least the bottom 25% of smaller houses.<br>
We would ideally like to know if distribution of Garage Types among these 25% is different than in the three other quarters<br>
We are now friends with the City Hall, so we can ask them one last favor : <br>
Split the smaller houses population in 4 based on surface, and give us a sample of each quarter.<br>
Because we working here with categorical data, we'll run a Chi-Square test.

In [19]:
city_hall_dataset['GarageType'].fillna('No Garage', inplace=True)
sample1 = city_hall_dataset.sort_values('GrLivArea')[:183].sample(n=100)
sample2 = city_hall_dataset.sort_values('GrLivArea')[183:366].sample(n=100)
sample3 = city_hall_dataset.sort_values('GrLivArea')[366:549].sample(n=100)
sample4 = city_hall_dataset.sort_values('GrLivArea')[549:730].sample(n=100)
dff = pd.concat([
    sample1['GarageType'].value_counts().to_frame(),
    sample2['GarageType'].value_counts().to_frame(), 
    sample3['GarageType'].value_counts().to_frame(), 
    sample4['GarageType'].value_counts().to_frame()], 
    axis=1, sort=False)
dff.columns = ['Sample1 (smallest houses)', 'Sample2', 'Sample3', 'Sample4 (largest houses)']
dff = dff[:3] #chi-square tests do not work when table contains some 0, we take only the most frequent attributes
dff 

Unnamed: 0,Sample1 (smallest houses),Sample2,Sample3,Sample4 (largest houses)
Detchd,53.0,37.0,38.0,27.0
Attchd,30.0,54.0,55.0,59.0
No Garage,16.0,4.0,6.0,6.0


<b>Null Hypothesis</b> : No difference between GarageType distribution <br>
<b>Alternative Hypothesis</b> : Difference between GarageType distribution <br>

In [20]:
p['score'], p['p_value'], p['ddf'], p['contigency'] = stats.chi2_contingency(dff)
p.pop('contigency')
results(p)[['score', 'p_value', 'hypothesis_accepted']]

Unnamed: 0,score,p_value,hypothesis_accepted
,29.749,0.0,alternative


Clearly there's a difference in GarageType distribution according to size of houses.<br>
The sample that concerns us, Sample1, has the highest proportion of "No Garage" and "Detached Garage".<br>
We'll probably have to stick with Public Transportation.

# Conclusion

We probably won't have a great house, but at least, we learned about statistical tests.