In [1]:
import scipy
scipy.__version__

'1.11.4'

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import scipy.stats as stats

In [3]:
houseprice = pd.read_csv("Florida.csv")
houseprice.head()

Unnamed: 0,Metropolitan Area,Jan_2003,Jan_2002
0,Daytona Beach,117,96
1,Fort Lauderdale,207,169
2,Fort Myers,143,129
3,Fort Walton Beach,139,134
4,Gainesville,131,119


### Let's write the null and alternative hypothesis

Let $\mu_1, \mu_2$ be the mean price of single-family homes in metropolitan areas of Florida for 2002 and 2003 respectively.

We want to test whether there is an increase in the house price from 2002 to 2003.

We will test the null hypothesis

>$H_0:\mu_1=\mu_2$

against the alternate hypothesis

>$H_a:\mu_1<\mu_2$

In [5]:
diff = np.mean(houseprice['Jan_2003'] - houseprice['Jan_2002'])
print('The mean of the differences between the house prices from 2003 to 2002', diff)

The mean of the differences between the house prices from 2003 to 2002 15.0


In [13]:
from scipy.stats import ttest_rel

test_stat, p_value = ttest_rel(houseprice['Jan_2002'], houseprice['Jan_2003'], alternative = 'less')
print('The p-value is {}'.format(round(p_value,4)))

The p-value is 0.0001


### Insight
As the p-value is much less than the level of significance, the null hypothesis can be rejected. Thus, it may be concluded that there is enough statistical evidence to conclude that there is an increase in the price from 2002 to 2003.

# <a name='link12'>**Chi-Square Test for Variance**</a>



### Let's revisit an example
It is conjectured that the standard deviation for the annual return of mid cap mutual funds is 22.4%, when all such funds are considered and over a long period of time. The sample standard deviation of a certain mid cap mutual fund based on a random sample of size 32 is observed to be 26.4%. 

Do we have enough evidence to claim that the standard deviation of the chosen mutual fund is greater than the conjectured standard deviation for mid cap mutual funds at 0.05 level of significance?



### Let's write the null and alternative hypothesis
Let $\sigma$ be the average standard deviation of the mutual funds.

We will test the null hypothesis

>$H_0:\sigma^2 = 22.4^2$

against the alternate hypothesis

>$H_a:\sigma^2 > 22.4^2$

### Let's test whether the assumptions are satisfied or not

* Continuous data - Yes
* Normally distributed population - Since the sample sizes are greater than 30, Central Limit Theorem states that the distribution of sample means will be normal.
* Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.   


### P-value

In [18]:
from scipy.stats import chi2

def chi_var(pop_var, sample_var, n):
    test_stat = (n-1)*sample_var / pop_var
    p_value = 1- chi2.cdf(test_stat, n-1)
    return (test_stat, p_value)

# n is the sample size
n = 32
sigma2, sigma = 22.4**2, 26.4**2

test_stat, p_value = chi_var(sigma2, sigma, n)

print('The p-value is: {}'.format(round(p_value,4)))


The p-value is: 0.0734


### Insight
As the p-value is greater than the significance level, we can not reject the null hypothesis. Hence, we do not have enough statistical significance to conclude that the standard deviation of the chosen mutual fund is greater than the average standard deviation for mid cap mutual funds at 0.05 level of significance.

# <a name='link13'>**F-test for Equality of Variances**</a>



### Let's revisit the example

The variance of a process is an important quality of the process. A large variance implies that the process needs better control and there is opportunity to improve. 


The data (Bags.csv) includes weights for two different sets of bags manufactured from two different machines. It is assumed that the weights for two sets of bags follow normal distribution.

Do we have enough statistical evidence at 5% significance level  to conclude that there is a significant difference between the variances of the bag weights for the two machines.



### Let's write the null and alternative hypothesis
Let $\sigma_1^2, \sigma_2^2$ be the variances of weights of the bags produced by two different machines.

We will test the null hypothesis

>$H_0:\sigma_1^2 = \sigma_2^2$

against the alternate hypothesis

>$H_a:\sigma_1^2 \neq \sigma_2^2$

### Let's test whether the assumptions are satisfied or not

* Continuous data - Yes, the weight is measured on a continuous scale.
* Normally distributed populations - Yes, it is assumed that the populations are normally distributed.
* Independent populations - As the two sets of bags are manufactured from two different machines, the populations are independent.
* Random sampling from the population - Yes, we are informed that the collected sample is a simple random sample.


In [19]:
bagweight = pd.read_csv("Bags.csv")
bagweight.head()

Unnamed: 0,Machine 1,Machine 2
0,2.95,3.22
1,3.45,3.3
2,3.5,3.34
3,3.75,3.28
4,3.48,3.29


### P-value

In [22]:
from scipy.stats import f

def f_test(x,y):
    x=np.array(x)
    y=np.array(y)
    test_stat = np.var(x, ddof=1)/np.var(y, ddof=1) #calculate the f test statistic
    dfn = x.size-1 #define degrees of freedom numerator 
    dfd = y.size-1 #define degrees of freedom denominator
    p = (1 - f.cdf(test_stat, dfn, dfd)) # find the p-value of the f test statistic
    p1 = p*2 # converting the one-tail to two-tail test
    return(print("The p_value is {}".format(round(p,5))))

f_test(bagweight.dropna()["Machine 1"], bagweight.dropna()["Machine 2"])

The p_value is 0.0


### Insight
As the p-value is much smaller than the level of significance, the null hypothesis can be rejected. Hence, we have enough statistical evidence to conclude that there is a difference between the bag weights for the two machines at 0.05 significance level.