# T-statistic 

In this exercice we illustrate a two-sample t-test, that is used to determine if two population means are equal. We are using the data from https://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm, which is a data set that contains miles per gallon for U.S. cars (sample 1) and for Japanese cars (sample 2).


In [105]:
import numpy as np
import pandas as pd
from IPython.display import Image
from IPython.core.display import HTML

In [85]:
## The first column is miles per gallon for U.S. cars and the second column is miles per gallon for Japanese cars

data = pd.read_csv("cars_data.csv",header = 0)
data

Unnamed: 0,US_cars,Jp_cars
0,18,24.0
1,15,27.0
2,18,27.0
3,16,25.0
4,17,31.0
...,...,...
244,27,
245,27,
246,32,
247,28,


In [96]:
## Let's compute the mean and the standard deviation of each group

us_cars_mean = data["US_cars"].mean()

us_cars_sv = data["US_cars"].var()

us_n = data["US_cars"].count()

jp_cars_mean = data["Jp_cars"].mean()

jp_cars_sv = data["Jp_cars"].var()

jp_n = data["Jp_cars"].count()

 

In [97]:
## Print the data

countries = ["US","Jp"]

for i in countries:

     print (i+" sample mean = ",data[i+"_cars"].mean())
        
for i in countries:

     print (i+" sample variance",data[i+"_cars"].var())  
        
for i in countries:

     print (i+" n =",data[i+"_cars"].count())

US sample mean =  20.14457831325301
Jp sample mean =  30.481012658227847
US sample variance 41.14836766420521
Jp sample variance 37.30412203829926
US n = 249
Jp n = 79


Now we can use the Two-Sample t-Test for Equal Means formula:

$$ t = \frac{ \bar{X_1} - \bar{X_2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}  }} $$

In our case, we want to test if the average consumption of miles per gallon is different between the US and Japanese cars two population means are different at 0.05 significance level.
Thus, our hypotesis test can be formally formulated as:
$$H_0 : \mu_1 = \mu_2$$
$$H_1: \mu_1 \neq \mu_2$$



In [98]:
# Function to compute the t statistic:

def t_statistic(X_mean_one,X_mean_two,sigma_one,sigma_two,n_1,n_2):
    
    
    t = (X_mean_one - X_mean_two) / np.sqrt((sigma_one/n_1) + (sigma_two/n_2))
    
    return t

In [111]:
## Compute the t_stat

t_stat =  t_statistic(us_cars_mean,jp_cars_mean,us_cars_sv,jp_cars_sv,us_n,jp_n)

print("We have a t statistic of:",t_stat)

We have a t statistic of: -12.946273274932004


In [110]:
## Let's find our treshold for a .05 significance level.
## In this example we want to know if the means are different.
## We have that .05 / 2 = .025 and we have (249 -1 + 79 - 1) = 326 degrees of freddom

## Thus, we have a treashold t value approximatively  of (1.984+1.962)/2 = 1.9729999999999999:

Image(url= "http://i.stack.imgur.com/PiSUh.png")

Finally we can conlude that: $$abs(-12.946273274932004) > 1.9729999999999999$$ Hence, we can reject the null hypothesis and conclude that the two population means are different at the 0.05 significance level.