## Lab | Inferential statistics - T-test & P-value

**Instructions**
### Excercise 1

We will have another simple example on `two sample t test` (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file `files_for_lab/machine.txt.` Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as st

In [2]:
data = pd.read_csv('files_for_lab/machine.txt')
data

Unnamed: 0,New machine\t Old machine
0,42.1\t 42.7
1,41\t 43.6
2,41.3\t 43.8
3,41.8\t 43.3
4,42.4\t 42.5
5,42.8\t 43.5
6,43.2\t 43.1
7,42.3\t 41.7
8,41.8\t 44
9,42.7\t 44.1


In [3]:
data = pd.read_csv('files_for_lab/machine.txt', sep='\t')
data

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [4]:
data.columns

Index(['New machine', '    Old machine'], dtype='object')

In [5]:
data.columns = data.columns.str.replace(' ', '').str.lower()

In [6]:
data.columns

Index(['newmachine', 'oldmachine'], dtype='object')

The `null hypothesis` $(H_0)$ assumes that there is no difference in the average packing times between the two machines, while the `alternative hypothesis` $(H_1)$ assumes that the new machine packs faster on average.

$ H_0: \mu_{newMachine} = \mu_{oldMachine} $

$ H_1: \mu_{newMachine} \le \mu_{oldMachine} $

Confidence level = 95 %  
$\alpha$ = 5 %

In [7]:
st.ttest_ind(a=data.newmachine, b=data.oldmachine, alternative = 'less', equal_var=True)

Ttest_indResult(statistic=-3.3972307061176026, pvalue=0.0016055712503872579)

We use `ttest_ind` with the `alternative='less'` parameter, because we are interested in testing whether the mean of the new equipment's packing times is less than the mean of the old equipment's packing times. That is, we are looking for evidence to support the hypothesis that the new equipment packs faster than the old one.

In the other hand, the `equal_var=True` parameter indicates that we assume equal variances for both samples (pooled t-test).

The `pvalue` $(0.0016)$ is **less** than the `significance value` $(0.05)$, so there is sufficient evidence to reject the null hypothesis  
The alternative hypothesis is true:

***The new machine packs faster than the old machine.***

### Excercise 2

An additional problem (not mandatory):  

In this case we can't assume that the population variances are equal. Hence in this case we can not pool the variances.   
Independent random samples of $17$ sophomores and $13$ juniors attending a large university yield the following data on grade point averages. Data is provided in the file `files_for_lab/student_gpa.txt.` At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

For an unpooled T-test the statistics can be calculated as: 

$$t' = \frac{\bar{x_1} - \bar{x_2}}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$$

We use the notation $t'$ to indicate that this is an approximate, and not an exact, t-distribution.

Degrees of freedom is  
$$n_1+n_2-2$$

Where:  

$n_1$, $\bar{x_1}$, $S_1$ and $n_2$, $\bar{x_2}$, $S_2$, are the sample sizes, sample means and standard deviations of the first and second group, respectively.


In [8]:
gpa = pd.read_csv('files_for_lab/student_gpa.txt', sep='\t')
gpa

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


In [9]:
gpa.columns = gpa.columns.str.lower()
gpa.columns

Index(['sophomores', 'juniors'], dtype='object')

In [24]:
# Remove rows with NaN values
gpa_juniors = [row for row in gpa.juniors if not np.isnan(row)]
gpa_juniors

[2.56, 2.77, 2.7, 3.0, 2.98, 3.47, 3.26, 3.2, 3.19, 2.65, 3.0, 3.39, 2.58]

The `null hypothesis`  
$(H_0)$ assumes that there is ***no difference*** in the grade points averages between the sophomores and the juniors.  
The `alternative hypothesis` $(H_1)$ assumes that the gpa differs between one group and the other.

$ H_0: \mu_{sophomores} = \mu_{juniors} $

$ H_1: \mu_{sophomores} \neq \mu_{juniors} $

Confidence level = 95 %  
$\alpha$ = 5 %

In [11]:
# Perform unpooled two-sample t-test
st.ttest_ind(a=gpa.sophomores, b=gpa_juniors, alternative = 'two-sided', equal_var=False)

Ttest_indResult(statistic=nan, pvalue=nan)

Since the pvalue (0.36) is greater than the significance value (0.05), we can say that there is not enough evidence to reject the null hypothesis.  

***We cannot conclude that the mean GPAs of sophomores and juniors differ significantly.***