# Lab | Inferential statistics - T-test & P-value

### Instructions

1. We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file `files_for_lab/machine.txt`.
Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

In [3]:
# Libraries
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [15]:
df = pd.read_csv("files_for_lab/machine.txt", encoding="utf-16",sep='\t')


In [16]:
df.head()

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5


In [21]:
df.columns.tolist()

['New machine', '    Old machine']

In [22]:
df=df.rename(columns={"    Old machine":"Old machine"})

In [23]:
df.columns.tolist()

['New machine', 'Old machine']

In [28]:
df.columns = df.columns.str.lower().str.replace(" ", "_")


In [25]:
null_hypothesis= "The packing time in machines is not significantly different."
alternative_hypothesis="The packing time in machines is significantly different."

In [31]:
t_stat, p_value = st.ttest_ind(df.new_machine,df.old_machine, equal_var=False) # equal_var True for Welch's test (more robus, relies less on variance)
print(f"Test Statistic (t): {t_stat:.2f}")                                  # si la Ho fuese menor o mayor que, habría que ponerle el parámetro adicional
print(f"P-Value: {p_value:.4f}")
print()

Test Statistic (t): -3.40
P-Value: 0.0032



#### The p-value is lower than 0.05, so we reject the null hypothesis, and we can confirm with a 95% confidence level that the machines have different packing times due to the newness of one of them.



2. An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances.
   Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file `files_for_lab/student_gpa.txt`.
   At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

   Test statistics can be calculated as: [link to the image - Test statistics calculation for Unpooled Variance Case](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.04/7.04-unpooled_variances.png)

   Degrees of freedom is `(n1-1)+(n2-1)`.

In [48]:
df = pd.read_csv("files_for_lab/student_gpa.txt",sep="\t ")

  df = pd.read_csv("files_for_lab/student_gpa.txt",sep="\t ")


In [49]:
df.head()

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98


In [50]:
df.columns.tolist()

['Sophomores', ' Juniors']

In [51]:
df=df.rename(columns={" Juniors":"Juniors"})

In [52]:
df.columns=df.columns.str.lower().str.replace(" ","_")

In [53]:
df.head()

Unnamed: 0,sophomores,juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98


In [60]:
df.juniors

0     2.56
1     2.77
2     2.70
3     3.00
4     2.98
5     3.47
6     3.26
7     3.20
8     3.19
9     2.65
10    3.00
11    3.39
12    2.58
13     NaN
14     NaN
15     NaN
16     NaN
Name: juniors, dtype: float64

In [61]:
t_stat, p_value = st.ttest_ind(df.sophomores,df.juniors, equal_var=False,nan_policy="omit") # equal_var True for Welch's test (more robus, relies less on variance)
print(f"Test Statistic (t): {t_stat:.2f}")                                  # si la Ho fuese menor o mayor que, habría que ponerle el parámetro adicional
print(f"P-Value: {p_value:.4f}")
print()

Test Statistic (t): -0.92
P-Value: 0.3642



El p- value es mayor que "alfa" (0.05) por lo que no podemos rechazar la hipótesis nula, y por lo tanto no podemos afirmar que la calificación media sea distinta entre los juniors y los de segundo año. 