In [65]:
import pandas as pd
import numpy as np

# Lab Inferential statistics - T-test & P-value

## Instructions

- We will have another simple example on two sample t test (pooled- when the variances are equal). But this time this is a one sided t-test

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on the average than the machine currently used. To test that hypothesis, the times it takes each machine to pack ten cartons are recorded. The results, in seconds, are shown in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t test, does the data provide sufficient evidence to show if one machine is better than the other

https://www.jmp.com/en_ch/statistics-knowledge-portal/t-test/two-sample-t-test.html

https://www.statisticshowto.com/probability-and-statistics/hypothesis-testing/t-score-vs-z-score/

In [66]:
data = {'New machine': [42.1,41,41.3,41.8,42.4,42.8,43.2,42.3,41.8,42.7],
        'Old machine': [42.7,43.6,43.8,43.3,42.5,43.5,43.1,41.7,44,44.1]}
machine = pd.DataFrame(data)
machine

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5
5,42.8,43.5
6,43.2,43.1
7,42.3,41.7
8,41.8,44.0
9,42.7,44.1


In [78]:
machine.describe()

Unnamed: 0,New machine,Old machine
count,10.0,10.0
mean,42.14,43.23
std,0.683455,0.749889
min,41.0,41.7
25%,41.8,42.8
50%,42.2,43.4
75%,42.625,43.75
max,43.2,44.1


In [79]:
# Comenzamos calculando nuestra estadística de prueba. Este cálculo comienza con encontrar 
# la diferencia entre los dos promedios:
dif_average= 43.230000 - 42.140000
dif_average

1.0899999999999963

https://www.probabilidadyestadistica.net/desviacion-estandar-o-desviacion-tipica/#calculadora-de-la-desviacion-estandar-o-desviacion-tipica

In [80]:
# A continuación, calculamos la desviación estándar agrupada. Esto genera una estimación 
# combinada de la desviación estándar general. La estimación se ajusta para diferentes tamaños de 
# grupo. Primero, calculamos la varianza agrupada:
dev_new = 0.683455
dev_old = 0.749889

dev_estandar = 0.7174413737

In [81]:
# Luego, sacamos la raíz cuadrada de la varianza agrupada para obtener 
# la desviación estándar agrupada. Esto es:

dev_estandar = (np.sqrt(dev_estandar))
dev_estandar

0.8470191105872406

In [82]:
# Ahora tenemos todas las piezas para nuestra estadística de prueba. 
# Tenemos la diferencia de los promedios, la desviación estándar agrupada y 
# los tamaños de muestra. Calculamos nuestro estadístico de prueba de la 
# siguiente manera:

# t = diferencia de promedios de grupo / error estándar de diferencia

t = dif_average / (dev_estandar*(np.sqrt(1/10+1/10)) )
t

2.8775196037607302

In [None]:
# The P-Value is .002004.

# The result is significant at p < .05.

https://www.socscistatistics.com/pvalues/normaldistribution.aspx

- An additional problem (not mandatory): In this case we can't assume that the population variances are equal. Hence in this case we cannot pool the variances. Independent random samples of 17 sophomores and 13 juniors attending a large university yield the following data on grade point averages. Data is provided in the file files_for_lab/student_gpa.txt. At the 5% significance level, do the data provide sufficient evidence to conclude that the mean GPAs of sophomores and juniors at the university differ?

In [71]:
data = {'Sophomores': [3.04,1.71,3.3,2.88,2.11,2.6,2.92,3.6,2.28,2.82,3.03,3.13,2.86,3.49,3.11,2.13,3.27],
        'Juniors': [2.56,2.77,2.7,3,2.98,3.47,3.26,3.2,3.19,2.65,3,3.39,2.58,0,0,0,0,]}
student = pd.DataFrame(data)
student

Unnamed: 0,Sophomores,Juniors
0,3.04,2.56
1,1.71,2.77
2,3.3,2.7
3,2.88,3.0
4,2.11,2.98
5,2.6,3.47
6,2.92,3.26
7,3.6,3.2
8,2.28,3.19
9,2.82,2.65


In [86]:
student.describe()

Unnamed: 0,Sophomores,Juniors
count,17.0,17.0
mean,2.84,2.279412
std,0.519832,1.330538
min,1.71,0.0
25%,2.6,2.56
50%,2.92,2.77
75%,3.13,3.19
max,3.6,3.47


In [87]:
# Comenzamos calculando nuestra estadística de prueba. Este cálculo comienza con encontrar 
# la diferencia entre los dos promedios:
dif_average= 2.840000 - 2.279412
dif_average

0.5605879999999996

In [None]:
# A continuación, calculamos la desviación estándar agrupada. Esto genera una estimación 
# combinada de la desviación estándar general. La estimación se ajusta para diferentes tamaños de 
# grupo. Primero, calculamos la varianza agrupada:
dev_new = 0.683455
dev_old = 0.749889

dev_estandar = 0.7174413737

In [88]:
# Luego, sacamos la raíz cuadrada de la varianza agrupada para obtener 
# la desviación estándar agrupada. Esto es:

dev_estandar = (np.sqrt(dev_estandar))
dev_estandar

0.9203364116382882

In [89]:
# Ahora tenemos todas las piezas para nuestra estadística de prueba. 
# Tenemos la diferencia de los promedios, la desviación estándar agrupada y 
# los tamaños de muestra. Calculamos nuestro estadístico de prueba de la 
# siguiente manera:

# t = diferencia de promedios de grupo / error estándar de diferencia

t = dif_average / (dev_estandar*(np.sqrt(1/10+1/10)) )
t

1.3620159536438168

In [None]:
# The P-Value is .086599.

# The result is not significant at p < .05.

- Test statistics can be calculated as: link to the image - Test statistics calculation for Unpooled Variance Case

https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/7.04/7.04-unpooled_variances.png
Degrees of freedom is (n1-1)+(n2-1).