**Lab | Inferential statistics - T-test & P-value**

**Instructions**

1. One-tailed t-test / files_for_lab/machine.txt

In a packing plant, a machine packs cartons with jars. It is supposed that a new machine will pack faster on average than the machine currently used. To test that hypothesis, the times each machine takes to pack ten cartons are recorded. The results are in seconds in the tables in the file files_for_lab/machine.txt. Assume that there is sufficient evidence to conduct the t-test, does the data provide sufficient evidence to show if one machine is better than the other?

2. Matched Pairs Test / files_for_lab/pokemon.csv

In this challenge we will compare dependent samples of data describing our Pokemon (file files_for_lab/pokemon.csv). Our goal is to see whether there is a significant difference between each Pokemon's defense and attack scores. Our hypothesis is that the defense and attack scores are equal. Compare the two columns to see if there is a statistically significant difference between them and comment on your result.

**Inferential statistics - ANOVA**

Part 1

In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table

Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

Part 2

* In this section, use Python to conduct ANOVA.
* What conclusions can you draw from the experiment and why?

**1. One-tailed t-test**

In [181]:
#In a packing plant, a machine packs cartons with jars. 
#It is supposed that a **new machine** will pack **faster on average** than the machine currently used. 
#To **test that hypothesis**, the times each machine takes to pack ten cartons are recorded. 
#The results are in **seconds** in the tables in the file files_for_lab/machine.txt. 
#Assume that there is sufficient evidence to conduct the **t-test**, 
#does the data provide sufficient evidence to show if one machine is better than the other?

H0 = The new machine is faster on average than the old machine
H1 = The new machine is not faster on average than the old machine

How do I test this hypothesis?
Is there sufficient evidence to conduct the t-test?
Does the data provide sufficient evidence to show if one machine is better than the other?
How many cartons (ten cartons)

In [182]:
# Loading libraries

import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
# import warnings
# warnings.filterwarnings('ignore')

In [None]:
# Reading data

#import chardet

#with open('machine.txt', 'rb') as file:
    #raw_data = file.read()
    #result = chardet.detect(raw_data)
    #encoding = result['encoding']

#with open('machine.txt', 'r', encoding=encoding) as file:
    #data = file.read()


#data = pd.read_csv('pokemon.csv')
#data = pd.read_csv('../../Afternoon/lab-t-tests-p-values/files_for_lab/pokemon.csv')
#data = pd.read_csv('../../Afternoon/lab-t-tests-p-values/files_for_lab/machine.txt')
#data = pd.read_csv('machine.txt', sep = '\t')
#data = pd.read_csv('../../Afternoon/lab-t-tests-p-values/files_for_lab/machine.csv')
#data = pd.read_csv('your_text_file.txt', delimiter='\t')
#data = pd.read_csv('machine.txt', delimiter='\t')
#data = pd.read_table('../../Afternoon/lab-t-tests-p-values/files_for_lab/machine.txt')
#data = pd.read_table('machine.txt')
#data = pd.read_csv('machine.txt', sep='\t', header=None)
#data


In [183]:
# Reading data

with open('machine.txt', 'r', encoding='utf-16', errors='replace') as file:
    data = file.read()
data

'New machine\t    Old machine\n42.1\t        42.7\n41\t            43.6\n41.3\t        43.8\n41.8\t        43.3\n42.4\t        42.5\n42.8\t        43.5\n43.2\t        43.1\n42.3\t        41.7\n41.8\t        44\n42.7\t        44.1\n'

In [184]:
# Reading data

import chardet
# Detect the encoding of the file
file_path = 'machine.txt'
with open(file_path, 'rb') as f:
    result = chardet.detect(f.read())
encoding = result['encoding']
print(f"Detected encoding: {encoding}")
# Read the file with the detected encoding
data = pd.read_csv(file_path, sep='\t', encoding=encoding)
data.head()

Detected encoding: UTF-16


Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5


In [None]:
# Our hypthesis is that the new machine packs faster than current machine: 
𝐻0:𝜇𝑏=𝜇𝑎→𝜇𝑏−𝜇𝑎=0
𝐻1:𝜇𝑏≠𝜇𝑎→𝜇𝑏−𝜇𝑎≠0  


In [None]:
#It is supposed that a new machine will pack faster on average than the machine currently used. 
#Assume that there is sufficient evidence to conduct the t-test

In [155]:
data.head()

Unnamed: 0,New machine,Old machine
0,42.1,42.7
1,41.0,43.6
2,41.3,43.8
3,41.8,43.3
4,42.4,42.5


In [156]:
sample = data.sample(1, random_state = 1)
sample['difference'] = sample['New machine']-sample['Old machine']
sample.head()

KeyError: 'Old machine'

In [149]:
sample_diff_mean, sample_diff_std = sample['difference'].mean(), sample['difference'].std(ddof=1)
sample_diff_mean, sample_diff_std

KeyError: 'difference'

In [157]:
t = sample_diff_mean / ( sample_diff_std / np.sqrt(sample.shape[0]) )
print("The mean of our samples differences is: {:.2f}".format(sample_diff_mean))
print("The standard deviation of our samples differences is: {:.2f}".format(sample_diff_std))
print("Our t statistics is: {:.2f}".format(t))

NameError: name 'sample_diff_mean' is not defined

In [158]:
tc = st.t.ppf(1-(0.05/2),df= sample.shape[0] - 1)
tc

nan

**2. Matched Pairs Test**

In [None]:
#In this challenge we will **compare dependent samples of data** describing our Pokemon (file files_for_lab/pokemon.csv). 
#Our goal is to see whether there is a **significant difference** between each Pokemon's defense and attack scores. 
#Our **hypthesis** is that the **defense and attack scores are equal**. 
#Compare the two columns to see if there is a statistically significant difference between them and **comment on your result**.

H0: defense and attack scores are equal.
H1: defense and attack scores are not equal
    
How do I compare dependent samples?
How do I see a statistically significant difference?
How do I comment on my result?

In [95]:
# Reading data

#data = pd.read_csv('pokemon.csv')
data = pd.read_csv('../../Afternoon/lab-t-tests-p-values/files_for_lab/pokemon.csv')
data

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [96]:
len(data)

800

In [98]:
data.shape

(800, 13)

In [99]:
data.isna().sum()

#               0
Name            0
Type 1          0
Type 2        386
Total           0
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64

In [None]:
# Get columns Attack and Defense

In [175]:
#Sample 
ad_sample = data[data['Attack']['Defense']].sample(30, random_state=1)
ad_sample.head()

KeyError: 'Attack'

In [171]:
#Compute the statistic and p-value

sample_mean = ad_sample.mean()
sample_std = ad_sample.std(ddof=1)
pop_mean = data[data['Attack']]['Defense'].mean()
pop_std = data[data['Attack']]['Defense'].std()
sem = sample_std/np.sqrt(30)
print("Population mean is: {:.2f}".format(pop_mean))
print("Population standard deviation is: {:.2f}".format(pop_std))
print()
print("Our sample mean is: {:.2f}".format(sample_mean))
print("Our sample standard deviation is: {:.2f}".format(sample_std))
print("The sem is: {:.2f}".format(sem))
print()
t = (sample_mean - pop_mean) / sem
print("Our statistic is: {:.2f}".format(t))
print("The p_value corresponding to our statistic is: {:.2f}".format(st.t.cdf(t,df = len(ad_sample)-1)))
print("The significance level is set to 0.05")
print("We accept the H0?")

NameError: name 'ad_sample' is not defined

In [100]:
nulls_percent_df = pd.DataFrame(data.isna().sum()/len(data)).reset_index()
nulls_percent_df.columns = ['column_name', 'nulls_percentage']
nulls_percent_df

Unnamed: 0,column_name,nulls_percentage
0,#,0.0
1,Name,0.0
2,Type 1,0.0
3,Type 2,0.4825
4,Total,0.0
5,HP,0.0
6,Attack,0.0
7,Defense,0.0
8,Sp. Atk,0.0
9,Sp. Def,0.0


In [None]:
# Check for null values in the numerical columns

In [116]:
# Get numerical columns
numerical = data.select_dtypes(np.number)
numerical.head()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
0,1,318,45,49,49,65,65,45,1
1,2,405,60,62,63,80,80,60,1
2,3,525,80,82,83,100,100,80,1
3,3,625,80,100,123,122,120,80,1
4,4,309,39,52,43,60,50,65,1


In [117]:
numerical.shape

(800, 9)

In [118]:
# Check for null values in the numerical columns
numerical.isna().sum()

#             0
Total         0
HP            0
Attack        0
Defense       0
Sp. Atk       0
Sp. Def       0
Speed         0
Generation    0
dtype: int64

In [119]:
# Check for null values in the numerical columns.
df = pd.DataFrame(numerical.isna().sum()).reset_index()
df.columns = ['column_name', 'nulls']
df[df['nulls']>0]

Unnamed: 0,column_name,nulls


In [120]:
#Determine wich numerical columns have a percentage of NA's above 25%?
nulls_percent_df4 = pd.DataFrame(data.isna().sum()/len(data)).reset_index() # get dataframe and store in variable
nulls_percent_df4.columns = ['column_name', 'nulls_percentage'] #rename columnnames
nulls_percent_df4

nulls_percent_df4[nulls_percent_df4['nulls_percentage']>0.25]

Unnamed: 0,column_name,nulls_percentage
3,Type 2,0.4825


In [None]:
# Use appropriate methods to clean the columns Attack, Defense

In [121]:
# Check for null values in the numerical columns.
numerical.isna().sum()/len(numerical)

#             0.0
Total         0.0
HP            0.0
Attack        0.0
Defense       0.0
Sp. Atk       0.0
Sp. Def       0.0
Speed         0.0
Generation    0.0
dtype: float64

In [122]:
numerical['Attack'].isna().sum()/len(numerical)

0.0

In [123]:
numerical['Defense'].isna().sum()/len(numerical)

0.0

In [124]:
data['Attack'].value_counts()

100    40
65     39
50     37
80     37
85     33
       ..
46      1
190     1
106     1
132     1
33      1
Name: Attack, Length: 111, dtype: int64

In [125]:
data['Defense'].value_counts()

70     54
50     49
60     46
80     39
40     36
       ..
168     1
10      1
51      1
61      1
121     1
Name: Defense, Length: 103, dtype: int64

In [126]:
[col for col in data.columns if "Attack" in col]

['Attack']

In [135]:
# Get numerical columns
numerical = data.select_dtypes(np.number)
numerical.head()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
0,1,318,45,49,49,65,65,45,1
1,2,405,60,62,63,80,80,60,1
2,3,525,80,82,83,100,100,80,1
3,3,625,80,100,123,122,120,80,1
4,4,309,39,52,43,60,50,65,1


In [137]:
numerical.shape

(800, 9)

In [138]:
# to look at VarianceThresholds we need all the variable to be on the same scale
# we will use MinMaxScaler for this
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(numerical)
numerical_scaled = scaler.transform(numerical)

In [139]:
numerical_scaled_df = pd.DataFrame(numerical_scaled, columns = numerical.columns)
numerical_scaled_df.head()

Unnamed: 0,#,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation
0,0.0,0.23,0.173228,0.237838,0.195556,0.298913,0.214286,0.228571,0.0
1,0.001389,0.375,0.232283,0.308108,0.257778,0.380435,0.285714,0.314286,0.0
2,0.002778,0.575,0.311024,0.416216,0.346667,0.48913,0.380952,0.428571,0.0
3,0.002778,0.741667,0.311024,0.513514,0.524444,0.608696,0.47619,0.428571,0.0
4,0.004167,0.215,0.149606,0.254054,0.168889,0.271739,0.142857,0.342857,0.0


In [179]:
#Sample 
ad_sample = data[data['Attack']]['Defense'].sample(30, random_state=1)
ad_sample.head()

KeyError: 'Attack'

In [39]:
# Advanced automated filtering methods: Variance Threshold

In [140]:
removed_columns = pd.DataFrame(data=(numerical.columns,sel.variances_,sel.get_support()), index=('column_name','Attack','Defense')).T

NameError: name 'sel' is not defined

In [None]:
# Significant difference between each Pokemon's defense and attack scores?

In [None]:
# Our hypthesis is that the defense and attack scores are equal:
    
𝐻0:𝜇𝑏=𝜇𝑎 →𝜇𝑏−𝜇𝑎=0
 
𝐻1:𝜇𝑏≠𝜇𝑎 →𝜇𝑏−𝜇𝑎≠0
    
two-sided" test

In [None]:
Take a random sample using .sample(30,random_state=1) on the filtered dataset

**Inferential statistics - ANOVA**

Part 1

* In this activity, we will look at another example. 
* Your task is to **understand the problem** and **write down all the steps to set up ANOVA**. 
* After the next lesson, we will ask you to solve this problem using Python. 
* Here are the steps that you would need to work on: 

- Null hypothesis 
- Alternate hypothesis 
- Level of significance 
- Test statistic 
- P-value 
- F table

In [None]:
Context

Suppose you are working as an analyst in a microprocessor chip manufacturing plant. 
You have been given the task of **analyzing a plasma etching process** with respect to changing Power (in Watts) of the plasma beam. 
Data was collected and provided to you to conduct statistical analysis 
#and check if changing the power of the plasma beam has any effect on the etching rate by the machine. 
#You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. 
You can find the data anova_lab_data.xlsx file in the files_for_lab folder

H0: changing the power of the plasma beam has effect on the etching rate by the machine
H1: changing the power of the plasma beam has no effect on the etching rate by the machine 

How do I conduct statistical analysis on data?
How do I get power of the plasma beam and effect on the etching rate by the machine?
How do I measure the difference?
How do I measure the difference in the mean etching rate?
How do I measure the difference in the mean etching rate for different levels of power?
How do I compare different levels of power?
    

In [166]:
# Reading data

#data = pd.read_excel('anova_lab_data.xlsx')
data = pd.read_excel('../../Afternoon/lab-t-tests-p-values/files_for_lab/anova_lab_data.xlsx')
data

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


Part 2

* In this section, **use Python to conduct ANOVA.**
* What **conclusions** can you draw from the experiment **and why**?

In [None]:
# Import the required libraries

import scipy.stats as stats
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
# Ensure that your data is properly cleaned and prepared before performing the ANOVA analysis:

In [None]:
# Prepare your data:
data = pd.read_csv('your_data.csv')  # Load your data from a CSV file or use your own data


In [None]:
# Perform the ANOVA:
# Using scipy.stats

groups = data['Group']  # Replace 'Group' with the name of your group column
values = data['Values']  # Replace 'Values' with the name of your data column

f_statistic, p_value = stats.f_oneway(*[values[groups == group] for group in groups.unique()])

# Print the ANOVA results
print("F-statistic:", f_statistic)
print("P-value:", p_value)

In [None]:
# Interpret the results:
If the p-value is less than your chosen significance level (e.g., 0.05), 
you can reject the null hypothesis, 
indicating that there are significant differences between at least two of the groups.

In [None]:
# What conclusions can you draw from the experiment and why?