# Hypothesis Testing

In statistics a hypothesis is a claim or statement about a property of a population. A hypothesis test is a method of statistical inference that uses data to evaluate and reject a hypothesis.

1. Overview
2. Basics of Hypothesis testing
3. testing a claim about a proportion
4. testing a claim about a mean: known σ (population standard deviation known)
5. testing a claim about a mean: unknown σ (population standard deviation unknown)
6. testing a claim about a standard deviation

# Key Concepts of Hypothesis Testing
### 1. Null 
Hypothesis (H0): A statement that there is no effect or no difference, and it serves as a starting point for statistical testing.
### 2. Alternative Hypothesis (H1): 
A statement that indicates the presence of an effect or a difference, opposing the null hypothesis.
### 3. Significance Level (α): 
The probability threshold for rejecting the null hypothesis when the alternative hypothesis is true. Commonly set at 0.05 or 0.01.
### 4. p-value: 
The probability of observing the data, or something more extreme, if the null hypothesis is true. If the p-value is less than the significance level, we reject the null hypothesis in favor of the alternative hypothesis.
### 5. Test Statistic:
A standardized value that is calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis.
### 6. Confidence Interval: 
A range of values derived from sample data that is likely to contain the population parameter with a certain level of confidence, commonly used to estimate the precision of the sample statistic.
### 7. Type I Error (α):
The error made when rejecting the null hypothesis when it is actually true also known as a false positive.
### 8. Type II Error (β):
The error made when failing to reject the null hypothesis when it is actually false also known as a false negative.
### 9. Power of a Test:
The probability of correctly rejecting the null hypothesis when it is true
### 10. Effect Size:
The magnitude of a difference or relationship in a statistical test, indicating the practical significance of the results.
### 11. Sample Size:
The number of observations in a sample, which affects the test's power and the precision of the estimates.
### 12. Statistical Significance:
A determination that the results of a statistical test are unlikely to have occurred under the null hypothesis, often assessed using p-values and confidence intervals.
### 13. Assumptions of the Test:
The conditions that must be met for a statistical test to be valid, including normality, independence, and homoscedasticity.

## Making 5 hyptothesis about the data 

1. From IRIS dataset
2. From Titanic dataset
3. From Tips dataset
4. From Exercise dataset
5. From Attention dataset

In [41]:
import seaborn as sns

print(sns.get_dataset_names())

['anagrams', 'anscombe', 'attention', 'brain_networks', 'car_crashes', 'diamonds', 'dots', 'dowjones', 'exercise', 'flights', 'fmri', 'geyser', 'glue', 'healthexp', 'iris', 'mpg', 'penguins', 'planets', 'seaice', 'taxis', 'tips', 'titanic']


In [42]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [43]:
df = sns.load_dataset("iris")
df.head(150)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [45]:
df1 = sns.load_dataset("titanic")
df1.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [46]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


In [47]:
df2 = sns.load_dataset("tips")
df2.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [48]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [49]:
# If df is your DataFrame and 'column_name' is the column you want to inspect
unique_categories = df2["time"].unique()
print(unique_categories)

['Dinner', 'Lunch']
Categories (2, object): ['Lunch', 'Dinner']


In [50]:
df3 = sns.load_dataset("exercise")
df3.head(90)

Unnamed: 0.1,Unnamed: 0,id,diet,pulse,time,kind
0,0,1,low fat,85,1 min,rest
1,1,1,low fat,85,15 min,rest
2,2,1,low fat,88,30 min,rest
3,3,2,low fat,90,1 min,rest
4,4,2,low fat,92,15 min,rest
...,...,...,...,...,...,...
85,85,29,no fat,135,15 min,running
86,86,29,no fat,130,30 min,running
87,87,30,no fat,99,1 min,running
88,88,30,no fat,111,15 min,running


In [51]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Unnamed: 0  90 non-null     int64   
 1   id          90 non-null     int64   
 2   diet        90 non-null     category
 3   pulse       90 non-null     int64   
 4   time        90 non-null     category
 5   kind        90 non-null     category
dtypes: category(3), int64(3)
memory usage: 2.9 KB


In [52]:
# Create a copy of the DataFrame columns and modify
columns = df3.columns.tolist()
columns[0] = "names"  # Since 'Unnamed:0' is the first column
df3.columns = columns

# Verify the change
print(df3.columns)

Index(['names', 'id', 'diet', 'pulse', 'time', 'kind'], dtype='object')


In [53]:
# Get the number of unique values in each column
unique_counts = df3.nunique()
print(unique_counts)

# Get the unique values for each column
for column in df3.columns:
    if df3[column].dtype == "object":
        print(f"Unique values in column '{column}':")
        print(df3[column].unique())
        print("\n")

names    90
id       30
diet      2
pulse    39
time      3
kind      3
dtype: int64


In [54]:
df4 = sns.load_dataset("attention")
df4.head()

Unnamed: 0.1,Unnamed: 0,subject,attention,solutions,score
0,0,1,divided,1,2.0
1,1,2,divided,1,3.0
2,2,3,divided,1,3.0
3,3,4,divided,1,5.0
4,4,5,divided,1,4.0


In [55]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  60 non-null     int64  
 1   subject     60 non-null     int64  
 2   attention   60 non-null     object 
 3   solutions   60 non-null     int64  
 4   score       60 non-null     float64
dtypes: float64(1), int64(3), object(1)
memory usage: 2.5+ KB


# H1: Length of Satosa petal is longer than versica and virginica petal.
# H2: There is a positive association between survivors and sex.
# H3: The tips recieved at diner time are significantly higher than the tips recieved at lunch time.
# H4: Individuals who ran will exhibit higher heart rate than those who rested or walked.
# H5: Individuals who were focused will have lower scores on test in comparison to those who were distracted.