# Conditions when ttest is applied

#### 1. Data is Normal (Normality).
#### 2. Data is continuous.
#### 3. Population variance is unknown when applying ttest.
#### NOTE: When Population variance is known- Z test is most recommended approach.

#### Let's take significance level of testing, alpha is 5% i.e. 0.005

# One Sample ttest

## Null Hypothesis, H0 : Population mean = Specified value

In [54]:
# Import libraries

import numpy as np
import pandas as pd
from scipy.stats import ttest_ind, ttest_rel

In [1]:
# We are taking Covid dataset 
salary_data = pd.read_csv('Salary_Data.csv')
salary_data.head()

Unnamed: 0,Age,Gender,Education Level,Job Title,Years of Experience,Salary
0,32.0,Male,Bachelor's,Software Engineer,5.0,90000.0
1,28.0,Female,Master's,Data Analyst,3.0,65000.0
2,45.0,Male,PhD,Senior Manager,15.0,150000.0
3,36.0,Female,Bachelor's,Sales Associate,7.0,60000.0
4,52.0,Male,Master's,Director,20.0,200000.0


In [2]:
salary_data.shape

(6704, 6)

#### Perform ttest on Age column of the Salary dataset.

In [3]:
# Let's check if the Age column has any null values or not.
salary_data['Age'].unique()

array([32., 28., 45., 36., 52., 29., 42., 31., 26., 38., 48., 35., 40.,
       27., 44., 33., 39., 25., 51., 34., 47., 30., 41., 37., 24., 43.,
       50., 46., 49., 23., 53., nan, 61., 57., 62., 55., 56., 54., 60.,
       58., 22., 21.])

#### Now apply ttest for this particular observations.

In [8]:
salary_data['Age']= salary_data['Age'].replace(np.nan,salary_data['Age'].median() )

#### It does have any null values. Let's replace that value with median.


In [9]:
# Now let's fetch the age for all employees.
age = salary_data['Age']
age.dtype

dtype('float64')

### Performing 1 sample ttest

In [56]:
ttest_1_sample = stats.ttest_1samp(age, 34)
ttest_1_sample 

TtestResult(statistic=-4.082570504931489, pvalue=4.505820030225165e-05, df=6703)

## Conclusions:
### We can observe now the p_value is less than the significance level(0.05). Hence: 
#### H0 (null hypothesis) is rejected.
#### H1(Alternate hypothesis): Population Age mean > 34

### Let's find the Cumulative Frequency.

In [31]:
cumulative_distribution = stats.t.cdf(-4.082570504931489,6703)

### P-value using Cumulative Frequency.

In [32]:
p_value = 1-cumulative_distribution
p_value

0.9999774708998489

# Two Sample ttest : testing for difference across populations

## Let's take 2 Independent samples for this.

In [52]:
age_sample_a = np.random.normal(2.5,0.9,1000)
age_sample_b = np.random.normal(1.8,0.7,1000)

### ttest

In [58]:
ttest_2_sample = stats.ttest_ind(age_sample_a, age_sample_b)
ttest_2_sample

Ttest_indResult(statistic=18.32270574393021, pvalue=1.9285978576583044e-69)

# Two Sample Paired ttest : repeated measurements on the same individuals

## Null Hypothesis, H0 : variance chilled == variance non-chilled

## Alternate Hypothesis, H1 : variance chilled != variance non-chilled

In [69]:
# We are taking Type of plants dataframe.
co2_data = pd.read_csv("CO2.csv")
co2_data

Unnamed: 0,Plant,Type,Treatment,conc,uptake
0,Qn1,Quebec,nonchilled,95,16.0
1,Qn1,Quebec,nonchilled,175,30.4
2,Qn1,Quebec,nonchilled,250,34.8
3,Qn1,Quebec,nonchilled,350,37.2
4,Qn1,Quebec,nonchilled,500,35.3
...,...,...,...,...,...
79,Mc3,Mississippi,chilled,250,17.9
80,Mc3,Mississippi,chilled,350,17.9
81,Mc3,Mississippi,chilled,500,17.9
82,Mc3,Mississippi,chilled,675,18.9


## Split into 2 dataframes on the basis of Plant treatment Chill vs Non-chill

In [65]:
co2_chill = co2_data[co2_data['Treatment']=='chilled']
co2_non_chill = co2_data[co2_data['Treatment']=='nonchilled']

## Now extracting the uptake value for Chill and Non-chill treatment for plants.

In [71]:
uptake_chill = co2_chill['uptake']
uptake_non_chill = co2_non_chill['uptake']

### Performing 2 sample ttest

#### Assuming variance is same for both samples

In [83]:
ttest_2_sample = stats.ttest_ind(uptake_chill,uptake_non_chill)
ttest_2_sample

Ttest_indResult(statistic=-3.0484611149819503, pvalue=0.0030957332525416484)

## Conclusions:

### We can observe now the p_value is less than the significance level (0.05). Hence:

#### H0 (null hypothesis) is rejected.

#### The means of uptake values may not be equal for two treatments at 5% level of significance.

### Performing 2 sample Paired ttest :: repeated measurements on the same individuals


### Let's generate random integer from range x to y and perform paired ttest on it.

In [90]:
pre_weight= np.random.randint(80, 100, 5)
post_weight= np.random.randint(80, 100, 5)

### Now in Paired ttest, assume xi and yi are the paired observations and n is the sample size:
### 1. di= xi-yi is the corresponding difference in the paired observations
### 2. Let sd is the standard deviation for the difference and d is the mean of difference in samples. D is the populatiin mean.
### 3. Two tailed hypothesis-
## H0: D =0 and H1: D!=0

In [91]:
ttest_2_sample_paired = stats.ttest_rel(pre_weight,post_weight)
ttest_2_sample_paired

TtestResult(statistic=-1.8926798044321347, pvalue=0.13134358766976953, df=4)

## Conclusions:

### We can observe now the p_value is greater than the significance level (0.05). Hence:

#### H0 (null hypothesis) is accepted or failed to reject Nulll hypothesis.
