## Statistical Techniques:
 Statistical analysis involves applying appropriate techniques to test hypotheses, perform inference, and draw conclusions. Common statistical techniques include t-tests, analysis of variance (ANOVA), regression analysis, chi-square tests, correlation analysis, and more. The choice of technique depends on the type of data, research question, and assumptions of the data.

## 1. Chi-Square Test

In [2]:
# import the liberaries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [3]:
# load the dataset
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


**Null Hypothesis: (H0)** There is no significant association between gender ('sex') and survival ('survived') on the Titanic. 

**Alternative Hypothesis (H1):** There is a significant association between gender ('sex') and survival ('survived') on the Titanic.

In [4]:
# Create contigency table
contigency_table = pd.crosstab(df['sex'], df['survived'])
contigency_table

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,81,233
male,468,109


In [5]:
# apply chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contigency_table)

In [8]:
# print the results
print(f"Chi-Square Statistics: {chi2}")
print(f"P-value: {p}")
print(f"Degree of Freedom: {dof}")
print(f"Expected: \n {expected}")

Chi-Square Statistics: 260.71702016732104
P-value: 1.1973570627755645e-58
Degree of Freedom: 1
Expected: 
 [[193.47474747 120.52525253]
 [355.52525253 221.47474747]]


In [9]:
# print the results using if else
if p < 0.05:
    print(f'p-value: {p}, we reject the null hypothesis')
else:
    print(f'p-value: {p}, we failed to reject the null hypothesis')

p-value: 1.1973570627755645e-58, we reject the null hypothesis


## 2. Z-Test and T-Test
Choosing between a Z-test and a T-test for hypothesis testing depends primarily on two factors: the sample size and whether the population standard deviation is known.

**Z-test:**

When to Use:

- The population standard deviation is known.
- The sample size is large (commonly, n ≥ 30). With large samples, the sample standard deviation approximates the population standard deviation.
- For proportions (e.g., testing the proportion of success in a sample against a known population proportion).
  
**Characteristics:**

- Based on the Z-distribution, which is a normal distribution as n becomes large.
- More commonly used in quality control and standardization processes.

**T-test:**

When to Use:

- The population standard deviation is unknown.
- The sample size is small (typically, n < 30).
- Suitable for cases where the data is approximately normally distributed, especially in small samples.

**Characteristics:**

- Based on the T-distribution, which accounts for the additional uncertainty due to the estimation of the population standard deviation from the sample.
- T-distribution becomes closer to the normal distribution as the sample size increases.

**General Guidelines:**

- Large Samples: With large sample sizes, the T-test and Z-test will give similar results. This is because the T-distribution approaches the normal distribution as the sample size increases.
- Small Samples: When the sample size is small and the population standard deviation is unknown, the T-test is generally the appropriate choice due to its ability to account for the uncertainty in the standard deviation estimate.
- Unknown Population Standard Deviation: Even with large samples, if the population standard deviation is unknown and cannot be reliably estimated, a T-test is usually preferred.

**Conclusion:**

- Use the Z-test for large sample sizes or when the population standard deviation is known.
- Use the T-test for small sample sizes or when the population standard deviation is unknown.

In practice, the T-test is more commonly used in many research scenarios due to the rarity of knowing the population standard deviation and often dealing with smaller sample sizes.

Here, we will see t-test

### 2.1 One-Sample t-test

In [17]:
# Let's generate a sample data
x = [1, 2, 3, 4, 5]

In [18]:
# known population mean
mu = 4

**H0:** Sample mean is not equal to population mean

**H1:** Sample mean is equal to population mean

In [19]:
# perform one-sample t-test
statistic, p = stats.ttest_1samp(x, mu)

In [20]:
# print results 
print(f"Statistics: {statistic}")
print(f"p-value: {p}")

Statistics: -1.414213562373095
p-value: 0.23019964108049873


In [21]:
# print the result using if else condition
if p < 0.05:
  print(f'p-value: {p}, Sample mean is not equal to population mean (reject H0)')
else:
  print(f'p-value: {p}, Sample mean is equal to population mean (fail to reject H0)')

p-value: 0.23019964108049873, Sample mean is equal to population mean (fail to reject H0)


### 2.2 Two-Sample t-test (Independent)

In [22]:
# sample data
group1 = [2.3, 3.4, 4.5, 2.3, 3.4]
group2 = [1.2, 2.2, 3.2, 2.2, 2.3]

**H0:** group1 mean is equal to group2 mean

**H1:** group1 mean is not equal to group2 mean

In [23]:
# perform independent two-sample t-test
t_stat, p = stats.ttest_ind(group1, group2, equal_var=True)

In [24]:
# print the results
print("t-statistic:", t_stat)
print("p-value:", p)

t-statistic: 1.8482055087756457
p-value: 0.10175647371829195


In [25]:
# print the sesults based on if else condition
if p > 0.05:
    print(f'p-value: {p}, group1 mean is equal to group2 mean (fail to reject H0)')
else:
    print(f'p-value: {p}, group1 mean is not equal to group2 mean (reject H0)')

p-value: 0.10175647371829195, group1 mean is equal to group2 mean (fail to reject H0)


### 2.3 Two-Sample T-test (Paired)

In [26]:
# sample data
before = [2, 3, 4, 5, 6]
after = [3, 4, 5, 6, 7]

**H0:** before mean is not equal to after mean

**H1:** before mean is equal to after mean

In [29]:
# perform paired sample t-test
t_stat, p = stats.ttest_rel(before, after)

In [31]:
# print results 
print("t-statistic:", t_stat)
print("p-value:", p)

t-statistic: -inf
p-value: 0.0


In [32]:
# print the results using if else conditions
if p > 0.05:
    print(f'p-value: {p}, before mean is equal to after mean (fail to reject H0)')
else:
    print(f'p-value: {p}, before mean is not equal to after mean (reject H0)')

p-value: 0.0, before mean is not equal to after mean (reject H0)
