# CLT & Hypothesis Testing

In [None]:
#Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

**Central Limit Theorem**

Central Limit Theorem claims that, no matter the shape of the population distribution, the mean of a large number of sample means will always be equal to the population mean.

In other words, let's imagine that we collect lot of big enough samples $(n \ge 30)$ and we compute the mean of each sample:

$$sample_{1} =\{x_{11}, x_{12},...,x_{1n}\}\\
sample_{2} =\{x_{21}, x_{22},...,x_{2n}\}\\
sample_{3} =\{x_{31}, x_{32},...,x_{3n}\}\\
...$$

$$\bar{x}_{1} = mean(sample_{1})\\
\bar{x}_{2} = mean(sample_{2})\\
\bar{x}_{3} = mean(sample_{3})\\
...$$

What the Central Limit Theorem tells us is that if we compute the mean of those means $\bar{x_{i}}$, it will be equal to the population mean:

$$\bar{x} = mean(\bar{x}_{1},\bar{x}_{2},\bar{x}_{3},...,\bar{x}_{m}) = populationmean=\mu$$

In order to demonstrate the workings of the Central Limit Theorem, let's first create a population dataset. This simulated population will serve as a basis for later comparison with sampling distributions, showcasing how CLT facilitates accurate inferences even when only samples, rather than the entire population, are analyzed.

In [None]:
#Let's create a POPULATION of heights

np.random.seed(17)
# We will select values from a uniform distribution ie: all the values have the same likehood to be selected.
heights = np.random.uniform(low=155, high=200, size=100000)

heights

In [None]:
plt.hist(heights)
plt.show()

In [None]:
print(f"The population mean height is {heights.mean(): .2f}")

Now lets take some samples from the population

In [None]:
#collecting 100 samples of 100 observations each and get the mean of each sample
number_of_samples = 100
sample_size = 100
heights_samples = [np.mean(np.random.choice(heights, sample_size))for k in range(number_of_samples)]

Let's now plot the sampling distribution

In [None]:
plt.hist(heights_samples)
plt.show()
print(f"The mean of the sampling distribution is {np.mean(heights_samples): .2f}")

As shown in this example, irrespectively of the shape of the population distribution, **the mean of the sampling distribution is almost equal to the population mean** and follows a normal distribution.

The standard deviation of the sampling distribution can also be manipulated (by increasing sample size) and we calculate it with the following expression:



$$\sigma_{\bar{x}}=\frac{σ}{\sqrt{n}}$$

This formula implies that if we have a big sample size, the standard deviation of the sampling distribution will decrease. Let's observe it with some examples

In [None]:
#Generate small samples
heights_samples_10 = [np.mean(np.random.choice(heights, 10))for _ in range(100)]

#Generate bigger samples
heights_samples_100 = [np.mean(np.random.choice(heights, 100))for _ in range(100)]

_, charts = plt.subplots(nrows = 1, ncols=2, sharex=True, sharey=True, figsize=(10,5))
charts[0].set_title("Sampling Distribution with n=10", fontsize=10)
charts[0].hist(heights_samples_10)

charts[1].set_title("Sampling Distribution with n=100", fontsize=10)
charts[1].hist(heights_samples_100)
plt.show()

As shown in the sampling distributions plots above, larger samples sizes help us control **variation due to sample**.

# Hypothesis Testing

### One Sample T-test

In [None]:
#Load the data

df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv")
df.head()

In [None]:
df.shape

**Let's check if 1st class ticket prices, traditionally thought to be $65, actually align with historical data through statistical testing.**

First we are breaking it down step by step

##### Set Hypothesis

In [None]:
#Set the hypothesis

#H0: mu 1st class_prices = 65
#H1: mu 1st class_prices!= 65

$$H_{0}: \mu_{firstclass} = 65\\
H_{1/a}: \mu_{firstclass} \ne 65$$

##### Choose significance level

The significance level is the **threshold** to compare our p_value against it. If our p_value < alpha, then we reject the H0, otherwise, we accept it.

$$p-value = P(Data| H_{0}=True)$$

In [None]:
alpha = 0.05

##### Collect data

In [None]:
first_class = df[df["Pclass"]==1]["Fare"]

##### Compute Test Statistic

For this kind of test, the statistic is given by the following formula:

$$t = \frac{(mean - \mu)}{\frac{\sigma_{x}}{\sqrt{n}}}$$

$$sample-std = \sigma_{x} = \sqrt{\sum_{i}\frac{(x_{i}-\bar{x})^{2}}{(n-1)}}$$

$$population-std = \sigma = \sqrt{\sum_{i}\frac{(x_{i}-\mu)^{2}}{n}}$$

$$\bar{x} \ne \mu; \bar{x}\approx \mu$$

In [None]:
#In order to calculate test statistic we need

#sample mean
mean = first_class.mean()

#standard deviation of sample
s = first_class.std(ddof=1) # We need to use ddof=1 because we're working with a sample and not the whole population

#sample size
n = len(first_class)

#hypothesized population mean
mu = 65

t_statistic = (mean - mu)/(s/np.sqrt(n))

print(f"The value of the t_statistic is {t_statistic: .2f}")

##### Determine p_value

- In two-tailed test we can obtain the p_value using --> st.t.sf(abs(stat), n-1)*2

- In one-tailed test we can obtain the p_value using --> st.t.sf(abs(stat), n-1)

In [None]:
2 * (1-st.t.cdf(t_statistic, n-1))

In [None]:
# number_of_degrees_of_freedom = sample_size -1
p_value = st.t.sf(abs(t_statistic), n-1)*2
p_value

##### Decision-making

In [None]:
if p_value > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypotesis")

In this case because the p_value is lower our significance level, we can indeed reject the null hypothesis that claims 1st class tickets costs on average 60$.

**Python way**

In [None]:
st.ttest_1samp(first_class, 65)

**What if we believe that prices are more expensive than 65$ ?**

In [None]:
#Set hypothesis

#H0: mu 1st class_prices <= 65
#H1: mu 1st class_prices > 65

#with alpha = 0.05

$$H_{0}: mean \le 65\\
H{1/a}: mean \gt 65$$

In [None]:
st.ttest_1samp(first_class, 65, alternative = "greater")

We also reject the null hypothesis that claimed 1st class prices on average equal or lower than 65.

We can say we obtained enough evidence to reject the null hypothesis

### Two Sample T-test

In [None]:
#Load the data - we are going to use titanic dataset

df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv")
df.head()

We want to test if the average ticket price is the same for males and females

In [None]:
df_female = df[df["Sex"]=="female"]["Fare"]
df_male = df[df["Sex"]=="male"]["Fare"]

In [None]:
#Set the hypothesis

#H0: mu_price male = mu_price female
#H1: mu_price male != mu_price female

#significance level = 0.05

$$H_{0}: mean_{males} = mean_{females}\\
H_{1/a}: mean_{meales} \ne mean_{females}$$

$$H_{0}: mean_{males} - mean_{females} = 0\\
H_{1/a}: mean_{meales} - mean_{females} \ne 0$$

In [None]:
st.ttest_ind(df_male,df_female, equal_var=False)

Because p_value is lower than significance level, we reject the null hypothesis, this means that prices, on average, paid by males and females is indeed diferent

### Paired Sample T-test

We aim to assess the effectiveness of a medical drug in controlling blood pressure.

 We have obtained readings of individuals' blood pressure both before and after taking the drug

In [None]:
#Load data

df = pd.read_csv(r"https://raw.githubusercontent.com/data-bootcamp-v4/data/main/blood_pressure.csv")
df

In [None]:
#Set hypothesis

#H0: mu before = mu after
#H1: mu before != mu after

#Significance level -> 0.05
alpha = 0.05

$$H_{0}: mean_{before} = mean_{after}\\
H_{1/a}: mean_{before} \ne mean_{after}$$

$$H_{0}: mean_{before} - mean_{after} = 0\\
H_{1/a}: mean_{before} - mean_{after} \ne 0$$

In [None]:
t_statistic, p_value = st.ttest_rel(df["before"], df["after"])

In [None]:
t_statistic

In [None]:
p_value

In [None]:
if p_value > alpha:
    print("We are not able to reject the null hypothesis")
else:
    print("We reject the null hypothesis")

We reject the null hypothesis, therefore we can conclude the average blood pressure before and after taking the drug is not equal

### ANOVA

In [None]:
#Load the data

df = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/data_science_salaries.csv")
df.head()

Curious about salaries of data scientists, we're checking if company size will have an impact on people's salaries.

In order to procede with the test we must acknowledge that we have 3 differents groups

In [None]:
df["company_size"].unique()

In [None]:
df_small = df[(df["company_size"]=="Small") & (df["job_title"]=="Data Scientist")]["salary_in_usd"]
df_medium = df[(df["company_size"]=="Medium") & (df["job_title"]=="Data Scientist")]["salary_in_usd"]
df_large = df[(df["company_size"]=="Large") & (df["job_title"]=="Data Scientist")]["salary_in_usd"]

$$H_{0}: mean_{small} = mean_{medium} = mean_{large}\\
H_{1/a}: mean_{small} \ne mean_{medium} \ne mean_{large}$$

In [None]:
#Set the hypothesis

#H0: mu df_small = mu df_medium = mu df_large
#H1: mu df_small != mu df_medium != mu df_large

#Lets choose significance level of 5%
alpha = 0.05

st.f_oneway(df_small, df_medium, df_large)

With such a small p_value, we once again, can reject the null hypothesis, therefore the company size will indeed have an impact of data scientists salary