## 2.1 Describe statistical concepts that underpin hypothesis testing and experimentation

# 2.1 * Define different statistical distributions (e.g. binomial, normal, Poisson, t-distribution, chi-square, and F-distribution, etc. ).

## Statistical Distributions

Here are some of the most common statistical distributions:

* **Binomial distribution:**
    The binomial distribution is a discrete probability distribution that describes the number of successes in a fixed number of trials. The number of trials is called the number of successes, and the probability of success on each trial is called the success probability. The binomial distribution is often used to model the probability of getting a certain number of heads in a series of coin flips, or the number of customers who will buy a product from a store.
    - The probability of success on each trial is denoted by p, and the probability of failure is denoted by q. The binomial distribution is defined by the following formula:
        - P(x successes in n trials) = nCx * p^x * q^(n - x)
    * Discrete
    * Bell-shaped
    * Number of successes in a fixed number of trials
* **Normal distribution:**
    The normal distribution, also known as the Gaussian distribution, is a continuous probability distribution that is bell-shaped. The normal distribution is one of the most important distributions in statistics, and is used in a wide variety of applications, including hypothesis testing, confidence intervals, and regression analysis. The normal distribution is often used to model the heights of people, the weights of animals, or the scores on standardized tests.
    - The normal distribution is defined by the following formula:
        - f(x) = 1 / (σ * √(2π)) * exp(-(x - μ)^2 / (2 * σ^2)) : where μ is the mean, σ is the standard deviation, and f(x) is the probability density function.
    * Continuous
    * Bell-shaped
    * Hypothesis testing, confidence intervals, regression analysis
* **Poisson distribution:**
    The Poisson distribution is a discrete probability distribution that describes the number of events that occur in a fixed interval of time or space. The Poisson distribution is often used to model the number of phone calls that come into a call center in a given hour, or the number of defects that occur in a manufactured product. The Poisson distribution is based on the assumption that the events occur independently of each other and at a constant rate.
    -   The Poisson distribution is defined by the following formula:
        - P(x events in a time interval) = λ^x * exp(-λ) / x! : where λ is the average number of events per time interval.
    * Discrete
    * Exponential
    * Number of events that occur in a fixed interval of time or space
* **t-distribution:**
    The t-distribution is a continuous probability distribution that is similar to the normal distribution, but is used when the sample size is small. The t-distribution is often used in hypothesis testing and confidence intervals when the population standard deviation is unknown. The t-distribution is bell-shaped, but it is more spread out than the normal distribution. This is because the t-distribution takes into account the uncertainty of the sample standard deviation.
    -   The t-distribution is defined by the following formula:
        - f(t) = (1 + (t / df)^2)^(-(iris + 1) / 2) : where iris is the degrees of freedom.
    * Continuous
    * Bell-shaped
    * Hypothesis testing, confidence intervals
* **Chi-square distribution:**
    The chi-square distribution is a continuous probability distribution that is used to test the goodness of fit of a model to data. It is also used to calculate the p-value for a hypothesis test. The chi-square distribution is bell-shaped, but it is more spread out than the normal distribution. This is because the chi-square distribution takes into account the uncertainty of the sample variance.
    -   The chi-square distribution is defined by the following formula:
        - f(x) = (x / 2)^k / Γ(k / 2) : where k is the degrees of freedom.
    * Continuous
    * Bell-shaped
    * Goodness of fit, p-value calculation
* **F-distribution:**
    The F-distribution is a continuous probability distribution that is used to compare the variances of two or more populations. It is often used in ANOVA tests. The F-distribution is bell-shaped, but it is more spread out than the normal distribution. This is because the F-distribution takes into account the uncertainty of the sample variances.
    -   The F-distribution is defined by the following formula:
        - f(x) = (x / df1) * (df2 / (df1 + df2))^((df2 - 2) / 2) : where df1 and df2 are the degrees of freedom for the two populations being compared.
    * Continuous
    * Uniform
    * Comparing variances of two or more populations

| Distribution | Probability | Shape | Applications |
|---|---|---|---|
| Binomial | Discrete | Bell-shaped | Number of successes in a fixed number of trials |
| Normal | Continuous | Bell-shaped | Hypothesis testing, confidence intervals, regression analysis |
| Poisson | Discrete | Exponential | Number of events that occur in a fixed interval of time or space |
| t-distribution | Continuous | Bell-shaped | Hypothesis testing, confidence intervals |
| Chi-square | Continuous | Uniform | Goodness of fit, p-value calculation |
| F-distribution | Continuous | Uniform | Comparing variances of two or more populations |


## 2.1 * Explain the statistical concepts in hypothesis testing (e.g. null hypothesis, alternative hypothesis, one-tailed and two-tailed hypothesis tests, etc. )

## Null hypothesis
* The null hypothesis is a statement about the population parameter that is being tested. It is usually the statement of no difference or no effect. For example, the null hypothesis for a study on the effectiveness of a new drug might be that the drug has **no effect** on the patient's condition.

## Alternative hypothesis
* The alternative hypothesis is the opposite of the null hypothesis. It is the statement that the researcher is trying to prove. For example, the alternative hypothesis for the study on the new drug might be that the drug **does have an effect** on the patient's condition.

## One-tailed and two-tailed hypothesis tests
* A one-tailed hypothesis test is a test in which the alternative hypothesis specifies the direction of the difference between the null hypothesis and the population parameter. For example, the alternative hypothesis for the study on the new drug might be that the drug **has a positive effect** on the patient's condition. A two-tailed hypothesis test is a test in which the alternative hypothesis does not specify the direction of the difference between the null hypothesis and the population parameter. For example, the alternative hypothesis for the study on the new drug might be that the drug **has any effect** on the patient's condition.

## P-value
* The p-value is a probability that is used to determine whether to reject the null hypothesis. The p-value is calculated from the sample data and is the probability of obtaining a result as extreme as the one observed in the sample, given that the null hypothesis is true. A low p-value means that the probability of obtaining a result as extreme as the one observed in the sample is very low if the null hypothesis is true. This suggests that the null hypothesis is probably false.


| Hypothesis | Example | Description |
|---|---|---|
| Null hypothesis | The drug has no effect on the patient's condition. | The null hypothesis is a statement of no difference or no effect. It is the statement that the researcher is trying to disprove. |
| Alternative hypothesis | The drug has a positive effect on the patient's condition. | The alternative hypothesis is the opposite of the null hypothesis. It is the statement that the researcher is trying to prove. |
| One-tailed hypothesis test | The drug has a positive effect on the patient's condition. | A one-tailed hypothesis test is a test in which the alternative hypothesis specifies the direction of the difference between the null hypothesis and the population parameter. In this case, the researcher is only interested in whether the drug has a positive effect on the patient's condition. |
| Two-tailed hypothesis test | The drug has any effect on the patient's condition. | A two-tailed hypothesis test is a test in which the alternative hypothesis does not specify the direction of the difference between the null hypothesis and the population parameter. In this case, the researcher is interested in whether the drug has any effect on the patient's condition, whether it is positive or negative. |


Hypothesis testing is a statistical procedure that is used to determine whether there is enough evidence to reject the null hypothesis. If the p-value is less than a pre-specified level of significance, then the null hypothesis is rejected. This means that there is enough evidence to conclude that the alternative hypothesis is true.


## 2.1 * Explain the statistical concepts in the experimental design (e.g. control group, randomization, confounding variables, etc. ).

### Statistical Concepts in Experimental Design

* **Control group:**
    * A control group is a group of subjects in an experiment that is not exposed to the experimental treatment.
    * The control group is used to compare the results of the experimental treatment group to see if there is a significant difference.
    * The control group is essential for experimental design because it provides a baseline against which to compare the results of the experimental treatment group.
    * If there is no significant difference between the control group and the experimental treatment group, then it is unlikely that the experimental treatment had any effect.
* **Randomization:**
    * Randomization is the process of assigning subjects to groups in an experiment in a random manner.
    * This helps to ensure that the groups are as similar as possible, which minimizes the chances of bias.
    * Randomization is important for experimental design because it helps to ensure that the groups are as similar as possible.
    * This is important because if the groups are not similar, then it is possible that the results of the experiment are due to differences between the groups rather than the experimental treatment.
* **Confounding variables:**
    * Confounding variables are variables that can affect the outcome of an experiment but are not the focus of the experiment.
    * Can be controlled for by randomization or by matching the groups on the confounding variable.
    * Confounding variables are a major threat to the validity of experimental results.
    * If a confounding variable is not controlled for, then it is possible that the results of the experiment are due to the confounding variable rather than the experimental treatment.
* **Blinding:**
    * Blinding is the process of keeping the subjects and/or the researchers in an experiment unaware of the treatment group assignment.
    * This helps to prevent bias in the results of the experiment.
    * Blinding is important for experimental design because it helps to prevent bias in the results of the experiment.
    * If the subjects or the researchers know which group they are in, then they may be more likely to interpret the results of the experiment in a way that supports their hypothesis.
* **Replication:**
    * Replication is the process of repeating an experiment multiple times.
    * This helps to increase the reliability of the results of the experiment.
    * Replication is important for experimental design because it helps to increase the reliability of the results of the experiment.
    * If the results of the experiment are replicated multiple times, then it is more likely that the results are valid.

* **Internal validity:**
    * Internal validity is the extent to which the results of an experiment can be attributed to the experimental treatment.
    * A high level of internal validity means that the results of the experiment are unlikely to be due to other factors, such as confounding variables.
* **External validity:**
    * External validity is the extent to which the results of an experiment can be generalized to other populations or settings.
    * A high level of external validity means that the results of the experiment are likely to be applicable to other people or situations.
* **Power:**
    * Power is the probability of rejecting the null hypothesis when it is false.
    * A high power means that the experiment is likely to detect a real effect, even if the effect is small.
* **Type I error:**
    * A Type I error is the error of rejecting the null hypothesis when it is true.
    * This is also known as a false positive.
* **Type II error:**
    * A Type II error is the error of failing to reject the null hypothesis when it is false.
    * This is also known as a false negative.



## 2.1 * Explain parameter estimation and confidence intervals.

## Parameter Estimation and Confidence Intervals

- Parameter estimation is the process of estimating the value of a population parameter from a sample of data. 
- A confidence interval is a range of values that is likely to contain the true value of the population parameter.

For example, suppose we want to estimate the average height of all adults in the United States. We could take a sample of 100 adults and measure their heights. The average height of the sample would be an estimate of the average height of the population.

We could also calculate a confidence interval for the average height of the population. This would be a range of values that is likely to contain the true average height of the population. For example, the confidence interval might be from 5 feet 10 inches to 6 feet 1 inch.

The width of the confidence interval depends on the sample size and the confidence level. A higher confidence level means that the confidence interval will be wider. For example, a 95% confidence interval will be wider than a 90% confidence interval.

* **Parameter estimation:** Parameter estimation is the process of estimating the value of a population parameter from a sample of data. The most common method of parameter estimation is **maximum likelihood estimation**.
* **Confidence interval:** A confidence interval is a range of values that is likely to contain the true value of the population parameter. The confidence interval is calculated using a **confidence level**, which is the probability that the confidence interval will contain the true value of the population parameter.
* **Width of the confidence interval:** The width of the confidence interval depends on the sample size and the confidence level. A higher confidence level means that the confidence interval will be wider.
* **Statistical inference:** Statistical inference is the process of making inferences about the population based on a sample of data. Parameter estimation and confidence intervals are two important tools for statistical inference.


## 2.2 Apply sampling methods to data

In [192]:
# load data for study
from sklearn.datasets import  load_iris
import pandas as pd

data = load_iris(as_frame=True)
iris = pd.concat([data.data, data.target], axis=1)

map_names = {
    0 : data.target_names[0],
    1 : data.target_names[1],
    2 : data.target_names[2]
}

iris['target'] = iris['target'].replace(map_names)

iris.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# 2.2 * Distinguish between different types of random sampling techniques and apply themethods using Python

* Simple random sampling: This is the simplest type of random sampling. In simple random sampling, each member of the population has an equal chance of being selected. 

In [193]:
import numpy as np

In [194]:
# using pandas
display(iris['target'].sample(n=5))

# using random
import random

population = list(iris.target.values)

sample = random.sample(population, 5)

print(sample)


92    versicolor
94    versicolor
90    versicolor
48        setosa
44        setosa
Name: target, dtype: object

['virginica', 'virginica', 'versicolor', 'setosa', 'versicolor']


*  Systematic random sampling: In systematic random sampling, you select every _k_th member of the population.


In [195]:
# using pandas

# Calculate the step size
step_size = len(iris) // 5

# Generate the sample indices
sample_indices = list(range(0, len(iris), step_size))

# Print the sample
print(iris['target'].iloc[sample_indices])

# Using random

k = step_size

sample = []

for i in range(0, len(iris), k):
  sample.append(iris['target'][i])

print(sample)

0          setosa
30         setosa
60     versicolor
90     versicolor
120     virginica
Name: target, dtype: object
['setosa', 'setosa', 'versicolor', 'versicolor', 'virginica']


* Stratified random sampling: In stratified random sampling, you divide the population into strata and then randomly sample from each stratum. 

In [197]:
#Using random

strata = list(iris.target.unique())

sample = []

for stratum in strata:
  population = iris.loc[iris['target'] == stratum]
  samples = random.sample(list(population[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)','petal width (cm)', 'target']].values), 5)
  sample.append(samples)

print(np.array(sample))

# Using pandas 
iris.groupby('target').sample(n=5)

# Using sklearn

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=5)

for i, (train_index, test_index) in enumerate(sss.split(iris.iloc[:, :-1], iris.iloc[:, -1])):
    print(f"Fold {i}:")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")

[[[4.8 3.0 1.4 0.3 'setosa']
  [5.7 4.4 1.5 0.4 'setosa']
  [5.0 3.4 1.5 0.2 'setosa']
  [5.1 3.8 1.9 0.4 'setosa']
  [5.1 3.5 1.4 0.2 'setosa']]

 [[6.0 2.2 4.0 1.0 'versicolor']
  [6.7 3.0 5.0 1.7 'versicolor']
  [7.0 3.2 4.7 1.4 'versicolor']
  [6.1 2.9 4.7 1.4 'versicolor']
  [6.0 2.9 4.5 1.5 'versicolor']]

 [[6.2 3.4 5.4 2.3 'virginica']
  [7.2 3.6 6.1 2.5 'virginica']
  [7.7 2.8 6.7 2.0 'virginica']
  [6.5 3.2 5.1 2.0 'virginica']
  [7.2 3.0 5.8 1.6 'virginica']]]
Fold 0:
  Train: index=[ 59  47  36 147  76  98  12 144  21  96  46 103  13  51  57   5 129  20
 137  80 111  89 135  40 142  60  27   8  69  30   0 118 149  88  35  83
  39 113 126 107 139  74  78 143 132  33 138 119  50 121  68 109  56 116
  85 117   1  31 127   6  14  17  32  49  44   3   2 134  53 133  99 115
 141 106  64  63 145  52  61 146 108  73  26  38  71  87  54 114 105 140
 128  67 122  91   9  11 100  45  41  42  18  19 131  72  92 104 148 130
  86 110  97  29  81  77  84  65  15  43  58  62   4  22 123 12

* Cluster random sampling: In cluster random sampling, you randomly select clusters from the population and then sample all members of each cluster. 

In [223]:
def cluster_random_sample(df, k, n_samples):
  """
  Cluster random sampling algorithm.

  Args:
    df: A DataFrame of data points.
    k: The number of clusters to sample.

  Returns:
    A list of k clusters.
  """

  clusters = []
  for _ in range(k):
    cluster = []
    while len(cluster) < n_samples:
      index = random.randint(0, len(df) - 1)
      cluster.append(df.iloc[index].values)
    clusters.append(cluster)
  return clusters

In [226]:
# Using pandas

cluster_labels = iris.target

n_samples = 5

cluster_indices = np.random.choice(len(cluster_labels) - n_samples, size=n_samples, replace=False)

samples = []

for cluster_index in cluster_indices:
    sample = iris.iloc[cluster_index : cluster_index + n_samples]
    samples.append(sample.values)

display(samples[0])

# using a custom function

cluster_random_sample(iris, 5, 1)
    

array([[5.5, 2.3, 4.0, 1.3, 'versicolor'],
       [6.5, 2.8, 4.6, 1.5, 'versicolor'],
       [5.7, 2.8, 4.5, 1.3, 'versicolor'],
       [6.3, 3.3, 4.7, 1.6, 'versicolor'],
       [4.9, 2.4, 3.3, 1.0, 'versicolor']], dtype=object)

[[array([6.9, 3.1, 5.4, 2.1, 'virginica'], dtype=object)],
 [array([6.3, 3.3, 4.7, 1.6, 'versicolor'], dtype=object)],
 [array([4.8, 3.1, 1.6, 0.2, 'setosa'], dtype=object)],
 [array([6.6, 3.0, 4.4, 1.4, 'versicolor'], dtype=object)],
 [array([5.0, 3.5, 1.6, 0.6, 'setosa'], dtype=object)]]

## 2.2 * Sample data from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) using Python

In [232]:
import numpy as np
import scipy.stats as stats

# Generate 5 samples from a normal distribution with mean 0 and standard deviation 1
samples = stats.norm(0, 1).rvs(5)
display(samples)

# Generate 5 samples from a binomial distribution with n=10 and p=0.5
samples = stats.binom(10, 0.5).rvs(5)
display(samples)

# Generate 5 samples from a Poisson distribution with lambda=5
samples = stats.poisson(5).rvs(5)
display(samples)

# Generate 5 samples from an exponential distribution with mean 1
samples = stats.expon(1).rvs(5)
display(samples)

array([ 1.04212624,  0.18185751,  0.14583106,  0.31338098, -1.34649079])

array([5, 4, 4, 5, 4])

array([8, 8, 6, 5, 7])

array([1.82355793, 1.16453715, 1.28156788, 1.14111947, 1.80062599])

In [251]:
# using only numpy

# Generate 5 samples from a normal distribution with mean 0 and standard deviation 1
normal = np.random.normal(0, 1, 5)
display(normal)

# Generate 5 samples from a binomial distribution with n=10 and p=0.5
binomial = np.random.binomial(10, 0.5, 5)
display(binomial)

# Generate 5 samples from a Poisson distribution with lambda=5
poisson = np.random.poisson(5, 5)
display(poisson)

# Generate 5 samples from an exponential distribution with mean 1
exponential = np.random.exponential(1, 5)
display(exponential)

array([0.82560445, 0.24134217, 1.08769223, 0.92162577, 0.39867176])

array([5, 7, 5, 4, 5])

array([5, 0, 5, 4, 5])

array([0.56762036, 0.97899688, 1.2752171 , 0.14785828, 1.02736449])

## 2.2 Calculate a probability from a statistical distribution (e.g. normal, binomial, Poisson, exponential, etc.) Python

In [255]:
# Calculate the probability of getting 5 heads in 10 coin flips
binom = stats.binom(10, 0.5).pmf(5)
display(binom)

# Calculate the probability of getting a value between 0 and 1 from a standard normal distribution
norm = stats.norm(0, 1).pdf(0.5)
display(norm)

# Calculate the probability of getting a value less than 5 from a Poisson distribution with lambda=5
poisson = stats.poisson(5).cdf(5)
display(poisson)

# Calculate the probability of getting a value less than 1 from an exponential distribution with mean 1
expon = stats.expon(1).cdf(1)
display(expon)

0.24609375000000003

0.3520653267642995

0.615960654833063

0.0

## 2.3 Implement methods for performing statistical tests

## 2.3 * Run statistical tests (e.g. t-test, ANOVA test, chi-square test) using Python.

In [260]:
from scipy.stats import ttest_ind

# Create two groups of data
group1 = np.random.normal(0, 1, 100)
group2 = np.random.normal(1, 1, 100)

# Run the t-test
t_statistic, p_value = stats.ttest_ind(group1, group2)

# Print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

t-statistic: -7.244424595510364
p-value: 9.454388464164657e-12


In [261]:
import scipy.stats as stats

# Create three groups of data
group1 = np.random.normal(0, 1, 100)
group2 = np.random.normal(1, 1, 100)
group3 = np.random.normal(2, 1, 100)

# Run the ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Print the results
print("f-statistic:", f_statistic)
print("p-value:", p_value)

f-statistic: 84.36763762231348
p-value: 9.67545646594095e-30


In [262]:
import scipy.stats as stats

# Create a contingency table
contingency_table = np.array([[20, 10], [10, 20]])

# Run the chi-square test
chi_statistic, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# Print the results
print("chi-statistic:", chi_statistic)
print("p-value:", p_value)
print("degrees of freedom:", dof)
print("expected values:", expected)

chi-statistic: 5.4
p-value: 0.02013675155034633
degrees of freedom: 1
expected values: [[15. 15.]
 [15. 15.]]


## 2.3 * Analyze the results of statistical tests from Python.

### Binomial Distribution

* The binomial distribution describes the probability of obtaining a certain number of successes in a fixed number of independent Bernoulli trials.
* In hypothesis testing, if you have conducted a binomial test and obtained a p-value, you can interpret it as follows:
    * If the p-value is less than the chosen significance level (e.g., 0.05), you have evidence to reject the null hypothesis, suggesting that the observed results are statistically significant.
    * If the p-value is greater than the significance level, you do not have sufficient evidence to reject the null hypothesis, implying that the observed results are not statistically significant.

### Normal Distribution

* The normal distribution (also called Gaussian distribution) is a continuous probability distribution commonly used to model real-world data that exhibit a symmetric bell-shaped curve.
* In hypothesis testing, when comparing means or conducting tests such as the t-test or z-test:
    * The test statistic (e.g., t-statistic or z-score) measures the number of standard deviations a data point or sample mean is away from the population mean.
    * A positive or negative test statistic indicates the direction of the difference between the sample mean and population mean.
    * The p-value represents the probability of observing the data or more extreme results under the null hypothesis.
    * If the p-value is less than the chosen significance level, you can reject the null hypothesis in favor of the alternative hypothesis.

### Poisson Distribution

* The Poisson distribution models the probability of a given number of events occurring within a fixed interval of time or space.
* In hypothesis testing, when analyzing count data or event occurrences:
    * The test statistic depends on the specific test being conducted (e.g., chi-square test for count data).
    * The p-value indicates the probability of observing the data or more extreme results under the null hypothesis.
    * Similar to other tests, if the p-value is less than the chosen significance level, you may reject the null hypothesis.

### t-Distribution

* The t-distribution is used when sample sizes are small or when the population standard deviation is unknown. It closely resembles the normal distribution but has fatter tails.
* In hypothesis testing, particularly in situations with small sample sizes:
    * The t-statistic measures how much the sample mean differs from the null hypothesis mean in terms of standard error.
    * The p-value represents the probability of observing the data or more extreme results under the null hypothesis.
    * If the p-value is less than the chosen significance level, you can reject the null hypothesis.

### Chi-Square Distribution

* The chi-square distribution is commonly used in tests of independence, goodness-of-fit, and homogeneity.
* In hypothesis testing using chi-square tests:
    * The chi-square test statistic measures the discrepancy between the observed and expected frequencies.
    * The p-value indicates the probability of observing the data or more extreme results under the null hypothesis.
    * If the p-value is less than the chosen significance level, you can reject the null hypothesis.

### F-Distribution

* The F-distribution is used in analysis of variance (ANOVA) and other tests involving variance ratios.
* In hypothesis testing using ANOVA or other F-tests:
    * The F-statistic measures the ratio of variability between groups to variability within groups.
    * The p-value represents the probability of observing the data or more extreme results under the null hypothesis.
    * If the p-value is less than the chosen significance level, you can reject the null hypothesis.
