### In this notebook we compleated 4 tasks related to applied statistics using Python

In [None]:
#Import Libraries necessary for tasks  

# Mathematical functions from the standard library.
# https://docs.python.org/3/library/math.html
import math

# Permutations and combinations.
# https://docs.python.org/3/library/itertools.html
import itertools

# Random selections.
# https://docs.python.org/3/library/random.html
import random

# Numerical structures and operations.
# https://numpy.org/doc/stable/reference/index.html#reference
import numpy as np

# Plotting.
# https://matplotlib.org/stable/contents.html
import matplotlib.pyplot as plt

#Statistical tests
# https://docs.scipy.org/doc/scipy/reference/stats.html

import scipy.stats as stats

# Statistical data visualization.
#https://seaborn.pydata.org/#seaborn-statistical-data-visualization
import seaborn as sns

#### Task 1:Permutations and Combinations   

 **Suppose we alter the Lady Tasting Tea experiment to involve twelve cups of tea. Six have the milk in first and the other six having tea in first.   
 A person claims they have the special power of being able to tell whether the tea or the milk went into a cup first upon tasting it.   
 You agree to accept their claim if they can tell which of the six cups in your experiment had the milk in first.  
 Calculate, using Python, the probability that they select the correct six cups.   
 Here you should assume that they have no special powers in figuring it out, that they are just guessing.**

In [None]:
# Number of cups of tea in total.
no_cups = 12
# Number of cups of tea with milk in first.
no_cups_milk_first = 6
# Number of ways of selecting 6 cups from 12.
ways = math.comb(no_cups, no_cups_milk_first)
# Show.
ways

In [None]:
# Out of 924 possible ways of selecting 6 cups out of 12 only one is all 6 "milk_first"
# The probability that they (randomly) selects the 6 correct cups.

1 / 924

*As we can see the probability of randoly selecting 6 correct cups is very low.*

**Suppose, now, you are willing to accept one error. Once they select the six cups they think had the milk in first, you will give them the benefit of the doubt should they have selected at least five of the correct cups. Calculate the probability, assuming they have no special powers, that the person makes at most one error.**


In [None]:
# Number of cups of tea in total.
n = 12
# Number of cups of tea with milk in first.
k = 6
# 12 factorial.
math.factorial(n)
# 6 factorial.
math.factorial(n - k)
# No of ways of selecting k objects from n without replacement and without order.
math.factorial(n) // (math.factorial(k) * math.factorial(n - k))

In [None]:
# The cup labels.
labels = list(range(no_cups))

# Show.
labels

In [None]:
# Show the different ways of selecting no_cups_milk_first out of no_cups cups of tea.
combs = list(itertools.combinations(labels, no_cups_milk_first))

# Show.
#combs

# Number of combinations.
len(combs)

# Select four cups at random to put milk in first.
# https://docs.python.org/3/library/random.html#random.sample
labels_milk = random.sample(labels, 6)

# Sort, inplace.
labels_milk.sort()

# Show.
labels_milk



In [None]:
# Turn labels_milk into a set.
# Uses: https://docs.python.org/3/tutorial/datastructures.html#sets
set(labels_milk)

In [None]:
# Calculate the overlap between each element of combs and labels_milk.

no_overlaps = []

for comb in combs:
  # Turn comb into a set.
  s1 = set(comb)
  # Turn labels_milk into a set.
  s2 = set(labels_milk)
  # Figure out where they overlap.
  overlap = s1.intersection(s2)
  # Show the combination and the overlap.
  #print(comb, overlap, len(overlap))
  # Append overlap to no_overlaps.
  no_overlaps.append(len(overlap))

In [None]:
# Count the number of times each overlap occurs.
counts = np.unique(no_overlaps, return_counts=True)

# Show.
counts

In [None]:
# Create a figure.
fig, ax = plt.subplots(figsize=(6, 4))

# Bar chart.
ax.bar(counts[0], counts[1]);


In [None]:
# The probability that they (randomly) selects at least 5 correct cups.
(36 + 1) / 924

*If we compare probability of correct selection of 6 cups (0.0010822510822510823)
with one error expected (0.04004329004329004) we can see that it is noticeably higher for one error, but still quite low.*


**Would you accept two errors? Explain.**

In [None]:
# The probability that they (randomly) selects at least 4 correct cups.
(225 + 36 + 1) / 924


- Probability of selecting all 6 correct cups: 0.0010822510822510823 (≈ 0.11%).
- Probability of selecting at most one error (5 or 6 correct): 0.04004329004329004 (≈ 4.00%).
- Probability of selecting at most two errors (4, 5 or 6 correct): 0.28354978354978355 (≈ 28.00%).

##### Conclusion:  

 If we accept at most one error (5 or 6 correct cups), the person has about a 4% chance of succeeding by random guessing. If you accept two errors (4 and more correct cups), the probability jumps to 28%. This makes it much more likely that the person could sicceed by random guessing. We will not accept two errors if we what to establish ability to tell the diffrence between "milk first" and "tea first" cups.

#### Task 2: numpy's Normal Distribution ####
**Assess whether numpy.random.normal() properly generates normal values. To begin, generate a sample of one hundred thousand values using the function with mean 10.0 and standard deviation 3.0.**

In [None]:
#generate a sample of one hundred thousand values 
# using the function with mean 10.0 and standard deviation 3.0

#setting parametrs in numpy documentation 
# mean - loc, 
# standard deviation - scale, 
mean = 10.0
std_dev = 3.0
sample_size = (100000)
#genereting the sample

sample = np.random.normal(mean, std_dev, sample_size)

#print first 10 values
print(sample[:10])


**Use the scipy.stats.shapiro() function to test whether your sample came from a normal distribution.**

In [None]:
#Use the scipy.stats.shapiro() function 
# to test whether your sample came from a normal distribution. 
stats.shapiro(sample)


**Explain the results and output.**  

We received two outputs:

- statistic- A measure of how closely your sample resembles a normal distribution.  
The closer this statistic is to 1, the more normal the data.
- p-value: This indicates whether the sample's deviation from normality is statistically significant.  
If p-value > 0.05: Fail to reject the null hypothesis. This suggests that there’s no strong evidence against normality, so it’s reasonable to assume the data comes from a normal distribution.
If p-value ≤ 0.05: Reject the null hypothesis. This means there’s evidence that the sample is not normally distributed.*
https://en.wikipedia.org/wiki/Shapiro-Wilk_test


**Plot a histogram of your values and plot the corresponding normal distribution probability density function on top of it.**

In [None]:
#Plot a histogram of your values 
fig, ax = plt.subplots()
ax.hist(sample,bins=51, edgecolor='black', density=True)

# and plot the corresponding normal distribution probability density function on top of 
x = np.linspace(mean - 4 * std_dev, mean + 4 * std_dev, 1000)
pdf = stats.norm.pdf(x, mean, std_dev)
plt.plot(x, pdf, 'r', linewidth=2, label="Normal Distribution PDF")
# Add titles and labels
plt.title("Histogram of Sample Data with Normal Distribution PDF")
plt.xlabel("Value")
plt.ylabel("Density")
plt.legend()

*We visualized the data using a histogram overlaid with the corresponding probability density function (PDF) of a normal distribution with the same mean and standard deviation. The histogram closely matched the theoretical PDF, further supporting the correctness of numpy.random.normal() in generating normal values.*

##### Conclusion:  
Both statistical testing and visual inspection confirm that numpy.random.normal() produces data that aligns well with the properties of a normal distribution under the specified parameters.

#### Task 3: t-Test Calculation ####
**Consider the following dataset containing resting heart rates for patients before and after embarking on a two-week exercise program.**

| Patient ID | 0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  |
|------------|----|----|----|----|----|----|----|----|----|----|
| Before     | 63 | 68 | 70 | 64 | 74 | 67 | 70 | 57 | 66 | 65 |
| After      | 64 | 64 | 68 | 64 | 73 | 70 | 72 | 54 | 61 | 63 |

**Calculate the t-statistic based on this data set, using Python. Compare it to the value given by scipy.stats. Explain your work and list any sources used.**

In [None]:
#Set up the data
before = np.array([63, 68, 70, 64, 74, 67, 70, 57, 66, 65])
after = np.array([64, 64, 68, 64, 73, 70, 72, 54, 61, 63])

##### Visualizing data
To gain some insights into the data, we visualized it using multiple techniques.  
A histogram is plotted to observe the distribution of the differences in heart rates, revealing the spread and central tendency of the data.  
A scatter plot is generated to show the relationship between "before" and "after" heart rates for each patient, offering a visual representation of any consistent trends or changes.  
A box plot is used to identify potential outliers and compare the central tendency and spread before and after the exercise program.  
These visualizations helped in understanding the data characteristics and patterns beyond numerical statistics.

In [None]:
# Create an empty data frame.
fig, ax = plt.subplots()
# Create histogram.
ax.hist(before, bins = 10, color='blue', alpha=0.5, label='Before')
ax.hist(after, bins = 10, color='green', alpha=0.5, label='After')
ax.set_title("Histogram of Heart Rates")
ax.set_xlabel("Heart Rate (BPM)")
ax.set_ylabel("Frequency")
ax.legend()

#Fix x-axis ticks by setting a range of values
min_tick = min(before.min(), after.min()) - 1
max_tick = max(before.max(), after.max()) + 1
ax.set_xticks(range(min_tick, max_tick + 1, 1))


fig, ax = plt.subplots(1,2, figsize=(12, 6))
# Create a strip plot.
sns.stripplot(data=[before, after], ax=ax[0], palette=["blue", "green"])
ax[0].set_xticks([0,1])
ax[0].set_xticklabels(["Before", "After"])
ax[0].set_title("Strip Plot of Heart Rates")

# Create a box plot for "before" and "after" data
sns.boxplot(data=[before, after], ax=ax[1], palette=["blue", "green"])
ax[1].set_xticks([0, 1])
ax[1].set_xticklabels(["Before", "After"])
ax[1].set_title("Box Plot of Heart Rates")



*If we visually compare "before" and "after" datasets we can decide that heart rate is lower after two-week exercise program*

**Calculate the t-statistic based on this data set, using Python.**

In [None]:
# Step 2: Calculate the differences
differences = after - before

# Step 3: Calculate the mean and standard deviation of the differences
mean_diff = np.mean(differences)
std_diff = np.std(differences, ddof=1)  # ddof=1 for sample standard deviation
n = len(differences)

# Step 4: Calculate the t-statistic
t_statistic_python = mean_diff / (std_diff / np.sqrt(n))

# Output results
t_statistic_python

**Calculate the t-statistic by scipy.stats.**

In [None]:
# Step 5: Verify with scipy.stats
t_statistic_scipy, p_value_scipy = stats.ttest_rel(after, before)

# Output results
t_statistic_scipy, p_value_scipy

##### Conclusion:  

Manual Calculation: We calculated the mean and standard deviation of the differences in resting heart rates before and after the program, used these values to find the t-statistic, and confirmed it was 
− 1.337
Scipy Calculationn: scipy.stats.ttest_rel gave the same t-statistic of 
− 1.337 and a p-value of 0.214

indicating no statistically significant change in heart rate at a typical 
0.05 significance level.

This verification method shows consistency between manual calculations and scipy’s implementation.  
If we visually compare the "before" and "after" datasets, there is a slight indication that heart rates appear lower after the two-week exercise program. While this trend is observable, it is not strong enough to be statistically significant based on the t-test results.

#### Task 4: ANOVA ####
**In this test we will estimate the probability of committing a type II error in specific circumstances. To begin, create a variable called no_type_ii and set it to 0.  
Now use a loop to perform the following test 10,000 times.  
Use numpy.random.normal to generate three samples with 100 values each. Give each a standard deviation of 0.1. Give the first sample a mean of 4.9, the second a mean of 5.0, and the third a mean of 5.1.  
Perform one-way anova on the three samples and add 1 to no_type_ii whenever a type II error occurs.  
Summarize and explain your results.**

In [None]:
#Create a variable called no_type_ii and set it to 0.
no_type_ii = 0.

In [None]:
# Define parameters for the test
num_tests = 10000  # Number of iterations
num_values = 100    # Number of values in each sample
std_dev = 0.1       # Standard deviation for each sample
means = [4.9, 5.0, 5.1]
# Perform the test 10,000 times
samples = []

for _ in range(num_tests):
    # Generate three samples with the specified means and standard deviation
    sample1 = np.random.normal(means[0], std_dev, num_values)
    sample2 = np.random.normal(means[1], std_dev, num_values)
    sample3 = np.random.normal(means[2], std_dev, num_values)
    
    # Perform one-way ANOVA
    f_stat, p_value = stats.f_oneway(sample1, sample2, sample3)
    
    # If p-value is greater than 0.05, it means we failed to reject the null hypothesis (Type II error)
    if p_value > 0.05:
        no_type_ii += 1

# Display the number of Type II errors
print(f"Number of Type II errors: {no_type_ii}")


##### Conclusion:  
- The absence of type II errors in this simulation suggests that the ANOVA test had sufficient statistical power to detect even the subtle differences between the sample means, given the low variability (standard deviation of 0.1) and relatively large sample size (100 values per sample).
- The result emphasizes the robustness of ANOVA when applied to datasets with low variability and well-distributed sample sizes.
- The combination of the effect size, low standard deviation, and moderate sample size made the differences between the means statistically detectable in every iteration.
- This test demonstrated that under the given conditions, the probability of committing a type II error was effectively 0.0, showcasing the high sensitivity of ANOVA in detecting differences in means when variability is minimal and sample sizes are sufficient.