In [2]:
#Loading the dataset
import pandas as pd

athletes = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-07-27/athlete_events.csv")

__SECTION ONE__

1. **In inferential statistics, we attempt to draw a conclusion about a population from sample data. Define the two terms:**  
- __Population__ - _Is the entire group of individuals, items, or data that you are interested in studying or drawing conclusions about._

- __Sample__ - _Is a subset of the population that is selected for analysis._

2. **When we estimate an unknown population value using a confidence interval, what does the confidence interval tell us?**  
_Confidence interval gives us a range of values in which the expected true value or parameter lies with a certain level of probability. For example, if we say a 95% confidence interval for the average height of adult men might be 95% CI: (168 cm, 174 cm), we mean that we are 95% confident that the true average height of all adult men in the population falls somewhere between 168 cm and 174 cm._

3. **What is a hypothesis?**  
_A hypothesis is a statement or a claim about a population that you aim to test using data from a sample. It is used as the starting point for hypothesis testing,to determine whether there is enough evidence in a sample to support or reject the assumption about the population. For example, "The average height of adult men in Kenya is 170 cm," can be a hypothesis to be proven when the study takes place_

4. **Describe the two types of hypothesis. Give an example for each.**  

- _Null hypothesis (H₀) - This is the default assumption that there is no effect, no difference, or no relationship in the population. It's the statement being tested, and the goal of hypothesis testing is often to reject the null hypothesis in favor of the alternative hypothesis._
- _Alternative hypothesis (H₁) - This hypothesis contradicts the null hypothesis and proposes that there is a significant effect, change or difference in the population._

5. **What is hypothesis testing? Which of the two types of hypothesis do we test in hypothesis testing?**  
_Hypothesis testing is a statistical method used to make decisions or inferences about a population based on sample data. In hypothesis testing, we test the null hypothesis (H₀). We assume H₀ is true at the start. The goal is to determine whether the sample data provide strong enough evidence to reject H₀ in favor of the alternative hypothesis (H₁)._

6. **Describe the step-by-step process for hypothesis testing.**

- _**Formulate Hypotheses** - The first step is to state the null (H₀) and alternative (H₁) hypotheses._
- _**Choose the Significance Level (α)** - Secondly, we choose the the probability of rejecting the null hypothesis when it is true.This is the threshold for how much error we're willing to accept. Common choices are 0.05 (5%), 0.01 (1%), and 0.10 (10%)._
- _**Conduct the statistical test** - This process involves choosing a statistical test based on the type of data and the hypothesis, gathering the data that will be analyzed in the test, and calculating a test statistic that reflects how much the observed data deviates from the null hypothesis._
- _**Determine the p-value** - Here, we calculate the probability of observing test results at least as extreme as the results observed, assuming the null hypothesis is correct. It helps determine the strength of the evidence against the null hypothesis._
- _**Make a Decision** - Finally, we compare the p-value to the chosen significance level to make the final decision. If the p-value is less tah or equal to the chosen significant level, we should reject the null hypothesis.  Otherwise, we fail to reject the null hypothesis, meaning there is not enough evidence to support the alternative._


7. **What is alpha in hypothesis testing?**  
_This is the significance level.  It is the threshold that helps us decide whether to reject the null hypothesis (H₀). Common choices for (α) are 0.05 (5%), 0.01 (1%), and 0.10 (10%)._


8. **How does the p-value obtained from hypothesis testing relate to alpha?**  
_The p-value and alpha (α) are directly related in hypothesis testing, they work together to help us decide whether to reject or fail to reject the null hypothesis (H₀). If the p-value is less than or equal to our α, the data is unlikely under H₀, so we reject H₀. If the p-value is greater than α, the data is not unusual enough, so we fail to reject H₀._

9. **When selecting which statistical test to use in hypothesis testing, what assumptions do we make about the data for a parametric test?**  
_When choosing a parametric statistical test, we assume that:_
- _The data used are **normally** distributed. This means the data should be roughly bell-shaped, with a symmetrical distribution around the mean. If the data are not normally distributed, non-parametric tests must be used instead. Non-parametric tests do not assume that the data are normally distributed._
- _The data points within the sample should be **independent** of each other. This means that the value of one observation should not influence the value of any other observation._
- _This variance within each group being compared should be roughly **equal**. If one group has much more variation than others, it will limit the test’s effectiveness._

10. **Name one test used to determine whether a datset is a normal distribution.**  
_The Kolmogorov-Smirnov test_

__SECTION TWO__

Research question:  

Are male and female olympic athletes the same height on average?

a. **State the hypotheses. Define the null and alternative hypotheses.**  
- _Null hypothesis(H₀) - The average height of male and female athletes is the same._
- _Alternative hypothesis(H₁) - The average height of male and female athletes is different._

b. **We have already chosen the significance level of (α = 0.05). So the next step is to choose the appropriate statistical test: consider whether the data is numerical or categorical, whether the groups are independent, and the number of groups. Also test for normality of the data as well as equality of variance.**  
- _Height column data is numerical. So we will be comparing this column with the Age column data which is either M or F. The ages are **independent** and in binary form. Lets test to see if all assumptions are met, including **normality** and **equality** of variance of groups._

In [3]:
# Using the Shapiro-Wilk test to test the normality
from scipy.stats import anderson
import matplotlib.pyplot as plt

# Drop missing height values
athletes = athletes.dropna(subset=["Height"])

# Separating height data by sex
male_heights = athletes[athletes['Sex'] == 'M']['Height'].dropna()
female_heights = athletes[athletes['Sex'] == 'F']['Height'].dropna()

# Anderson-Darling Test for Normality
anderson_m = anderson(male_heights)
anderson_f = anderson(female_heights)

print("Anderson-Darling Test (Males):")
print("Statistic:", anderson_m.statistic)
print("Critical Values:", anderson_m.critical_values)
print("Significance Levels:", anderson_m.significance_level)

print("Anderson-Darling Test (Females):")
print("Statistic:", anderson_f.statistic)
print("Critical Values:", anderson_f.critical_values)
print("Significance Levels:", anderson_f.significance_level)


Anderson-Darling Test (Males):
Statistic: 123.37905829734518
Critical Values: [0.576 0.656 0.787 0.918 1.092]
Significance Levels: [15.  10.   5.   2.5  1. ]
Anderson-Darling Test (Females):
Statistic: 89.8122314439388
Critical Values: [0.576 0.656 0.787 0.918 1.092]
Significance Levels: [15.  10.   5.   2.5  1. ]


In [4]:
# Using Levene’s Test to test the equality of Variances
from scipy.stats import levene

levene_test = levene(male_heights, female_heights)
print("Levene's test:", levene_test)


Levene's test: LeveneResult(statistic=np.float64(494.37845930272937), pvalue=np.float64(2.1243629849629428e-109))


c. **State the justifications for your choice for a statistical test.**  
- _The height variable is **numerical** (continuous)._
- _We are comparing two **independent** groups, male and female athletes._
- _In male heights, the statistic value (123.38) is greater than the critical value (0.787). Therefore, the male group is **not normally** distributed._
- _In female heights, the statistic value (89.81) is also greater than the critical value (0.787). Therefore, the female group is also **not normally** distributed._
- _The p-value in Lavene's test is less than 0.05. Therefore, the two groups have significantly different variances._
- _Since the normality or equal variance assumptions are not met, and the groups are independent, we will use a non-parametric test for unpaired data, in this case Mann–Whitney U Test._

d. **Prepare the data. Handle missing values and filter for male and female athletes with valid height entries.**

In [5]:
# Ensuring heights are clean
male_heights = male_heights.dropna()
female_heights = female_heights.dropna()

e & f. **Conduct the statistical test. & Compute the test statistic and p-value.**

In [6]:
# Importing the library
from scipy.stats import mannwhitneyu

u_stat, p_value = mannwhitneyu(male_heights, female_heights, alternative='two-sided')

print("Mann–Whitney U Test statistic:", u_stat)
print("p-value:", p_value)


Mann–Whitney U Test statistic: 7814318265.5
p-value: 0.0


g. **Make a decision: interpret the p-value and state whether you reject or fail to reject the null hypothesis.**

_From the Mann–Whitney U Test above, the p value is less than 0.05 (in fact, it's extremely close to 0). Therefore, we have very strong evidence against the null hypothesis (H₀). Consequently, we reject the null hypothesis, and state that, there is a statistically significant difference in the average heights of male and female Olympic athletes._