<img src='https://sharif.edu/~izadi/images/logo_sharif.png' alt="SUT logo" width=260 height=300 align=left class="saturate">

<br><br>
<font face="Times New Roman">
    <div dir=ltr align=center>
        <font color=0F5298 size=7>
            Probability and Statistics
        </font>
        <br><br>
        <font color=2565AE size=5>
            Computer Engineering Department<br>Lecturer: Dr. Ali Sharifi Zarchi<br>Spring 2025
        </font>
        <br><br>
        <font color=3C99D size=5>
            Homework 6 (Practical): Statistical Tests in Probability and Statistics
        </font>
        <br><br>
        <font color=6EACDA size=4>
            Authors: Arshia Dadras, Aida Jalali, Alireza Malekhosseini, Leili Motahari, Radin Jarireh
        </font>
    </div>
    <br><br>
</font>

____

#### Student Information  
- **First Name**: YOUR FIRST NAME
- **Last Name**: YOUR LAST NAME
- **Student ID**: YOUR STUDENT ID

## Introduction to Statistical Tests

Hypothesis testing is a fundamental method in statistics that allows us to make decisions about population parameters based on sample data. It involves two competing hypotheses:

- **Null Hypothesis (H₀)**: The default assumption, often stating no effect or no difference.
- **Alternative Hypothesis (H₁)**: The hypothesis we aim to support, indicating a significant effect or difference.

We use a test statistic to evaluate these hypotheses, comparing it to a critical value or calculating a p-value to decide whether to reject H₀. Statistical tests can be categorized as:

- **Parametric Tests**: Assume the data follows a specific distribution (e.g., normal) and often require known population parameters.
- **Non-Parametric Tests**: Make fewer assumptions about the data distribution, suitable for smaller samples or non-normal data.

In this notebook, you will explore four key statistical tests:

1. **Permutation Test**: A non-parametric test to assess if two datasets come from the same distribution.
2. **Chi-Square Test**: A test for categorical data to evaluate goodness of fit or independence.
3. **Z Test**: A parametric test for hypotheses about means or proportions with known variance or large samples.
4. **Fisher's Exact Test**: A test for associations between categorical variables, ideal for small samples.

Each section includes an introduction, a practical example with Python code, and TODOs for you to complete. These exercises will help you apply the concepts and deepen your understanding.

## Instructions for Students

1. Complete all TODO sections by writing the necessary code or answers.
2. Ensure you have the required datasets (`Occupancy_Estimation.csv`, `StudentsPerformance.csv`, `heart_cleveland_upload.csv`) and adjust file paths as needed.
3. Run the code cells to verify your solutions and interpretations.
4. Use the theoretical questions to deepen your understanding of each test's purpose and application.

## Section 1: Permutation Test

### Overview

The permutation test is a non-parametric method used to determine if two datasets originate from the same distribution. It works by repeatedly shuffling the combined data and recalculating a test statistic (e.g., difference in means) to build a distribution under the null hypothesis. The observed statistic is then compared to this distribution to compute a p-value.

### Use Case

You have sleep hours data for two groups of students:

- **Group A**: Students with exams (`group_A = [5.5, 5, 6.3, 8, 3]`).
- **Group B**: Students without exams (`group_B = [7.5, 7.5, 7.5, 6, 8]`).

Your task is to test whether exams significantly affect sleep hours.

### Hypotheses

- **H₀**: There is no significant difference in sleep hours between the two groups.
- **H₁**: There is a significant difference in sleep hours between the two groups.

In [None]:
import numpy as np
from itertools import combinations

# Sleep hours data
group_A = [5.5, 5, 6.3, 8, 3]  # With exams
group_B = [7.5, 7.5, 7.5, 6, 8]  # Without exams

# TODO: Calculate the average sleep hours for group_A and group_B, then compute their difference
# Explanation: This difference in means will be our test statistic for the permutation test.
mean_A = ... # Your code here
mean_B = ... # Your code here
observed_diff = ... # Your code here
print(f"Observed difference in means: {observed_diff}")

# Combine the data
combined = np.array(group_A + group_B)
n = len(combined)
k = len(group_A)

# Generate all possible divisions
divisions = list(combinations(range(n), k))

# TODO: Calculate the test statistic (difference in means) for each permutation
# Explanation: By shuffling the data, we simulate the null hypothesis and build a distribution of the test statistic.
test_stats = []
for div in divisions:
    diff_perm = ... # Your code here
    test_stats.append(diff_perm)

# TODO: Calculate the p-value
# Explanation: The p-value is the proportion of permutations where the test statistic is as extreme as or more extreme than the observed difference.
p_value = ... # Your code here
print(f"P-value: {p_value}")

### Alternative Test Statistic

Now, consider a different test statistic: the difference between the maximum of Group A and the minimum of Group B.

In [None]:
# TODO: Calculate the observed test statistic: max(group_A) - min(group_B)
# Explanation: This exercise explores how the choice of test statistic affects the outcome of the test.
observed_stat = ... # Your code here
print(f"Observed statistic (max(A) - min(B)): {observed_stat}")

# TODO: Calculate the test statistic for each permutation
test_stats_alt = []
for div in divisions:
    stat_perm = ... # Your code here
    test_stats_alt.append(stat_perm)

# TODO: Calculate the p-value for this test statistic
p_value_alt = ... # Your code here
print(f"Alternative p-value: {p_value_alt}")

### Theoretical Question

- **Task**: Discuss the role and importance of the test statistic in a permutation test. What makes a test statistic a poor choice, and when might it lead to misleading results? Write your answer in a markdown cell below.

**Answer Here**:

## Section 2: Chi-Square Test

### Overview

The Chi-Square test is used to determine whether observed frequencies in categorical data differ significantly from expected frequencies. It can be applied in two main scenarios:

- **Goodness of Fit**: To test if a dataset follows a specific distribution.
- **Test of Independence**: To check if two categorical variables are related.

### Use Case

You will work with the `Occupancy_Estimation` dataset, which includes sensor data (temperature, light, sound, CO₂, PIR) and room occupancy counts. Your tasks include testing if room occupancy is uniformly distributed and exploring relationships between variables.

### Hypotheses (Goodness of Fit Example)

- **H₀**: The room occupancy count follows a uniform distribution.
- **H₁**: The room occupancy count does not follow a uniform distribution.

### Implementation

*Note*: Ensure you have the `Occupancy_Estimation.csv` dataset available and adjust the file path as necessary.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency, chisquare

occupancy_df = pd.read_csv("Occupancy_Estimation.csv")  # Adjust path as needed
print(occupancy_df.head())

# TODO: Plot histograms with density curves for all numerical features
# Explanation: Visualizing the distributions helps in understanding the data's characteristics.
numerical_cols = ... # Your code here
for col in numerical_cols:
    sns.histplot(occupancy_df[col], kde=True)
    plt.title(f"Distribution of {col}")
    plt.show()

### Theoretical Question

- **Task**: What type of distribution do the sound features follow?

**Answer Here**:

In [None]:
# TODO: Analyze the relationship between PIR sensor activity and room occupancy
# Explanation: Use a contingency table to check if PIR activity reliably indicates the presence of people.
... # Your code here
print(f"Chi-Square Statistic: {chi2}, P-value: {p}")

### Theoretical Question

- **Task**: Does PIR activity align with room occupancy?

**Answer Here**:

In [None]:
# TODO: Test if Room_Occupancy_Count follows a uniform distribution
# Explanation: Compare the observed frequencies of occupancy counts to expected frequencies under a uniform distribution.
... # Your code here
print(f"Chi-Square Statistic: {chi2_stat}, P-value: {p_val}")

### Theoretical Question

- **Task**: Is the room occupancy uniformly distributed?

**Answer Here**:

In [None]:
correlation_matrix = occupancy_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

### Theoretical Question

- **Task**: Which feature has the strongest correlation with Room_Occupancy_Count?

**Answer Here**:

- **Task**: Based on the correlation, can we infer if S1_Sound and S2_Sound were installed in similar parts of the room?

**Answer Here**:

## Section 3: Z Test

### Overview

The Z test is a parametric test used when the population variance is known or when the sample size is large (n ≥ 30). It leverages the normal distribution, often through the Central Limit Theorem, to test hypotheses about means or proportions.

### Use Case

You will use two datasets:

- **StudentsPerformance.csv**: For one-sample (math scores) and two-sample (reading scores by gender) Z tests.
- **heart_cleveland_upload.csv**: For a proportion Z test (heart disease prevalence).

### Hypotheses (Examples)

- **One-Sample Z Test**: H₀: The mean math score is 70; H₁: The mean math score is not 70.
- **Two-Sample Z Test**: H₀: There is no difference in reading scores between male and female students; H₁: There is a difference.
- **Proportion Z Test**: H₀: The proportion of heart disease is 0.5; H₁: The proportion is not 0.5.

### Implementation

*Note*: Ensure you have the `StudentsPerformance.csv` and `heart_cleveland_upload.csv` datasets available and adjust the file paths as necessary.

In [None]:
import pandas as pd
import numpy as np
from scipy.stats import norm

# Load datasets
students_df = pd.read_csv("StudentsPerformance.csv")
heart_df = pd.read_csv("heart_cleveland_upload.csv")

# One-Sample Z Test (Math Scores)
mu0 = 70  # Hypothesized mean
sample_mean = students_df['math score'].mean()
sample_std = students_df['math score'].std()
n = len(students_df)

# TODO: Calculate the Z-score and p-value for the one-sample Z test
# Explanation: This test will determine if the mean math score is significantly different from 70.
se = ... # Your code here
z_score = ... # Your code here
p_value = ... # Your code here
print(f"One-Sample Z Test - Z-score: {z_score}, P-value: {p_value}")

### Theoretical Question

- **Task**: Based on the p-value, should we reject the null hypothesis?

**Answer Here**:

In [None]:
# Two-Sample Z Test (Reading Scores by Gender)
male_scores = students_df[students_df['gender'] == 'male']['reading score']
female_scores = students_df[students_df['gender'] == 'female']['reading score']

# TODO: Calculate the means, standard deviations, and Z-score for the two-sample Z test
# Explanation: This test compares the reading scores between male and female students.
mean_male = ... # Your code here
mean_female = ... # Your code here
std_male = ... # Your code here
std_female = ... # Your code here
n_male = ... # Your code here
n_female = ... # Your code here
se_diff = ... # Your code here
z_score = ... # Your code here
p_value = ... # Your code here
print(f"Two-Sample Z Test - Z-score: {z_score}, P-value: {p_value}")

### Theoretical Question

- **Task**: Is there a significant difference in reading scores between genders?

**Answer Here**:

In [None]:
# Proportion Z Test (Heart Disease)
p0 = 0.5  # Hypothesized proportion
n = len(heart_df)
p_hat = heart_df['condition'].mean()

# TODO: Calculate the Z-score and p-value for the proportion Z test
# Explanation: This test checks if the proportion of heart disease in the sample differs from 50%.
se = ... # Your code here
z_score = ... # Your code here
p_value = ... # Your code here
print(f"Proportion Z Test - Z-score: {z_score}, P-value: {p_value}")

### Theoretical Question

- **Task**: Is the proportion of heart disease significantly different from 50%?

**Answer Here**:

## Section 4: Fisher's Exact Test

### Overview

Fisher's Exact Test is used to determine if there is a significant association between two categorical variables in a 2×2 contingency table. It is particularly useful for small sample sizes, as it calculates exact p-values using the hypergeometric distribution.

### Use Case

You will use the Titanic dataset to test if survival is associated with gender or passenger class.

### Hypotheses

- **H₀**: Survival is independent of gender/passenger class.
- **H₁**: Survival is associated with gender/passenger class.

In [None]:
import pandas as pd
import seaborn as sns
from scipy.stats import fisher_exact

# Load Titanic dataset
titanic_df = sns.load_dataset('titanic')

contingency_table = pd.crosstab(titanic_df['survived'], titanic_df['sex'])
print(contingency_table)

# TODO: Perform Fisher's Exact Test on the contingency table
# Explanation: This test will provide an exact p-value to assess the independence of survival and gender.
odds_ratio, p_value = ... # Your code here
print(f"Odds Ratio: {odds_ratio}, P-value: {p_value}")

### Theoretical Question

- **Task**: Is there a significant association between survival and gender?

**Answer Here**:

In [None]:
# TODO: Create a contingency table for survival vs. passenger class
# Note: Since passenger class has more than two categories, you may need to adjust the test or subset the data.
contingency_table_class = ... # Your code here
# For a 2x3 table, Fisher's Exact Test isn't directly applicable. Consider using Chi-Square or subsetting to two classes.
# Optional: Subset to two classes, e.g., 1st and 3rd class
odds_ratio, p_value = ... # Your code here
print(f"Odds Ratio: {odds_ratio}, P-value: {p_value}")

### Theoretical Question

- **Task**: Is there a significant association between survival and passenger class?

**Answer Here**: