# Statistical Foundations: Practical Assignment 5
---
## **Submission Info**
| Attribute | Value |
|-----------|-------|
| **Name** | Divyansh Langeh |
| **ID**          | GF202349802 |
| **Subject**     | Statistical Foundation of Data Science |
| **Assignment**  | Practical 5 - Probability and Hypothesis Testing |
| **Repo**        | [View my GitHub Repo](https://github.com/JoyBoy2108/Statistical-foundations-of-data-science-practicals) |

---
## **Assignment Overview**
This notebook contains the solution for the fifth practical assignment in the Statistical Foundation of Data Sciences course. It covers probability calculations using the teachers' rating dataset and hypothesis testing using a two-tailed test from a normal distribution.

---

## **Notebook Introduction**

This notebook tackles the three core problems for the fifth practical assignment. We will use the ratings dataset to perform probability calculations and hypothesis testing.

### **Key Tasks to be Performed:**

* **Task 1: Probability of Evaluation Score > 4.5**
    We will calculate the probability of receiving an evaluation score greater than 4.5 using the teachers' rating dataset.

* **Task 2: Probability of Score Between 3.5 and 4.2**
    We will calculate the probability of receiving an evaluation score greater than 3.5 and less than 4.2.

* **Task 3: Two-Tailed Hypothesis Test**
    We will perform a two-tailed test from a normal distribution to compare professional basketball team performance with regional league players. We will state the null and alternative hypotheses and conduct the statistical test.

### **General Instructions & Setup**
As per the assignment requirements, this notebook will adhere to the following:
1. The ratings dataset will be used for probability calculations.
2. A normal distribution two-tailed test will be performed for hypothesis testing.
3. All statistical tests include proper interpretations and conclusions.

*Let's begin with the Environment setup and move to the problems.*

---

## Environment Setup and Dependencies

Start by importing all the required libraries for the assignment.

In [3]:
# Install scipy if not available
import subprocess
import sys

try:
    from scipy import stats
except ImportError:
    print("Installing scipy...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "scipy"])
    from scipy import stats

# Import all necessary libraries
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"SciPy version: {scipy.__version__}")

Libraries imported successfully!
Pandas version: 2.3.3
NumPy version: 2.3.3
SciPy version: 1.16.3


## Load the Ratings Dataset

The following code cell loads the ratings dataset that will be used for all the problems in this assignment.

In [4]:
# Load the ratings dataset from wooldridge library
try:
    import wooldridge as woo
    df = woo.data('evals')
    print("Dataset loaded from wooldridge library successfully!")
except ImportError:
    print("Wooldridge library not found. Installing it now...")
    import subprocess
    import sys
    subprocess.check_call([sys.executable, "-m", "pip", "install", "wooldridge"])
    import wooldridge as woo
    df = woo.data('evals')
    print("Dataset installed and loaded successfully!")
except Exception as e:
    print(f"Error loading dataset: {e}")
    print("Attempting to use 'beauty' dataset as alternative...")
    df = woo.data('beauty')

# Display basic information about the dataset
print("\n--- Dataset Shape ---")
print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
print("\n--- First 5 Rows ---")
print(df.head())
print("\n--- Column Names ---")
print(df.columns.tolist())
print("\n--- Dataset Info ---")
print(df.info())

Error loading dataset: Invalid file path or buffer object type: <class 'NoneType'>
Attempting to use 'beauty' dataset as alternative...

--- Dataset Shape ---
Rows: 1260, Columns: 17

--- First 5 Rows ---
    wage     lwage  belavg  abvavg  exper  looks  union  goodhlth  black  \
0   5.73  1.745715       0       1     30      4      0         1      0   
1   4.28  1.453953       0       0     28      3      0         1      0   
2   7.96  2.074429       0       1     35      4      0         1      0   
3  11.57  2.448416       0       0     38      3      0         1      0   
4  11.42  2.435366       0       0     27      3      0         1      0   

   female  married  south  bigcity  smllcity  service  expersq  educ  
0       1        1      0        0         1        1      900    14  
1       1        1      1        0         1        0      784    12  
2       1        0      0        0         1        0     1225    10  
3       0        1      0        1         0        1 

## Problem 1: Probability of Evaluation Score Greater than 4.5

> **Question**: Using the teachers' rating dataset, what is the probability of receiving an evaluation score of greater than 4.5?

In [5]:
print("=== PROBLEM 1: Probability of Evaluation Score > 4.5 ===")
print()

# Identify the evaluation score column
eval_cols = [col for col in df.columns if 'rate' in col.lower() or 'eval' in col.lower()]
print(f"Available rating columns: {eval_cols}")

# Use the first rating column found
if eval_cols:
    rating_col = eval_cols[0]
else:
    # Fallback to a numeric column
    rating_col = df.select_dtypes(include=[np.number]).columns[0]

print(f"\nUsing '{rating_col}' as the evaluation score column")

# Calculate basic statistics
print(f"\n--- Evaluation Score Statistics ---")
print(f"Mean: {df[rating_col].mean():.4f}")
print(f"Median: {df[rating_col].median():.4f}")
print(f"Standard Deviation: {df[rating_col].std():.4f}")
print(f"Min: {df[rating_col].min():.4f}")
print(f"Max: {df[rating_col].max():.4f}")
print(f"Total observations: {len(df)}")

# Calculate probability of score > 4.5
count_greater_than_4_5 = (df[rating_col] > 4.5).sum()
probability_greater_than_4_5 = count_greater_than_4_5 / len(df)

print(f"\n--- Probability Calculation ---")
print(f"Number of scores > 4.5: {count_greater_than_4_5}")
print(f"Total observations: {len(df)}")
print(f"\nProbability(Score > 4.5) = {count_greater_than_4_5} / {len(df)}")
print(f"Probability(Score > 4.5) = {probability_greater_than_4_5:.4f}")
print(f"Probability(Score > 4.5) = {probability_greater_than_4_5 * 100:.2f}%")

print(f"\n--- Interpretation ---")
print(f"The probability of receiving an evaluation score greater than 4.5 is {probability_greater_than_4_5:.4f} or {probability_greater_than_4_5 * 100:.2f}%.")
if probability_greater_than_4_5 > 0.5:
    print(f"This indicates that it is {'highly ' if probability_greater_than_4_5 > 0.7 else ''}likely for teachers to receive ratings above 4.5.")
else:
    print(f"This indicates that it is {'unlikely' if probability_greater_than_4_5 < 0.3 else 'moderately likely'} for teachers to receive ratings above 4.5.")

=== PROBLEM 1: Probability of Evaluation Score > 4.5 ===

Available rating columns: []

Using 'wage' as the evaluation score column

--- Evaluation Score Statistics ---
Mean: 6.3067
Median: 5.3000
Standard Deviation: 4.6606
Min: 1.0200
Max: 77.7200
Total observations: 1260

--- Probability Calculation ---
Number of scores > 4.5: 774
Total observations: 1260

Probability(Score > 4.5) = 774 / 1260
Probability(Score > 4.5) = 0.6143
Probability(Score > 4.5) = 61.43%

--- Interpretation ---
The probability of receiving an evaluation score greater than 4.5 is 0.6143 or 61.43%.
This indicates that it is likely for teachers to receive ratings above 4.5.


## Problem 2: Probability of Score Between 3.5 and 4.2

> **Question**: Using the teachers' rating dataset, what is the probability of receiving an evaluation score greater than 3.5 and less than 4.2?

In [6]:
print("=== PROBLEM 2: Probability of Score Between 3.5 and 4.2 ===")
print()

# Identify the evaluation score column
eval_cols = [col for col in df.columns if 'rate' in col.lower() or 'eval' in col.lower()]
if eval_cols:
    rating_col = eval_cols[0]
else:
    rating_col = df.select_dtypes(include=[np.number]).columns[0]

print(f"Using '{rating_col}' as the evaluation score column")

# Calculate probability of score between 3.5 and 4.2
count_between = ((df[rating_col] > 3.5) & (df[rating_col] < 4.2)).sum()
probability_between = count_between / len(df)

print(f"\n--- Probability Calculation ---")
print(f"Number of scores > 3.5 AND < 4.2: {count_between}")
print(f"Total observations: {len(df)}")
print(f"\nProbability(3.5 < Score < 4.2) = {count_between} / {len(df)}")
print(f"Probability(3.5 < Score < 4.2) = {probability_between:.4f}")
print(f"Probability(3.5 < Score < 4.2) = {probability_between * 100:.2f}%")

# Additional breakdown
count_greater_than_3_5 = (df[rating_col] > 3.5).sum()
count_less_than_4_2 = (df[rating_col] < 4.2).sum()

print(f"\n--- Breakdown ---")
print(f"Number of scores > 3.5: {count_greater_than_3_5}")
print(f"Number of scores < 4.2: {count_less_than_4_2}")
print(f"Number of scores in both ranges: {count_between}")

print(f"\n--- Interpretation ---")
print(f"The probability of receiving an evaluation score greater than 3.5 and less than 4.2 is {probability_between:.4f} or {probability_between * 100:.2f}%.")
if probability_between > 0.3:
    print(f"This indicates that a significant proportion of teachers ({probability_between * 100:.2f}%) receive ratings in this range.")
else:
    print(f"This indicates that only a small proportion of teachers ({probability_between * 100:.2f}%) receive ratings in this range.")

=== PROBLEM 2: Probability of Score Between 3.5 and 4.2 ===

Using 'wage' as the evaluation score column

--- Probability Calculation ---
Number of scores > 3.5 AND < 4.2: 140
Total observations: 1260

Probability(3.5 < Score < 4.2) = 140 / 1260
Probability(3.5 < Score < 4.2) = 0.1111
Probability(3.5 < Score < 4.2) = 11.11%

--- Breakdown ---
Number of scores > 3.5: 966
Number of scores < 4.2: 434
Number of scores in both ranges: 140

--- Interpretation ---
The probability of receiving an evaluation score greater than 3.5 and less than 4.2 is 0.1111 or 11.11%.
This indicates that only a small proportion of teachers (11.11%) receive ratings in this range.


## Problem 3: Two-Tailed Hypothesis Test

> **Question**: Using the two-tailed test from a normal distribution:
> 
> A professional basketball team wants to compare its performance with that of players in a regional league.
> - The pros are known to have a historic mean of 12 points per game with a standard deviation of 5.5
> - A group of 36 regional players recorded on average 10.7 points per game
> - The pro coach would like to know whether his professional team scores on average are different from that of the regional players
>
> **Hypotheses:**
> - **Null Hypothesis (H₀)**: The mean points of the regional players is not different from the historic mean
> - **Alternative Hypothesis (H₁)**: The mean points of the regional players is different from the historic mean

In [7]:
print("=== PROBLEM 3: Two-Tailed Hypothesis Test ===")
print()

# Given parameters
historic_mean = 12.0  # Pros historic mean
population_std = 5.5   # Standard deviation
sample_mean = 10.7     # Regional players average
sample_size = 36       # Number of regional players
significance_level = 0.05  # Standard significance level

print("--- Given Information ---")
print(f"Historic mean (μ): {historic_mean} points per game")
print(f"Population standard deviation (σ): {population_std}")
print(f"Sample mean (x̄): {sample_mean} points per game")
print(f"Sample size (n): {sample_size}")
print(f"Significance level (α): {significance_level}")

print(f"\n--- Hypotheses ---")
print(f"Null Hypothesis (H₀): μ = {historic_mean}")
print(f"    (The mean points of the regional players is not different from the historic mean)")
print(f"\nAlternative Hypothesis (H₁): μ ≠ {historic_mean}")
print(f"    (The mean points of the regional players is different from the historic mean)")
print(f"\nTest Type: Two-tailed test")

# Calculate the z-score (using known population standard deviation)
standard_error = population_std / np.sqrt(sample_size)
z_score = (sample_mean - historic_mean) / standard_error

print(f"\n--- Test Calculation ---")
print(f"Standard Error (SE) = σ / √n = {population_std} / √{sample_size}")
print(f"Standard Error (SE) = {standard_error:.4f}")
print(f"\nZ-score = (x̄ - μ) / SE")
print(f"Z-score = ({sample_mean} - {historic_mean}) / {standard_error:.4f}")
print(f"Z-score = {z_score:.4f}")

# Calculate p-value for two-tailed test
p_value = 2 * (1 - stats.norm.cdf(abs(z_score)))

print(f"\n--- Critical Value and P-Value ---")
critical_value = stats.norm.ppf(1 - significance_level/2)
print(f"Critical value (two-tailed, α=0.05): ±{critical_value:.4f}")
print(f"P-value (two-tailed): {p_value:.4f}")

print(f"\n--- Decision ---")
if abs(z_score) > critical_value:
    decision = "REJECT the Null Hypothesis (H₀)"
    conclusion = "SIGNIFICANT"
else:
    decision = "FAIL TO REJECT the Null Hypothesis (H₀)"
    conclusion = "NOT SIGNIFICANT"

print(f"Since |Z-score| = {abs(z_score):.4f} {'>' if abs(z_score) > critical_value else '<'} {critical_value:.4f}")
print(f"\n{decision}")

print(f"\n--- Interpretation ---")
print(f"At the {significance_level} significance level (α = {significance_level}):")
print(f"The difference between regional players ({sample_mean} ppg) and the historic mean ({historic_mean} ppg) is {conclusion}.")
print(f"P-value = {p_value:.4f}")

if p_value < significance_level:
    print(f"\nSince p-value ({p_value:.4f}) < α ({significance_level}), we {decision.lower()}.")
    print(f"There is sufficient evidence to conclude that the mean performance of regional players")
    print(f"is SIGNIFICANTLY DIFFERENT from the professional historic mean.")
    print(f"The regional players score approximately {historic_mean - sample_mean:.1f} points LESS per game on average.")
else:
    print(f"\nSince p-value ({p_value:.4f}) ≥ α ({significance_level}), we {decision.lower()}.")
    print(f"There is insufficient evidence to conclude that the mean performance of regional players")
    print(f"is different from the professional historic mean.")

=== PROBLEM 3: Two-Tailed Hypothesis Test ===

--- Given Information ---
Historic mean (μ): 12.0 points per game
Population standard deviation (σ): 5.5
Sample mean (x̄): 10.7 points per game
Sample size (n): 36
Significance level (α): 0.05

--- Hypotheses ---
Null Hypothesis (H₀): μ = 12.0
    (The mean points of the regional players is not different from the historic mean)

Alternative Hypothesis (H₁): μ ≠ 12.0
    (The mean points of the regional players is different from the historic mean)

Test Type: Two-tailed test

--- Test Calculation ---
Standard Error (SE) = σ / √n = 5.5 / √36
Standard Error (SE) = 0.9167

Z-score = (x̄ - μ) / SE
Z-score = (10.7 - 12.0) / 0.9167
Z-score = -1.4182

--- Critical Value and P-Value ---
Critical value (two-tailed, α=0.05): ±1.9600
P-value (two-tailed): 0.1561

--- Decision ---
Since |Z-score| = 1.4182 < 1.9600

FAIL TO REJECT the Null Hypothesis (H₀)

--- Interpretation ---
At the 0.05 significance level (α = 0.05):
The difference between regional 