# Practical 4 :
AIM: Hypothesis Testing
*   Formulate null and alternative hypotheses for a given problem.
*   Conduct a hypothesis test using appropriate statistical tests (eg., t-test, chi-square test).
*   Interpret the results and draw conclusions based on the test outcomes.


**1. t-test to evaluate whether our hypothesis is correct or not.**

In [2]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# Load the dataset
df = pd.read_csv('WORLD UNIVERSITY RANKINGS.csv')

# --- Data Preparation ---
# Use the 'Score' column as the single sample data
data = df['Score']

# Define the hypothesized population mean (popmean)
hypothesized_mean = 70.0

# Define the null hypothesis
H0 = f"The average university score is {hypothesized_mean}."

# Define the alternative hypothesis (two-tailed test: not equal to)
H1 = f"The average university score is not {hypothesized_mean}."

# Print the data summary for reference
print(f"Dataset Size (n): {len(data)}")
print(f"Sample Mean Score: {data.mean():.3f}")

# --- Calculate the Test Statistic ---
# stats.ttest_1samp performs a two-sided test by default.
t_stat, p_value = stats.ttest_1samp(data, popmean=hypothesized_mean)

# Print the results
print("\n--- One-Sample t-Test Results ---")
print("Null Hypothesis:", H0)
print("Alternative Hypothesis:", H1)
print(f"Test statistic (t): {t_stat:.3f}")
print(f"p-value: {p_value:.5f}")

# --- Conclusion ---
alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion (at alpha={alpha}): Reject the null hypothesis.")
    print(f"The average university score is statistically different from {hypothesized_mean}.")
else:
    print(f"\nConclusion (at alpha={alpha}): Fail to reject the null hypothesis.")
    print(f"The average university score is not statistically different from {hypothesized_mean}.")

Dataset Size (n): 2000
Sample Mean Score: 71.586

--- One-Sample t-Test Results ---
Null Hypothesis: The average university score is 70.0.
Alternative Hypothesis: The average university score is not 70.0.
Test statistic (t): 13.967
p-value: 0.00000

Conclusion (at alpha=0.05): Reject the null hypothesis.
The average university score is statistically different from 70.0.


**2. Two sampled T-test**

In [3]:
import pandas as pd
from scipy.stats import ttest_ind
import numpy as np

# Load the dataset
df = pd.read_csv("WORLD UNIVERSITY RANKINGS.csv")

# --- Data Preparation for Two Independent Samples ---
# We will compare the 'Score' of universities in the USA against those in the United Kingdom.

# Filter data for Group 1: USA
group1_name = "USA"
group1_scores = df[df['Location'] == group1_name]['Score'].values

# Filter data for Group 2: United Kingdom
group2_name = "United Kingdom"
group2_scores = df[df['Location'] == group2_name]['Score'].values

# --- Display Data and Statistics ---
print(f"Data for Group 1 ({group1_name}, n={len(group1_scores)}):")
print(group1_scores[:5], "...")
print(f"\nData for Group 2 ({group2_name}, n={len(group2_scores)}):")
print(group2_scores[:5], "...")

group1_mean = np.mean(group1_scores)
group2_mean = np.mean(group2_scores)
print("\nMean Score for {}: {:.3f}".format(group1_name, group1_mean))
print("Mean Score for {}: {:.3f}".format(group2_name, group2_mean))

group1_std = np.std(group1_scores)
group2_std = np.std(group2_scores)
print("Std Dev Score for {}: {:.3f}".format(group1_name, group1_std))
print("Std Dev Score for {}: {:.3f}".format(group2_name, group2_std))

# --- Perform Independent Samples t-Test ---
# Null Hypothesis (H0): The mean scores for USA and UK universities are equal.
# Alternative Hypothesis (H1): The mean scores are not equal.
ttest, pval = ttest_ind(group1_scores, group2_scores)

print("\n--- Independent Samples t-Test Results ---")
print("t-statistic:", ttest)
print("p-value:", pval)

# --- Conclusion ---
alpha = 0.05
if pval < alpha:
    print(f"\nWe reject the null hypothesis (since p < {alpha}).")
    print("Conclusion: There is a statistically significant difference in mean scores between US and UK universities.")
else:
    print(f"\nWe accept the null hypothesis (since p >= {alpha}).")
    print("Conclusion: There is no statistically significant difference in mean scores between US and UK universities.")

Data for Group 1 (USA, n=335):
[100.   96.7  95.1  92.6  92. ] ...

Data for Group 2 (United Kingdom, n=94):
[94.1 93.3 88.  86.6 85.5] ...

Mean Score for USA: 74.019
Mean Score for United Kingdom: 73.678
Std Dev Score for USA: 6.651
Std Dev Score for United Kingdom: 6.098

--- Independent Samples t-Test Results ---
t-statistic: 0.4462791940671699
p-value: 0.6556218076683412

We accept the null hypothesis (since p >= 0.05).
Conclusion: There is no statistically significant difference in mean scores between US and UK universities.


3. Paired sampled t-test

In [4]:
import pandas as pd
from scipy import stats
import numpy as np

# Load the dataset
df = pd.read_csv("WORLD UNIVERSITY RANKINGS.csv")

# --- Data Preparation for Paired t-Test ---

# 1. Identify the two paired columns: Education Rank vs. Employability Rank
col1_name = 'Education Rank'
col2_name = 'Employability Rank'

# 2. Clean the columns: Replace '-' with NaN and convert to numeric.
# Errors='coerce' handles conversion issues, turning non-numeric into NaN.
df[col1_name] = pd.to_numeric(df[col1_name].replace('-', np.nan), errors='coerce')
df[col2_name] = pd.to_numeric(df[col2_name].replace('-', np.nan), errors='coerce')

# 3. Drop rows with missing values to ensure the samples are perfectly paired.
df_paired = df.dropna(subset=[col1_name, col2_name]).copy()

# Select the paired data
rank_before = df_paired[col1_name]
rank_after = df_paired[col2_name]

# --- Descriptive Statistics ---
print("--- Paired Samples Descriptive Statistics ---")
print(df_paired[[col1_name, col2_name]].describe())
print(f"\nNumber of paired observations used (n): {len(df_paired)}")

# --- Paired Samples t-Test ---
# Null Hypothesis (H0): The mean Education Rank is equal to the mean Employability Rank (Mean difference = 0).
# Alternative Hypothesis (H1): The mean ranks are not equal (Mean difference != 0).
ttest, pval = stats.ttest_rel(rank_before, rank_after)

print("\n--- Paired Samples t-Test Results ---")
print(f"p-value: {pval:.5f}")

# --- Conclusion ---
alpha = 0.05
if pval < alpha:
    print(f"\nConclusion (at alpha={alpha}): Reject the null hypothesis.")
    print("There is a statistically significant difference between the mean Education Rank and the mean Employability Rank.")
else:
    print(f"\nConclusion (at alpha={alpha}): Accept null hypothesis (Fail to reject).")
    print("There is no statistically significant difference between the mean Education Rank and the mean Employability Rank.")

--- Paired Samples Descriptive Statistics ---
       Education Rank  Employability Rank
count      355.000000          355.000000
mean       268.814085          530.935211
std        161.449540          424.664838
min          1.000000            1.000000
25%        120.500000          176.000000
50%        281.000000          440.000000
75%        412.500000          783.500000
max        536.000000         1653.000000

Number of paired observations used (n): 355

--- Paired Samples t-Test Results ---
p-value: 0.00000

Conclusion (at alpha=0.05): Reject the null hypothesis.
There is a statistically significant difference between the mean Education Rank and the mean Employability Rank.


4. Z-TEST

In [5]:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests

# Load the dataset
df = pd.read_csv('WORLD UNIVERSITY RANKINGS.csv')

# --- Data Preparation ---
# Use the 'Score' column as the sample data
data = df['Score']

# Define the hypothesized mean for the test
hypothesized_mean = 70.0

# --- One-Sample Z-Test ---
# x2=None specifies this is a One-Sample test.
# value=70.0 is the hypothesized population mean (popmean).
ztest ,pval = stests.ztest(data, x2=None, value=hypothesized_mean)

# Print the results
print("--- One-Sample Z-Test Results ---")
print(f"Sample Mean Score: {data.mean():.3f}")
print(f"Z-statistic: {ztest:.3f}")
print(f"p-value: {float(pval):.5f}")

# --- Conclusion ---
alpha = 0.05
if pval < alpha:
    print(f"\nConclusion (at alpha={alpha}): reject null hypothesis.")
    print(f"The average university score is statistically different from {hypothesized_mean}.")
else:
    print(f"\nConclusion (at alpha={alpha}): accept null hypothesis (fail to reject).")
    print(f"The average university score is not statistically different from {hypothesized_mean}.")

--- One-Sample Z-Test Results ---
Sample Mean Score: 71.586
Z-statistic: 13.967
p-value: 0.00000

Conclusion (at alpha=0.05): reject null hypothesis.
The average university score is statistically different from 70.0.


**5. Two-sample Z test**

In [6]:
import pandas as pd
from statsmodels.stats import weightstats as stests

# Load the dataset
df = pd.read_csv('WORLD UNIVERSITY RANKINGS.csv')

# --- Data Preparation for Two Independent Samples ---
# Sample 1: USA Scores
usa_scores = df[df['Location'] == 'USA']['Score']

# Sample 2: United Kingdom Scores
uk_scores = df[df['Location'] == 'United Kingdom']['Score']

# --- Two-Sample Z-Test ---
# Null Hypothesis (H0): Mean Score(USA) - Mean Score(UK) = 0
ztest, pval1 = stests.ztest(usa_scores, x2=uk_scores,
                             value=0, alternative='two-sided')

# Print the results
print("--- Two-Sample Z-Test Results (USA Scores vs. UK Scores) ---")
print(f"Sample Mean (USA): {usa_scores.mean():.3f} (n={len(usa_scores)})")
print(f"Sample Mean (UK): {uk_scores.mean():.3f} (n={len(uk_scores)})")
print(f"Z-statistic: {ztest:.3f}")
print(f"p-value: {float(pval1):.5f}")

# --- Conclusion ---
alpha = 0.05
if pval1 < alpha:
    print(f"\nConclusion (at alpha={alpha}): reject null hypothesis.")
else:
    print(f"\nConclusion (at alpha={alpha}): accept null hypothesis (fail to reject).")

--- Two-Sample Z-Test Results (USA Scores vs. UK Scores) ---
Sample Mean (USA): 74.019 (n=335)
Sample Mean (UK): 73.678 (n=94)
Z-statistic: 0.446
p-value: 0.65540

Conclusion (at alpha=0.05): accept null hypothesis (fail to reject).


**6. Chi-Square Test**

In [7]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency

# Load the dataset
df = pd.read_csv('WORLD UNIVERSITY RANKINGS.csv')

# --- Data Preparation for Chi-Square Test ---
# 1. Define Factor 1: Location (Filter to Top 5 countries)
top_5_locations = df['Location'].value_counts().head(5).index.tolist()
df_filtered = df[df['Location'].isin(top_5_locations)].copy()
factor1 = df_filtered['Location']

# 2. Define Factor 2: Binned World Rank (Categorical)
bins = [0, 100, 500, np.inf]
labels = ['Top 100', '101-500', '501+']
factor2 = pd.cut(
    df_filtered['World Rank'],
    bins=bins,
    labels=labels,
    right=True,
    include_lowest=True
)

# --- Chi-Square Test of Independence ---
# Create a contingency table (Observed Frequencies)
contingency_table = pd.crosstab(factor1, factor2)

print("--- Observed Frequencies (Contingency Table) ---")
print(contingency_table)

# Perform the chi-square test
chi2_statistic, p_value, dof, expected_frequencies = chi2_contingency(contingency_table)

# Print the results
print('\n--- Chi-Square Test Results ---')
print(f'Chi-square statistic: {chi2_statistic:.3f}')
print(f'P-value: {p_value:.5f}')
print(f'Degrees of freedom: {dof}')

# Print expected frequencies for validation
print('\nExpected frequencies:')
print(pd.DataFrame(expected_frequencies, index=contingency_table.index, columns=contingency_table.columns).round(1))

# --- Conclusion ---
alpha = 0.05
if p_value < alpha:
    print(f"\nConclusion (at alpha={alpha}): Reject the null hypothesis.")
    print("There is a statistically significant relationship between the university's location (top 5) and its World Rank Group.")
else:
    print(f"\nConclusion (at alpha={alpha}): Accept null hypothesis (Fail to reject).")
    print("There is no statistically significant relationship between the university's location (top 5) and its World Rank Group.")

--- Observed Frequencies (Contingency Table) ---
World Rank      Top 100  101-500  501+
Location                              
China                 5       56   241
France                5       18    54
Japan                 4       10   104
USA                  50       86   199
United Kingdom        9       27    58

--- Chi-Square Test Results ---
Chi-square statistic: 70.991
P-value: 0.00000
Degrees of freedom: 8

Expected frequencies:
World Rank      Top 100  101-500   501+
Location                               
China              23.8     64.2  213.9
France              6.1     16.4   54.5
Japan               9.3     25.1   83.6
USA                26.4     71.3  237.3
United Kingdom      7.4     20.0   66.6

Conclusion (at alpha=0.05): Reject the null hypothesis.
There is a statistically significant relationship between the university's location (top 5) and its World Rank Group.
