##**SHETH L.U.J. & SIR M.V. COLLEGE**

####Aayush D. Yadav | T123

###**Practical No. 04**

**Aim:** Hypothesis Testing
* Formulate null and alternative hypotheses for a given problem.
* Conduct a hypothesis test using appropriate statistical tests (e.g., t-test chi-square test).
* Interpret the results and draw conclusions based on the test outcomes.

### **t-test to evaluate whether our hypothesis is correct or not.**

In [1]:
import pandas as pd
import scipy.stats as stats

# Load the dataset
df = pd.read_csv("Churn_Modelling.csv")

# Sample data: We will use the 'CreditScore' column
data = df['CreditScore']

print("First 10 Credit Scores : ", data.head(10).values)

# Define the null hypothesis
H0 = "The average Credit Score of customers is 650."

# Define the alternative hypothesis
H1 = "The average Credit Score of customers is not 650."

# Calculate the test statistic
# We perform a 1-sample t-test against the population mean of 650
t_stat, p_value = stats.ttest_1samp(data, 650)

# Print the results
print("Test statistic:", t_stat)
print("p-value:", p_value)

# Conclusion
if p_value < 0.05:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

First 10 Credit Scores :  [619 608 502 699 850 645 822 376 501 684]
Test statistic: 0.5471101420384049
p-value: 0.5843152748160038
Fail to reject the null hypothesis.


### **Two sampled T-test**

The Independent Samples t-test compares the means of two independent groups. Here, we will compare the **Credit Score** of **Male** customers vs **Female** customers to see if there is a significant difference.

In [2]:
from scipy.stats import ttest_ind
import numpy as np

# Select two independent groups
male_scores = df[df['Gender'] == 'Male']['CreditScore']
female_scores = df[df['Gender'] == 'Female']['CreditScore']

print("Male Credit Scores (first 5):", male_scores.head().values)
print("Female Credit Scores (first 5):", female_scores.head().values)

male_mean = np.mean(male_scores)
female_mean = np.mean(female_scores)

print("Male mean credit score:", male_mean)
print("Female mean credit score:", female_mean)

male_std = np.std(male_scores)
female_std = np.std(female_scores)

print("Male std deviation:", male_std)
print("Female std deviation:", female_std)

# Perform Independent t-test
ttest, pval = ttest_ind(male_scores, female_scores)
print("p-value", pval)

if pval < 0.05:
    print("We reject null hypothesis")
else:
    print("We accept null hypothesis")

Male Credit Scores (first 5): [645 822 501 684 528]
Female Credit Scores (first 5): [619 608 502 699 850]
Male mean credit score: 650.2768920652373
Female mean credit score: 650.831388950033
Male std deviation: 96.5408594899164
Female std deviation: 96.77669657167198
p-value 0.7751639097068665
We accept null hypothesis


### **Paired sampled t-test**

The paired sample t-test checks for differences between two related variables. Since this dataset is a snapshot (not time-series), we will create a synthetic "Previous Year Credit Score" for demonstration purposes to simulate a paired scenario (e.g., comparing Current Credit Score vs Previous Year Score for the same customer).

**H0**: Mean difference between current and previous scores is 0.
**H1**: Mean difference is not 0.

In [3]:
import pandas as pd
from scipy import stats
import numpy as np

# Simulating a "Previous Credit Score" by adding random noise to the current score
# This is for demonstration as the dataset doesn't have paired time-series data
np.random.seed(42)
df['CreditScore_Prev'] = df['CreditScore'] + np.random.randint(-10, 10, size=len(df))

# Selecting a subset for clearer demonstration
subset_df = df.head(100)

print(subset_df[['CreditScore_Prev', 'CreditScore']].describe())

# Perform Paired t-test
ttest, pval = stats.ttest_rel(subset_df['CreditScore_Prev'], subset_df['CreditScore'])
print("p-value:", pval)

if pval < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")

       CreditScore_Prev  CreditScore
count        100.000000   100.000000
mean         637.860000   638.850000
std          113.680842   114.114188
min          376.000000   376.000000
25%          550.000000   549.750000
50%          642.000000   645.500000
75%          728.000000   730.500000
max          854.000000   850.000000
p-value: 0.07673522092703224
Accept null hypothesis


### **When you can run a Z Test.**

We use a Z-test because our sample size is large (N \> 30). We will test if the mean **Estimated Salary** is significantly different from 100,000.

**One-sample Z test**

In [4]:
import pandas as pd
from scipy import stats
from statsmodels.stats import weightstats as stests

# H0: The mean Estimated Salary is 100,000
ztest, pval = stests.ztest(df['EstimatedSalary'], x2=None, value=100000)

print(float(pval))

if pval < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")

0.8753155501546055
Accept null hypothesis


### **Two-sample Z test**

Here we check two independent data groups. We will compare the **Balance** of customers in **France** vs **Germany**.

**H0**: Mean Balance of France and Germany groups is equal.
**H1**: Mean Balance is not equal.

In [5]:
# Selecting two independent groups
france_balance = df[df['Geography'] == 'France']['Balance']
germany_balance = df[df['Geography'] == 'Germany']['Balance']

# Perform Two-sample Z-test
ztest, pval1 = stests.ztest(france_balance, x2=germany_balance,
                            value=0, alternative='two-sided')

print(float(pval1))

if pval1 < 0.05:
    print("Reject null hypothesis")
else:
    print("Accept null hypothesis")

0.0
Reject null hypothesis


### **Chi-Square Test**

This test is applied to two categorical variables. We will determine if there is a significant association between **Geography** and **Exited** (Churn status).

In [6]:
import pandas as pd
from scipy.stats import chi2_contingency

# Create a contingency table
contingency_table = pd.crosstab(df['Geography'], df['Exited'])
print("Contingency Table:\n", contingency_table)

# Perform the chi-square test
chi2_statistic, p_value, dof, expected_frequencies = chi2_contingency(contingency_table)

# Print the results
print('Chi-square statistic:', chi2_statistic)
print('P-value:', p_value)
print('Degrees of freedom:', dof)
print('Expected frequencies:')
print(expected_frequencies)

if p_value < 0.05:
    print("Reject null hypothesis (Variables are related)")
else:
    print("Accept null hypothesis (Variables are independent)")

Contingency Table:
 Exited        0    1
Geography           
France     4204  810
Germany    1695  814
Spain      2064  413
Chi-square statistic: 301.25533682434536
P-value: 3.8303176053541544e-66
Degrees of freedom: 2
Expected frequencies:
[[3992.6482 1021.3518]
 [1997.9167  511.0833]
 [1972.4351  504.5649]]
Reject null hypothesis (Variables are related)
