# **SHETH L.U.J. & SIR M.V. COLLEGE**
**Shreeraj Desai | T075**
## **Practical No. 4**

**Aim** :- Hypothesis Testing
*  Formulate null and alternative hypotheses for a given problem.
*  Conduct a hypothesis test using appropriate statistical tests (e.g., t-test, chisquare test).
*  Interpret the results and draw conclusions based on the test outcomes.


# **Statistical Tests on the City Lifestyle Dataset**

## 1. Data Loading and Setup

In [9]:
import pandas as pd
import numpy as np
import scipy.stats as stats
from statsmodels.stats import weightstats as stests
import statsmodels.api as sm
from statsmodels.formula.api import ols

try:
    df = pd.read_csv('Datasets/city_lifestyle_dataset.csv')
    print("Dataset loaded successfully.")
    # Create a binary category for happiness score for Chi-Square test
    df['happiness_category'] = pd.cut(df['happiness_score'], bins=[0, 7.5, 10], labels=['Not Very Happy', 'Very Happy'])
except FileNotFoundError:
    print("Error: 'city_lifestyle_dataset.csv' not found.")
    df = pd.DataFrame() # Create an empty DataFrame to avoid errors

df.head()

Dataset loaded successfully.


Unnamed: 0,city_name,country,population_density,avg_income,internet_penetration,avg_rent,air_quality_index,public_transport_score,happiness_score,green_space_ratio,happiness_category
0,Old Vista,Europe,2775,3850,86.4,1310,43,52.0,8.5,23.8,Very Happy
1,Beachport,Europe,3861,3700,78.1,1330,42,62.8,8.1,33.1,Very Happy
2,Valleyborough,Europe,2562,4310,80.1,1330,39,73.2,8.5,40.2,Very Happy
3,City,Europe,3192,3970,81.2,1480,60,49.2,8.5,43.6,Very Happy
4,Falls,Europe,3496,4320,100.0,1510,64,93.7,8.5,42.5,Very Happy


## 2. One-Sample t-test

**Hypothesis:** Is the average happiness score of cities significantly different from a hypothesized value of 7.5?

In [10]:
if not df.empty:
    # Define the null and alternative hypotheses
    H0 = "The average happiness score is 7.5."
    H1 = "The average happiness score is not 7.5."

    # Extract the data for the test
    happiness_data = df['happiness_score']

    # Calculate the test statistic
    t_stat, p_value = stats.ttest_1samp(happiness_data, 7.5)

    # Print the results
    print("Hypothesis: Is the average happiness score different from 7.5?")
    print("Test statistic:", t_stat)
    print("p-value:", p_value)

    # Conclusion
    if p_value < 0.05:
        print("Conclusion: Reject the null hypothesis. The average happiness score is significantly different from 7.5.")
    else:
        print("Conclusion: Fail to reject the null hypothesis.")

Hypothesis: Is the average happiness score different from 7.5?
Test statistic: -8.794513648649467
p-value: 1.1611502350289266e-16
Conclusion: Reject the null hypothesis. The average happiness score is significantly different from 7.5.


## 3. Two-Sample (Independent) t-test

**Hypothesis:** Is there a significant difference in the average income between cities in 'Europe' and 'Asia'?

In [11]:
if not df.empty and 'Europe' in df['country'].unique() and 'Asia' in df['country'].unique():
    # Separate the data into two groups
    europe_income = df[df['country'] == 'Europe']['avg_income']
    asia_income = df[df['country'] == 'Asia']['avg_income']

    print("Europe avg_income mean:", europe_income.mean())
    print("Asia avg_income mean:", asia_income.mean())

    # Perform the independent t-test
    ttest, pval = stats.ttest_ind(europe_income, asia_income, nan_policy='omit')

    print("Hypothesis: Is there a difference in avg_income between Europe and Asia?")
    print("p-value:", pval)

    if pval < 0.05:
        print("Conclusion: We reject the null hypothesis. There is a significant difference.")
    else:
        print("Conclusion: We accept the null hypothesis.")

Europe avg_income mean: 3436.8333333333335
Asia avg_income mean: 2478.625
Hypothesis: Is there a difference in avg_income between Europe and Asia?
p-value: 5.303039929971217e-15
Conclusion: We reject the null hypothesis. There is a significant difference.


## 4. Paired Sample t-test

**Hypothesis:** Is there a significant difference between the public transport score and the air quality index within the same cities? (Note: This is an illustrative example; these variables are not a true 'before/after' pair but we can still test for a difference in their means).

*   H0: The mean difference between the two scores is 0.
*   H1: The mean difference between the two scores is not 0.

In [12]:
if not df.empty:
    print(df[['public_transport_score', 'air_quality_index']].describe())

    # Perform the paired t-test
    ttest, pval = stats.ttest_rel(df['public_transport_score'], df['air_quality_index'])

    print("Hypothesis: Is there a difference between public transport and air quality scores?")
    print("p-value:", pval)

    if pval < 0.05:
        print("Conclusion: Reject the null hypothesis; there is a significant difference.")
    else:
        print("Conclusion: Accept the null hypothesis.")

       public_transport_score  air_quality_index
count              300.000000         300.000000
mean                55.717333          71.246667
std                 14.712549          25.344961
min                 15.000000          22.000000
25%                 46.075000          54.000000
50%                 54.700000          67.500000
75%                 64.200000          86.000000
max                 95.000000         146.000000
Hypothesis: Is there a difference between public transport and air quality scores?
p-value: 5.161579130122333e-17
Conclusion: Reject the null hypothesis; there is a significant difference.


## 5. One-Sample Z-test

**Hypothesis:** Is the average population density of cities significantly different from a hypothesized value of 2500?

In [13]:
if not df.empty and len(df) > 30:
    # Perform one-sample Z-test
    ztest, pval = stests.ztest(df['population_density'], x2=None, value=2500)

    print("Hypothesis: Is population density different from 2500?")
    print("p-value:", float(pval))

    if pval < 0.05:
        print("Conclusion: Reject the null hypothesis.")
    else:
        print("Conclusion: Accept the null hypothesis.")
else:
    print("Not enough data (n < 30) for a Z-test. A t-test would be more appropriate.")

Hypothesis: Is population density different from 2500?
p-value: 4.838110571991089e-17
Conclusion: Reject the null hypothesis.


## 6. Two-Sample Z-test

**Hypothesis:** Is there a significant difference in the mean of the public transport score and the air quality index?

*   H0: The mean of the two groups is 0.
*   H1: The mean of the two groups is not 0.

In [14]:
if not df.empty and len(df) > 30:
    ztest, pval1 = stests.ztest(df['public_transport_score'], x2=df['air_quality_index'], 
                                value=0, alternative='two-sided')
    
    print("Hypothesis: Is the mean of transport score different from air quality score?")
    print("p-value:", float(pval1))

    if pval1 < 0.05:
        print("Conclusion: Reject the null hypothesis.")
    else:
        print("Conclusion: Accept the null hypothesis.")

Hypothesis: Is the mean of transport score different from air quality score?
p-value: 4.380739904162979e-20
Conclusion: Reject the null hypothesis.


## 7. Chi-Square Test

**Hypothesis:** Is there a significant association between a city's country and its happiness category ('Very Happy' vs. 'Not Very Happy')?

In [15]:
if not df.empty:
    # Create a contingency table
    contingency_table = pd.crosstab(df['country'], df['happiness_category'])

    print("Contingency Table:")
    print(contingency_table)
    print("\n")

    # Perform the chi-square test
    chi2_stat, p_value, dof, expected_freq = stats.chi2_contingency(contingency_table)

    # Print the results
    print('Chi-square statistic:', chi2_stat)
    print('P-value:', p_value)
    print('Degrees of freedom:', dof)
    print('Expected frequencies:\n', expected_freq)

Contingency Table:
happiness_category  Not Very Happy  Very Happy
country                                       
Africa                          35           0
Asia                            79           1
Europe                          18          42
North America                    9          41
Oceania                          1          34
South America                   40           0


Chi-square statistic: 208.05999388021178
P-value: 5.354204110315832e-43
Degrees of freedom: 5
Expected frequencies:
 [[21.23333333 13.76666667]
 [48.53333333 31.46666667]
 [36.4        23.6       ]
 [30.33333333 19.66666667]
 [21.23333333 13.76666667]
 [24.26666667 15.73333333]]
