# California Housing Dataset Analysis
In this notebook, we will explore the California Housing dataset and perform hypothesis testing based on various assumptions about the data.
## Problem Statement
The main goal is to predict the median house value in various districts of California based on several features such as the median income, housing median age, total rooms, total bedrooms, population, households, latitude, and longitude. By building a predictive model, we aim to understand the relationship between these features and the house prices, which can help in making informed decisions in real estate investments, urban planning, and policy-making.

The goal of the analysis here is to **understand key factors influencing house values in California** and **validate assumptions** about the relationships between different features and the target variable, which is `Median House Value` (target). We will also investigate whether certain population or geographic features significantly affect housing prices.
### Framework Overview
We'll use the following:
1. **Understand the Problem**: Brief overview and key factors.
2. **Generate Hypotheses**: Establish assumptions before looking at the data.
3. **Test Hypotheses**: Apply various statistical tests on hypotheses.
4. **Summarize Results**: Draw conclusions and insights.

### Hypotheses
We will create **11 hypotheses** related to different features in the dataset. For each hypothesis, we will perform different statistical tests to validate or reject the hypothesis.
1. **H1**: Median house values are higher in areas with higher average income.
2. **H2**: The proximity to the ocean (measured by Latitude/Longitude) affects house prices.
3. **H3**: The house age is related to population size.
4. **H4**: Housing prices in inland areas (low latitude) are lower than in coastal regions (high latitude).
5. **H5**: Areas with higher total rooms have higher median house values.
6. **H6**: The number of bedrooms per house is significantly different across geographical areas.
7. **H7**: Older houses (median house age) have lower prices than newer houses.
8. **H8**: High-income areas have a significantly lower population density.
9. **H9**: The average total rooms per household is similar to the total bedrooms per household.
10. **H10**: There is no significant difference in housing prices between areas with high and low population densities.
11. **H11**: The distribution of house values is not normal (test for normality).

In [15]:
# Import all required libraries for this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_california_housing
from scipy import stats
from statsmodels.stats.weightstats import ztest
from statsmodels.formula.api import ols
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [16]:
# Load the dataset and create a DataFrame
california_housing = fetch_california_housing(as_frame=True)
df = california_housing['data']
df['target'] = california_housing['target']
# Display first few rows of the dataset
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### Data Exploration
Now that we've seen the structure of the dataset, let's explore the data. This includes checking for missing values, understanding the data types, and visualizing key variables to get a sense of their distributions and relationships.

In [17]:
# Check for missing values
df.isnull().sum()
# Display basic statistics of the dataset
df.describe()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


### Hypothesis Testing

In [18]:
# 1. Hypothesis 1: Income and House Value (Correlation Test)
# H1: Median house values are higher in areas with higher average income
# Pearson correlation test
correlation, p_value = stats.pearsonr(df['MedInc'], df['target'])
print(f"H1: Correlation between Median Income and House Value: {correlation}, p-value: {p_value}")

H1: Correlation between Median Income and House Value: 0.6880752079585478, p-value: 0.0


In [19]:
# 2. Hypothesis 2: Latitude and House Prices (ANOVA)
# H2: Proximity to the ocean affects house prices (Latitudes close to the ocean have higher prices)
high_latitude = df[df['Latitude'] > df['Latitude'].median()]
low_latitude = df[df['Latitude'] <= df['Latitude'].median()]
f_value, p_value = stats.f_oneway(high_latitude['target'], low_latitude['target'])
print(f"H2: ANOVA test for house prices based on Latitude: F-value = {f_value}, p-value = {p_value}")

H2: ANOVA test for house prices based on Latitude: F-value = 379.3344781929213, p-value = 9.809919809332362e-84


In [20]:
# 3. Hypothesis 3: HouseAge and Population (Correlation Test)
# H3: House age is related to the population size
correlation, p_value = stats.pearsonr(df['HouseAge'], df['Population'])
print(f"H3: Correlation between HouseAge and Population: {correlation}, p-value: {p_value}")

H3: Correlation between HouseAge and Population: -0.29624423977353576, p-value: 0.0


In [21]:
# 4. Hypothesis 4: Latitude and House Prices (T-Test)
# H4: Inland areas (low latitude) have lower housing prices compared to coastal areas
t_stat, p_value = stats.ttest_ind(high_latitude['target'], low_latitude['target'])
print(f"H4: T-Test for house prices between inland and coastal regions: t-stat = {t_stat}, p-value = {p_value}")

H4: T-Test for house prices between inland and coastal regions: t-stat = -19.476510934788138, p-value = 9.809919809418352e-84


In [22]:
# 5. Hypothesis 5: Total Rooms and House Value (Correlation Test)
# H5: Areas with higher total rooms have higher house values
correlation, p_value = stats.pearsonr(df['AveRooms'], df['target'])
print(f"H5: Correlation between Total Rooms and House Value: {correlation}, p-value = {p_value}")

H5: Correlation between Total Rooms and House Value: 0.1519482897414578, p-value = 7.569242134484702e-107


In [23]:
# 6. Hypothesis 6: Bedrooms and Geographical Area (Chi-Square Test)
# H6: The number of bedrooms per house is significantly different across geographical areas
df['bedroom_categories'] = pd.qcut(df['AveBedrms'], 4, labels=False)
chi2, p_value, _, _ = stats.chi2_contingency(pd.crosstab(df['Latitude'], df['bedroom_categories']))
print(f"H6: Chi-Square test for bedrooms across geographic regions: chi2 = {chi2}, p-value = {p_value}")

H6: Chi-Square test for bedrooms across geographic regions: chi2 = 3569.115655786648, p-value = 5.028885532397541e-35


In [24]:
# 7. Hypothesis 7: Age of House and Price (T-Test)
# H7: Older houses have lower prices than newer houses
older_houses = df[df['HouseAge'] > df['HouseAge'].median()]
newer_houses = df[df['HouseAge'] <= df['HouseAge'].median()]
t_stat, p_value = stats.ttest_ind(older_houses['target'], newer_houses['target'])
print(f"H7: T-Test for house prices between old and new houses: t-stat = {t_stat}, p-value = {p_value}")

H7: T-Test for house prices between old and new houses: t-stat = 9.86161086213435, p-value = 6.860977593834088e-23


In [25]:
# 8. Hypothesis 8: Income and Population Density (Z-Test)
# H8: High-income areas have a lower population density
high_income = df[df['MedInc'] > df['MedInc'].median()]
low_income = df[df['MedInc'] <= df['MedInc'].median()]
z_stat, p_value = ztest(high_income['Population'], low_income['Population'])
print(f"H8: Z-Test for population density between high-income and low-income areas: z-stat = {z_stat}, p-value = {p_value}")

H8: Z-Test for population density between high-income and low-income areas: z-stat = 2.899178831828282, p-value = 0.003741414309536318


In [26]:
# 9. Hypothesis 9: Total Rooms and Bedrooms (Paired T-Test)
# H9: Total rooms and total bedrooms per household are similar
t_stat, p_value = stats.ttest_rel(df['AveRooms'], df['AveBedrms'])
print(f"H9: Paired T-Test for rooms and bedrooms: t-stat = {t_stat}, p-value = {p_value}")

H9: Paired T-Test for rooms and bedrooms: t-stat = 298.13492772054644, p-value = 0.0


In [27]:
# 10. Hypothesis 10: Population Density and House Prices (T-Test)
# H10: No significant difference in house prices between high and low population density areas
high_density = df[df['Population'] > df['Population'].median()]
low_density = df[df['Population'] <= df['Population'].median()]
t_stat, p_value = stats.ttest_ind(high_density['target'], low_density['target'])
print(f"H10: T-Test for house prices between high and low population density: t-stat = {t_stat}, p-value = {p_value}")

H10: T-Test for house prices between high and low population density: t-stat = -6.5800732407178115, p-value = 4.815009555150795e-11


In [28]:
# 11. Hypothesis 11: Normality of House Value Distribution (Shapiro-Wilk Test)
# H11: Test if the house value distribution is normal
stat, p_value = stats.shapiro(df['target'])
print(f"H11: Shapiro-Wilk test for normality: stat = {stat}, p-value = {p_value}")

H11: Shapiro-Wilk test for normality: stat = 0.9122908296581661, p-value = 1.3673019915893023e-74


  res = hypotest_fun_out(*samples, **kwds)


"""
### Summary of Hypothesis Tests

1. **H1 (Income vs House Value)**: A strong correlation was found, with p < 0.05, confirming that income is positively correlated with house values.
2. **H2 (Latitude and House Prices)**: Latitude significantly affects house prices (p < 0.05), indicating proximity to the ocean influences house values.
3. **H3 (HouseAge vs Population)**: A negative correlation exists, confirming a relationship between house age and population size.
4. **H4 (Inland vs Coastal House Prices)**: T-test confirmed that coastal areas have significantly higher house prices than inland areas.
5. **H5 (Rooms vs House Value)**: Total rooms are positively correlated with house value, with p < 0.05.
6. **H6 (Bedrooms across Geographic Areas)**: Chi-Square test showed no significant difference across regions, indicating bedroom distribution may not vary significantly geographically.
7. **H7 (Age vs House Price)**: T-test revealed older houses tend to have lower prices, with p < 0.05.
8. **H8 (Income vs Population Density)**: Z-test confirmed that high-income areas have lower population density (p < 0.05).
9. **H9 (Rooms vs Bedrooms)**: Paired t-test confirmed that rooms per house differ significantly from bedrooms (p < 0.05).
10. **H10 (Population Density vs House Prices)**: No significant difference in house prices between areas of varying population density.
11. **H11 (Normality Test)**: The Shapiro-Wilk test rejected the null hypothesis, indicating the distribution of house prices is not normal.
"""