In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
calihouse_data = pd.read_csv('california_housing.csv')

# Exploratory Data Analysis
First, let's check the shape and size of the data set!

In [None]:
calihouse_data.shape

In [None]:
calihouse_data.size

Wow! This is one extensive data set. Let's do a deeper dive and get some information about the data types.

In [None]:
calihouse_data.info()

It seems like there is no null values, but we should check for duplicate rows!

In [None]:
calihouse_data.duplicated().sum()

YIPPEE! We do not have any duplicated values but let's double check if we need to do any data type conversions.

In [None]:
calihouse_data.head()

In [None]:
calihouse_data.tail()

In [None]:
print("Median IncomeColumn:", calihouse_data['MedInc'].unique())
print("House Age Column:", calihouse_data['HouseAge'].unique())
print("Average Rooms Column:", calihouse_data['AveRooms'].unique())
print("Average Bedrooms Column:", calihouse_data['AveBedrms'].unique())
print("Population Column:", calihouse_data['Population'].unique())
print("Average Occupancy Column:", calihouse_data['AveOccup'].unique())
print("Latitude Column:", calihouse_data['Latitude'].unique())
print("Longitude Column:", calihouse_data['Longitude'].unique())
print("Price Above Median Column:", calihouse_data['price_above_median'].unique())

In [None]:
print("Median Income Column:", calihouse_data['MedInc'].nunique())
print("House Age Column:", calihouse_data['HouseAge'].nunique())
print("Average Rooms Column:", calihouse_data['AveRooms'].nunique())
print("Average Bedrooms Column:", calihouse_data['AveBedrms'].nunique())
print("Population Column:", calihouse_data['Population'].nunique())
print("Average Occupancy Column:", calihouse_data['AveOccup'].nunique())
print("Latitude Column:", calihouse_data['Latitude'].nunique())
print("Longitude Column:", calihouse_data['Longitude'].nunique())
print("Price Above Median Column:", calihouse_data['price_above_median'].nunique())

It looks like there is not a need for any data type conversions! All the data types are in float64, except for `price_above_median` which is in int64, and the values reflect that! Let's try to find any anamolies in the data by running through some statistical tests!

In [None]:
print("Median Income Column Median:", calihouse_data['MedInc'].median())
print("House Age Column Median:", calihouse_data['HouseAge'].median())
print("Average Rooms Column Median:", calihouse_data['AveRooms'].median())
print("Average Bedrooms Column Median:", calihouse_data['AveBedrms'].median())
print("Population Column Median:", calihouse_data['Population'].median())
print("Average Occupancy Column Median:", calihouse_data['AveOccup'].median())
print("Longitude Latitude Column Median:", calihouse_data['Longitude'].median())
print("Latitude Column Median:", calihouse_data['Latitude'].median())

In [None]:
print("Median Income Column mean:", calihouse_data['MedInc'].mean())
print("House Age Column mean:", calihouse_data['HouseAge'].mean())
print("Average Rooms Column mean:", calihouse_data['AveRooms'].mean())
print("Average Bedrooms Column mean:", calihouse_data['AveBedrms'].mean())
print("Population Column mean:", calihouse_data['Population'].mean())
print("Average Occupancy Column mean:", calihouse_data['AveOccup'].mean())
print("Longitude Latitude Column mean:", calihouse_data['Longitude'].mean())
print("Latitude Column mean:", calihouse_data['Latitude'].mean())

In [None]:
print("Median Income Column min:", calihouse_data['MedInc'].min())
print("House Age Column min:", calihouse_data['HouseAge'].min())
print("Average Rooms Column min:", calihouse_data['AveRooms'].min())
print("Average Bedrooms Column min:", calihouse_data['AveBedrms'].min())
print("Population Column min:", calihouse_data['Population'].min())
print("Average Occupancy Column min:", calihouse_data['AveOccup'].min())
print("Longitude Latitude Column min:", calihouse_data['Longitude'].min())
print("Latitude Column min:", calihouse_data['Latitude'].min())

In [None]:
print("Median Income Column max:", calihouse_data['MedInc'].max())
print("House Age Column max:", calihouse_data['HouseAge'].max())
print("Average Rooms Column max:", calihouse_data['AveRooms'].max())
print("Average Bedrooms Column max:", calihouse_data['AveBedrms'].max())
print("Population Column max:", calihouse_data['Population'].max())
print("Average Occupancy Column max:", calihouse_data['AveOccup'].max())
print("Longitude Latitude Column max:", calihouse_data['Longitude'].max())
print("Latitude Column max:", calihouse_data['Latitude'].max())

In [None]:
print("Median Income Column standard deviation:", calihouse_data['MedInc'].std())
print("House Age Column standard deviation:", calihouse_data['HouseAge'].std())
print("Average Rooms Column standard deviation:", calihouse_data['AveRooms'].std())
print("Average Bedrooms Column standard deviation:", calihouse_data['AveBedrms'].std())
print("Population Column standard deviation:", calihouse_data['Population'].std())
print("Average Occupancy Column standard deviation:", calihouse_data['AveOccup'].std())
print("Longitude Latitude Column standard deviation:", calihouse_data['Longitude'].std())
print("Latitude Column standard deviation:", calihouse_data['Latitude'].std())

Phew! Okay we did a lot of statistical methods on our data set but let's analyze what each column is telling us! 

1) Median Income (MedInc): The median and mean are close, suggesting a relatively normal distribution. However, the max value of 15.00 is quite high compared to the mean of 3.87.
2) House Age (HouseAge): The oldest houses are 52 years old, and the youngest is 1 year old. The distribution seems reasonable.
3) Average Rooms (AveRooms): The max value of 141.91 is an anomaly—this suggests an outlier where a block has an abnormally high average number of rooms.
4) Average Bedrooms (AveBedrms): The max value of 34.07 bedrooms per unit is extremely high and likely an anomaly.
5) Population: The max value 35,682 suggests a highly populated block, which is significantly above the mean (1,425).
6) Average Occupancy (AveOccup): The max of 1,243.33 seems like an outlier since the mean is only 3.07. This suggests that some data points may have incorrect values.
7) Longitude & Latitude: These values align with California’s geographical bounds.

One thing that we didn't analyze yet is our `price_above_median` column. This is because this is the only dependent variable within the whole dataset. So, let's do some analysis on this variable to better understand what is happening wiht this column!

In [None]:
print(calihouse_data['price_above_median'].value_counts()) #this is to check if the data is balanced between houses priced above and below the median
print(calihouse_data['price_above_median'].value_counts(normalize=True)) #this gives us the percentage of houses priced above and below the median
print(calihouse_data.groupby('price_above_median').mean()) #this will tell me the average median income, house age, number of rooms etc, for each group 
print(calihouse_data.corr()['price_above_median'].sort_values(ascending=False)) #this will tell which factors most strongly influence home prices

Okay, so what does this tell us?

1) The dataset is perfectly balanced! 50% of the houses are above the median price and 50% of the houses are below the median price. This means we can take a breath of relief because we do not have to worry about class imbalance. This also means that this dataset is great for predictive modeling!

2) Some Key differences between the below and above median price houses:
    - Median Income (MedInc): Income has a strong positive relationship with house price. Houses in higher-income areas are much more likely to be priced above the median.
    - House Age (HouseAge): Slightly older homes tend to be priced higher, but the difference isn’t large.
    - Average Rooms (AveRooms): Homes with more rooms tend to be above the median price. This makes sense because larger houses are usually more expensive.
    - Average Bedrooms (AveBedrms): Slightly fewer bedrooms in more expensive homes. This might indicate that larger houses with fewer, more spacious rooms are more valuable.
    - Population (Population): Population size does not seem to be a strong factor in determining house price.
    - Average Occupancy (AveOccup): More expensive homes tend to have fewer occupants per household, possibly indicating larger houses with more space per person.
    - Latitude (Latitude): Higher-priced houses tend to be slightly further south, suggesting that more expensive homes may be located in urban or coastal regions.
    - Longitude (Longitude): Expensive homes are slightly further west, potentially closer to coastal areas.
  
3) Some potential anomalies and key insights are:
    - Average bedrooms being lower for higher-priced houses is unexpected but might make sense if expensive homes have more open space per room.
    - Population is nearly identical for both price categories, meaning it likely has little impact on house prices.
    - Latitude and Longitude differences suggest a geographic influence on pricing, possibly due to proximity to the coast or urban centers.
  
4) Let's interpret the correlation values:
    - Strongest Positive Correlation:
        - MedInc (High correlation): Higher median income strongly predicts a house being above the median price.
        - Latitude (Moderate correlation): Suggests homes further north (around the Bay Area and coastal cities) tend to be more expensive.
    - Strongest Negative Correlation:
        - AveOccup (Negative correlation): Indicates higher occupancy per home is associated with lower-priced houses, possibly reflecting more crowded living conditions in lower-income areas.
        - Longitude (Negative correlation): Suggests homes further west (closer to Los Angeles and San Francisco) are generally higher-priced. 

# Univariate Analysis
Now that we have introduced ourselves to the dataset a bit, let's visualize the dataset to get a better understanding of what we are dealing with here.

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(12, 10))
plt.subplot(4, 2, 1)
sns.histplot(calihouse_data['MedInc'], bins=30, kde=True, color="red")
plt.title('Distribution of MedInc')

plt.subplot(4, 2, 2)
sns.histplot(calihouse_data['HouseAge'], bins=30, kde=True, color="orange")
plt.title('Distribution of HouseAge')

plt.subplot(4, 2, 3)
sns.histplot(calihouse_data['AveRooms'], bins=30, kde=True, color="gold")
plt.title('Distribution of AveRooms')

plt.subplot(4, 2, 4)
sns.histplot(calihouse_data['AveBedrms'], bins=30, kde=True, color="yellowgreen")
plt.title('Distribution of AveBedrms')

plt.subplot(4, 2, 5)
sns.histplot(calihouse_data['Population'], bins=30, kde=True, color="blue")
plt.title('Distribution of Population')

plt.subplot(4, 2, 6)
sns.histplot(calihouse_data['AveOccup'], bins=30, kde=True, color="indigo")
plt.title('Distribution of AveOccup')

plt.subplot(4, 2, 7)
sns.histplot(calihouse_data['Latitude'], bins=30, kde=True, color="orchid")
plt.title('Distribution of Latitude')

plt.subplot(4, 2, 8)
sns.histplot(calihouse_data['Longitude'], bins=30, kde=True, color="hotpink")
plt.title('Distribution of Longitude')

plt.tight_layout()
plt.show()

This is quite interesting!

The distributions of `MedInc`, `AveRooms`, `AveBedrms`, `Population`, and `AveOccup` are all right skewed. Let's look into each of these graphs:
1) MedInc (Median Income): Right-skewed distribution, meaning most households have lower incomes, with a few having significantly higher values.
2) AveRooms & AveBedrms: Right-skewed, with a majority of houses having fewer rooms and bedrooms.
3) Population: Strong right-skew, indicating that most block groups have lower populations, with a few significantly larger ones.
4) AveOccup (Average Occupancy): Highly skewed, suggesting that most households have lower occupancy rates, with extreme values present.
   
The rest of the graphs look fairly uniform! But, let's look into them:

5) HouseAge: Fairly uniform distribution, but some peaks around newer housing developments.
6) Latitude & Longitude: These show a relatively uniform spread, reflecting the geographical distribution of houses across California.

In [None]:
sns.set_style("whitegrid")

plt.figure(figsize=(12, 10))
plt.subplot(4, 2, 1)
sns.boxplot(y=calihouse_data['MedInc'], color="red")
plt.title('Box Plot of MedInc')

plt.subplot(4, 2, 2)
sns.boxplot(y=calihouse_data['HouseAge'], color="orange")
plt.title('Box Plot of HouseAge')

plt.subplot(4, 2, 3)
sns.boxplot(y=calihouse_data['AveRooms'], color="gold")
plt.title('Box Plot of AveRooms')

plt.subplot(4, 2, 4)
sns.boxplot(y=calihouse_data['AveBedrms'], color="yellowgreen")
plt.title('Box Plot of AveBedrms')

plt.subplot(4, 2, 5)
sns.boxplot(y=calihouse_data['Population'], color="blue")
plt.title('Box Plot of Population')

plt.subplot(4, 2, 6)
sns.boxplot(y=calihouse_data['AveOccup'], color="indigo")
plt.title('Box Plot of AveOccup')

plt.subplot(4, 2, 7)
sns.boxplot(y=calihouse_data['Latitude'], color="orchid")
plt.title('Box Plot of Latitude')

plt.subplot(4, 2, 8)
sns.boxplot(y=calihouse_data['Longitude'], color="hotpink")
plt.title('Box Plot of Longitude')

plt.tight_layout()
plt.show()

Here are some interesting insights and correlations I found with the box plots:
1) MedInc & HouseAge: Outliers present in high-income households and very old houses.
2) Population & AveOccup: These show extreme outliers, indicating some block groups have exceptionally high population density.
3) AveRooms & AveBedrms: Outliers suggest that some houses have an unusually high number of rooms and bedrooms.
4) Latitude & Longitude: Most houses are concentrated in Central and Southern California.

In [None]:
sns.boxplot(x='price_above_median', y='MedInc', data=calihouse_data,  color="lavender")
plt.title("Median Income vs. Price Category")
plt.show()

We can see from the plot above: 
1) Houses priced above the median (price_above_median = 1) tend to be located in areas with higher median incomes, as the distribution of median income is shifted upward for higher-priced homes. This suggests a strong positive correlation between income levels and home prices—wealthier areas generally have more expensive homes.
2) On the other hand, the interquartile range (IQR) for lower-priced houses (price_above_median = 0) is wider, indicating greater variation in household income within these neighborhoods. This could suggest that lower-priced housing is spread across a broader spectrum of income groups, encompassing both low-income and some middle-income areas.
3) There may be outliers in the higher price category, which could represent exceptionally wealthy neighborhoods where the median income is significantly above the typical range.

# Classification Techniques

In [None]:
import sklearn
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

X = calihouse_data.drop(columns=['price_above_median'])  
y = calihouse_data['price_above_median']  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print("Class distribution in original dataset:\n", y.value_counts(normalize=True))
print("Class distribution in training set:\n", y_train.value_counts(normalize=True))
print("Class distribution in test set:\n", y_test.value_counts(normalize=True))

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

In [None]:
#KNN
scaler = StandardScaler()
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

knn = KNeighborsClassifier()
knn_params = {'n_neighbors': [3, 5, 7, 9], 'weights': ['uniform', 'distance']}
grid_knn = GridSearchCV(knn, knn_params, cv=5, scoring='accuracy')
grid_knn.fit(X_train_scaled, y_train)

knn_best = grid_knn.best_estimator_
y_pred_knn = knn_best.predict(X_test_scaled)
print("KNN Classification Report:\n", classification_report(y_test, y_pred_knn))
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print("KNN AUC-ROC:", roc_auc_score(y_test, knn_best.predict_proba(X_test_scaled)[:, 1]))

In [None]:
#decision tree
pipeline = Pipeline([
    ('scaler', StandardScaler()),  
    ('classifier', DecisionTreeClassifier(random_state=42))
])

param_grid = {
    'classifier__max_depth': [5, 10, 15, 20, None],
    'classifier__min_samples_split': [2, 5, 10, 20],
    'classifier__min_samples_leaf': [1, 2, 5, 10],
    'classifier__max_features': ['sqrt', 'log2', None],
    'classifier__criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Hyperparameters:", grid_search.best_params_)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Accuracy:", accuracy_score(y_test, y_pred))
print("AUC-ROC:", roc_auc_score(y_test, best_model.predict_proba(X_test)[:, 1]))

In [None]:
#random forest
pipeline_rf = Pipeline([
    ('scaler', StandardScaler()),  # Standardize the data
    ('classifier', RandomForestClassifier(random_state=42))
])

param_grid_rf = {
    'classifier__n_estimators': [50, 100, 200],  # Number of trees
    'classifier__max_depth': [5, 10, 20, None],  # Tree depth
    'classifier__min_samples_split': [2, 5, 10],  # Min samples required to split
    'classifier__min_samples_leaf': [1, 2, 5],  # Min samples at leaf
    'classifier__max_features': ['sqrt', 'log2', None],  # Feature selection strategy
    'classifier__criterion': ['gini', 'entropy']  # Splitting criterion
}

grid_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_rf.fit(X_train, y_train)
rf_best = grid_rf.best_estimator_
y_pred_rf = rf_best.predict(X_test)

print("Best Random Forest Hyperparameters:", grid_rf.best_params_)
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred_rf))
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("Random Forest AUC-ROC:", roc_auc_score(y_test, rf_best.predict_proba(X_test)[:, 1]))

In [None]:
#adaboost
pipeline_ab = Pipeline([
    ('scaler', StandardScaler()),  # Standardize the data
    ('classifier', AdaBoostClassifier(random_state=42))
])

param_grid_ab = {
    'classifier__n_estimators': [50, 100, 200, 500],  # Number of weak learners
    'classifier__learning_rate': [0.001, 0.01, 0.1, 1],  # Learning rate
    'classifier__algorithm': ['SAMME', 'SAMME.R']  # Algorithm type
}

grid_ab = GridSearchCV(pipeline_ab, param_grid_ab, cv=5, scoring='accuracy', n_jobs=-1)
grid_ab.fit(X_train, y_train)
ab_best = grid_ab.best_estimator_
y_pred_ab = ab_best.predict(X_test)

print("Best AdaBoost Hyperparameters:", grid_ab.best_params_)
print("AdaBoost Classification Report:\n", classification_report(y_test, y_pred_ab))
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred_ab))
print("AdaBoost AUC-ROC:", roc_auc_score(y_test, ab_best.predict_proba(X_test)[:, 1]))

In [None]:
models = {
    'KNN': (knn_best, X_test_scaled),  #uses scaled data
    'Decision Tree': (dt_best, X_test),  #uses unscaled data
    'Random Forest': (rf_best, X_test),  #uses unscaled data
    'AdaBoost': (ab_best, X_test)  #uses unscaled data
}

for name, (model, X_data) in models.items():
    print(f"{name} Accuracy: {accuracy_score(y_test, model.predict(X_data)):.4f}")
    print(f"{name} AUC-ROC: {roc_auc_score(y_test, model.predict_proba(X_data)[:, 1]):.4f}\n")

WOW! These are some impressive results!! So, we know from project 1 that when we did not standardize the dataset then KNN would not perform well. After we scaled the data we can clearly see that the model did a lot better. After testing the scaled data on the KNN model I also tried it on the decision tree, random forest, and adaboost models to see if there was going to be a major difference. I found that there was not a big difference in the accuracy or the AUC-ROC. I am guessing this might be because these models are not scale sensitive.

Just for funsies, let's see what would happen if we used ensemble learning or stacking. We will train multiple base models (KNN, decision tree, random forest, and adaboost) then use their prediction as an input for a meta model using logistic regression. 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

def plot_confusion_matrix(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(6, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Class 0', 'Class 1'], yticklabels=['Class 0', 'Class 1'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.show()
    

Hmmm this is quite interesting! We computed a confusion matrix for both unscaled and scaled models. The results of the different classifiers provide valuable insights into the impact of scaling on their performance. For KNN, scaling had a significant positive effect. Without scaling, the confusion matrix showed weaker classification performance with more misclassifications, as KNN relies on distance calculations that can be dominated by features with larger scales. However, after standardization, the confusion matrix became more balanced, improving the model's ability to correctly classify instances. 

The decision tree classifier model, however, was unaffected by scaling. Since decision trees split features independently and do not rely on distance-based calculations, the confusion matrices remained nearly identical before and after scaling. Similarly, the random forest Classifier, like Decision Trees, showed no sensitivity to feature scaling, and the confusion matrices were almost identical. However, random forest's strong performance, with fewer misclassifications compared to decision trees, suggests it generalizes better by reducing overfitting. AdaBoost classifier also showed no significant change with scaling, and while it performed well, it might not have been as strong as Random Forest due to its sensitivity to noisy data and weaker base estimators. 

Overall, scaling significantly impacted KNN, but had little effect on decision trees, random forest, or adaboost. Random forest emerged as a top-performing model regardless of scaling, making it a strong choice for this dataset. For computational efficiency, decision trees may be a good option as they are simpler while still maintaining decent accuracy. When misclassification penalties are high, model selection should prioritize precision and recall over accuracy.