<a href="https://colab.research.google.com/github/PanosRntgs/Machine-Learning/blob/main/Californian_Housing_Trends_Analysis_Using_Random_Forest_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Perform data analysis and predictive modeling on Californian housing data using Python.

Summary:

Data Loading and Preparation.

Feature Scaling.

Model Training and Parameter Tuning (Random Forest Regression).

Model Evaluation.

Feature Importance Analysis.

Overall, the following notebook provides a comprehensive pipeline for loading, preprocessing, modeling, and analyzing Californian housing data, offering insights into the factors influencing median house values in different districts.

In [1]:
from pathlib import Path
import pandas as pd
import tarfile
import urllib.request
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler

In [2]:
def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))

housing = load_housing_data()

In [3]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
# Data Cleaning and Preprocessing
housing['total_bedrooms'].fillna(housing['total_bedrooms'].median(), inplace=True)

In [5]:
# Encoding categorical variables using one-hot encoding
housing = pd.get_dummies(housing, columns=['ocean_proximity'], drop_first=True)

In [6]:
# Feature scaling
scaling = MinMaxScaler()
columns_to_scale = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income']
housing[columns_to_scale] = scaling.fit_transform(housing[columns_to_scale])

In [7]:
# Performing Grid Search with Random Forest Regressor
parameters_grid = {
    'n_estimators': [5, 15, 25],
    'max_features': [2, 4, 8]
}

random_forest = RandomForestRegressor()

grid_search = GridSearchCV(random_forest, parameters_grid, cv=3, scoring='neg_mean_squared_error')
grid_search.fit(housing.drop('median_house_value', axis=1), housing['median_house_value'])

best_parameters = grid_search.best_params_

# Printing the best parameters
print("Best Parameters:", best_parameters)

Best Parameters: {'max_features': 8, 'n_estimators': 25}


In [8]:
# Create and train Random Forest Regressor model with input the best parameters
random_forest = RandomForestRegressor(n_estimators=25, max_features=8, random_state=42)
random_forest.fit(housing.drop('median_house_value', axis=1), housing['median_house_value'])

# Get the most important features
feature_importances = random_forest.feature_importances_

# Create a DataFrame to associate feature names with their importances
feature_importance_df = pd.DataFrame({'Feature': housing.columns[:-1], 'Importance': feature_importances})

# Sorting the features by importance (descending order)
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Showing the top three most important features
top_three_features = feature_importance_df.head(3)

print(top_three_features)


              Feature  Importance
7       median_income    0.452753
8  median_house_value    0.157943
0           longitude    0.115337
