### First, Technical Terms and Definitions

* __Base class:__ A class that defines common properties and methods for other classes. In this case, DataProcessor is the base class.

* __Derived class:__ A class that inherits properties and methods from another class. HousingDataProcessor is a derived class of DataProcessor.

* __Feature engineering:__ The process of creating new features from existing data to improve model performance.

* __One-hot encoding:__ A technique to represent categorical variables as numerical features by creating binary columns for each category.

* __Random Forest:__ An ensemble machine learning algorithm that combines multiple decision trees to make predictions.

* __Standardization:__ A preprocessing technique that scales numerical features to have a mean of 0 and a standard deviation of 1.
Feature importance: A measure of how important each feature is in predicting the target variable.

### Import Libraries and Load Data
In this step, we import the necessary libraries.

In [8]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor

### Object-Oriented Programming (OOP)
Define DataProcessor (base class).
It serves as a blueprint for creating objects that handle data processing tasks.
* __init__ (self, data_path): This is the constructor method, called when creating a new DataProcessor object. It takes the data path (location of the CSV file) as input and:

* Reads the CSV file using pd.read_csv and stores it in the self.data attribute.

* Initializes self.X and self.y to None as placeholders for features and target variables later.

* clean_data(self): This method removes rows with missing values (NaN) from the data using self.data.dropna(inplace=True).

* split_features_target(self, target_column): This method separates features (independent variables) from the target variable (dependent variable). It takes the target column name as input and:

* Drops the target column from the data using self.data.drop and stores the remaining features in self.X.

* Extracts the target column and stores it in self.y.

* __@staticmethod__: This decorator defines a static method called log_transform that doesn't require creating an object of the class. It takes a value as input and performs a log transformation using np.log(value + 1). This is useful for handling skewed numerical features in housing data (e.g., number of bedrooms).

In [9]:
class DataProcessor:
    def __init__(self, data_path):
        self.data = pd.read_csv(data_path)
        self.X = None
        self.y = None

    def clean_data(self):
        self.data.dropna(inplace=True)

    def split_features_target(self, target_column):
        self.X = self.data.drop([target_column], axis=1)
        self.y = self.data[target_column]

    @staticmethod
    def log_transform(value):
        return np.log(value + 1)

### Object-Oriented Programming (OOP)
Define HousingDataProcessor (derived class)

This class inherits from DataProcessor and is specifically designed for housing data.

* __init__(self, data_path): Calls the constructor of the parent class DataProcessor using super(). __init__(data_path). It also initializes self.train_data to None.

* feature_engineering(self): This method performs feature engineering on the training data:

* Combines the features (self.X) and target variable (self.y) into a single dataframe self.train_data.

* Applies the log_transform function to the specified columns (total_rooms, total_bedrooms, population, households) to handle skewed distributions.

* Creates two new features: bedroom_ratio (ratio of total bedrooms to total rooms) and household_rooms (ratio of total rooms to households).

* encode_categorical(self, column): This method encodes a categorical column using one-hot encoding. It takes the column name as input and:

* Creates dummy variables using pd.get_dummies for the specified column.

* Joins the dummy variables with the train_data dataframe and drops the original categorical column.

* feature_importance_generator(self, model): This method generates a generator that yields feature importance values for the given model. It takes a trained model as input and:

* Extracts the feature importance values from the model using model.feature_importances_.

* Iterates over the feature names and importance values, sorting them in descending order of importance.

* Yields a string representation of each feature name and its importance for each iteration.

In [10]:
class HousingDataProcessor(DataProcessor):
    def __init__(self, data_path):
        super().__init__(data_path)
        self.train_data = None

    def feature_engineering(self):
        self.train_data = self.X.join(self.y)
        for column in ['total_rooms', 'total_bedrooms', 'population', 'households']:
            self.train_data[column] = self.log_transform(self.train_data[column])
        self.train_data['bedroom_ratio'] = self.train_data['total_bedrooms'] / self.train_data['total_rooms']
        self.train_data['household_rooms'] = self.train_data['total_rooms'] / self.train_data['households']

    def encode_categorical(self, column):
        self.train_data = self.train_data.join(pd.get_dummies(self.train_data[column])).drop([column], axis=1)

    def feature_importance_generator(self, model):
        feature_importance = model.feature_importances_
        feature_names = self.train_data.drop(['median_house_value'], axis=1).columns
        for name, importance in sorted(zip(feature_names, feature_importance), key=lambda x: x[1], reverse=True):
            yield f"{name}: {importance:.4f}"

### Machine Learning Model

* This function trains a Random Forest regression model on the given features (X) and target variable (y).

* Splits the data into training and testing sets using train_test_split.

* Standardizes the numerical features using StandardScaler to improve model performance.

* Creates a Random Forest model with 300 trees and a random state of 42 for reproducibility.

* Fits the model on the training data and returns the trained model, scaled testing features, and testing target values.

In [11]:
def train_model(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    model = RandomForestRegressor(n_estimators=300, random_state=42)
    model.fit(X_train_scaled, y_train)
    return model, X_test_scaled, y_test


### Usage

* Creates an instance of HousingDataProcessor with the CSV file path.

* Calls the necessary methods to clean the data, split features and target, perform feature engineering, and encode the categorical column.

* Extracts features and target variables for training.

* Trains the model using the train_model function.

* Prints the feature importance values using the feature_importance_generator method.

* Evaluates the model's accuracy on the testing data using model.score.

In [12]:

path = "/housing.csv"
processor = HousingDataProcessor(path)
processor.clean_data()
processor.split_features_target('median_house_value')
processor.feature_engineering()
processor.encode_categorical('ocean_proximity')

X = processor.train_data.drop(['median_house_value'], axis=1)
y = processor.train_data['median_house_value']
model, X_test, y_test = train_model(X, y)

### Results & Feature Importance:

* median_income: 0.4827
This feature has the highest importance, suggesting that the median income of the neighborhood is a strong predictor of housing prices.

* INLAND: 0.1423
This indicates that the location of the house (inland) is also a significant factor.

* longitude: 0.0985
The geographic longitude contributes to the prediction, possibly indicating that the location relative to other areas influences prices.

* latitude: 0.0888
Similar to longitude, the geographic latitude also plays a role in price prediction.

* housing_median_age: 0.0477
The age of the housing stock in the neighborhood is another important factor.

* bedroom_ratio: 0.0317
The ratio of bedrooms to total rooms is considered relevant.

* population: 0.0276
The population density of the area seems to have some influence on prices.

* household_rooms: 0.0243
The average number of rooms per household is also a contributing factor.

* total_rooms: 0.0192
The total number of rooms in the neighborhood is less important compared to other features.

* total_bedrooms: 0.0140
The total number of bedrooms is also less influential.

* households: 0.0127
The number of households in the neighborhood has a relatively low impact.

* NEAR OCEAN: 0.0058
Proximity to the ocean is considered a factor, but with a lower importance compared to other features.

* ISLAND: 0.0002
Being on an island has the least impact on price prediction.

### Model Accuracy:

The model's accuracy is: 0.8220
This statement indicates that the trained Random Forest model achieved an accuracy of 82.20% on the testing dataset. This means that the model correctly predicted the median house value for approximately 82.2% of the houses in the test set.

In [13]:
for importance in processor.feature_importance_generator(model):
    print('-'*10)
    print(importance)

print('-'*10)
score = model.score(X_test, y_test)
print(f"The model's accuracy is: {score:.4f}")

----------
median_income: 0.4827
----------
INLAND: 0.1423
----------
longitude: 0.0985
----------
latitude: 0.0888
----------
housing_median_age: 0.0477
----------
bedroom_ratio: 0.0317
----------
population: 0.0276
----------
household_rooms: 0.0243
----------
total_rooms: 0.0192
----------
total_bedrooms: 0.0140
----------
households: 0.0127
----------
NEAR OCEAN: 0.0058
----------
<1H OCEAN: 0.0033
----------
NEAR BAY: 0.0012
----------
ISLAND: 0.0002
----------
The model's accuracy is: 0.8220
