<a href="https://colab.research.google.com/github/Mix1996/Prediction-of-Product-Sales/blob/main/Ensemble_Trees_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [33]:
# Imports
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor

from sklearn.pipeline import make_pipeline

from sklearn import set_config
set_config(display='diagram')

## Import the Data

In [3]:
# Load in the data
df = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vTDduxF_M9JUiQ-fhHEnWBpiljZBT-NPHofXDFFwJl7qd4XkH7WxjkGaFYBCrLWr9IYcUO9UcwFlWKg/pub?output=csv')

In [4]:
# Display the first five rows of the dataframe
df.head()

Unnamed: 0,CRIM,NOX,RM,AGE,PTRATIO,LSTAT,PRICE
0,0.00632,0.538,6.575,65.2,15.3,4.98,24.0
1,0.02731,0.469,6.421,78.9,17.8,9.14,21.6
2,0.02729,0.469,7.185,61.1,17.8,4.03,34.7
3,0.03237,0.458,6.998,45.8,18.7,2.94,33.4
4,0.06905,0.458,7.147,54.2,18.7,5.33,36.2


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   NOX      506 non-null    float64
 2   RM       506 non-null    float64
 3   AGE      506 non-null    float64
 4   PTRATIO  506 non-null    float64
 5   LSTAT    506 non-null    float64
 6   PRICE    506 non-null    float64
dtypes: float64(7)
memory usage: 27.8 KB


## Split the Data

In [9]:
# split X and y, we are predicting price
target = 'PRICE'
X = df.drop(columns=[target]).copy()
y = df[target].copy()

# split training and test

# set random_state to 42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30 , random_state=42)

In [10]:
X_train.shape

(354, 6)

In [11]:
X_test.shape

(152, 6)

In [12]:
X_train.dtypes

CRIM       float64
NOX        float64
RM         float64
AGE        float64
PTRATIO    float64
LSTAT      float64
dtype: object

In [13]:
# PREPROCESSING PIPELINE FOR NUMERIC DATA

# Save list of number column names
num_cols = X_train.select_dtypes("number").columns
print("Numeric Columns:", num_cols)

# Transformers
mean_imputer = SimpleImputer(strategy='mean')
scaler = StandardScaler()

# Pipeline
num_pipeline = make_pipeline(mean_imputer, scaler)
num_pipeline

Numeric Columns: Index(['CRIM', 'NOX', 'RM', 'AGE', 'PTRATIO', 'LSTAT'], dtype='object')


In [16]:
# Initialize the base estimator (decision tree)
base_estimator = DecisionTreeRegressor()

In [22]:
# Initialize the BaggingRegressor with the base estimator
bagged_model = BaggingRegressor(base_estimator=base_estimator, n_estimators=10, random_state=42)

In [23]:
# Fit the model on the training data
bagged_model.fit(X_train, y_train)



In [24]:
# Predict the target variable on the test data
y_pred = bagged_model.predict(X_test)

In [25]:
# Calculate the mean squared error as a measure of performance
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Mean Squared Error: 12.226481578947368


## Use GridSearchCV to tune the Bagged Tree model to optimize performance on the test set.

Evaluate the best Bagged Tree model's performance.

In [27]:
# Define hyperparameters grid to search through
param_grid = {
    'n_estimators': [10, 50, 100],
    'max_samples': [0.5, 0.7, 0.9],
    'max_features': [0.5, 0.7, 0.9]
}

In [None]:
# Initialize GridSearchCV
grid_search = GridSearchCV(bagged_model, param_grid, scoring='neg_mean_squared_error', cv=3)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

In [29]:
# Get the best model from the grid search
best_model = grid_search.best_estimator_


In [30]:
# Predict the target variable on the test data using the best model
y_pred = best_model.predict(X_test)

In [31]:
# Calculate the mean squared error as a measure of performance
mse = mean_squared_error(y_test, y_pred)
print("Best Model's Mean Squared Error:", mse)

Best Model's Mean Squared Error: 13.649993736842104


In [32]:
# Get the best hyperparameters from the grid search
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'max_features': 0.7, 'max_samples': 0.9, 'n_estimators': 50}


## Train and evaluate a default random forest

In [34]:
# Initialize the Random Forest Regressor
random_forest_model = RandomForestRegressor(random_state=42)

# Fit the model on the training data
random_forest_model.fit(X_train, y_train)

# Predict the target variable on the test data
y_pred = random_forest_model.predict(X_test)

In [35]:
# Initialize the Random Forest Regressor
random_forest_model = RandomForestRegressor(random_state=42)

In [36]:
# Fit the model on the training data
random_forest_model.fit(X_train, y_train)

In [37]:
# Predict the target variable on the test data
y_pred = random_forest_model.predict(X_test)

In [38]:
# Calculate the mean squared error as a measure of performance
mse = mean_squared_error(y_test, y_pred)
print("Random Forest Mean Squared Error:", mse)

Random Forest Mean Squared Error: 11.836238618421056


## Use GridSearchCV to tune the Random Forest model to optimize performance on the test set.
Evaluate the best Random Forest model's performance.

In [39]:
# Define hyperparameters grid to search through
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(random_forest_model, param_grid, scoring='neg_mean_squared_error', cv=3)

In [40]:
# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_model = grid_search.best_estimator_

In [41]:
# Predict the target variable on the test data using the best model
y_pred = best_model.predict(X_test)


In [42]:
# Calculate the mean squared error as a measure of performance
mse = mean_squared_error(y_test, y_pred)
print("Best Random Forest Model's Mean Squared Error:", mse)

Best Random Forest Model's Mean Squared Error: 11.396233017907457


In [43]:
# Get the best hyperparameters from the grid search
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}


## Which model and model parameters provided the best results?

In [44]:
# Access the best model
best_model = grid_search.best_estimator_

# Access the best hyperparameters
best_params = grid_search.best_params_

print("Best Model:", best_model)
print("Best Hyperparameters:", best_params)


Best Model: RandomForestRegressor(max_depth=10, n_estimators=150, random_state=42)
Best Hyperparameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 150}


The best Random Forest model and its associated hyperparameters that provided the best results according ro the GridSearchCV.

This means that the best Random Forest model has a maximum depth of 10, uses 150 estimators (trees), and other hyperparameters were set to their default values. These hyperparameters were determined through the GridSearchCV process to optimize the model's performance on the test set.

 ## Explain in a text cell how your model will perform if deployed by referring to the metrics. Ex. How close can your stakeholders expect its predictions to be to the true value?

When the best-performing Random Forest model is deployed, stakeholders can expect its predictions to be relatively close to the true values of the target variable, which in this case is housing prices. The model's performance has been optimized using hyperparameter tuning, specifically with a focus on minimizing the mean squared error (MSE).

It's important to note that while the optimized model's performance is promising, no model is perfect and there will still be instances where predictions deviate from the true values. Nevertheless, the optimized Random Forest model is expected to provide more accurate and reliable predictions compared to its default version, thus providing valuable insights for stakeholders making decisions based on predicted housing prices.





