#### End to End Machine Learnin Project

Machine Learning Project will generall follow  the below machine learning development cycle

1. Problem defination/ Statement
2. Collection of data
3. Data Cleaning - This will involve
   - Handling Missing Data
   - Handling Outliers
   - Remove Duplicates
   - Correct Data formats
4. Exploratory Data Analysis
5. Data machine Learning Pre-processing(Feature engineering) which includes
     - Feature Transformation
     - Feature Creation
     - Feature Encoding
     - Feature selection
     - Feature Scaling
     - Dimensionality Reduction
     - Handling Imbalanced Data
6. Choice of Algorithim to used based on Machine Learning problems
7. Train your data on Train set
8. Fine tune your model
9. Evaluate your model
10. Deploy your model


----
----

In this  Noteboook we will focus on Choice of model , train and testing your models

----
----

In [80]:
# Import packages

import os
import tarfile
import urllib.request
import pandas as pd
import seaborn as sns
import matplotlib as plt


pd.set_option('display.float_format','{:.2f}'.format)

In [53]:
## Load the data

DOWNLOAD_ROOT =  "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH  =  os.path.join("datasets", "housing")
HOUSING_URL   =  DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    
    os.makedirs(housing_path, exist_ok=True)
    tgz_path    = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [59]:
# fetch_housing_data()

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [60]:
# Load the data 

df = load_housing_data()
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


**Our Problem Statement**

- To create a model that can predict the median_house_value, This is regression problem

**Data Cleaning**

In [62]:
df.isna().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [63]:
# Simple Mean Imputer to treat missing Nos

df['total_bedrooms'].fillna(df['total_bedrooms'].mean(),inplace=True)

**Data Feature engineering**


**1.Feature Creation**

In [66]:


df['rooms_per_households']     = df['total_rooms'] / df['total_bedrooms']
df['population_per_household'] = df['population'] / df['households']

**2.Feature Encoding**

1. Encoding (Data Encoding:)

Encoding is the process of converting data from one form to another, often for the purposes of efficient storage, transmission, or processing. It involves representing information using a specific format or set of rules. 

- In Data Encoding we have 2 approaches
    1. **One Hot Encoding** : 0ne-hot encoding converts categorical variables into a series of binary variables (0 or 1). Each category is represented as a binary vector where only one element is "hot" (1), and the rest are "cold" (0). t’s commonly used when the categorical variable does not have an intrinsic order and is typically used with nominal data.

    2. **Label encoding** assigns each category a unique integer value. Each category is mapped to an integer.It’s often used when the categorical variable has an order or rank (ordinal data). However, it can be misleading when applied to nominal data because the model may incorrectly infer an order or relationship between categories.
    3. Binary encoding 
    4. Frequency encoding
    5. Embedding (In Deep Learning) Converts categories into dense vectors of continuous numbers. These vectors capture relationships between categories in a lower-dimensional space.


In [67]:
# We will use One Hot Encoding function from sklearn model
from sklearn.preprocessing import OneHotEncoder

# Intialize the oneHotEncoder
encoder = OneHotEncoder(sparse_output=False)

#Fit and Transform the data
encoder_data = encoder.fit_transform(df[['ocean_proximity']])
print(encoder_data[0])

#  Convert the encoded data to a DataFrame for easier visualization
encoded_df = pd.DataFrame(encoder_data, columns=encoder.get_feature_names_out(['ocean_proximity']))

df = pd.concat([df, encoded_df], axis=1).drop(columns=['ocean_proximity'])
df.head(1)

[0. 0. 0. 1. 0.]


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,rooms_per_households,population_per_household,ocean_proximity_<1H OCEAN,ocean_proximity_INLAND,ocean_proximity_ISLAND,ocean_proximity_NEAR BAY,ocean_proximity_NEAR OCEAN
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,6.821705,2.555556,0.0,0.0,0.0,1.0,0.0


**Create a Test & Train**

In [69]:
from sklearn.model_selection import train_test_split


# First we split data into Target and Input Features

X = df.drop(columns='median_house_value')
y = df['median_house_value']

X_train , X_test, y_train , y_test = train_test_split(X,y, test_size=0.2, random_state=42)

**Now Lets use our regressor Models**

For this example we will work with 3

1. Linear Regression
2. Random Forest regessor
3. DecisionTreeRegressor

---
---

And since we are working with regression models our evaluation metrics we will use

1. MSE - Mean Squared Error
2. RMSE - Root Mean Squared Error
3. MAPE - Mean Absolute Percentaga Error
4. R2 Adjusted

In [76]:
# Lets use our regressors models to create Base Models 

from sklearn.linear_model import LinearRegression
from sklearn.tree         import DecisionTreeRegressor 
from sklearn.ensemble     import RandomForestRegressor
from sklearn.metrics      import mean_absolute_percentage_error , mean_squared_error ,r2_score

from math import sqrt


# create an instance of the model
models = [LinearRegression(),
          DecisionTreeRegressor(),
          RandomForestRegressor()
          ]


results ={}

for model in models:
    model.fit(X_train,y_train)
    y_preds = model.predict(X_test)
    results['model_name'] = model.__class__.__name__
    
    # Evaluate
    mse_score  = mean_squared_error(y_test,y_preds)
    rmse_score = sqrt(mse_score)
    mape_score = mean_absolute_percentage_error(y_test,y_preds)
    r_score    = r2_score(y_test,y_preds)
    
    # Store the results
    results[model.__class__.__name__] = {
        'MSE' : mse_score,
        'RMSE': rmse_score,
        'MAPE': mape_score,
        'R^2' : r_score
        }
    
results

{'model_name': 'RandomForestRegressor',
 'LinearRegression': {'MSE': 5008649147.18831,
  'RMSE': 70771.8103992565,
  'MAPE': 0.295955219645154,
  'R^2': 0.6177796985249637},
 'DecisionTreeRegressor': {'MSE': 5255240751.692587,
  'RMSE': 72493.03933270136,
  'MAPE': 0.24656393348429215,
  'R^2': 0.5989617868196193},
 'RandomForestRegressor': {'MSE': 2544644770.5614924,
  'RMSE': 50444.47215068756,
  'MAPE': 0.18311246884144228,
  'R^2': 0.8058129322360537}}

In [81]:
# Remove the 'model_name' key if present
results.pop('model_name', None)

# Convert the cleaned results dictionary to a DataFrame
results_df = pd.DataFrame.from_dict(results, orient='index')
results_df

Unnamed: 0,MSE,RMSE,MAPE,R^2
LinearRegression,5008649147.19,70771.81,0.3,0.62
DecisionTreeRegressor,5255240751.69,72493.04,0.25,0.6
RandomForestRegressor,2544644770.56,50444.47,0.18,0.81


Metrics Explained:


**Mean Squared Error (MSE)**
Definition: Measures the average squared difference between the predicted values and the actual values. Lower values indicate better model performance.
Interpretation: A lower MSE indicates that the model’s predictions are closer to the actual values.

**Root Mean Squared Error (RMSE)**
Definition: The square root of MSE. It represents the average distance between the predicted values and actual values in the same units as the target variable. Like MSE, lower values indicate better performance.
Interpretation: RMSE gives a sense of how far off predictions are from actual values, on average. Lower RMSE is better.

**Mean Absolute Percentage Error (MAPE)**
Definition: Measures the average absolute percentage error between predicted values and actual values. It is expressed as a percentage. Lower values indicate better model performance.
Interpretation: MAPE indicates the accuracy of predictions. Lower percentages mean the model's predictions are closer to the actual values on average.

**R-squared (R²)**
Definition: Represents the proportion of variance in the dependent variable that is predictable from the independent variables. Ranges from 0 to 1, where 1 indicates perfect prediction.
Interpretation: Higher R² values mean that the model explains a larger proportion of the variance in the target variable.

Model Performance Interpretation:

1. LinearRegression
```
MSE: 5,008,649,147.19
RMSE: 70,771.81
MAPE: 0.30 (30%)
R²: 0.62
Interpretation:
```

The Linear Regression model has a relatively high MSE and RMSE, suggesting that its predictions are on average quite far from the actual values.
A MAPE of 30% indicates that the model's predictions are off by 30% on average, which is quite high and indicates significant prediction errors.
An R² of 0.62 means that approximately 62% of the variance in the target variable is explained by the model. While this is decent, there is room for improvement.

2. DecisionTreeRegressor
```
MSE: 5,255,240,751.69
RMSE: 72,493.04
MAPE: 0.25 (25%)
R²: 0.60
Interpretation:
```

The Decision Tree Regressor has a slightly higher MSE and RMSE compared to Linear Regression, indicating somewhat higher prediction errors.
A MAPE of 25% suggests that, on average, the predictions are 25% off from the actual values, which is an improvement over Linear Regression.
The R² of 0.60 is similar to Linear Regression, meaning the Decision Tree Regressor explains about 60% of the variance in the target variable. This is slightly lower than Linear Regression.

3. RandomForestRegressor
```
MSE: 2,544,644,770.56
RMSE: 50,444.47
MAPE: 0.18 (18%)
R²: 0.81

```
Interpretation:

- The Random Forest Regressor has the lowest MSE and RMSE among the three models, indicating that its predictions are closest to the actual values on average.
- A MAPE of 18% is the lowest among the models, suggesting it is the most accurate in terms of percentage error.
- The R² of 0.81 means that 81% of the variance in the target variable is explained by the model, which is the highest among the models. This indicates that the Random Forest model performs the best in explaining the variability of the target variable.


Summary Conclusion:
- Best Performance: RandomForestRegressor stands out as the best-performing model based on all metrics: lowest MSE and RMSE, lowest MAPE, and highest R².


**Fine Tune the Model Our RandomForest Regressor**

In [83]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from scipy.stats import randint


# Define Hyperparameter Grid: 
param_grid = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth':     [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap':  [True, False]
}

# Create the Model: Instantiate the RandomForestRegressor
rf = RandomForestRegressor(random_state=42)


# Set Up Grid Search or Random Search
# Grid Search: Tries all combinations in the parameter grid.
# Random Search: Randomly samples combinations and is often faster.

random_search = RandomizedSearchCV(estimator=rf, 
                                    param_distributions=param_grid, 
                                    n_iter=50,
                                    scoring='neg_mean_squared_error', 
                                    cv=5,
                                    n_jobs=-1,
                                    verbose=2,
                                    random_state=42)
random_search.fit(X_train, y_train)


print("Best Parameters:", random_search.best_params_)
print("Best Score:", -random_search.best_score_)  # Negative MSE score converted to positive


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'sqrt', 'max_depth': 30, 'bootstrap': False}
Best Score: 2390459714.2556105


In [85]:
best_rf = random_search.best_estimator_
y_preds = best_rf.predict(X_test)

mse  = mean_squared_error(y_test, y_preds)
rmse = sqrt(mse)
r2   = r2_score(y_test, y_preds)

print("MSE:", mse)
print("RMSE:", rmse)
print("R^2:", r2)


MSE: 2376775118.54469
RMSE: 48752.18065425064
R^2: 0.8186234100948174


A slight improvement of RMSE and MSE

In [86]:
import joblib

# Save the best model
joblib.dump(best_rf, 'best_random_forest_model.pkl')


['best_random_forest_model.pkl']

In [87]:
# Load the model
loaded_model = joblib.load('best_random_forest_model.pkl')
loaded_model 