<a href="https://colab.research.google.com/github/DButmeh/-Sales-Prediction-Project/blob/main/Project_1_Part_4_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  **"Maching Learning - Training the Models"**

### **Dina Al Botmeh**

--------
-----


##The goal of this step is to help the retailer by using machine learning to make predictions about future sales based on the data provided.

# ⭐️ **Separate your data...**
>[Click here](#new3) to jump to assignment's solution.



# ⭐️ **CRISP-DM Phase 4-Modeling**
>[Click here](#new5) to jump to assignment's solution.

In [None]:
#Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import packages
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.max_columns',100)
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import RandomForestRegressor # NEW
# Set pandas as the default output for sklearn
from sklearn import set_config
set_config(transform_output='pandas')

#Define Custom Functions

In [None]:
def regression_metrics(y_true, y_pred, label='', verbose = True, output_dict=False):
  # Get metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:
    # Print Result with Label and Header
    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics

def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  # Get predictions for training data
  y_train_pred = reg.predict(X_train)

  # Call the helper function to obtain regression metrics for training data
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = reg.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )

  # Store results in a dataframe if ouput_frame is True
  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    # Set the label as the index
    results_df = results_df.set_index('Label')
    # Set index.name to none to get a cleaner looking result
    results_df.index.name=None
    # Return the dataframe
    return results_df.round(2)


##load data

In [None]:
# Read in the data with Pandas
fpath = "/content/drive/MyDrive/CodingDojo/ project sales predictions/sales_predictions_2023.csv"
df=pd.read_csv(fpath)
# Display the first 5 rows
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [None]:
#Check the data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


###no dtypes to convert
  - 4float
  - 7 object
  - 1 int
------------------------

In [None]:
#Column for duplicate rows
df.duplicated().sum()

0

In [None]:
#Check for null value
null_sums=df.isna().sum()
null_sums

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

### Find and fix any inconsistent categories

In [None]:
#select object columns
col_ob=df.select_dtypes("object").columns
col_ob

Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [None]:
#for loop to Check each string column
for col in col_ob:
  print(f"value counts for {col}")
  print(df[col].value_counts())
  print("\n")

value counts for Item_Identifier
Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: count, Length: 1559, dtype: int64


value counts for Item_Fat_Content
Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: count, dtype: int64


value counts for Item_Type
Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: count, dtype: int64


value counts for Outlet_Identifier
Outlet_Identifier
OUT027    935
OUT013

In [None]:
# Standardize the values in "Item_Fat_Content" column
df["Item_Fat_Content"]=df["Item_Fat_Content"].replace({"Low Fat":"LF",
                                                       "Regular":"Reg",
                                                       "low fat":"LF",
                                                       "reg":"Reg"})
df["Item_Fat_Content"].value_counts()

Item_Fat_Content
LF     5517
Reg    3006
Name: count, dtype: int64

In [None]:
#Check for inconsistent categories
df.describe().round(2)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,7060.0,8523.0,8523.0,8523.0,8523.0
mean,12.86,0.07,140.99,1997.83,2181.29
std,4.64,0.05,62.28,8.37,1706.5
min,4.56,0.0,31.29,1985.0,33.29
25%,8.77,0.03,93.83,1987.0,834.25
50%,12.6,0.05,143.01,1999.0,1794.33
75%,16.85,0.09,185.64,2004.0,3101.3
max,21.35,0.33,266.89,2009.0,13086.96


In [None]:
target_describe=df["Item_Outlet_Sales"].describe().round(2)
target_describe

count     8523.00
mean      2181.29
std       1706.50
min         33.29
25%        834.25
50%       1794.33
75%       3101.30
max      13086.96
Name: Item_Outlet_Sales, dtype: float64

 <a name='new3'></a>
# ⭐️**Separate your data...**

#Separate your data into the feature matrix (X) and the target vector (y)

In [None]:
# drop the "Item_Identifier" feature because it has very high cardinality.
df=df.drop(columns="Item_Identifier")

In [None]:
#difine taregt features
X=df.drop(columns="Item_Outlet_Sales")
y=df['Item_Outlet_Sales']

In [None]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=42)
X_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
4776,16.35,LF,0.029565,Household,256.4646,OUT018,2009,Medium,Tier 3,Supermarket Type2
7510,15.25,Reg,0.0,Snack Foods,179.766,OUT018,2009,Medium,Tier 3,Supermarket Type2
5828,12.35,Reg,0.158716,Meat,157.2946,OUT049,1999,Medium,Tier 1,Supermarket Type1
5327,7.975,LF,0.014628,Baking Goods,82.325,OUT035,2004,Small,Tier 2,Supermarket Type1
4810,19.35,LF,0.016645,Frozen Foods,120.9098,OUT045,2002,,Tier 2,Supermarket Type1


#For categorical (nominal) pipeline:

In [None]:
# Defining list of nominal features
ohe_cols = X_train.select_dtypes('object').columns.drop("Outlet_Size")
# Instantiate the imputer with the desired strategy
impute_na = SimpleImputer(strategy='constant', fill_value='Missing')
# Instantiate one hot encoder
ohe_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
# Instantiate the pipeline
ohe_pipe = make_pipeline(impute_na, ohe_encoder)
ohe_pipe

##**For the ordinal pipeline:**

In [None]:
# Defining lists of ordinal features
ord_cols = ['Outlet_Size']

In [None]:
# Instantiate the imputer object from the SimpleImputer class with strategy 'median'
impute_most_freq= SimpleImputer(strategy='most_frequent')
## Making the OrdinalEncoder
# define list to replace ordinal categories
OutLet_order =["Small", "Medium" ,"High" ]
# inserInstantiate an OrdinalEncoder
ord_encoder=OrdinalEncoder(categories=[OutLet_order])
# Making a final scaler to scale category #'s
scaler_ord = StandardScaler()
## Making an ord_pipe
ord_pipe = make_pipeline(impute_most_freq, ord_encoder, scaler_ord)
ord_pipe

##For the numeric features pipeline:

In [None]:
# Defining lists of types of features
num_cols = X_train.select_dtypes("number").columns
# Instantiate the imputer object from the SimpleImputer class with strategy 'median'
impute_median = SimpleImputer(strategy='median')
scaler = StandardScaler()
num_pipe = make_pipeline(impute_median, scaler)
num_pipe

##Create a tuple for each transformer

In [None]:
# Making a ohe_tuple for ColumnTransformer
ohe_tuple = ('categorical', ohe_pipe, ohe_cols)
ohe_tuple

('categorical',
 Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='Missing', strategy='constant')),
                 ('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse_output=False))]),
 Index(['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
        'Outlet_Location_Type', 'Outlet_Type'],
       dtype='object'))

In [None]:
# Making an ordinal_tuple for ColumnTransformer
ord_tuple = ('ordinal', ord_pipe, ord_cols)
ord_tuple

('ordinal',
 Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='most_frequent')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['Small', 'Medium', 'High']])),
                 ('standardscaler', StandardScaler())]),
 ['Outlet_Size'])

In [None]:
# Making a numeric tuple for ColumnTransformer
num_tuple = ('numeric', num_pipe, num_cols)
num_tuple

('numeric',
 Pipeline(steps=[('simpleimputer', SimpleImputer(strategy='median')),
                 ('standardscaler', StandardScaler())]),
 Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
        'Outlet_Establishment_Year'],
       dtype='object'))

------------------------------------
------------------------------------------
#The goal is to help the retailer understand the properties of products and outlets that play crucial roles in predicting sales.

-----------------------------------------
------------------------------------------
 <a name='new5'></a>
# ⭐️**CRISP-DM Phase 4-Modeling**

##Create one column transformer object that includes the 3

In [None]:
# Define a column transformer
preprocessor = ColumnTransformer([num_tuple, ord_tuple, ohe_tuple],verbose_feature_names_out=False)

##Create one column transformer object that includes the 3

In [None]:
# Define a column transformer
preprocessor = ColumnTransformer([num_tuple, ord_tuple, ohe_tuple],verbose_feature_names_out=False)

In [None]:
# Instantiate a linear regression model
reg=LinearRegression()
# Combine the preprocessing ColumnTransformer and the linear regression model in a Pipeline
linreg_pipe = make_pipeline(preprocessor, reg)
# Fit the model on the training data
linreg_pipe.fit(X_train,y_train)
evaluate_regression(linreg_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 847.131
- MSE = 1,297,556.865
- RMSE = 1,139.104
- R^2 = 0.562

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 804.089
- MSE = 1,194,326.602
- RMSE = 1,092.853
- R^2 = 0.567


In [None]:
#call target describe to comper
target_describe

count     8523.00
mean      2181.29
std       1706.50
min         33.29
25%        834.25
50%       1794.33
75%       3101.30
max      13086.96
Name: Item_Outlet_Sales, dtype: float64

#Linear Regression Model Observations
- This model performs as moderate on the testing set and trainig set . the model predicts 57% for test data and almost the same for  train data ,train R= 0.56 while test R =0.57
- The predicted Item Outlet sales errors MAE = 804 items for testing data which is almost equal to 25% of the data  Item Outlet sales count
-  we will explore using other models to see how they perform.

# Random Forest model to predict sales

##Default Random Forest model.

In [None]:
# Instantiate default random forest model
rf = RandomForestRegressor(random_state = 42)
# Model Pipeline with
rf_pipe = make_pipeline(preprocessor, rf)
# Fit the model pipeline on the training data only
rf_pipe.fit(X_train, y_train)
# Use custom function to evaluate default model
evaluate_regression(rf_pipe, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 296.124
- MSE = 182,241.944
- RMSE = 426.898
- R^2 = 0.938

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 765.671
- MSE = 1,213,934.180
- RMSE = 1,101.787
- R^2 = 0.560


#Random Forest Model Observations
- This model performs  is notably better for training data  R=0.93 but still  performs moderate on the testing set R =0.56 comparing to linear regression model . the model predicts 56% for testing data
- The predicted Item Outlet sales for test error MAE =765 items which is a minor improvements comparing to linear regression  test data MAE= 804 iteam but still the model preformenc not acsptabil compering to 25% of the item outlet  sales


#Use GridSearchCV to tune the Random Forest model

In [None]:
# Parameters for tuning
rf_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('numeric',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(strategy='median')),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
          'Outlet_Establishment_Year'],
         dtype='object')),
                                   ('ordinal',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(strategy='most_frequent')),
                                                    ('ordinalencoder',
                                                     Ord...
                                                     StandardScaler())]),
                           

---------------
##trying  1 canceled
-----------------------

In [None]:
# Define param grid with options to try
params2 = {'randomforestregressor__max_depth': [None,10,15,20],
          'randomforestregressor__n_estimators':[10,50,100,150,200],
          'randomforestregressor__min_samples_leaf':[.5,1,2,3,4]}

In [None]:
# Instantiate the gridsearch
gridsearch2 = GridSearchCV(rf_pipe, params2, n_jobs=-1, verbose=1)
# Fit the gridsearch on training data
gridsearch2.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [None]:
gridsearch2.best_params_

{'randomforestregressor__max_depth': 10,
 'randomforestregressor__min_samples_leaf': 1,
 'randomforestregressor__n_estimators': 200}

In [None]:
# Define and refit best model
best_rf = gridsearch2.best_estimator_
evaluate_regression(best_rf, X_train, y_train, X_test, y_test,)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 642.566
- MSE = 822,456.745
- RMSE = 906.894
- R^2 = 0.722

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 739.702
- MSE = 1,132,628.235
- RMSE = 1,064.250
- R^2 = 0.589


--------------------
# approved try 2
------------------------

In [None]:
# Define param grid with options to try
params = {'randomforestregressor__max_depth': [None,10,15,20],
          'randomforestregressor__n_estimators':[10,100,150,200],
          'randomforestregressor__min_samples_leaf':[2,3,4],
          'randomforestregressor__max_features':['sqrt','log2',None],
           'randomforestregressor__oob_score':[True,False],
          }

In [None]:
# Instantiate the gridsearch
gridsearch2 = GridSearchCV(rf_pipe, params, n_jobs=-1, verbose=1)
# Fit the gridsearch on training data
gridsearch2.fit(X_train, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits


In [None]:
# Define and refit best model
best_rf = gridsearch2.best_estimator_
evaluate_regression(best_rf, X_train, y_train, X_test, y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 645.997
- MSE = 841,513.967
- RMSE = 917.341
- R^2 = 0.716

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 738.982
- MSE = 1,130,203.386
- RMSE = 1,063.110
- R^2 = 0.590


# tuned Random Forest Model Observations
##This model performs fairly well on the training set.and  the test data has improved from 0.56(default) to 0.59 (tuned)
- Considering R2 =0.72 for training data is a regression over the default random forest model with  R2=0.93
- Tuning Random Forest model improved the results for the testing data to R2=0.59 which is still poorlly prediction.
- For the MAE the testing data is off by about 739 items.which is a minor improvementscoper to default model.

-------------------------------------------------------
-------------------------------------------------------------
----------------------------------------------------

#CRISP-DM Phase 5 - Evaluation

#Maching Learning Using the Following Models:
- Linear Regression Model
- Random Forest Regressor Model
- Tuned Random Forest Regressor Model

#**Models Evaluated & Results**
------------------------------------------------------------
#**Linear Regression Model :**
------------------------------------------------------------
###**Regression Metrics: Training Data**
- MAE = 847.131
- MSE = 1,297,556.865
- RMSE = 1,139.104
- R^2 = 0.562

###**Regression Metrics: Test Data**

- MAE = 804.089
- MSE = 1,194,326.602
- RMSE = 1,092.853
- R^2 = 0.567

----------------------------------------
#**Default Random Forest  Model:**
------------------------------------------
### **Regression Metrics: Training Data**
- MAE = 296.124
- MSE = 182,241.944
- RMSE = 426.898
- R^2 = 0.938

### **Regression Metrics: Test Data**

- MAE = 765.671
- MSE = 1,213,934.180
- RMSE = 1,101.787
- R^2 = 0.560
----------------------------------------
#**Tuned Random Forest  Model:**
------------------------------------------
###**Regression Metrics: Training Data**
- MAE = 645.997
- MSE = 841,513.967
- RMSE = 917.341
- R^2 = 0.716

###**Regression Metrics: Test Data**
- MAE = 738.982
- MSE = 1,130,203.386
- RMSE = 1,063.110
- R^2 = 0.590

----------------------------------------------------------
#**Item Outlet Sales Describe(target describe)**
-------------------------------------------
- count= 8523.00   
- mean  =    2181.29
- std   =    1706.50
- min    =     33.29
- 25%   =     834.25
- 50%     =  1794.33
- 75%    =   3101.30
- max   =   13086.96
--------------------------------------

- For the testing set on the model, 59% of the variance in y was explained by x.

- The Mean Absolute Error was off by about 739 items.

- The Mean Squared Error was 1,130,203.

- The Root Mean Squared Error had a calculation of 1,063 items.
----------------------------------------------------
--------------------------------------
### - Tuned Random Forest Model Observations is better than Random Forest and Linear Regression Model for test data.
### - Item outlet sales Error of MAE=739 is still almost near to the 25% of the item outlet sales count which is a high score especially that min item score 33 items only.
------------------------------------------------

#**Using This Model** to make predictions for item outlet sales which item outlets  to choose to earn the highest outlet sales would not be a very reliable. Considering the previous regression metrics from how the model performed, there is a disparity in one of the model while out of performance for others. Considering the previous regression metrics from how the model performed.
-------------------------------------------------------
---------------------------------------------

