<a href="https://colab.research.google.com/github/1995anas/Prediction_of_Product_Sales/blob/main/Project_1_Part_6_(Core).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicion of Product Sales
- Author: Anas Abu Alhaija


[original data set here](https://drive.google.com/file/d/1syH81TVrbBsdymLT_jl2JIf6IjPXtSQw/view)

## Loading Data

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [7]:
fpath = '/content/drive/MyDrive/CodingDojo/02-IntroML/Week05/Data/sales_predictions_2023.csv'
import pandas as pd
df = pd.read_csv(fpath)
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


## Imports

In [8]:
# Import the libraries and packages that we will need it
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder,StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn import set_config
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
set_config(transform_output= 'pandas')
from sklearn.compose import ColumnTransformer
pd.set_option('display.max_columns',100)

### Duplicated Data
Checking for duplicated data:

In [9]:
rows_duplicated = df.duplicated().sum()
rows_duplicated

0

### Inspecting Categorical Columns & Addressing Inconsistent Values

In [10]:
cat_cols = df.select_dtypes('object').columns
cat_cols

Index(['Item_Identifier', 'Item_Fat_Content', 'Item_Type', 'Outlet_Identifier',
       'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')

In [11]:
for col in cat_cols:
  print(f'The value counts for {col}')
  print(df[col].value_counts())
  print('\n')

The value counts for Item_Identifier
FDW13    10
FDG33    10
NCY18     9
FDD38     9
DRE49     9
         ..
FDY43     1
FDQ60     1
FDO33     1
DRF48     1
FDC23     1
Name: Item_Identifier, Length: 1559, dtype: int64


The value counts for Item_Fat_Content
Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64


The value counts for Item_Type
Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64


The value counts for Outlet_Identifier
OUT027    935
OUT013    932
OUT049    93

- After further investigation of the categorical columns, there are inconsistencies with spellings of the following categories in the `Item_Fat_Content` column:
  - `LF` should be `Low Fat`
  - `low fat` should be `Low Fat`
  - `reg` should be `Regular`



In [12]:
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

In [13]:
df['Item_Fat_Content'] = df['Item_Fat_Content'].replace({'LF':'Low Fat','low fat':'Low Fat','reg':'Regular'})
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

### Identify the features (X) and target (y) and drop the "Item_Identifier" feature because it has very high cardinality.

In [14]:
y = df['Item_Outlet_Sales']
X = df.drop(columns= ['Item_Outlet_Sales','Item_Identifier'])
X.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1
1,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2
2,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1
3,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store
4,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1


### Perform a train test split

In [15]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state= 42)

### preprocessing object to prepare the dataset for Machine Learning

In [16]:
# Create lists types of columns: numercal, ordinal and nominal columns
num_cols = ['Item_Weight','Item_Visibility','Item_MRP']
ord_cols = ['Item_Fat_Content','Outlet_Size','Outlet_Location_Type']
cat_cols = ['Item_Type','Outlet_Identifier','Outlet_Establishment_Year','Outlet_Type']

In [17]:
# Checking the columns contains null values for imputation process
X_train.isna().sum()

Item_Weight                  1107
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  1812
Outlet_Location_Type            0
Outlet_Type                     0
dtype: int64

In [18]:
# We need to prepare numeric features for transformation through imputation and scaling
impute_mean = SimpleImputer(strategy='mean')
scaler_num = StandardScaler()
num_pipe = make_pipeline(impute_mean,scaler_num)
num_tuple = ('numeric',num_pipe,num_cols)
num_tuple

('numeric',
 Pipeline(steps=[('simpleimputer', SimpleImputer()),
                 ('standardscaler', StandardScaler())]),
 ['Item_Weight', 'Item_Visibility', 'Item_MRP'])

In [19]:
# We need to prepare ordinal features for transformation through imputation, encoding and scaling
impute_na = SimpleImputer(strategy='constant', fill_value='NA')
fat_order = ['Low Fat','Regular']
size_order = ['NA','Small','Medium','High']
location_order = ['Tier 1', 'Tier 2', 'Tier 3']
ordinal_cat_order = [fat_order,size_order,location_order]
ord_encoder = OrdinalEncoder(categories = ordinal_cat_order)
scaler_ord = StandardScaler()
ord_pipe = make_pipeline(impute_na,ord_encoder,scaler_ord)
ord_tuple = ('ordinal',ord_pipe,ord_cols)
ord_tuple

('ordinal',
 Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='NA', strategy='constant')),
                 ('ordinalencoder',
                  OrdinalEncoder(categories=[['Low Fat', 'Regular'],
                                             ['NA', 'Small', 'Medium', 'High'],
                                             ['Tier 1', 'Tier 2', 'Tier 3']])),
                 ('standardscaler', StandardScaler())]),
 ['Item_Fat_Content', 'Outlet_Size', 'Outlet_Location_Type'])

In [20]:
# We need to prepare nominal features for transformation through encoding process
ohe_encoder = OneHotEncoder(sparse_output= False , handle_unknown='ignore')
ohe_pipe = make_pipeline(ohe_encoder)
ohe_tuple = ('categorical',ohe_pipe,cat_cols)
ohe_tuple

('categorical',
 Pipeline(steps=[('onehotencoder',
                  OneHotEncoder(handle_unknown='ignore', sparse_output=False))]),
 ['Item_Type',
  'Outlet_Identifier',
  'Outlet_Establishment_Year',
  'Outlet_Type'])

In [21]:
# Instantiate the ColumnTransformer
preprocessor = ColumnTransformer([num_tuple,ord_tuple,ohe_tuple],verbose_feature_names_out=False)
preprocessor

In [22]:
# Fitting the ColumnTransformer on the training data only
preprocessor.fit(X_train)

In [23]:
# Transform the training data
X_train_processed = preprocessor.transform(X_train)
# Transform the teating data
X_test_processed = preprocessor.transform(X_test)
# Checking training data after pre-processing
X_train_processed.head()

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Item_Fat_Content,Outlet_Size,Outlet_Location_Type,Item_Type_Baking Goods,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,Item_Type_Frozen Foods,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods,Outlet_Identifier_OUT010,Outlet_Identifier_OUT013,Outlet_Identifier_OUT017,Outlet_Identifier_OUT018,Outlet_Identifier_OUT019,Outlet_Identifier_OUT027,Outlet_Identifier_OUT035,Outlet_Identifier_OUT045,Outlet_Identifier_OUT046,Outlet_Identifier_OUT049,Outlet_Establishment_Year_1985,Outlet_Establishment_Year_1987,Outlet_Establishment_Year_1997,Outlet_Establishment_Year_1998,Outlet_Establishment_Year_1999,Outlet_Establishment_Year_2002,Outlet_Establishment_Year_2004,Outlet_Establishment_Year_2007,Outlet_Establishment_Year_2009,Outlet_Type_Grocery Store,Outlet_Type_Supermarket Type1,Outlet_Type_Supermarket Type2,Outlet_Type_Supermarket Type3
4776,0.817249,-0.712775,1.828109,-0.740321,0.748125,1.084948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
7510,0.55634,-1.291052,0.603369,1.350766,0.748125,1.084948,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
5828,-0.131512,1.813319,0.244541,1.350766,0.748125,-1.384777,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5327,-1.169219,-1.004931,-0.952591,-0.740321,-0.26437,-0.149914,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
4810,1.528819,-0.965484,-0.33646,-0.740321,-1.276865,-0.149914,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [24]:
# show summary of statistics for numerical features after pre-processing
X_train_processed[num_cols].describe().round(2)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP
count,6392.0,6392.0,6392.0
mean,0.0,-0.0,0.0
std,1.0,1.0,1.0
min,-1.98,-1.29,-1.77
25%,-0.81,-0.76,-0.76
50%,0.0,-0.23,0.03
75%,0.76,0.56,0.72
max,2.0,5.13,1.99


In [25]:
# show summary of statistics for ordinal features after pre-processing
X_train_processed[ord_cols].describe().round(2)

Unnamed: 0,Item_Fat_Content,Outlet_Size,Outlet_Location_Type
count,6392.0,6392.0,6392.0
mean,0.0,0.0,0.0
std,1.0,1.0,1.0
min,-0.74,-1.28,-1.38
25%,-0.74,-1.28,-1.38
50%,-0.74,-0.26,-0.15
75%,1.35,0.75,1.08
max,1.35,1.76,1.08


###  imputation of missing values occurs after the train test split using SimpleImputer.

In [26]:
# Before using SimpleImputer
X_test.isna().sum().sum()

954

In [27]:
# After using SimpleImputer
X_test_processed.isna().sum().sum()

0

## Modeling

In [28]:
linreg = LinearRegression()
linreg_pipe = make_pipeline(preprocessor,linreg)
linreg_pipe

In [29]:
linreg_pipe.fit(X_train,y_train)

In [30]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
def regression_metrics(y_true, y_pred, label='', verbose = True, output_dict=False):
  # Get metrics
  mae = mean_absolute_error(y_true, y_pred)
  mse = mean_squared_error(y_true, y_pred)
  rmse = mean_squared_error(y_true, y_pred, squared=False)
  r_squared = r2_score(y_true, y_pred)
  if verbose == True:
    # Print Result with Label and Header
    header = "-"*60
    print(header, f"Regression Metrics: {label}", header, sep='\n')
    print(f"- MAE = {mae:,.3f}")
    print(f"- MSE = {mse:,.3f}")
    print(f"- RMSE = {rmse:,.3f}")
    print(f"- R^2 = {r_squared:,.3f}")
  if output_dict == True:
      metrics = {'Label':label, 'MAE':mae,
                 'MSE':mse, 'RMSE':rmse, 'R^2':r_squared}
      return metrics

def evaluate_regression(reg, X_train, y_train, X_test, y_test, verbose = True,
                        output_frame=False):
  # Get predictions for training data
  y_train_pred = reg.predict(X_train)

  # Call the helper function to obtain regression metrics for training data
  results_train = regression_metrics(y_train, y_train_pred, verbose = verbose,
                                     output_dict=output_frame,
                                     label='Training Data')
  print()
  # Get predictions for test data
  y_test_pred = reg.predict(X_test)
  # Call the helper function to obtain regression metrics for test data
  results_test = regression_metrics(y_test, y_test_pred, verbose = verbose,
                                  output_dict=output_frame,
                                    label='Test Data' )

  # Store results in a dataframe if ouput_frame is True
  if output_frame:
    results_df = pd.DataFrame([results_train,results_test])
    # Set the label as the index
    results_df = results_df.set_index('Label')
    # Set index.name to none to get a cleaner looking result
    results_df.index.name=None
    # Return the dataframe
    return results_df.round(3)



In [31]:
evaluate_regression(linreg_pipe,X_train,y_train,X_test,y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 847.126
- MSE = 1,297,558.140
- RMSE = 1,139.104
- R^2 = 0.562

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 804.118
- MSE = 1,194,345.108
- RMSE = 1,092.861
- R^2 = 0.567


The linear regression model is underfit, R^2 for training data = 0.562 and for testing data = 0.567

### Random Fortst Model

### Default Model

In [32]:
rf = RandomForestRegressor(random_state= 42)
rf_pipe = make_pipeline(preprocessor,rf)
rf_pipe.fit(X_train,y_train)

In [33]:
evaluate_regression(rf_pipe,X_train,y_train,X_test,y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 296.391
- MSE = 183,304.915
- RMSE = 428.141
- R^2 = 0.938

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 765.240
- MSE = 1,212,988.989
- RMSE = 1,101.358
- R^2 = 0.560



The default random forest model is overfit, R^2 for training data = 0.938 and for testing data = 0.560

linear regression model has a best performance. its' metrics has a higher scores than random forest model ,exclude MAE

### Improving random forest model by tuning three hyperparameters

In [34]:
rf_pipe.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('numeric',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer()),
                                                    ('standardscaler',
                                                     StandardScaler())]),
                                    ['Item_Weight', 'Item_Visibility',
                                     'Item_MRP']),
                                   ('ordinal',
                                    Pipeline(steps=[('simpleimputer',
                                                     SimpleImputer(fill_value='NA',
                                                                   strategy='constant')),
                                                    ('ordinalencoder',
                                                     OrdinalEncoder(categories=[['Low '
                                             

In [35]:
params = {'randomforestregressor__max_depth': [None,10,15,20],
          'randomforestregressor__n_estimators':[10,100,150,200] }

In [36]:
gridsearch = GridSearchCV(rf_pipe, params, n_jobs=-1, cv = 3, verbose=1)
gridsearch.fit(X_train, y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


In [37]:
gridsearch.best_params_

{'randomforestregressor__max_depth': 10,
 'randomforestregressor__n_estimators': 100}

In [38]:
best_model = gridsearch.best_estimator_
evaluate_regression(best_model,X_train,y_train,X_test,y_test)

------------------------------------------------------------
Regression Metrics: Training Data
------------------------------------------------------------
- MAE = 643.470
- MSE = 825,957.831
- RMSE = 908.822
- R^2 = 0.721

------------------------------------------------------------
Regression Metrics: Test Data
------------------------------------------------------------
- MAE = 737.283
- MSE = 1,125,348.763
- RMSE = 1,060.825
- R^2 = 0.592


The model imporved after tuning hyperprameters , we can showed through the metrics

I recommended to use tuning random forest model , it has a best metrics on testing data

R^2 Metric means the ability of the model to explain and predict the data correctly , it is between zero to one , zero very bad , one means the model perfect.

in Our model R^2 = 0.592 , That's considered good

MAE Metric means the Average sum of errors in predictions

in our model MAE = 737.283, if we have product need to predict it's sales amount of error can be $737

MAE is interpretable metric for non - technicals


MAE traning data = 643.470

MAE testing data = 737.283


R^2 training data = 0.721

R^2 testing data = 0.592

### The model is overfit