# Big Sales Prediction using Random Forest ML Model

---------------------------------------------------------------------------------------------------------

## Objective

1. Introduction
2. Importing Library
3. Importing Sales Data
4. Data Preprocessing
5. Data Visualization
6. Exploratory Data Analysis
7. Defining y and X variables for RF model
8. RandomForest Modelling
9. Model Output

-------------------------------------------------------------------------------------------------------

## Introduction

In this project, I've used a large dataset of sales transactions to predict outlet sales using a Random Forest Regression model. By analyzing features such as item weight, visibility, type, maximum retail price, outlet details, and more, I aim to understand how well these factors can predict sales across different outlets.

## Importing Library

In [None]:
import pandas as pd

In [None]:
import numpy as np

## Importing Sales Dataset

In [None]:
df = pd.read_csv('https://github.com/YBIFoundation/Dataset/raw/main/Big%20Sales%20Data.csv')

## Exploratory Data Analysis

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.columns

In [None]:
df.describe()

## Data Preprocessing

In [None]:
df['Item_Weight'].fillna(df.groupby(['Item_Type'])['Item_Weight'].transform('median'), inplace=True)

In [None]:
df.info()

## Describing Data

In [None]:
df.describe()

## Removing Outliers

In [None]:
# remove outlier
from scipy import stats
df = df[np.abs(stats.zscore(df['Item_Outlet_Sales'])) < 2]

## Data  Visualization

In [None]:
import seaborn as sns
sns.pairplot(df)

## Counts of Categorical Values available in Dataset

In [None]:
df[['Item_Identifier']].value_counts()

In [None]:
df[['Item_Fat_Content']].value_counts()

In [None]:
df.replace({'Item_Fat_Content': {'LF':'Low Fat','reg':'Regular', 'low fat':'Low Fat'}}, inplace=True)

In [None]:
df[['Item_Fat_Content']].value_counts()

In [None]:
df.replace({'Item_Fat_Content': {'Low Fat': 0,'Regular' : 1}}, inplace=True)

In [None]:
df[['Item_Type']].value_counts()

In [None]:
df.replace({'Item_Type':{'Fruits and Vegetables':0,'Snack Foods':0,'Household':1,
                         'Frozen Foods' : 0, 'Dairy' : 0, 'Baking Goods' : 0,
                         'Canned' : 0, 'Health and Hygiene' : 1,
                         'Meat' : 0, 'Soft Drinks' : 0, 'Breads' : 0, 'Hard Drinks' : 0,
                         'Others' : 2,'Starchy Foods' : 0, 'Breakfast' : 0, 'Seafood' : 0
                         }},inplace=True)

In [None]:
df[['Item_Type']].value_counts()

In [None]:
df[['Outlet_Identifier']].value_counts()

In [None]:
df.replace({'Outlet_Identifier':{'OUT027': 0,'OUT013': 1,
                         'OUT049' : 2, 'OUT046' : 3, 'OUT035' : 4,
                         'OUT045' : 5, 'OUT018' : 6,
                         'OUT017' : 7, 'OUT010' : 8, 'OUT019' : 9,
                         }},inplace=True)

In [None]:
df[['Outlet_Identifier']].value_counts()

In [None]:
df[['Outlet_Size']].value_counts()

In [None]:
df.replace({'Outlet_Size': {'Small': 0,'Medium' : 1, 'High' : 1}}, inplace=True)

In [None]:
df[['Outlet_Size']].value_counts()

In [None]:
df[['Outlet_Location_Type']].value_counts()

In [None]:
df.replace({'Outlet_Location_Type': {'Tier 1': 0,'Tier 2' : 1, 'Tier 3' : 2}}, inplace=True)

In [None]:
df[['Outlet_Location_Type']].value_counts()

In [None]:
df[['Outlet_Type']].value_counts()

In [None]:
df.replace({'Outlet_Type': {'Grocery Store': 0,'Supermarket Type1' : 1, 'Supermarket Type2' : 2, 'Supermarket Type3': 3}}, inplace=True)

In [None]:
df[['Outlet_Type']].value_counts()

In [None]:
df.describe().corr()

## Defining Target Variable (y) and Feature Variables (X)


In [None]:
y = df['Item_Outlet_Sales']

In [None]:
X = df[['Item_Weight', 'Item_Fat_Content', 'Item_Visibility',
       'Item_Type', 'Item_MRP', 'Outlet_Identifier',
       'Outlet_Establishment_Year', 'Outlet_Size', 'Outlet_Location_Type',
       'Outlet_Type']]

In [None]:
X = df.drop(['Item_Identifier', 'Item_Outlet_Sales'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state=2529)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

## RandomForest Regression Model for predicting the sales!

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Assuming df is your dataframe
y = df['Item_Outlet_Sales']
X = df.drop(['Item_Identifier', 'Item_Outlet_Sales'], axis=1)

# Define numerical and categorical columns
numerical_cols = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Outlet_Establishment_Year']
categorical_cols = ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Location_Type', 'Outlet_Type']

# Preprocessing for numerical data: imputation
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
])

# Preprocessing for categorical data: imputation and one-hot encoding
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define the model
rfr = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=2529)

# Create and evaluate the pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', rfr)])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2529)

# Fit the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate the Mean Squared Error and Mean Absolute Error
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)

print(f'Mean Squared Error: {mse}')
print(f'Mean Absolute Error: {mae}')

## Model Output

In [None]:
import plotly.graph_objs as go
from plotly.subplots import make_subplots

fig = make_subplots(rows=1, cols=1)

fig.add_trace(go.Scatter(x=y_test, y=y_test, mode='markers', name='Actual', marker=dict(color='blue')))

fig.add_trace(go.Scatter(x=y_test, y=y_pred, mode='markers', name='Predicted', marker=dict(color='teal')))

# Update layout
fig.update_layout(
    title="Actual Price vs Predicted Price",
    xaxis_title="Actual Prices",
    yaxis_title="Predicted Prices",
    showlegend=True,
    height=650
)

fig.show()

## Inference

Based on the results, it seems the Random Forest Regression model is performing quite well in predicting outlet sales. The Mean Squared Error (MSE) of approximately 824,045 indicates how close the predictions align with actual sales figures. Meanwhile, the Mean Absolute Error (MAE) of about 662.338 gives a sense of the average difference between the predicted and actual sales values.

These metrics suggest that the model is effectively capturing the patterns and relationships within the data, allowing to make reliable predictions about future sales. This is particularly valuable for planning and decision-making in retail operations, where accurate sales forecasts are essential for optimizing inventory, pricing strategies, and overall business performance.

In conclusion, Random Forest Regression model shows promise in its ability to forecast outlet sales, providing valuable insights that can inform strategic decisions and improve operational efficiency.