# 🧠 Sales Forecasting Using Machine Learning

This project builds a regression model to forecast future sales using historical data. Accurate sales forecasts help businesses with inventory planning, budgeting, and strategy.



## 📂 Dataset Overview

The dataset includes sales records with dates and corresponding revenue. We'll clean, visualize, and model this data to generate accurate monthly sales forecasts.



In [None]:
# BigMart Sales Prediction

'''Objective
The data scientists at BigMart have collected sales data for 1559 products across 10 stores in different cities. 
Also, certain attributes of each product and store have been defined. The aim is to build a predictive model and
 find out the sales of each product at a particular store (each row of data).
    
So the idea is to find out the features (properties) of a product, and store which impacts the sales of a product.'''

import pandas as pd
df= pd.read_csv('sales_prediction.csv')
df

In [None]:
df.head()

In [None]:
df.info()

 # prepare training and test dataset  -->

need data to train model and need unseen data to meausure model performance
we take ratio of 70 and 30( 70% of data used for train and 30% of data will use in testing the model)

In [None]:
x=df.drop(columns=['Item_Outlet_Sales'])
y=df['Item_Outlet_Sales']

In [None]:
!pip install scikit-learn
import sklearn

from sklearn.model_selection  import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=11)

x_train.shape

In [None]:
x_test.shape

# data wrangling and eda and feature engieering

In [None]:
# make a copy of data to do some modeifications
x_train_c=x_train.copy()

In [None]:
x_train_c.info()

In [None]:
x_train_c.isnull().sum()

In [None]:
num_data=x_train_c.select_dtypes(exclude=['object'])
num_data

In [None]:
num_data.describe()

In [None]:
num_data.isnull().sum()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig,ax=plt.subplots(1,2,figsize=(12,5))

sns.histplot(data=x_train_c,x=x_train_c['Item_Weight'],ax=ax[0])
sns.boxplot(data=x_train_c,y=x_train_c['Item_Weight'],ax=ax[1])

In [None]:
def visualize(data_frame,col_name):
    fig,ax=plt.subplots(1,2,figsize=(12,5))

    sns.histplot(data=data_frame,x=col_name,ax=ax[0])
    sns.boxplot(data=data_frame,y=col_name,ax=ax[1])
    

In [None]:
visualize(x_train_c,'Item_Visibility')

In [None]:
visualize(x_train_c,'Item_MRP')

In [None]:
visualize(x_train_c,'Outlet_Establishment_Year')

In [None]:
sns.countplot(data=x_train_c,x='Outlet_Establishment_Year')

In [None]:
cat_data=x_train_c.select_dtypes(include=['object'])
cat_data

In [None]:
cat_data.isnull().sum()

In [None]:
cat_data.describe()

In [None]:
cat_data['Item_Identifier'].value_counts()

In [None]:
cat_data['Item_Fat_Content'].value_counts()

In [None]:
for i in cat_data:
    print(f'Value counts for {i}:')
    print(cat_data[i].value_counts())
    

# Data Wrangling and feature engieering


In [None]:
# create high level item

x_train_c['Item_Identifier'].apply(lambda x:x[:2]).value_counts()

In [None]:
# another way
x_train_c['Item_Identifier'].str[:2].value_counts()

In [None]:
# map item id into item types

def create_item_type(data_frame):
    data_frame['Item_Type']=data_frame['Item_Identifier'].str[:2]
    data_frame['Item_Type']=data_frame['Item_Type'].map({
                                                   'FD':'Food',
                                                   'NC': 'Non-Consumables',
                                                   'DR': 'Drinks'  })
    return data_frame
                                          
    

In [None]:
x_train_c=create_item_type(x_train_c)
x_train_c.head()

In [None]:
# filling missing values in item weight

x_train_c.isnull().sum()

In [None]:
x_train_c[['Item_Identifier','Item_Weight']].drop_duplicates().sort_values(by='Item_Identifier')

In [None]:
x_train_c[['Item_Type','Item_Weight']].drop_duplicates().sort_values(by='Item_Type')

In [None]:
# use mapping of item id -weight to fill missing values

item_id_weight_pivot= x_train_c.pivot_table(values='Item_Weight',index='Item_Identifier').reset_index()
item_id_weight_mapping= dict(zip(item_id_weight_pivot['Item_Identifier'],item_id_weight_pivot['Item_Weight']))
list(item_id_weight_mapping.items())[:10]

In [None]:
# if new_tem comes up

item_type_weight_pivot=x_train_c.pivot_table(values='Item_Weight',index='Item_Type',aggfunc='median').reset_index()
item_type_weight_mapping= dict(zip(item_type_weight_pivot['Item_Type'],item_type_weight_pivot['Item_Weight']))
item_type_weight_mapping.items()

In [None]:
def impute_item_weight(data_frame):
    data_frame.loc[:,'Item_Weight']=data_frame.loc[:,'Item_Weight'].fillna(data_frame.loc[:,'Item_Identifier'].map(item_id_weight_mapping))
    data_frame.loc[:,'Item_Weight']=data_frame.loc[:,'Item_Weight'].fillna(data_frame.loc[:,'Item_Type'].map(item_type_weight_mapping))

    return data_frame
    

In [None]:
x_train_c=impute_item_weight(x_train_c)

In [None]:
x_train_c.isnull().sum()

In [None]:
# filling missing values for outlet size

x_train_c.groupby(by=['Outlet_Type','Outlet_Size']).size()

In [None]:
from scipy.stats import mode

def calculate_mode(series):
    return series.mode().iloc[0] 
    
# Create a pivot table
Outlet_type_size_pivot = x_train_c.pivot_table(
    values='Outlet_Size',
    index='Outlet_Type',
    aggfunc=calculate_mode  # Get the mode of the values
).reset_index()


Outlet_type_size_mapping= dict(zip(Outlet_type_size_pivot['Outlet_Type'],Outlet_type_size_pivot['Outlet_Size']))
Outlet_type_size_mapping

In [None]:
def impute_outlet_size(data_frame):
    data_frame.loc[:,'Outlet_Size']=data_frame.loc[:,'Outlet_Size'].fillna(data_frame.loc[:,'Outlet_Type'].map(Outlet_type_size_mapping))
    return data_frame





In [None]:
x_train_c=impute_outlet_size(x_train_c)

In [None]:
x_train_c

In [None]:
x_train_c.isnull().sum()

In [None]:
x_train_c['Item_Fat_Content'].value_counts()

In [None]:
def create_item_type1(data_frame):
    data_frame['Item_Fat_Content']=data_frame['Item_Fat_Content'].replace({
                                                   'Low Fat':'Low_Fat',
                                                   'reg': 'Regular',
                                                   'low fat': 'Low_Fat',
                                                   'LF': 'Low_Fat'
    })
    return data_frame
                        

In [None]:
x_train_c=create_item_type1(x_train_c)

In [None]:
x_train_c['Item_Fat_Content'].value_counts()

In [None]:
x_train_c.groupby(by=['Item_Type','Item_Fat_Content']).size()

In [None]:
x_train_c.loc[x_train_c['Item_Type']=='Non-Consumables','Item_Fat_Content']

In [None]:
def correct_item_fat_content(data_frame):
    data_frame.loc[data_frame['Item_Type']=='Non-Consumables','Item_Fat_Content']='Non_Edible'
    return data_frame

In [None]:
x_train_c=correct_item_fat_content(x_train_c)

In [None]:
x_train_c.groupby(by=['Item_Type','Item_Fat_Content']).size()

In [None]:
x_train_c.info()

In [None]:
def prepare_dataset(data_frame):
    data_frame=create_item_type(data_frame)
    data_frame=impute_item_weight(data_frame)
    data_frame=impute_outlet_size(data_frame)
    data_frame=create_item_type1(data_frame)
    data_frame=correct_item_fat_content(data_frame)
    return data_frame
    

In [None]:
x_train.isnull().sum()

In [None]:
x_train=prepare_dataset(x_train)
x_train.isnull().sum()

In [None]:
x_test.isnull().sum()

In [None]:
x_test=prepare_dataset(x_test)

In [None]:
x_test.isnull().sum()

# handling categorical data

In [None]:
cat_feats=x_train.select_dtypes(include=['object'])
cat_feats.head()

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
ohe=OneHotEncoder(handle_unknown='ignore')

In [None]:
ohe.fit(cat_feats)

In [None]:
ohe_feature_names=ohe.get_feature_names_out(input_features=cat_feats.columns)
ohe_feature_names

In [None]:
num_feat_train=x_train.select_dtypes(exclude=['object']).reset_index(drop=True)
num_feat_train.head()

In [None]:
cat_feat_train=x_train.select_dtypes(include=['object'])
x_train_cat_ohe=pd.DataFrame(ohe.transform(cat_feat_train).toarray(),columns=ohe_feature_names)
x_train_cat_ohe.head()

In [None]:
x_train_final=pd.concat([num_feat_train,x_train_cat_ohe],axis=1)
x_train_final.head()

In [None]:
final_col=x_train_final.columns.values
final_col

In [None]:
num_feat_test=x_test.select_dtypes(exclude=['object']).reset_index(drop=True)
cat_feat_test=x_test.select_dtypes(include=['object'])
x_test_cat_ohe=pd.DataFrame(ohe.transform(cat_feat_test).toarray(),columns=ohe_feature_names)
x_test_final=pd.concat([num_feat_test,x_test_cat_ohe],axis=1)
x_test_final=x_test_final[final_col]

x_test_final.head()


# Modelling

In [None]:
sns.histplot(y_train)

In [None]:
!pip install xgboost
!pip install lightgbm


from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor,HistGradientBoostingRegressor
import xgboost as xgb
from lightgbm import LGBMRegressor
from sklearn.model_selection import cross_validate
import numpy as np

In [None]:
x_test_final.shape

In [None]:
def evaluate_model_simple(model, x, y, test_size=0.2, random_state=None):
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=test_size, random_state=random_state)
    
    # Fit the model
    model.fit(x_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(x_test)
    
    # Calculate R² and RMSE
    r2 = r2_score(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    print('R² score:', r2)
    print('RMSE:', rmse)


In [None]:
rf = RandomForestRegressor(random_state=11)
gb = GradientBoostingRegressor(random_state=11)

In [None]:

from sklearn.metrics import mean_squared_error, r2_score

# Evaluate RandomForestRegressor
print("RandomForestRegressor:")
evaluate_model_simple(rf, x=x_train_final, y=y_train)

# Evaluate GradientBoostingRegressor
print("\nGradientBoostingRegressor:")
evaluate_model_simple(gb, x=x_train_final, y=y_train)

In [None]:
xgr=xgb.XGBRegressor(objective='reg:squarederror',random_state=11)
evaluate_model_simple(model=xgr,x=x_train_final,y=y_train)

In [None]:

num_feats_test=x_train.select_dtypes(exclude=['object']).reset_index(drop=True)
cat_feats_test=x_train.select_dtypes(include=['object']).drop(columns=['Item_Identifier'])
x_train_cat_ohe=pd.DataFrame(ohe.transform(cat_feats_test).toarray(),columns=ohe_feature_names)
x_train_final=pd.concat([num_feats_test,hashed_test_df,x_test_cat_ohe],axis=1)
x_train_final.head()

In [None]:
importances = gb.feature_importances_
feature_names = x_train_final.columns  # Assuming x is a DataFrame

# Create a DataFrame to display feature importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(20, 10))
plt.barh(importance_df['Feature'], importance_df['Importance'])
plt.xlabel('Importance')
plt.title('Feature Importances from GradientBoostingRegressor')
plt.gca().invert_yaxis()  # Display the most important feature on top
plt.show()

In [None]:
feature_names.shape

## 🛠️ Data Preprocessing & Feature Engineering

We'll parse the `Date` column and extract useful time-based features like `month` and `year` for analysis and modeling.


## 📊 Exploratory Data Analysis (EDA)

We'll analyze overall trends, monthly seasonality, and other patterns in sales.


## 🤖 Model Building

We’ll use a Linear Regression model to forecast sales. We’ll split the data into train and test sets and evaluate the model using RMSE and R² score.


## 📈 Forecast Visualization

We'll visualize actual vs. predicted sales to evaluate the model’s performance and interpretability.


## 📌 Business Insights & Recommendations

- Sales show a seasonal trend with higher demand in certain months.
- Forecasts help with inventory planning and sales strategy.
- Recommend increasing stock and marketing in peak months identified through the model.



## 🚀 Future Improvements

- Try more advanced models like ARIMA, XGBoost, or Prophet.
- Incorporate external factors like promotions, holidays, or weather data.
