# Sales Prediction Project
## Problem Statement:
Jamie’s Supermarket struggles to accurately predict sales, leading to inefficient stock management. This results in two major issues: stockouts, where popular products run out, or overstocking, which increases storage costs and can lead to wastage. Without data-driven tools to forecast future demand, the supermarket faces both lost sales and excess inventory, impacting their profitability

### Approach: 
This project seeks to solve this issue by developing a machine-learning model to predict sales trends based on historical data. The model will help Jamie’s Supermarket in Uganda optimize stock replenishment, avoid stockouts, and reduce overstocking. The goal is to make their operations more efficient, leading to higher sales and improved satisfaction 
The goal is to develope a machine learning model that leverages historical sales data to predict future sales 

**Item_Weight:** Weight of product

**Item_Fat_Content:** Whether the product is low fat or not

**Item_Type:** The category to which the product belongs

**Item_MRP:** Maximum Retail Price of the product

**Outlet_Establishment_Year:** The year in which supermarket was established

**Outlet_Location_Type:** The type of city in which the store is located

**Item_Outlet_Sales:** Sales of the product in the supermarket. This is the
variable to be predicted.


In [2]:
#importing basics libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Modelling-
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split,GridSearchCV, RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

import pickle

In [3]:
df_train = pd.read_csv(r'C:\Users\USER\Desktop\sales pred\Train.csv')
df_test = pd.read_csv(r'C:\Users\USER\Desktop\sales pred\Test.csv')

In [4]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,Dairy,915250,2017,Kireka,Supermarket,13689281
1,5.92,Regular,Soft Drinks,176906,2017,Kireka,Supermarket,1625144
2,17.5,Low Fat,Meat,518880,2017,Kireka,Supermarket,7686495
3,19.2,Regular,Fruits and Vegetables,667378,2017,Kireka,Supermarket,2684172
4,8.93,Low Fat,Household,197402,2017,Kireka,Supermarket,3645595


In [5]:
df_train.shape

(257, 8)

In [6]:
df_train.columns

Index(['Item_Weight', 'Item_Fat_Content', 'Item_Type', 'Item_MRP',
       'Outlet_Establishment_Year', 'Outlet_Location', 'Outlet_Type',
       'Item_Outlet_Sales'],
      dtype='object')

In [7]:
df_test.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location,Outlet_Type
0,20.75,Low Fat,Snack Foods,395315,2017,Kireka,Supermarket
1,8.3,reg,Dairy,320027,2017,Kireka,Supermarket
2,14.6,Low Fat,Others,886028,2017,Kireka,Supermarket
3,7.315,Low Fat,Snack Foods,568200,2017,Kireka,Supermarket
4,,Regular,Dairy,858453,2017,Kireka,Supermarket


In [8]:
df_test.shape

(329, 7)

In [9]:
df_train.describe()

Unnamed: 0,Item_Weight,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,207.0,257.0,257.0,257.0
mean,12.8243,512239.649805,2017.0,7956811.0
std,4.415548,234369.519589,0.0,6401026.0
min,4.88,114654.0,2017.0,151288.0
25%,8.8925,340313.0,2017.0,2733176.0
50%,12.85,527324.0,2017.0,6576223.0
75%,16.6,679438.0,2017.0,11488260.0
max,21.35,972041.0,2017.0,29203900.0


In [10]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257 entries, 0 to 256
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                207 non-null    float64
 1   Item_Fat_Content           257 non-null    object 
 2   Item_Type                  257 non-null    object 
 3   Item_MRP                   257 non-null    int64  
 4   Outlet_Establishment_Year  257 non-null    int64  
 5   Outlet_Location            257 non-null    object 
 6   Outlet_Type                257 non-null    object 
 7   Item_Outlet_Sales          257 non-null    int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 16.2+ KB


In [11]:
df_train.isnull().sum()

Item_Weight                  50
Item_Fat_Content              0
Item_Type                     0
Item_MRP                      0
Outlet_Establishment_Year     0
Outlet_Location               0
Outlet_Type                   0
Item_Outlet_Sales             0
dtype: int64

In [12]:
df_test.isnull().sum()

Item_Weight                  68
Item_Fat_Content              0
Item_Type                     0
Item_MRP                      0
Outlet_Establishment_Year     0
Outlet_Location               0
Outlet_Type                   0
dtype: int64

In [13]:
df_train.duplicated().sum()

0

In [14]:
df_train.head(3)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,Dairy,915250,2017,Kireka,Supermarket,13689281
1,5.92,Regular,Soft Drinks,176906,2017,Kireka,Supermarket,1625144
2,17.5,Low Fat,Meat,518880,2017,Kireka,Supermarket,7686495


In [15]:
df_train['Outlet_Establishment_Year'].unique()

array([2017], dtype=int64)

In [16]:
df_train.isnull().sum()

Item_Weight                  50
Item_Fat_Content              0
Item_Type                     0
Item_MRP                      0
Outlet_Establishment_Year     0
Outlet_Location               0
Outlet_Type                   0
Item_Outlet_Sales             0
dtype: int64

In [17]:
# Display rows with null values
null_rows = df_train[df_train.isnull().any(axis=1)]
null_rows

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location,Outlet_Type,Item_Outlet_Sales
7,,Low Fat,Snack Foods,394948,2017,Kireka,Supermarket,14743369
18,,Low Fat,Hard Drinks,415183,2017,Kireka,Supermarket,8442942
21,,Regular,Baking Goods,529754,2017,Kireka,Supermarket,14894718
23,,Low Fat,Baking Goods,394698,2017,Kireka,Supermarket,785731
29,,Regular,Canned,159960,2017,Kireka,Supermarket,461190
36,,Regular,Fruits and Vegetables,469368,2017,Kireka,Supermarket,10253540
38,,Regular,Snack Foods,135610,2017,Kireka,Supermarket,1422611
39,,Low Fat,Snack Foods,321027,2017,Kireka,Supermarket,7991514
49,,Regular,Dairy,721562,2017,Kireka,Supermarket,2859864
59,,Low Fat,Canned,659826,2017,Kireka,Supermarket,3269710


In [18]:
# define numerical & categorical columns in train data
numeric_features = [feature for feature in df_train.columns if df_train[feature].dtype != 'O']
categorical_features = [feature for feature in df_train.columns if df_train[feature].dtype == 'O']

# print numerical & categorical columns in train data
print('We have {} numerical features in train data and they as as follows : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features train data and they as as follows: {}'.format(len(categorical_features), categorical_features))

We have 4 numerical features in train data and they as as follows : ['Item_Weight', 'Item_MRP', 'Outlet_Establishment_Year', 'Item_Outlet_Sales']

We have 4 categorical features train data and they as as follows: ['Item_Fat_Content', 'Item_Type', 'Outlet_Location', 'Outlet_Type']


In [19]:
print('Number of unique data points in categorical features in Train data')
print('Number of unique data points in Item_Fat_Content:', df_train['Item_Fat_Content'].unique())
print('Number of unique data points in Item_Type:',df_train['Item_Type'].unique())
print('Number of unique data points in Outlet_Location:', df_train['Outlet_Location'].unique())
print('Number of unique data points in Outlet_Type:', df_train['Outlet_Type'].unique())

Number of unique data points in categorical features in Train data
Number of unique data points in Item_Fat_Content: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']
Number of unique data points in Item_Type: ['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']
Number of unique data points in Outlet_Location: ['Kireka']
Number of unique data points in Outlet_Type: ['Supermarket']


## Data Preprocessing 
1. Remove Outliers as discover from EDA file
2. Fill Features with null values with median and mode
3. Drop redundant features 
4. Feature encoding 

#### 1. Remove Outliers as discover from EDA file 
##### Winsorization:
Winsorization replaces the extreme values with the nearest non-outlier value. You can choose to replace them with the maximum or minimum non-outlier value.

In [20]:
def find_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

outliers_weight = find_outliers_iqr(df_train, 'Item_Weight')
outliers_mrp = find_outliers_iqr(df_train, 'Item_MRP')
outliers_sales = find_outliers_iqr(df_train, 'Item_Outlet_Sales')

print("Number of outliers in Item_Weight:", len(outliers_weight))
print("Number of outliers in Item_MRP:", len(outliers_mrp))
print("Number of outliers in Item_Outlet_Sales:", len(outliers_sales))

Number of outliers in Item_Weight: 0
Number of outliers in Item_MRP: 0
Number of outliers in Item_Outlet_Sales: 6


In [21]:
# Creating a new column for Outlet_Age
df_train['Outlet_Age'] = df_train['Outlet_Establishment_Year'].apply(lambda year: 2023 - year)

# Standardize values in the 'Item_Fat_Content' column
df_train['Item_Fat_Content'] = df_train['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

# Drop unnecessary columns
df_train.drop([ 'Outlet_Establishment_Year'], axis=1, inplace=True)

Fill Features with null values with median and mode

In [22]:
df_train['Item_Weight'].fillna(df_train['Item_Weight'].mean(), inplace=True)


In [23]:
df_train['Outlet_Type'].value_counts()

Outlet_Type
Supermarket    257
Name: count, dtype: int64

In [24]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257 entries, 0 to 256
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Item_Weight        257 non-null    float64
 1   Item_Fat_Content   257 non-null    object 
 2   Item_Type          257 non-null    object 
 3   Item_MRP           257 non-null    int64  
 4   Outlet_Location    257 non-null    object 
 5   Outlet_Type        257 non-null    object 
 6   Item_Outlet_Sales  257 non-null    int64  
 7   Outlet_Age         257 non-null    int64  
dtypes: float64(1), int64(3), object(4)
memory usage: 16.2+ KB


In [25]:
df_train['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

In [26]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Location,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,Low Fat,Dairy,915250,Kireka,Supermarket,13689281,6
1,5.92,Regular,Soft Drinks,176906,Kireka,Supermarket,1625144,6
2,17.5,Low Fat,Meat,518880,Kireka,Supermarket,7686495,6
3,19.2,Regular,Fruits and Vegetables,667378,Kireka,Supermarket,2684172,6
4,8.93,Low Fat,Household,197402,Kireka,Supermarket,3645595,6


In [27]:
df_train['Item_Outlet_Sales'].max()

29203898

In [28]:
df_train['Item_Outlet_Sales'].min()

151288

In [29]:
df_train.shape

(257, 8)

In [30]:
df_train['Item_Fat_Content'].value_counts()

Item_Fat_Content
Low Fat    168
Regular     89
Name: count, dtype: int64

In [31]:
df_train[df_train['Outlet_Type'] == 'Supermarket']

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Location,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3000,Low Fat,Dairy,915250,Kireka,Supermarket,13689281,6
1,5.9200,Regular,Soft Drinks,176906,Kireka,Supermarket,1625144,6
2,17.5000,Low Fat,Meat,518880,Kireka,Supermarket,7686495,6
3,19.2000,Regular,Fruits and Vegetables,667378,Kireka,Supermarket,2684172,6
4,8.9300,Low Fat,Household,197402,Kireka,Supermarket,3645595,6
...,...,...,...,...,...,...,...,...
252,7.7850,Low Fat,Fruits and Vegetables,225268,Kireka,Supermarket,2781979,6
253,11.8000,Regular,Snack Foods,461669,Kireka,Supermarket,6881085,6
254,13.1500,Regular,Fruits and Vegetables,629774,Kireka,Supermarket,13851681,6
255,12.8243,Low Fat,Frozen Foods,152408,Kireka,Supermarket,151288,6


In [32]:
# define numerical & categorical columns in train data
numeric_features = [feature for feature in df_train.columns if df_train[feature].dtype != 'O']
categorical_features = [feature for feature in df_train.columns if df_train[feature].dtype == 'O']

# print numerical & categorical columns in train data
print('There are {} numerical features in train data and they are : {}'.format(len(numeric_features), numeric_features))
print('\nThere are {} categorical features train data and they are : {}'.format(len(categorical_features), categorical_features))

There are 4 numerical features in train data and they are : ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales', 'Outlet_Age']

There are 4 categorical features train data and they are : ['Item_Fat_Content', 'Item_Type', 'Outlet_Location', 'Outlet_Type']


## Feature Encoding 
1. Label Encoding 
2. One-Hot-Encoding


Ordinal variables:

Item_Fat_Content
Outlet_Location

Nominal variables:

Item_Type
Outlet_Type

In [33]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Define the lists of categorical and numerical features
numerical_features = ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales']
categorical_features = ['Item_Fat_Content', 'Item_Type', 'Outlet_Location', 'Outlet_Type']

le = LabelEncoder()
Label = ['Item_Fat_Content','Outlet_Location']

for i in Label:
    df_train[i] = le.fit_transform(df_train[i])

In [34]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Location,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,0,Dairy,915250,0,Supermarket,13689281,6
1,5.92,1,Soft Drinks,176906,0,Supermarket,1625144,6
2,17.5,0,Meat,518880,0,Supermarket,7686495,6
3,19.2,1,Fruits and Vegetables,667378,0,Supermarket,2684172,6
4,8.93,0,Household,197402,0,Supermarket,3645595,6


In [35]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Define columns for one-hot encoding
cols = ['Item_Type', 'Outlet_Type']

# Applying one-hot encoder
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first')
oh_encoder_df_train = pd.DataFrame(oh_encoder.fit_transform(df_train[cols])).astype('int64')

# Get feature column names
oh_encoder_df_train.columns = oh_encoder.get_feature_names_out(cols)

# One-hot encoding removed index; put it back
oh_encoder_df_train.index = df_train.index

# Add one-hot encoded columns to the main DataFrame
df_train = pd.concat([df_train, oh_encoder_df_train], axis=1)

# Dropping the original categorical columns
df_train = df_train.drop(['Item_Type', 'Outlet_Type'], axis=1)

In [36]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_MRP,Outlet_Location,Item_Outlet_Sales,Outlet_Age,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,...,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods
0,9.3,0,915250,0,13689281,6,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,5.92,1,176906,0,1625144,6,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,17.5,0,518880,0,7686495,6,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,19.2,1,667378,0,2684172,6,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,8.93,0,197402,0,3645595,6,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [37]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

col_to_scale = ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales', 'Outlet_Age']

for col in col_to_scale:
    # Reshaping column to a 2D array with a single column
    col_data = df_train[col].values.reshape(-1, 1)
    
    # Fitting and transforming the scaler on the reshaped data
    df_train[col] = scaler.fit_transform(col_data)


In [38]:
X = df_train.drop(['Item_Outlet_Sales'], axis = 1)
y = df_train['Item_Outlet_Sales']

In [39]:
# Splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [40]:
len(X_train)


205

In [41]:
len(X_test)

52

In [42]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [43]:


# Defininig evaluate_model function
def evaluate_model(true_values, predicted_values):
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    rmse = np.sqrt(mean_squared_error(true_values, predicted_values))
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)
    return mae, rmse, r2

# Ensuring df_train is the preprocessed DataFrame
# Defining features (X) and target (y)
X = df_train.drop('Item_Outlet_Sales', axis=1)
y = df_train['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Models dictionary
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest Regressor": RandomForestRegressor(random_state=42),
    "XGBRegressor": XGBRegressor(random_state=42),
    "CatBoosting Regressor": CatBoostRegressor(verbose=False, random_state=42),
    "AdaBoost Regressor": AdaBoostRegressor(random_state=42)
}

model_list = []
r2_list = []

# Clear the file before writing new results (use 'w' instead of 'a')
file_path = r'C:\Users\USER\Desktop\sales pred\notebook\model_results.txt'
with open(file_path, 'w') as f:
    f.write("Model Results\n\n")

# Loop through models and evaluate
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    # Print results
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))
    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

    # Save to file
    with open(file_path, 'a') as f:
        f.write(f"{list(models.keys())[i]}\n")
        f.write('Model performance for Training set\n')
        f.write(f"- Root Mean Squared Error: {model_train_rmse:.4f}\n")
        f.write(f"- Mean Absolute Error: {model_train_mae:.4f}\n")
        f.write(f"- R2 Score: {model_train_r2:.4f}\n")
        f.write('----------------------------------\n')
        f.write('Model performance for Test set\n')
        f.write(f"- Root Mean Squared Error: {model_test_rmse:.4f}\n")
        f.write(f"- Mean Absolute Error: {model_test_mae:.4f}\n")
        f.write(f"- R2 Score: {model_test_r2:.4f}\n")
        f.write('='*35 + '\n\n')

print(f"Results saved to {file_path}")

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.7370
- Mean Absolute Error: 0.5492
- R2 Score: 0.4338
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8274
- Mean Absolute Error: 0.6396
- R2 Score: 0.4098


Lasso
Model performance for Training set
- Root Mean Squared Error: 0.9795
- Mean Absolute Error: 0.7790
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1.0771
- Mean Absolute Error: 0.8825
- R2 Score: -0.0002


Ridge
Model performance for Training set
- Root Mean Squared Error: 0.7385
- Mean Absolute Error: 0.5485
- R2 Score: 0.4316
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8293
- Mean Absolute Error: 0.6425
- R2 Score: 0.4071


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 0.6699
- Mean Absolute Error: 0.5138
- R2 Score: 0.5323
----------------------

In [44]:
with open('C:\\Users\\USER\\Desktop\\sales pred\\notebook\\model_results.txt', 'r') as f:
    print(f.read())

Model Results

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.7370
- Mean Absolute Error: 0.5492
- R2 Score: 0.4338
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8274
- Mean Absolute Error: 0.6396
- R2 Score: 0.4098

Lasso
Model performance for Training set
- Root Mean Squared Error: 0.9795
- Mean Absolute Error: 0.7790
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1.0771
- Mean Absolute Error: 0.8825
- R2 Score: -0.0002

Ridge
Model performance for Training set
- Root Mean Squared Error: 0.7385
- Mean Absolute Error: 0.5485
- R2 Score: 0.4316
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8293
- Mean Absolute Error: 0.6425
- R2 Score: 0.4071

K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 0.6699
- Mean Absolute Error: 0.5138
- R2 Score: 0.5323
----------


### LINEAR REGRESSION MODEL

In [45]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Defining the evaluate_model function
def evaluate_model(true_values, predicted_values):
    rmse = np.sqrt(mean_squared_error(true_values, predicted_values))
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)
    return mae, rmse, r2

# Defining the features (X) and target (y)
X = df_train.drop('Item_Outlet_Sales', axis=1)
y = df_train['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model
model = LinearRegression()
model.fit(X_train, y_train)
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Evaluate
model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

# Print results
print("Linear Regression")
print('Model performance for Training set')
print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
print("- R2 Score: {:.4f}".format(model_train_r2))
print('----------------------------------')
print('Model performance for Test set')
print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
print("- R2 Score: {:.4f}".format(model_test_r2))
print('='*35)

# Save to file
file_path = r'C:\Users\USER\Desktop\sales pred\notebook\linear_regression_results.txt'
with open(file_path, 'w') as f:
    f.write("Linear Regression\n")
    f.write('Model performance for Training set\n')
    f.write(f"- Root Mean Squared Error: {model_train_rmse:.4f}\n")
    f.write(f"- Mean Absolute Error: {model_train_mae:.4f}\n")
    f.write(f"- R2 Score: {model_train_r2:.4f}\n")
    f.write('----------------------------------\n')
    f.write('Model performance for Test set\n')
    f.write(f"- Root Mean Squared Error: {model_test_rmse:.4f}\n")
    f.write(f"- Mean Absolute Error: {model_test_mae:.4f}\n")
    f.write(f"- R2 Score: {model_test_r2:.4f}\n")
    f.write('='*35 + '\n')

print(f"Results saved to {file_path}")

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.7370
- Mean Absolute Error: 0.5492
- R2 Score: 0.4338
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8274
- Mean Absolute Error: 0.6396
- R2 Score: 0.4098
Results saved to C:\Users\USER\Desktop\sales pred\notebook\linear_regression_results.txt


### RANDOM FOREST

In [46]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Defining evaluate_model function
def evaluate_model(true_values, predicted_values):
    rmse = np.sqrt(mean_squared_error(true_values, predicted_values))
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)
    return mae, rmse, r2

# df_train is the preprocessed DataFrame
X = df_train.drop('Item_Outlet_Sales', axis=1)
y = df_train['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the base Random Forest model
rf_base = RandomForestRegressor(random_state=42)

# hyperparameter grid
param_grid = {
    'n_estimators': [100, 150, 200],          # Focus around 100-200
    'max_depth': [10, 15, 20],                # Test slightly deeper trees
    'min_samples_split': [5, 10, 15],         # Tweak splitting
    'min_samples_leaf': [1, 2],               # Keeping leaf options light
    'max_features': ['sqrt', 'log2']          
}

# Performing  GridSearchCV
grid_search = GridSearchCV(estimator=rf_base, param_grid=param_grid, 
                           cv=5, n_jobs=-1, scoring='r2', verbose=1)
grid_search.fit(X_train, y_train)

# Getting the best Random Forest model
rf_tuned = grid_search.best_estimator_
print("Best Random Forest Parameters:", grid_search.best_params_)

# Fitting the tuned model and makig predictions
rf_tuned.fit(X_train, y_train)
y_train_pred = rf_tuned.predict(X_train)
y_test_pred = rf_tuned.predict(X_test)

# Evaluation of the tuned model
model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

# Print results
print("Random Forest Regressor (Tuned)")
print('Model performance for Training set')
print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
print("- R2 Score: {:.4f}".format(model_train_r2))
print('----------------------------------')
print('Model performance for Test set')
print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
print("- R2 Score: {:.4f}".format(model_test_r2))
print('='*35)

# Save results to a file
file_path = r'C:\Users\USER\Desktop\sales pred\notebook\random_forest_tuned_results.txt'
with open(file_path, 'w') as f:
    f.write("Random Forest Regressor (Tuned)\n")
    f.write('Model performance for Training set\n')
    f.write(f"- Root Mean Squared Error: {model_train_rmse:.4f}\n")
    f.write(f"- Mean Absolute Error: {model_train_mae:.4f}\n")
    f.write(f"- R2 Score: {model_train_r2:.4f}\n")
    f.write('----------------------------------\n')
    f.write('Model performance for Test set\n')
    f.write(f"- Root Mean Squared Error: {model_test_rmse:.4f}\n")
    f.write(f"- Mean Absolute Error: {model_test_mae:.4f}\n")
    f.write(f"- R2 Score: {model_test_r2:.4f}\n")
    f.write('='*35 + '\n')
    f.write(f"Best Parameters: {grid_search.best_params_}\n")

print(f"Results saved to {file_path}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Random Forest Parameters: {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 150}
Random Forest Regressor (Tuned)
Model performance for Training set
- Root Mean Squared Error: 0.6535
- Mean Absolute Error: 0.4924
- R2 Score: 0.5548
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8710
- Mean Absolute Error: 0.6769
- R2 Score: 0.3460
Results saved to C:\Users\USER\Desktop\sales pred\notebook\random_forest_tuned_results.txt


### XGBOOST REGRESSOR

In [47]:
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Defining evaluate_model function
def evaluate_model(true_values, predicted_values):
    rmse = np.sqrt(mean_squared_error(true_values, predicted_values))
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)
    return mae, rmse, r2

# df_train is the preprocessed DataFrame
X = df_train.drop('Item_Outlet_Sales', axis=1)
y = df_train['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the base XGBoost model
xgb_base = XGBRegressor(random_state=42, objective='reg:squarederror')

# hyperparameter grid
param_grid = {
    'n_estimators': [200, 300, 500],       
    'max_depth': [3, 4, 5],                
    'learning_rate': [0.01, 0.03, 0.05],   
    'subsample': [0.8, 0.9, 1.0],          
    'colsample_bytree': [0.8, 0.9]         
}

# Perform GridSearchCV
grid_search = GridSearchCV(estimator=xgb_base, param_grid=param_grid, 
                           cv=5, n_jobs=-1, scoring='r2', verbose=1)
grid_search.fit(X_train, y_train)

# Getting the best XGBoost model
xgb_tuned = grid_search.best_estimator_
print("Best XGBoost Parameters:", grid_search.best_params_)

# Fitting the tuned model and making predictions
xgb_tuned.fit(X_train, y_train)
y_train_pred = xgb_tuned.predict(X_train)
y_test_pred = xgb_tuned.predict(X_test)

# Evaluation of the tuned model
model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

# Print results
print("XGBRegressor (Tuned v3)")
print('Model performance for Training set')
print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
print("- R2 Score: {:.4f}".format(model_train_r2))
print('----------------------------------')
print('Model performance for Test set')
print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
print("- R2 Score: {:.4f}".format(model_test_r2))
print('='*35)

# Save results to a file
file_path = r'C:\Users\USER\Desktop\sales pred\notebook\xgboost_tuned_results.txt'
with open(file_path, 'w') as f:
    f.write("XGBRegressor (Tuned v3)\n")
    f.write('Model performance for Training set\n')
    f.write(f"- Root Mean Squared Error: {model_train_rmse:.4f}\n")
    f.write(f"- Mean Absolute Error: {model_train_mae:.4f}\n")
    f.write(f"- R2 Score: {model_train_r2:.4f}\n")
    f.write('----------------------------------\n')
    f.write('Model performance for Test set\n')
    f.write(f"- Root Mean Squared Error: {model_test_rmse:.4f}\n")
    f.write(f"- Mean Absolute Error: {model_test_mae:.4f}\n")
    f.write(f"- R2 Score: {model_test_r2:.4f}\n")
    f.write('='*35 + '\n')
    f.write(f"Best Parameters: {grid_search.best_params_}\n")

print(f"Results saved to {file_path}")

Fitting 5 folds for each of 162 candidates, totalling 810 fits
Best XGBoost Parameters: {'colsample_bytree': 0.9, 'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 200, 'subsample': 1.0}
XGBRegressor (Tuned v3)
Model performance for Training set
- Root Mean Squared Error: 0.6832
- Mean Absolute Error: 0.5069
- R2 Score: 0.5135
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8493
- Mean Absolute Error: 0.6457
- R2 Score: 0.3781
Results saved to C:\Users\USER\Desktop\sales pred\notebook\xgboost_tuned_results.txt


In [48]:
import pandas as pd

def compare_models(models, X_train, y_train, X_test, y_test):
    results = []
    for model_name, model in models.items():
        model_results = evaluate_model(model, X_train, y_train, X_test, y_test)
        model_results['Model'] = model_name
        results.append(model_results)

    return pd.DataFrame(results)


In [49]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X_train, y_train, X_test, y_test):
    # Training the model on the training data
    model.fit(X_train, y_train)

    # Making predictions on the training and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculating evaluation metrics for training data
    mae_train = mean_absolute_error(y_train, y_train_pred)
    mse_train = mean_squared_error(y_train, y_train_pred)
    rmse_train = np.sqrt(mse_train)
    r2_train = r2_score(y_train, y_train_pred)

    # Calculating evaluation metrics for test data
    mae_test = mean_absolute_error(y_test, y_test_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    rmse_test = np.sqrt(mse_test)
    r2_test = r2_score(y_test, y_test_pred)

    # Calculating cross-validation RMSE
    cross_val_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    cross_val_rmse = np.sqrt(-cross_val_scores)

    return {
        'MAE_train': mae_train,
        'MSE_train': mse_train,
        'RMSE_train': rmse_train,
        'R^2_train': r2_train,
        'MAE_test': mae_test,
        'MSE_test': mse_test,
        'RMSE_test': rmse_test,
        'R^2_test': r2_test,
        'Cross_Val_RMSE': cross_val_rmse.mean()
    }


In [50]:
# Using the compare_models function
results_df = compare_models(models, X_train, y_train, X_test, y_test)

# Sorting the results by MAE
results_df.sort_values(by='MAE_test', ascending=True, inplace=True)

# Save results to a CSV file
results_df.to_csv("model_comparison_results.csv", index=False)


In [51]:
result = pd.read_csv('model_comparison_results.csv')

In [52]:
result

Unnamed: 0,MAE_train,MSE_train,RMSE_train,R^2_train,MAE_test,MSE_test,RMSE_test,R^2_test,Cross_Val_RMSE,Model
0,0.549232,0.543211,0.737029,0.433787,0.639554,0.684583,0.827395,0.409817,0.802616,Linear Regression
1,0.548497,0.545346,0.738476,0.431561,0.642483,0.687725,0.829292,0.407108,0.79494,Ridge
2,0.242865,0.108661,0.329638,0.886738,0.653271,0.715641,0.845955,0.383042,0.916839,Random Forest Regressor
3,0.205129,0.076338,0.276293,0.92043,0.700045,0.85111,0.922556,0.266253,0.919453,CatBoosting Regressor
4,0.513752,0.448742,0.669882,0.532256,0.700944,0.857564,0.926048,0.260689,0.841966,K-Neighbors Regressor
5,0.573471,0.488557,0.698969,0.490755,0.708897,0.816725,0.903728,0.295896,0.907614,AdaBoost Regressor
6,0.015852,0.000523,0.022869,0.999455,0.79266,1.140108,1.067759,0.017106,1.012788,XGBRegressor
7,0.0,0.0,0.0,1.0,0.828418,1.196642,1.093911,-0.031633,1.203138,Decision Tree
8,0.778952,0.959376,0.979477,0.0,0.882548,1.160205,1.077128,-0.00022,0.977329,Lasso


In [53]:
import pandas, numpy, seaborn, matplotlib, sklearn, catboost, xgboost, flask, dill, joblib
print("All packages imported successfully!")

All packages imported successfully!
