# Sale Prediction Project
## Problem Statement:
Jamie’s Supermarket struggles to accurately predict sales, leading to inefficient stock management. This results in two major issues: stockouts, where popular products run out, or overstocking, which increases storage costs and can lead to wastage. Without data-driven tools to forecast future demand, the supermarket faces both lost sales and excess inventory, impacting their profitability

### Approach: 
This project seeks to solve this issue by developing a machine-learning model to predict sales trends based on historical data. The model will help Jamie’s Supermarket in Uganda optimize stock replenishment, avoid stockouts, and reduce overstocking. The goal is to make their operations more efficient, leading to higher sales and improved satisfaction 

**Item_Weight:** Weight of product

**Item_Fat_Content:** Whether the product is low fat or not

**Item_Type:** The category to which the product belongs

**Item_MRP:** Maximum Retail Price (list price) of the product

**Outlet_Establishment_Year:** The year in which supermarket was established

**Outlet_Location_Type:** The type of city in which the store is located

**Item_Outlet_Sales:** Sales of the product in the supermarket. This is the
variable to be predicted.


In [1]:
#importing basics libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Modelling-
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split,GridSearchCV, RandomizedSearchCV
from catboost import CatBoostRegressor
from xgboost import XGBRegressor

import pickle

In [3]:
df_train = pd.read_csv(r'C:\Users\USER\Desktop\sales pred\Train.csv')
df_test = pd.read_csv(r'C:\Users\USER\Desktop\sales pred\Test.csv')

In [4]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,Dairy,249.8092,2017,Kireka,Supermarket,3735.138
1,5.92,Regular,Soft Drinks,48.2692,2017,Kireka,Supermarket,443.4228
2,17.5,Low Fat,Meat,141.618,2017,Kireka,Supermarket,2097.27
3,19.2,Regular,Fruits and Vegetables,182.095,2017,Kireka,Supermarket,732.38
4,8.93,Low Fat,Household,53.8614,2017,Kireka,Supermarket,994.7052


In [5]:
df_train.shape

(257, 8)

In [6]:
df_train.columns

Index(['Item_Weight', 'Item_Fat_Content', 'Item_Type', 'Item_MRP',
       'Outlet_Establishment_Year', 'Outlet_Location_Type', 'Outlet_Type',
       'Item_Outlet_Sales'],
      dtype='object')

In [8]:
df_test.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type
0,20.75,Low Fat,Snack Foods,107.8622,1,2017,Kireka,Supermarket
1,8.3,reg,Dairy,87.3198,1,2017,Kireka,Supermarket
2,14.6,Low Fat,Others,241.7538,1,2017,Kireka,Supermarket
3,7.315,Low Fat,Snack Foods,155.034,1,2017,Kireka,Supermarket
4,,Regular,Dairy,234.23,1,2017,Kireka,Supermarket


In [9]:
df_test.shape

(329, 8)

In [10]:
df_train.describe()

Unnamed: 0,Item_Weight,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
count,207.0,257.0,257.0,257.0
mean,12.8243,139.765798,2017.0,2170.997635
std,4.415548,63.948644,0.0,1746.535039
min,4.88,31.29,2017.0,41.2796
25%,8.8925,92.6804,2017.0,745.696
50%,12.85,143.8812,2017.0,1794.331
75%,16.6,185.4266,2017.0,3134.5864
max,21.35,265.2226,2017.0,7968.2944


In [11]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257 entries, 0 to 256
Data columns (total 8 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Weight                207 non-null    float64
 1   Item_Fat_Content           257 non-null    object 
 2   Item_Type                  257 non-null    object 
 3   Item_MRP                   257 non-null    float64
 4   Outlet_Establishment_Year  257 non-null    int64  
 5   Outlet_Location_Type       257 non-null    object 
 6   Outlet_Type                257 non-null    object 
 7   Item_Outlet_Sales          257 non-null    float64
dtypes: float64(3), int64(1), object(4)
memory usage: 16.2+ KB


In [12]:
df_train.isnull().sum()

Item_Weight                  50
Item_Fat_Content              0
Item_Type                     0
Item_MRP                      0
Outlet_Establishment_Year     0
Outlet_Location_Type          0
Outlet_Type                   0
Item_Outlet_Sales             0
dtype: int64

In [13]:
df_test.isnull().sum()

Item_Weight                  68
Item_Fat_Content              0
Item_Type                     0
Item_MRP                      0
Outlet_Identifier             0
Outlet_Establishment_Year     0
Outlet_Location_Type          0
Outlet_Type                   0
dtype: int64

In [14]:
df_train.duplicated().sum()

0

In [17]:
df_train.head(3)

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,9.3,Low Fat,Dairy,249.8092,2017,Kireka,Supermarket,3735.138
1,5.92,Regular,Soft Drinks,48.2692,2017,Kireka,Supermarket,443.4228
2,17.5,Low Fat,Meat,141.618,2017,Kireka,Supermarket,2097.27


In [18]:
df_train['Outlet_Establishment_Year'].unique()

array([2017], dtype=int64)

In [19]:
df_train.isnull().sum()

Item_Weight                  50
Item_Fat_Content              0
Item_Type                     0
Item_MRP                      0
Outlet_Establishment_Year     0
Outlet_Location_Type          0
Outlet_Type                   0
Item_Outlet_Sales             0
dtype: int64

In [20]:
# Display rows with null values
null_rows = df_train[df_train.isnull().any(axis=1)]
null_rows

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
7,,Low Fat,Snack Foods,107.7622,2017,Kireka,Supermarket,4022.7636
18,,Low Fat,Hard Drinks,113.2834,2017,Kireka,Supermarket,2303.668
21,,Regular,Baking Goods,144.5444,2017,Kireka,Supermarket,4064.0432
23,,Low Fat,Baking Goods,107.6938,2017,Kireka,Supermarket,214.3876
29,,Regular,Canned,43.6454,2017,Kireka,Supermarket,125.8362
36,,Regular,Fruits and Vegetables,128.0678,2017,Kireka,Supermarket,2797.6916
38,,Regular,Snack Foods,36.9874,2017,Kireka,Supermarket,388.1614
39,,Low Fat,Snack Foods,87.6198,2017,Kireka,Supermarket,2180.495
49,,Regular,Dairy,196.8794,2017,Kireka,Supermarket,780.3176
59,,Low Fat,Canned,180.0344,2017,Kireka,Supermarket,892.172


In [21]:
# define numerical & categorical columns in train data
numeric_features = [feature for feature in df_train.columns if df_train[feature].dtype != 'O']
categorical_features = [feature for feature in df_train.columns if df_train[feature].dtype == 'O']

# print numerical & categorical columns in train data
print('We have {} numerical features in train data and they as as follows : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features train data and they as as follows: {}'.format(len(categorical_features), categorical_features))

We have 4 numerical features in train data and they as as follows : ['Item_Weight', 'Item_MRP', 'Outlet_Establishment_Year', 'Item_Outlet_Sales']

We have 4 categorical features train data and they as as follows: ['Item_Fat_Content', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Type']


In [23]:
print('Number of unique data points in categorical features in Train data')
print('Number of unique data points in Item_Fat_Content:', df_train['Item_Fat_Content'].unique())
print('Number of unique data points in Item_Type:',df_train['Item_Type'].unique())
print('Number of unique data points in Outlet_Location_Type:', df_train['Outlet_Location_Type'].unique())
print('Number of unique data points in Outlet_Type:', df_train['Outlet_Type'].unique())

Number of unique data points in categorical features in Train data
Number of unique data points in Item_Fat_Content: ['Low Fat' 'Regular' 'low fat' 'LF' 'reg']
Number of unique data points in Item_Type: ['Dairy' 'Soft Drinks' 'Meat' 'Fruits and Vegetables' 'Household'
 'Baking Goods' 'Snack Foods' 'Frozen Foods' 'Breakfast'
 'Health and Hygiene' 'Hard Drinks' 'Canned' 'Breads' 'Starchy Foods'
 'Others' 'Seafood']
Number of unique data points in Outlet_Location_Type: ['Kireka']
Number of unique data points in Outlet_Type: ['Supermarket']


## Data Preprocessing 
1. Remove Outliers as discover from EDA file
2. Fill Features with null values with median and mode
3. Drop redundant features 
4. Feature encoding 

#### 1. Remove Outliers as discover from EDA file 
##### Winsorization:
Winsorization replaces the extreme values with the nearest non-outlier value. You can choose to replace them with the maximum or minimum non-outlier value.

In [24]:
def find_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

outliers_weight = find_outliers_iqr(df_train, 'Item_Weight')
outliers_mrp = find_outliers_iqr(df_train, 'Item_MRP')
outliers_sales = find_outliers_iqr(df_train, 'Item_Outlet_Sales')

print("Number of outliers in Item_Weight:", len(outliers_weight))
print("Number of outliers in Item_MRP:", len(outliers_mrp))
print("Number of outliers in Item_Outlet_Sales:", len(outliers_sales))

Number of outliers in Item_Weight: 0
Number of outliers in Item_MRP: 0
Number of outliers in Item_Outlet_Sales: 6


In [26]:
# Creating a new column for Outlet_Age
df_train['Outlet_Age'] = df_train['Outlet_Establishment_Year'].apply(lambda year: 2023 - year)

# Standardize values in the 'Item_Fat_Content' column
df_train['Item_Fat_Content'] = df_train['Item_Fat_Content'].replace({'LF': 'Low Fat', 'low fat': 'Low Fat', 'reg': 'Regular'})

# Drop unnecessary columns
df_train.drop([ 'Outlet_Establishment_Year'], axis=1, inplace=True)

Fill Features with null values with median and mode

In [28]:
df_train['Item_Weight'].fillna(df_train['Item_Weight'].mean(), inplace=True)


In [29]:
df_train['Outlet_Type'].value_counts()

Outlet_Type
Supermarket    257
Name: count, dtype: int64

In [30]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257 entries, 0 to 256
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Item_Weight           257 non-null    float64
 1   Item_Fat_Content      257 non-null    object 
 2   Item_Type             257 non-null    object 
 3   Item_MRP              257 non-null    float64
 4   Outlet_Location_Type  257 non-null    object 
 5   Outlet_Type           257 non-null    object 
 6   Item_Outlet_Sales     257 non-null    float64
 7   Outlet_Age            257 non-null    int64  
dtypes: float64(3), int64(1), object(4)
memory usage: 16.2+ KB


In [31]:
df_train['Item_Fat_Content'].unique()

array(['Low Fat', 'Regular'], dtype=object)

In [32]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,Low Fat,Dairy,249.8092,Kireka,Supermarket,3735.138,6
1,5.92,Regular,Soft Drinks,48.2692,Kireka,Supermarket,443.4228,6
2,17.5,Low Fat,Meat,141.618,Kireka,Supermarket,2097.27,6
3,19.2,Regular,Fruits and Vegetables,182.095,Kireka,Supermarket,732.38,6
4,8.93,Low Fat,Household,53.8614,Kireka,Supermarket,994.7052,6


In [33]:
df_train['Item_Outlet_Sales'].max()

7968.2944

In [34]:
df_train['Item_Outlet_Sales'].min()

41.2796

In [35]:
df_train.shape

(257, 8)

In [None]:
df_train['Item_Fat_Content'].value_counts()

Item_Fat_Content
Low Fat    5517
Regular    3006
Name: count, dtype: int64

In [36]:
df_train[df_train['Outlet_Type'] == 'Supermarket']

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3000,Low Fat,Dairy,249.8092,Kireka,Supermarket,3735.1380,6
1,5.9200,Regular,Soft Drinks,48.2692,Kireka,Supermarket,443.4228,6
2,17.5000,Low Fat,Meat,141.6180,Kireka,Supermarket,2097.2700,6
3,19.2000,Regular,Fruits and Vegetables,182.0950,Kireka,Supermarket,732.3800,6
4,8.9300,Low Fat,Household,53.8614,Kireka,Supermarket,994.7052,6
...,...,...,...,...,...,...,...,...
252,7.7850,Low Fat,Fruits and Vegetables,61.4510,Kireka,Supermarket,759.0120,6
253,11.8000,Regular,Snack Foods,125.9704,Kireka,Supermarket,1877.5560,6
254,13.1500,Regular,Fruits and Vegetables,171.8764,Kireka,Supermarket,3779.0808,6
255,12.8243,Low Fat,Frozen Foods,41.5796,Kireka,Supermarket,41.2796,6


In [37]:
# define numerical & categorical columns in train data
numeric_features = [feature for feature in df_train.columns if df_train[feature].dtype != 'O']
categorical_features = [feature for feature in df_train.columns if df_train[feature].dtype == 'O']

# print numerical & categorical columns in train data
print('We have {} numerical features in train data and they as as follows : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features train data and they as as follows: {}'.format(len(categorical_features), categorical_features))

We have 4 numerical features in train data and they as as follows : ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales', 'Outlet_Age']

We have 4 categorical features train data and they as as follows: ['Item_Fat_Content', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Type']


## Feature Encoding 
1. Label Encoding 
2. One-Hot-Encoding


Ordinal variables:

Item_Fat_Content
Outlet_Size
Outlet_Location_Type

Nominal variables:

Item_Identifier
Item_Type
Outlet_Identifier
Outlet_Type

In [38]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Define the lists of categorical and numerical features
numerical_features = ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales']
categorical_features = ['Item_Fat_Content', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Type']

le = LabelEncoder()
Label = ['Item_Fat_Content','Outlet_Location_Type']

for i in Label:
    df_train[i] = le.fit_transform(df_train[i])

In [39]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Type,Item_MRP,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age
0,9.3,0,Dairy,249.8092,0,Supermarket,3735.138,6
1,5.92,1,Soft Drinks,48.2692,0,Supermarket,443.4228,6
2,17.5,0,Meat,141.618,0,Supermarket,2097.27,6
3,19.2,1,Fruits and Vegetables,182.095,0,Supermarket,732.38,6
4,8.93,0,Household,53.8614,0,Supermarket,994.7052,6


In [41]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Define columns for one-hot encoding
cols = ['Item_Type', 'Outlet_Type']

# Apply one-hot encoder
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False, drop='first')
oh_encoder_df_train = pd.DataFrame(oh_encoder.fit_transform(df_train[cols])).astype('int64')

# Get feature column names
oh_encoder_df_train.columns = oh_encoder.get_feature_names_out(cols)

# One-hot encoding removed index; put it back
oh_encoder_df_train.index = df_train.index

# Add one-hot encoded columns to the main DataFrame
df_train = pd.concat([df_train, oh_encoder_df_train], axis=1)

# Drop the original categorical columns
df_train = df_train.drop(['Item_Type', 'Outlet_Type'], axis=1)

In [42]:
df_train.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_MRP,Outlet_Location_Type,Item_Outlet_Sales,Outlet_Age,Item_Type_Breads,Item_Type_Breakfast,Item_Type_Canned,Item_Type_Dairy,...,Item_Type_Fruits and Vegetables,Item_Type_Hard Drinks,Item_Type_Health and Hygiene,Item_Type_Household,Item_Type_Meat,Item_Type_Others,Item_Type_Seafood,Item_Type_Snack Foods,Item_Type_Soft Drinks,Item_Type_Starchy Foods
0,9.3,0,249.8092,0,3735.138,6,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,5.92,1,48.2692,0,443.4228,6,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,17.5,0,141.618,0,2097.27,6,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,19.2,1,182.095,0,732.38,6,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,8.93,0,53.8614,0,994.7052,6,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [43]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

col_to_scale = ['Item_Weight', 'Item_MRP', 'Item_Outlet_Sales', 'Outlet_Age']

for col in col_to_scale:
    # Reshape the column to a 2D array with a single column
    col_data = df_train[col].values.reshape(-1, 1)
    
    # Fit and transform the scaler on the reshaped data
    df_train[col] = scaler.fit_transform(col_data)


In [44]:
X = df_train.drop(['Item_Outlet_Sales'], axis = 1)
y = df_train['Item_Outlet_Sales']

In [45]:
# Splitting the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [46]:
len(X_train)


205

In [47]:
len(X_test)

52

In [48]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [53]:


# Define evaluate_model function
def evaluate_model(true_values, predicted_values):
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    rmse = np.sqrt(mean_squared_error(true_values, predicted_values))
    mae = mean_absolute_error(true_values, predicted_values)
    r2 = r2_score(true_values, predicted_values)
    return mae, rmse, r2

# Ensure df_train is your preprocessed DataFrame
# Define features (X) and target (y)
X = df_train.drop('Item_Outlet_Sales', axis=1)
y = df_train['Item_Outlet_Sales']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models dictionary
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest Regressor": RandomForestRegressor(random_state=42),
    "XGBRegressor": XGBRegressor(random_state=42),
    "CatBoosting Regressor": CatBoostRegressor(verbose=False, random_state=42),
    "AdaBoost Regressor": AdaBoostRegressor(random_state=42)
}

model_list = []
r2_list = []

# Clear the file before writing new results (use 'w' instead of 'a')
file_path = r'C:\Users\USER\Desktop\sales pred\notebook\model_results.txt'
with open(file_path, 'w') as f:
    f.write("Model Results\n\n")

# Loop through models and evaluate
for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    model_train_mae, model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)
    model_test_mae, model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    # Print results
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))
    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

    # Save to file
    with open(file_path, 'a') as f:
        f.write(f"{list(models.keys())[i]}\n")
        f.write('Model performance for Training set\n')
        f.write(f"- Root Mean Squared Error: {model_train_rmse:.4f}\n")
        f.write(f"- Mean Absolute Error: {model_train_mae:.4f}\n")
        f.write(f"- R2 Score: {model_train_r2:.4f}\n")
        f.write('----------------------------------\n')
        f.write('Model performance for Test set\n')
        f.write(f"- Root Mean Squared Error: {model_test_rmse:.4f}\n")
        f.write(f"- Mean Absolute Error: {model_test_mae:.4f}\n")
        f.write(f"- R2 Score: {model_test_r2:.4f}\n")
        f.write('='*35 + '\n\n')

print(f"Results saved to {file_path}")

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.7370
- Mean Absolute Error: 0.5492
- R2 Score: 0.4338
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8273
- Mean Absolute Error: 0.6395
- R2 Score: 0.4099


Lasso
Model performance for Training set
- Root Mean Squared Error: 0.9795
- Mean Absolute Error: 0.7790
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1.0771
- Mean Absolute Error: 0.8825
- R2 Score: -0.0002


Ridge
Model performance for Training set
- Root Mean Squared Error: 0.7385
- Mean Absolute Error: 0.5485
- R2 Score: 0.4316
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8292
- Mean Absolute Error: 0.6424
- R2 Score: 0.4072


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 0.6699
- Mean Absolute Error: 0.5138
- R2 Score: 0.5323
----------------------

In [55]:
with open('C:\\Users\\USER\\Desktop\\sales pred\\notebook\\model_results.txt', 'r') as f:
    print(f.read())

Model Results

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 0.7370
- Mean Absolute Error: 0.5492
- R2 Score: 0.4338
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8273
- Mean Absolute Error: 0.6395
- R2 Score: 0.4099

Lasso
Model performance for Training set
- Root Mean Squared Error: 0.9795
- Mean Absolute Error: 0.7790
- R2 Score: 0.0000
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 1.0771
- Mean Absolute Error: 0.8825
- R2 Score: -0.0002

Ridge
Model performance for Training set
- Root Mean Squared Error: 0.7385
- Mean Absolute Error: 0.5485
- R2 Score: 0.4316
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 0.8292
- Mean Absolute Error: 0.6424
- R2 Score: 0.4072

K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 0.6699
- Mean Absolute Error: 0.5138
- R2 Score: 0.5323
----------

In [56]:
import pandas as pd

def compare_models(models, X_train, y_train, X_test, y_test):
    results = []
    for model_name, model in models.items():
        model_results = evaluate_model(model, X_train, y_train, X_test, y_test)
        model_results['Model'] = model_name
        results.append(model_results)

    return pd.DataFrame(results)


In [57]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

def evaluate_model(model, X_train, y_train, X_test, y_test):
    # Train the model on the training data
    model.fit(X_train, y_train)

    # Make predictions on the training and test data
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    # Calculate evaluation metrics for training data
    mae_train = mean_absolute_error(y_train, y_train_pred)
    mse_train = mean_squared_error(y_train, y_train_pred)
    rmse_train = np.sqrt(mse_train)
    r2_train = r2_score(y_train, y_train_pred)

    # Calculate evaluation metrics for test data
    mae_test = mean_absolute_error(y_test, y_test_pred)
    mse_test = mean_squared_error(y_test, y_test_pred)
    rmse_test = np.sqrt(mse_test)
    r2_test = r2_score(y_test, y_test_pred)

    # Calculate cross-validation RMSE
    cross_val_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
    cross_val_rmse = np.sqrt(-cross_val_scores)

    return {
        'MAE_train': mae_train,
        'MSE_train': mse_train,
        'RMSE_train': rmse_train,
        'R^2_train': r2_train,
        'MAE_test': mae_test,
        'MSE_test': mse_test,
        'RMSE_test': rmse_test,
        'R^2_test': r2_test,
        'Cross_Val_RMSE': cross_val_rmse.mean()
    }


In [58]:
# Use the compare_models function
results_df = compare_models(models, X_train, y_train, X_test, y_test)

# Sort the results by MAE
results_df.sort_values(by='MAE_test', ascending=True, inplace=True)

# Save the results to a CSV file
results_df.to_csv("model_comparison_results.csv", index=False)


In [59]:
result = pd.read_csv('model_comparison_results.csv')

In [60]:
result

Unnamed: 0,MAE_train,MSE_train,RMSE_train,R^2_train,MAE_test,MSE_test,RMSE_test,R^2_test,Cross_Val_RMSE,Model
0,0.549242,0.543195,0.737018,0.43381,0.639508,0.684495,0.827342,0.409872,0.802586,Linear Regression
1,0.548515,0.545331,0.738465,0.431583,0.642439,0.687648,0.829245,0.407155,0.794914,Ridge
2,0.242438,0.108912,0.330019,0.886477,0.649148,0.711277,0.843372,0.386783,0.914769,Random Forest Regressor
3,0.536316,0.479945,0.692781,0.499737,0.66742,0.765471,0.874912,0.34006,0.884262,AdaBoost Regressor
4,0.201796,0.075924,0.275543,0.920862,0.694244,0.838397,0.91564,0.277188,0.915395,CatBoosting Regressor
5,0.513762,0.448749,0.669887,0.532254,0.700899,0.857444,0.925983,0.260767,0.841958,K-Neighbors Regressor
6,0.014833,0.000496,0.022263,0.999483,0.788348,1.136921,1.066265,0.01982,1.00998,XGBRegressor
7,0.0,0.0,0.0,1.0,0.82241,1.197605,1.094351,-0.032498,1.215282,Decision Tree
8,0.778979,0.959386,0.979482,0.0,0.88251,1.160165,1.07711,-0.000219,0.977337,Lasso


In [61]:
import pandas, numpy, seaborn, matplotlib, sklearn, catboost, xgboost, flask, dill, joblib
print("All packages imported successfully!")

All packages imported successfully!
