# Supermarket Sales Data Prediction (Model)
# **CRISP_DM Methodological Steps:**
1. **Business Understanding**
   - **Business Problem**:
     The project addresses the issue of predicting sales data accurately. Mispricing or incorrect predictions can lead to missed opportunities for both customers and businesses. By using predictive analytics, we aim to understand the factors that influence sales, improving pricing strategies and decision-making. This results in better market efficiency, benefiting all involved parties.
   
   - **Business Goal**:
     The primary goal of this project is to predict sales data accurately. By analyzing various features of the sales, we aim to create a model that can predict sales performance reliably. This helps businesses optimize strategies, improve market efficiency, and streamline operations.

2. **Introduction**:
   This section summarizes the sales dataset and its key features.

   **About Dataset:**
   - Invoice ID
   - Branch
   - City
   - Customer type
   - Gender
   - Product line
   - Unit price
   - Quantity
   - Tax 5%
   - Total
   - Date
   - Time
   - Payment
   - cogs
   - gross margin percentage
   - gross income
   - Rating

   **Summary of Predictive Analytics for Sales Prediction Project:**
   This project applies machine learning techniques to analyze sales data and predict future sales figures. By using algorithms such as **Linear Regression and RandomForestRegressor**, the model learns from historical data and identifies the key features that influence sales performance. This early prediction empowers businesses to make informed decisions, optimize strategies, and improve financial outcomes. The project is important for business operations as it supports market transparency, strategic planning, and improves financial decision-making.

#  **Importing Necessary Libraries**
   The libraries needed for data manipulation, visualization, machine learning, and model evaluation are imported. These libraries are crucial for loading, preparing the data, training the model, and evaluating its performance.


In [2]:
import pandas as pd  # For data analysis
import numpy as np  # For numerical operations
import seaborn as sns  # For data visualization and analysis
import matplotlib.pyplot as plt  # For plotting graphs
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import LabelEncoder, OneHotEncoder  # For converting categorical data to numerical values
from sklearn.metrics import r2_score, confusion_matrix  # For performance evaluation (r2_score for regression models, confusion_matrix for classification models)
from sklearn.linear_model import LinearRegression  # For Linear Regression model
from sklearn.ensemble import RandomForestRegressor  # For Random Forest models (regression and classification)
import joblib  # For saving and loading models

# if the model regression you must choose r2_score error
#if the model classification you must choose accuracy_score

## 2. Data Understanding
### Loading the Dataset


In [4]:
df=pd.read_csv(r"C:\Users\original\Downloads\Supermarket_Sales Data.csv")

# Display the first 5 rows of the dataset to understand its structure


In [6]:
df.head(5)

Unnamed: 0,Invoice ID,Branch,City,Customer type,Gender,Product line,Unit price,Quantity,Tax 5%,Total,Date,Time,Payment,cogs,gross margin percentage,gross income,Rating
0,750-67-8428,A,Yangon,Member,Female,Health and beauty,74.69,7,26.1415,548.9715,1/5/2019,13:08,Ewallet,522.83,4.761905,26.1415,9.1
1,226-31-3081,C,Naypyitaw,Normal,Female,Electronic accessories,15.28,5,3.82,80.22,3/8/2019,10:29,Cash,76.4,4.761905,3.82,9.6
2,631-41-3108,A,Yangon,Normal,Male,Home and lifestyle,46.33,7,16.2155,340.5255,3/3/2019,13:23,Credit card,324.31,4.761905,16.2155,7.4
3,123-19-1176,A,Yangon,Member,Male,Health and beauty,58.22,8,23.288,489.048,1/27/2019,20:33,Ewallet,465.76,4.761905,23.288,8.4
4,373-73-7910,A,Yangon,Normal,Male,Sports and travel,86.31,7,30.2085,634.3785,2/8/2019,10:37,Ewallet,604.17,4.761905,30.2085,5.3


# Get the shape of the dataset (number of rows and columns)


In [8]:
df.shape

(1000, 17)

# Displaying Information About the Dataset


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 17 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Invoice ID               1000 non-null   object 
 1   Branch                   1000 non-null   object 
 2   City                     1000 non-null   object 
 3   Customer type            1000 non-null   object 
 4   Gender                   1000 non-null   object 
 5   Product line             1000 non-null   object 
 6   Unit price               1000 non-null   float64
 7   Quantity                 1000 non-null   int64  
 8   Tax 5%                   1000 non-null   float64
 9   Total                    1000 non-null   float64
 10  Date                     1000 non-null   object 
 11  Time                     1000 non-null   object 
 12  Payment                  1000 non-null   object 
 13  cogs                     1000 non-null   float64
 14  gross margin percentage  

# Checking for Missing Values


In [12]:
df.isnull().sum()

Invoice ID                 0
Branch                     0
City                       0
Customer type              0
Gender                     0
Product line               0
Unit price                 0
Quantity                   0
Tax 5%                     0
Total                      0
Date                       0
Time                       0
Payment                    0
cogs                       0
gross margin percentage    0
gross income               0
Rating                     0
dtype: int64

In [13]:
df.duplicated().sum()

0

# 3. Data Preparation


### Dropping Unnecessary Columns

In [16]:
df.drop(['Invoice ID','Date','Time'],axis=1,inplace=True)

In [17]:
df.columns

Index(['Branch', 'City', 'Customer type', 'Gender', 'Product line',
       'Unit price', 'Quantity', 'Tax 5%', 'Total', 'Payment', 'cogs',
       'gross margin percentage', 'gross income', 'Rating'],
      dtype='object')

# Checking for Duplicates


In [19]:
df.dtypes

Branch                      object
City                        object
Customer type               object
Gender                      object
Product line                object
Unit price                 float64
Quantity                     int64
Tax 5%                     float64
Total                      float64
Payment                     object
cogs                       float64
gross margin percentage    float64
gross income               float64
Rating                     float64
dtype: object

# Encoding Categorical Variables


In [21]:
le=LabelEncoder()
df['Branch']=le.fit_transform(df['Branch'])
df['City']=le.fit_transform(df['City'])
df['Customer type']=le.fit_transform(df['Customer type'])
df['Gender']=le.fit_transform(df['Gender'])
df['Product line']=le.fit_transform(df['Product line'])
df['Payment']=le.fit_transform(df['Payment'])

# Splitting Features and Target Variable


In [23]:
x=df.drop('Unit price',axis=1)
y=df['Unit price']

# Splitting Data into Training and Test Sets


In [25]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

# 4. Modeling
## Training the Model

In [27]:
model=LinearRegression()
model.fit(x_train,y_train)

### 5. Evaluation
#### Making Predictions


In [29]:
y_predict =model.predict(x_test)

### Evaluating the Model Performance

In [31]:
print(r2_score(y_test, y_predict))

0.7968736505327494


In [32]:
joblib.dump(model,r"C:\Users\original\Desktop\Holistic Models\supermarket\ali.pkl")

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\original\\Desktop\\Holistic Models\\supermarket\\ali.pkl'

In [None]:
model2=RandomForestRegressor()
model2.fit(x_train,y_train)

In [None]:
y_predict =model2.predict(x_test)

In [None]:
print(r2_score(y_test, y_predict))

In [None]:
joblib.dump(model2,r"C:\Users\original\Desktop\Holistic Models\supermarket\ali2.pkl")