<a href="https://colab.research.google.com/github/SinghNavnoor/Food-Sales-Predictions/blob/main/food_sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Project 1: Part 1
#####Navnoor Singh Kahlon
---



##Loading Data

In [196]:
import pandas as pd

In [197]:
filename = '/content/sales_predictions.csv'
df = pd.read_csv(filename)

In [198]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


In [199]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


Looks like there are null values in Item_weight and Outlet_size

##Data Cleaning

1) Number of rows and columns

In [200]:
len(df)

8523

In [201]:
len(df.columns)

12

2) Data types of each variable 

In [202]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

3) Duplicates

In [203]:
df.duplicated().sum()

0

No duplicates found

4) Identifying Missing Values

In [204]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

First looking at the Item_Weight column, I will be using the mean to fill the null values. 

In [205]:
item_weight_ave = df['Item_Weight'].mean()

item_weight_ave

12.857645184135976

In [206]:
df['Item_Weight'].fillna(item_weight_ave, inplace = True)

Analyzing the Outlet_Size column

In [207]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

The Outlet_Size column is an object type. 
It has 2410 null values.
The best way to fill these null values will be using Machine Learning.
For now I will be filling it with .mode() just so that we don't have any null values. 

In [208]:
df['Outlet_Size'].fillna('Medium', inplace = True)

Confirming that there are no missing values. 

In [209]:
df.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Looking for any inconsistencies with the columns. 

In [210]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

Focusing on every object type columns to see for any inconsistencies. 

Column: Item_Fat_Content 

In [211]:
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

Found inconsistencies with the naming of the Fat Contents like: Low Fat, LF, low fat, reg and regular. These need to be the same. 

Fixing the inconsistencies

In [212]:
df.replace({'Item_Fat_Content': {'LF':'Low Fat',
                                 'low fat': 'Low Fat',
                                 'reg': 'Regular'}}, inplace = True)

In [213]:
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

Column: Item_Type

In [214]:
df['Item_Type'].value_counts()
#no inconsistencies. 

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

Column: Outlet_Identifier

In [215]:
df['Outlet_Identifier'].value_counts()
#no inconsistencies

OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: Outlet_Identifier, dtype: int64

Column: Outlet_Size

Columns: Outlet_Location_Type

In [216]:
df['Outlet_Location_Type'].value_counts()
#no inconsistencies 

Tier 3    3350
Tier 2    2785
Tier 1    2388
Name: Outlet_Location_Type, dtype: int64

Column: Outlet_Type

In [217]:
df['Outlet_Type'].value_counts()
#no inconsistencies

Supermarket Type1    5577
Grocery Store        1083
Supermarket Type3     935
Supermarket Type2     928
Name: Outlet_Type, dtype: int64

Getting summary stats of each Numerical column. 

In [218]:
def summary_stats(category):
    m = category.mean().round(2)
    max = category.max().round(2)
    min = category.min().round(2)
    mode = category.mode()
    print(f"The {category.name}: Mean = {m}, Max = {max}, Min = {min}")
  

In [219]:
summary_stats(df['Item_Weight'])

The Item_Weight: Mean = 12.86, Max = 21.35, Min = 4.56


In [220]:
summary_stats(df['Item_Visibility'])

The Item_Visibility: Mean = 0.07, Max = 0.33, Min = 0.0


In [221]:
summary_stats(df['Item_MRP'])

The Item_MRP: Mean = 140.99, Max = 266.89, Min = 31.29


In [222]:
summary_stats(df['Item_Outlet_Sales'])

The Item_Outlet_Sales: Mean = 2181.29, Max = 13086.96, Min = 33.29


In [223]:
#this loop did not work, however I'd like to keep it and continue to look on it and make it work. 
'''for i in df.dtypes:
  if i == 'float64'| 'int64':
    m = i.mean()
    max = i.max()
    min = i.min()
    print(f"For {i} the mean = {m} \n max = {max} \n min = {min}")
    print(max)'''
  


'for i in df.dtypes:\n  if i == \'float64\'| \'int64\':\n    m = i.mean()\n    max = i.max()\n    min = i.min()\n    print(f"For {i} the mean = {m} \n max = {max} \n min = {min}")\n    print(max)'

##Exploratory Visuals

## Explanatory Visuals