<a href="https://colab.research.google.com/github/SinghNavnoor/Food-Sales-Predictions/blob/main/food_sales_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Project 1: Part 1
#####Navnoor Singh Kahlon
---



In [85]:
import pandas as pd

In [86]:
filename = '/content/sales_predictions.csv'
df = pd.read_csv(filename)

In [87]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228


In [88]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


Looks like there are null values in Item_weight and Outlet_size

##Loading Data

1) Number of rows and columns

In [89]:
len(df)

8523

In [90]:
len(df.columns)

12

2) Data types of each variable 

In [91]:
df.dtypes

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

3) Duplicates

In [92]:
df.duplicated().sum()

0

No duplicates found

4) Identifying Missing Values

In [93]:
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

First looking at the Item_Weight column, I will be using the mean to fill the null values. 

In [94]:
item_weight_ave = df['Item_Weight'].mean()

item_weight_ave

12.857645184135976

In [95]:
df['Item_Weight'].fillna(item_weight_ave, inplace = True)

Analyzing the Outlet_Size column

In [96]:
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

The Outlet_Size column is an object type. 
It has 2410 null values. To be using mode() of this column to fill the null values won't be the best option. 
We are here to predict food sales, the size of the outlet may not matter that much. 
There are other columns like the Outlet_identifier that can help identify the store. 
Using .mode() is of no value.  
Hence I will be droping the column. 

In [97]:
df.drop(columns = ['Outlet_Size'], inplace = True)

Confirming that there are no missing values. 

In [98]:
df.isna().sum()

Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

Looking for any inconsistencies with the columns. 

In [99]:
df.head(2)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Tier 3,Supermarket Type2,443.4228


Focusing on Item_Outlet_Sales

I feel like Item_Outlet_Sales can be described in a better way. 
It should be called Total_Item_Outlet_Sales

In [100]:
df.rename(columns = {'Item_Outlet_Sales': 'Total_Item_Outlet_Sales',
                     'Item_Identifier': 'It_Identifier'}, inplace = True)

Further changes to the columns could be that we abbreviate Item with an It and Outlet to Olt. This will make the columns look more cleaner.

In [101]:
df.head()

Unnamed: 0,It_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Location_Type,Outlet_Type,Total_Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,Tier 3,Supermarket Type1,994.7052


In [102]:
df.dtypes

It_Identifier                 object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Location_Type          object
Outlet_Type                   object
Total_Item_Outlet_Sales      float64
dtype: object

Getting summary stats of each Numerical column. 

In [103]:
def summary_stats(category):
    m = category.mean().round(2)
    max = category.max().round(2)
    min = category.min().round(2)
    mode = category.mode()
    print(f"The {category.name}: Mean = {m}, Max = {max}, Min = {min}")
  

In [104]:
summary_stats(df['Item_Weight'])

The Item_Weight: Mean = 12.86, Max = 21.35, Min = 4.56


In [105]:
summary_stats(df['Item_Visibility'])

The Item_Visibility: Mean = 0.07, Max = 0.33, Min = 0.0


In [106]:
summary_stats(df['Item_MRP'])

The Item_MRP: Mean = 140.99, Max = 266.89, Min = 31.29


In [107]:
summary_stats(df['Total_Item_Outlet_Sales'])

The Total_Item_Outlet_Sales: Mean = 2181.29, Max = 13086.96, Min = 33.29


In [108]:
#this loop did not work, however I'd like to keep it and continue to look on it and make it work. 
'''for i in df.dtypes:
  if i == 'float64'| 'int64':
    m = i.mean()
    max = i.max()
    min = i.min()
    print(f"For {i} the mean = {m} \n max = {max} \n min = {min}")
    print(max)'''
  


'for i in df.dtypes:\n  if i == \'float64\'| \'int64\':\n    m = i.mean()\n    max = i.max()\n    min = i.min()\n    print(f"For {i} the mean = {m} \n max = {max} \n min = {min}")\n    print(max)'

##Data Cleaning

##Exploratory Visuals

## Explanatory Visuals