<a href="https://colab.research.google.com/github/JoshTorre/Prediction-of-Product-Sales/blob/main/Prediction_of_Product_Sales.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prediction of Product Sales
  - Author: Joshua Torre

## Project Overview

**The goal of this project is to help the retailer understand the properties of products and outlets that play crucial roles in increasing sales.**

## Load and Inspect Data

In [91]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [92]:
# Import panda library
import pandas as pd

In [93]:
# Load data
fname = '/content/drive/MyDrive/Github/sales_predictions_2023.csv'

# Create dataframe
df = pd.read_csv(fname)

In [94]:
# Inspect data with df.head()
df.head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052
5,FDP36,10.395,Regular,0.0,Baking Goods,51.4008,OUT018,2009,Medium,Tier 3,Supermarket Type2,556.6088
6,FDO10,13.65,Regular,0.012741,Snack Foods,57.6588,OUT013,1987,High,Tier 3,Supermarket Type1,343.5528
7,FDP10,,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
8,FDH17,16.2,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.2,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.535


In [95]:
# Inspect data with df.info()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Item_Identifier            8523 non-null   object 
 1   Item_Weight                7060 non-null   float64
 2   Item_Fat_Content           8523 non-null   object 
 3   Item_Visibility            8523 non-null   float64
 4   Item_Type                  8523 non-null   object 
 5   Item_MRP                   8523 non-null   float64
 6   Outlet_Identifier          8523 non-null   object 
 7   Outlet_Establishment_Year  8523 non-null   int64  
 8   Outlet_Size                6113 non-null   object 
 9   Outlet_Location_Type       8523 non-null   object 
 10  Outlet_Type                8523 non-null   object 
 11  Item_Outlet_Sales          8523 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.2+ KB


In [96]:
# Check number of rows and columns with df.shape
df.shape

(8523, 12)

## Clean Data

In [97]:
# Overview
df.isna().sum()

Item_Identifier                 0
Item_Weight                  1463
Item_Fat_Content                0
Item_Visibility                 0
Item_Type                       0
Item_MRP                        0
Outlet_Identifier               0
Outlet_Establishment_Year       0
Outlet_Size                  2410
Outlet_Location_Type            0
Outlet_Type                     0
Item_Outlet_Sales               0
dtype: int64

### Cleaning object columns

In [98]:
# Inspect only object columns
pd.set_option('display.max_rows', None)
df.select_dtypes('object').head(10)

Unnamed: 0,Item_Identifier,Item_Fat_Content,Item_Type,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type
0,FDA15,Low Fat,Dairy,OUT049,Medium,Tier 1,Supermarket Type1
1,DRC01,Regular,Soft Drinks,OUT018,Medium,Tier 3,Supermarket Type2
2,FDN15,Low Fat,Meat,OUT049,Medium,Tier 1,Supermarket Type1
3,FDX07,Regular,Fruits and Vegetables,OUT010,,Tier 3,Grocery Store
4,NCD19,Low Fat,Household,OUT013,High,Tier 3,Supermarket Type1
5,FDP36,Regular,Baking Goods,OUT018,Medium,Tier 3,Supermarket Type2
6,FDO10,Regular,Snack Foods,OUT013,High,Tier 3,Supermarket Type1
7,FDP10,Low Fat,Snack Foods,OUT027,Medium,Tier 3,Supermarket Type3
8,FDH17,Regular,Frozen Foods,OUT045,,Tier 2,Supermarket Type1
9,FDU28,Regular,Frozen Foods,OUT017,,Tier 2,Supermarket Type1


Each object column shall be checked if there are any nulls, duplicates, and inconsistencies in values.

#### Item Identifier

In [99]:
# Inspect if 'Item_Identifier' has any null values
df['Item_Identifier'].isna().sum()

0

In [100]:
# Inspect if 'Item_Identifier' has duplicated values
df['Item_Identifier'].duplicated().value_counts()

True     6964
False    1559
Name: Item_Identifier, dtype: int64

Duplication of Product ID or 'Item_Identifier' will be accepted, since same products are sold at the grocery store or supermarkets.

#### Item Fat Content

In [101]:
# Inspect if 'Item_Fat_Content' has any null values
df['Item_Fat_Content'].isna().sum()

0

In [102]:
# Inspect value counts of 'Item_Fat_Content'
df['Item_Fat_Content'].value_counts()

Low Fat    5089
Regular    2889
LF          316
reg         117
low fat     112
Name: Item_Fat_Content, dtype: int64

It seems like there are inconsistencies in the categories of 'Item_Fat_Content'. Based on the Data Dictionary of the dataset, 'Item_Fat_Content' should be categorized either Low Fat or Regular.

In [103]:
# Change 'LF' to 'Low Fat'
df['Item_Fat_Content'] = df['Item_Fat_Content'].str.replace('LF', 'Low Fat')

In [104]:
# Change 'reg' to 'Regular'
df['Item_Fat_Content'] = df['Item_Fat_Content'].str.replace('reg', 'Regular')

In [105]:
# Change 'low fat' to 'Low Fat'
df['Item_Fat_Content'] = df['Item_Fat_Content'].str.replace('low fat', 'Low Fat')

In [106]:
# Inspect new value counts of 'Item_Fat_Content'
df['Item_Fat_Content'].value_counts()

Low Fat    5517
Regular    3006
Name: Item_Fat_Content, dtype: int64

Data for 'Item_Fat_Content' is now consistent with having a categorical value of either Low Fat or Regular.

#### Item Type

In [107]:
# Inspect if 'Item_Type' has any null values
df['Item_Type'].isna().sum()

0

In [108]:
# Inspect value counts of 'Item_Type'
df['Item_Type'].value_counts()

Fruits and Vegetables    1232
Snack Foods              1200
Household                 910
Frozen Foods              856
Dairy                     682
Canned                    649
Baking Goods              648
Health and Hygiene        520
Soft Drinks               445
Meat                      425
Breads                    251
Hard Drinks               214
Others                    169
Starchy Foods             148
Breakfast                 110
Seafood                    64
Name: Item_Type, dtype: int64

It seems like values  of 'Item_Type' are properly categorized.

#### Outlet Identifier

In [109]:
# Check value counts of 'Outlet_Identifier'
df['Outlet_Identifier'].value_counts()

OUT027    935
OUT013    932
OUT049    930
OUT046    930
OUT035    930
OUT045    929
OUT018    928
OUT017    926
OUT010    555
OUT019    528
Name: Outlet_Identifier, dtype: int64

In [110]:
# Check if 'Outlet_Identifier' has any null values
df['Outlet_Identifier'].isna().sum()

0

It seems like there are not any changes to be made on the column of 'Outlet_Identifier'.

#### Outlet Size

In [111]:
# Check if 'Outlet_Size' has any null values
df['Outlet_Size'].isna().sum()

2410

In [112]:
# Inspect value counts of 'Outlet_Size'
df['Outlet_Size'].value_counts()

Medium    2793
Small     2388
High       932
Name: Outlet_Size, dtype: int64

The column of 'Outlet_Size' has 2,410 null values. These should be filled with either 'Small', 'Medium', or 'High' categorical value.

In [113]:
# Inspect null values of 'Outlet_Size' sorted by the values of 'Outlet_Identifier'
df.loc[df['Outlet_Size'].isna()].sort_values('Outlet_Identifier')

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
2966,FDY60,10.5,Regular,0.04414,Baking Goods,143.9128,OUT010,1998,,Tier 3,Grocery Store,143.8128
2963,FDR27,15.1,Regular,0.160852,Meat,131.3942,OUT010,1998,,Tier 3,Grocery Store,397.4826
1545,FDK52,18.25,Low Fat,0.132596,Frozen Foods,226.2062,OUT010,1998,,Tier 3,Grocery Store,677.1186
1548,NCK18,9.6,Low Fat,0.011211,Household,166.9184,OUT010,1998,,Tier 3,Grocery Store,660.4736
6489,FDX32,15.1,Regular,0.0,Fruits and Vegetables,146.2786,OUT010,1998,,Tier 3,Grocery Store,433.4358
1555,FDU58,6.61,Regular,0.04856,Snack Foods,188.4898,OUT010,1998,,Tier 3,Grocery Store,187.0898
6469,NCW06,16.2,Low Fat,0.08426,Household,192.3162,OUT010,1998,,Tier 3,Grocery Store,769.6648
2944,DRB48,16.75,Regular,0.0416,Soft Drinks,40.9822,OUT010,1998,,Tier 3,Grocery Store,157.1288
2943,FDU38,10.8,Low Fat,0.138172,Dairy,191.4504,OUT010,1998,,Tier 3,Grocery Store,575.2512


Based on the result above, null values of 'Outlet_Size' are spread out from three different store IDs or 'Outlet_Identifier'. These are 'OUT010', 'OUT017', and 'OUT045'.

Null values:
- OUT010: 555
- OUT017: 926
- OUT045: 929
- Total: 2,410




Observations:
1. 'OUT010' - Built in the year 1998, in a Tier 3 location, and classified as Grocery Store
2. 'OUT017' - Built in the year 2007, in  a Tier 2 location, and classified as Supermarket Type1
3. 'OUT045' - Built in the year 2002, in a Tier 2 location, and classified as Supermarket Type1

Filling in null values will be based on these observations:
1. Year built
2. Location type
3. Outlet type



My strategy would be to filter based on the observations and infer a placeholder value that would address the missing values.

In [114]:
# Inspect the value counts of 'Outlet_Establishment_Year'
df['Outlet_Establishment_Year'].value_counts()

1985    1463
1987     932
1999     930
1997     930
2004     930
2002     929
2009     928
2007     926
1998     555
Name: Outlet_Establishment_Year, dtype: int64

My criteria for making a filter for the year established are outlets built < 2000 and > 2000.

In [115]:
# Inspect the observations on OUT010 based on the year built, outlet location type, and outlet type
df.loc[(df['Outlet_Establishment_Year'] < 2000) & ((df['Outlet_Location_Type'] == 'Tier 3') & (df['Outlet_Type'] == 'Grocery Store'))].head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344
30,FDV38,19.25,Low Fat,0.170349,Dairy,55.7956,OUT010,1998,,Tier 3,Grocery Store,163.7868
45,FDM39,6.42,Low Fat,0.089499,Dairy,178.1002,OUT010,1998,,Tier 3,Grocery Store,358.2004
65,FDC46,17.7,Low Fat,0.195068,Snack Foods,185.4266,OUT010,1998,,Tier 3,Grocery Store,184.4266
90,FDW20,20.75,Low Fat,0.040421,Fruits and Vegetables,122.173,OUT010,1998,,Tier 3,Grocery Store,369.519
122,FDB14,20.25,Regular,0.171939,Canned,92.512,OUT010,1998,,Tier 3,Grocery Store,186.424
133,FDS52,8.89,Low Fat,0.009163,Frozen Foods,101.7016,OUT010,1998,,Tier 3,Grocery Store,101.2016
139,NCN07,18.5,Low Fat,0.056816,Others,132.1284,OUT010,1998,,Tier 3,Grocery Store,263.6568
174,FDI32,17.7,Low Fat,0.291865,Fruits and Vegetables,115.1834,OUT010,1998,,Tier 3,Grocery Store,345.5502


It seems like there are no values from 'Outlet_Size' that can serve as a criterion in choosing a place holder value.

In [116]:
# Inspect the observations on OUT010 based on the year built and outlet type
df.loc[(df['Outlet_Establishment_Year'] < 2000) & (df['Outlet_Type'] == 'Grocery Store')].head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,,Tier 3,Grocery Store,178.4344
29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
30,FDV38,19.25,Low Fat,0.170349,Dairy,55.7956,OUT010,1998,,Tier 3,Grocery Store,163.7868
45,FDM39,6.42,Low Fat,0.089499,Dairy,178.1002,OUT010,1998,,Tier 3,Grocery Store,358.2004
49,FDS02,,Regular,0.255395,Dairy,196.8794,OUT019,1985,Small,Tier 1,Grocery Store,780.3176
59,FDI26,,Low Fat,0.061082,Canned,180.0344,OUT019,1985,Small,Tier 1,Grocery Store,892.172
63,FDY40,,Regular,0.150286,Frozen Foods,51.0692,OUT019,1985,Small,Tier 1,Grocery Store,147.8076
65,FDC46,17.7,Low Fat,0.195068,Snack Foods,185.4266,OUT010,1998,,Tier 3,Grocery Store,184.4266


Based on the result above, outlets built before 2000 are located in 'Tier 1' or 'Tier 3' locations. Moreover, they are categorized as a 'Grocery Store' and a has 'Small' outlet size.

In [117]:
# What if we place 'Tier 2' as a filter to check if there are Grocery Stores located in this location
df.loc[df['Outlet_Location_Type'] == 'Tier 2'].head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8,FDH17,16.2,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.2,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.535
19,FDU02,13.35,Low Fat,0.102492,Dairy,230.5352,OUT035,2004,Small,Tier 2,Supermarket Type1,2748.4224
22,NCB30,14.6,Low Fat,0.025698,Household,196.5084,OUT035,2004,Small,Tier 2,Supermarket Type1,1587.2672
25,NCD06,13.0,Low Fat,0.099887,Household,45.906,OUT017,2007,,Tier 2,Supermarket Type1,838.908
26,FDV10,7.645,Regular,0.066693,Snack Foods,42.3112,OUT035,2004,Small,Tier 2,Supermarket Type1,1065.28
33,FDO23,17.85,Low Fat,0.0,Breads,93.1436,OUT045,2002,,Tier 2,Supermarket Type1,2174.5028
46,NCP05,19.6,Low Fat,0.0,Health and Hygiene,153.3024,OUT045,2002,,Tier 2,Supermarket Type1,2428.8384
47,FDV49,10.0,Low Fat,0.02588,Canned,265.2226,OUT045,2002,,Tier 2,Supermarket Type1,5815.0972
53,FDA43,10.895,Low Fat,0.065042,Fruits and Vegetables,196.3794,OUT017,2007,,Tier 2,Supermarket Type1,3121.2704


Outlets built on 'Tier 2' locations are built in the years beyond 2000 and categorized as a 'Supermarket Type1' which does not fit the observations on OUT010

In [118]:
# Double check if there are outlets located in 'Tier 2' location that is categorized as a 'Grocery Store'
df.loc[(df['Outlet_Location_Type'] == 'Tier 2')& (df['Outlet_Type'] == 'Grocery Store')]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales


There are no outlets located in 'Tier 2' locations that are cateogrized as a 'Grocery Store'. Therefore, we can conclude that outlets built before the year 2000, located in 'Tier 1' or 'Tier 3' locations, and categorized as 'Grocery Store' has 'Small' outlet sizes.

In [119]:
# Fill null values of 'Outlet_Size' based on observations from Store ID: OUT010
store_1 = df['Outlet_Identifier'] == 'OUT010'
null_size = df['Outlet_Size'].isna()


In [121]:
df.loc[store_1 & null_size] = df.loc[store_1 & null_size].fillna('Small')

In [123]:
df.shape

(8523, 12)

In [124]:
df['Outlet_Size'].isna().sum()

1855

In [88]:
df = df.loc[store_1 & null_size].fillna('Small')

In [55]:
df['Outlet_Size'].isna().sum()

2410

In [89]:
df

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,Small,Tier 3,Grocery Store,732.38
28,FDE51,5.925,Regular,0.161467,Dairy,45.5086,OUT010,1998,Small,Tier 3,Grocery Store,178.4344
30,FDV38,19.25,Low Fat,0.170349,Dairy,55.7956,OUT010,1998,Small,Tier 3,Grocery Store,163.7868
45,FDM39,6.42,Low Fat,0.089499,Dairy,178.1002,OUT010,1998,Small,Tier 3,Grocery Store,358.2004
65,FDC46,17.7,Low Fat,0.195068,Snack Foods,185.4266,OUT010,1998,Small,Tier 3,Grocery Store,184.4266
90,FDW20,20.75,Low Fat,0.040421,Fruits and Vegetables,122.173,OUT010,1998,Small,Tier 3,Grocery Store,369.519
122,FDB14,20.25,Regular,0.171939,Canned,92.512,OUT010,1998,Small,Tier 3,Grocery Store,186.424
133,FDS52,8.89,Low Fat,0.009163,Frozen Foods,101.7016,OUT010,1998,Small,Tier 3,Grocery Store,101.2016
139,NCN07,18.5,Low Fat,0.056816,Others,132.1284,OUT010,1998,Small,Tier 3,Grocery Store,263.6568
174,FDI32,17.7,Low Fat,0.291865,Fruits and Vegetables,115.1834,OUT010,1998,Small,Tier 3,Grocery Store,345.5502


In [90]:
df['Outlet_Size'].isna().sum()

0

In [31]:
# OUT017 & OUT45
df.loc[(df['Outlet_Establishment_Year'] > 2000) & ((df['Outlet_Location_Type'] == 'Tier 2') & (df['Outlet_Type'] == 'Supermarket Type1'))]

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
8,FDH17,16.2,Regular,0.016687,Frozen Foods,96.9726,OUT045,2002,,Tier 2,Supermarket Type1,1076.5986
9,FDU28,19.2,Regular,0.09445,Frozen Foods,187.8214,OUT017,2007,,Tier 2,Supermarket Type1,4710.535
19,FDU02,13.35,Low Fat,0.102492,Dairy,230.5352,OUT035,2004,Small,Tier 2,Supermarket Type1,2748.4224
22,NCB30,14.6,Low Fat,0.025698,Household,196.5084,OUT035,2004,Small,Tier 2,Supermarket Type1,1587.2672
25,NCD06,13.0,Low Fat,0.099887,Household,45.906,OUT017,2007,,Tier 2,Supermarket Type1,838.908
26,FDV10,7.645,Regular,0.066693,Snack Foods,42.3112,OUT035,2004,Small,Tier 2,Supermarket Type1,1065.28
33,FDO23,17.85,Low Fat,0.0,Breads,93.1436,OUT045,2002,,Tier 2,Supermarket Type1,2174.5028
46,NCP05,19.6,Low Fat,0.0,Health and Hygiene,153.3024,OUT045,2002,,Tier 2,Supermarket Type1,2428.8384
47,FDV49,10.0,Low Fat,0.02588,Canned,265.2226,OUT045,2002,,Tier 2,Supermarket Type1,5815.0972
53,FDA43,10.895,Low Fat,0.065042,Fruits and Vegetables,196.3794,OUT017,2007,,Tier 2,Supermarket Type1,3121.2704


After

### Cleaning number columns

In [52]:
# Inspect only number columns
df.select_dtypes('number').head(10)

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Outlet_Establishment_Year,Item_Outlet_Sales
0,9.3,0.016047,249.8092,1999,3735.138
1,5.92,0.019278,48.2692,2009,443.4228
2,17.5,0.01676,141.618,1999,2097.27
3,19.2,0.0,182.095,1998,732.38
4,8.93,0.0,53.8614,1987,994.7052
5,10.395,0.0,51.4008,2009,556.6088
6,13.65,0.012741,57.6588,1987,343.5528
7,,0.12747,107.7622,1985,4022.7636
8,16.2,0.016687,96.9726,2002,1076.5986
9,19.2,0.09445,187.8214,2007,4710.535


Since columns 'Item_Weight' and 'Outlet_Size' have many null values, these columns will be our priority for data manipulation.

In [47]:
# Inspect null values of 'Item_Weight'
df.loc[df['Item_Weight'].isna()].head(10)

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
7,FDP10,,Low Fat,0.12747,Snack Foods,107.7622,OUT027,1985,Medium,Tier 3,Supermarket Type3,4022.7636
18,DRI11,,Low Fat,0.034238,Hard Drinks,113.2834,OUT027,1985,Medium,Tier 3,Supermarket Type3,2303.668
21,FDW12,,Regular,0.0354,Baking Goods,144.5444,OUT027,1985,Medium,Tier 3,Supermarket Type3,4064.0432
23,FDC37,,Low Fat,0.057557,Baking Goods,107.6938,OUT019,1985,Small,Tier 1,Grocery Store,214.3876
29,FDC14,,Regular,0.072222,Canned,43.6454,OUT019,1985,Small,Tier 1,Grocery Store,125.8362
36,FDV20,,Regular,0.059512,Fruits and Vegetables,128.0678,OUT027,1985,Medium,Tier 3,Supermarket Type3,2797.6916
38,FDX10,,Regular,0.123111,Snack Foods,36.9874,OUT027,1985,Medium,Tier 3,Supermarket Type3,388.1614
39,FDB34,,Low Fat,0.026481,Snack Foods,87.6198,OUT027,1985,Medium,Tier 3,Supermarket Type3,2180.495
49,FDS02,,Regular,0.255395,Dairy,196.8794,OUT019,1985,Small,Tier 1,Grocery Store,780.3176
59,FDI26,,Low Fat,0.061082,Canned,180.0344,OUT019,1985,Small,Tier 1,Grocery Store,892.172


In [48]:
df['Item_Weight'].describe()

count    7060.000000
mean       12.857645
std         4.643456
min         4.555000
25%         8.773750
50%        12.600000
75%        16.850000
max        21.350000
Name: Item_Weight, dtype: float64

In [56]:
df[['Item_Weight', 'Item_Type', 'Item_MRP']].sort_values('Item_Weight')

Unnamed: 0,Item_Weight,Item_Type,Item_MRP
7808,4.555,Frozen Foods,110.1544
4430,4.555,Frozen Foods,112.6544
3489,4.555,Frozen Foods,112.7544
4400,4.555,Frozen Foods,111.3544
3077,4.59,Soft Drinks,111.986
7984,4.59,Soft Drinks,111.186
6432,4.59,Soft Drinks,113.286
1515,4.59,Soft Drinks,114.586
1082,4.59,Soft Drinks,111.686
5493,4.61,Hard Drinks,173.8396


## Exploratory Data Analysis

## Feature Inspection