# **Retail Store Inventory Dataset Analysis Using Pandas** -- (Undergoing...)


---



Dataset's Kaggle link: https://www.kaggle.com/datasets/anirudhchauhan/retail-store-inventory-forecasting-dataset

Steps to get a dataset from Kaggle into Colab ([YouTube demo link](https://https://www.youtube.com/watch?v=s9O6soJES74))

1. Login to Kaggle account.
2. Go to **Settings**.
3. Scroll down to find section called **'API'.**
4. Once there, click the button **'Create New Token'**.
5. A *.json* file containing the API key is saved locallly.
6. Open that file in notepad (or any text editor) and note the values for **'username'** and **'key'**.
7. Follow the steps below.

In [25]:
!pip install opendatasets > /dev/null  # '> /dev/null' hides the long output, though the package is installed successfully

In [26]:
import opendatasets as od

In [28]:
# Prompts to enter your Kaggle username and API key; post correct login, file is downloaded to folder icon on LHS of Colab
# If file is already downloaded, output of code is :
# 'Skipping, found downloaded files in "./retail-store-inventory-forecasting-dataset" (use force=True to force download)'

od.download('https://www.kaggle.com/datasets/anirudhchauhan/retail-store-inventory-forecasting-dataset')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: eddantes
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/anirudhchauhan/retail-store-inventory-forecasting-dataset
Downloading retail-store-inventory-forecasting-dataset.zip to ./retail-store-inventory-forecasting-dataset


100%|██████████| 1.51M/1.51M [00:00<00:00, 119MB/s]







In [5]:
import pandas as pd

In [155]:
df = pd.read_csv('/content/retail-store-inventory-forecasting-dataset/retail_store_inventory.csv')

In [156]:
df.head(3)

Unnamed: 0,Date,Store ID,Product ID,Category,Region,Inventory Level,Units Sold,Units Ordered,Demand Forecast,Price,Discount,Weather Condition,Holiday/Promotion,Competitor Pricing,Seasonality
0,2022-01-01,S001,P0001,Groceries,North,231,127,55,135.47,33.5,20,Rainy,0,29.69,Autumn
1,2022-01-01,S001,P0002,Toys,South,204,150,66,144.04,63.01,20,Sunny,0,66.16,Autumn
2,2022-01-01,S001,P0003,Toys,West,102,65,51,74.02,27.99,10,Sunny,1,31.32,Summer


**Above o/p shows index is starting from 0 but we will reset it to begin from 1.**

In [157]:
df.index = df.index + 1

# Following 2 lines can also reset the index range

# import numpy as np
# df.index = np.arange(1, len(df)+1)  # len(df) = 73100

In [158]:
df.head(3) # Index reset to begin from 1

Unnamed: 0,Date,Store ID,Product ID,Category,Region,Inventory Level,Units Sold,Units Ordered,Demand Forecast,Price,Discount,Weather Condition,Holiday/Promotion,Competitor Pricing,Seasonality
1,2022-01-01,S001,P0001,Groceries,North,231,127,55,135.47,33.5,20,Rainy,0,29.69,Autumn
2,2022-01-01,S001,P0002,Toys,South,204,150,66,144.04,63.01,20,Sunny,0,66.16,Autumn
3,2022-01-01,S001,P0003,Toys,West,102,65,51,74.02,27.99,10,Sunny,1,31.32,Summer


In [159]:
# df.index.dtype  # o/p : dtype('int64')
df.index.dtype.name

'int64'

**Some other methods for premiliary checks are as follows. One can run them individually.**

In [None]:
'''Methods for preliminary checks'''

# df.shape              # o/p : (73100, 15)
# df.info()             # Shows df shape, col names, their non-null val count, & dtypes
# df.dtypes             # Col names & their dtypes
# df.describe()         # Statistical data about df's numeric cols
# df.axes               # List of row axis' and col axis' labels, in that order
# df.index              # List of labels in index col
# df.columns            # List of all col labels
# df.keys()             # List of all col labels
# df.index.name         # Label of index col
# df.index.names        # Labels of multi-col index, aka multi-index
# df.ndim               # No. of dimensions in df (2 here)
# df.memory_usage()     # Memory usage of each col in bytes
# df.select_dtypes(exclude = 'object')        # include/exclude cols of specified dtypes

In [160]:
df.info() # O/P shows it's mostly a cleaned dataset; dtype of col 'Date' can be changed to datetime type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 73100 entries, 1 to 73100
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Date                73100 non-null  object 
 1   Store ID            73100 non-null  object 
 2   Product ID          73100 non-null  object 
 3   Category            73100 non-null  object 
 4   Region              73100 non-null  object 
 5   Inventory Level     73100 non-null  int64  
 6   Units Sold          73100 non-null  int64  
 7   Units Ordered       73100 non-null  int64  
 8   Demand Forecast     73100 non-null  float64
 9   Price               73100 non-null  float64
 10  Discount            73100 non-null  int64  
 11  Weather Condition   73100 non-null  object 
 12  Holiday/Promotion   73100 non-null  int64  
 13  Competitor Pricing  73100 non-null  float64
 14  Seasonality         73100 non-null  object 
dtypes: float64(3), int64(5), object(7)
memory usage: 8.4+

**Changing dtype of** Date **column.**

In [161]:
df['Date'].dtype.name # dtype of 'Date' col

'object'

In [162]:
df['Date'] = pd.to_datetime(df['Date'], format = '%Y-%m-%d')  # Changing dtype of 'Date' col

# df['Date'].dtype   # o/p: dtype('<M8[ns]')
# df['Date'].dtype.name  # o/p: datetime64[ns]

**Adding year and month columns:**

In [163]:
# Now that 'Date' column's dtype is of datetime type, we can use '.dt' accessor to extract year, month, day etc.

df['SalesYear'] = df['Date'].dt.year
df['SalesMonth'] = df['Date'].dt.month_name()

In [164]:
# Shifting cols 'SalesYear' & 'SalesMonth' beside 'Date' col
cols = df.columns.tolist()
cols = [cols[0], *cols[-2:], *cols[1:-2]]  # creating new cols list by using cols' position
df = df[cols]

In [165]:
df.head(1) # Just checking the cols

Unnamed: 0,Date,SalesYear,SalesMonth,Store ID,Product ID,Category,Region,Inventory Level,Units Sold,Units Ordered,Demand Forecast,Price,Discount,Weather Condition,Holiday/Promotion,Competitor Pricing,Seasonality
1,2022-01-01,2022,January,S001,P0001,Groceries,North,231,127,55,135.47,33.5,20,Rainy,0,29.69,Autumn


**Standardizing the column names:**

In [166]:
df.columns # original col names

Index(['Date', 'SalesYear', 'SalesMonth', 'Store ID', 'Product ID', 'Category',
       'Region', 'Inventory Level', 'Units Sold', 'Units Ordered',
       'Demand Forecast', 'Price', 'Discount', 'Weather Condition',
       'Holiday/Promotion', 'Competitor Pricing', 'Seasonality'],
      dtype='object')

In [167]:
df.columns = [col.replace(' ', '') for col in df.columns]
df.columns = [col.replace('/', '') for col in df.columns]  # 'Holiday/Promotion' to 'HolidayPromotion'

In [168]:
df.columns  # new col names

Index(['Date', 'SalesYear', 'SalesMonth', 'StoreID', 'ProductID', 'Category',
       'Region', 'InventoryLevel', 'UnitsSold', 'UnitsOrdered',
       'DemandForecast', 'Price', 'Discount', 'WeatherCondition',
       'HolidayPromotion', 'CompetitorPricing', 'Seasonality'],
      dtype='object')

**Replacing the values 0 and 1 in column** HolidayPromotion **to NO and YES, respectively:**

In [169]:
df['HolidayPromotion'] = df['HolidayPromotion'].replace({0: 'NO', 1: 'YES'})
df['HolidayPromotion'].head(3)

Unnamed: 0,HolidayPromotion
1,NO
2,NO
3,YES


# **Data Analysis**

### What are the trends in units sold over time for different categories?

In [171]:
sales_trend = df.groupby(['Date', 'Category'])['UnitsSold'].sum().unstack(level = 1)
sales_trend

Category,Clothing,Electronics,Furniture,Groceries,Toys
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2022-01-01,3784,3440,1738,3112,2410
2022-01-02,2326,2555,2108,2817,3609
2022-01-03,2524,3339,1146,3509,3163
2022-01-04,2280,2798,2522,2490,3994
2022-01-05,1387,2655,2398,2099,4033
...,...,...,...,...,...
2023-12-28,2586,2610,4122,3745,3208
2023-12-29,1692,3330,2799,2546,3001
2023-12-30,1776,1656,4359,2503,2862
2023-12-31,1703,1916,2711,2540,2338


In [80]:
# df.to_csv('retail_store_inventory_cleaned.csv', index = False)  # saving a copy of df as .csv