# Products Data Set - Additional Cleaning

In this script, I will conduct some further analysis and cleaning of the products data set

Note - missing values and duplicate rows have already been addressed

In [2]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [3]:
# Set path

path = r"C:\Users\tiffk\Instacart Basket Analysis 22-05-2024"

In [4]:
# Import products dataframe

df_prods = pd.read_csv(os.path.join(path, "02 Data", "Prepared Data", "products_checked.csv"), index_col = False)

In [5]:
# Begin with quick view of df

df_prods

Unnamed: 0.1,Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,0,1,Chocolate Sandwich Cookies,61,19,5.8
1,1,2,All-Seasons Salt,104,13,9.3
2,2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...,...
49667,49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49668,49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49669,49690,49686,Artisan Baguette,112,3,7.8
49670,49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [6]:
# Remove "Unnamed: 0" column

df_prods = df_prods.drop("Unnamed: 0", axis = 1)

In [7]:
# Check

df_prods

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49667,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49668,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49669,49686,Artisan Baguette,112,3,7.8
49670,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


Could potentially address the product_id column given the mismatch with the index column, but this does not seem essential to the analysis. Instacart may also prefer to leave the gaps for whatever reason.

Focus will now be turned to the max and min values

In [8]:
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


There is most certainly an issue with the max value in prices - Should investigate the most expensive products in this column

In [9]:
df_prods_sorted = df_prods.sort_values(by = "prices", ascending = False)

In [10]:
df_prods_sorted

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33649,33664,2 % Reduced Fat Milk,84,16,99999.0
21538,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
19376,19391,Turkey Breast Tenderloins,49,12,25.0
25564,25579,Naturally Smoked Trout Fillet,15,12,25.0
40469,40486,Chicken Tenders,49,12,25.0
...,...,...,...,...,...
11292,11307,Goat Cheese Logs,2,16,1.0
32788,32803,Regular Deodorant,80,11,1.0
46606,46623,"Pie Pans, Large",10,17,1.0
37022,37037,Redseedless,24,4,1.0


In [11]:
# It is quite clear that both of the top priced items are some type of error which should be removed
# Ideally, the true price could be found out by asking someone at Instacart (like department 16)

In [12]:
df_prods_clean = df_prods[df_prods["prices"] < 14900]

In [13]:
# Check

df_prods_clean.sort_values(by = "prices", ascending = False)

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
19376,19391,Turkey Breast Tenderloins,49,12,25.0
40469,40486,Chicken Tenders,49,12,25.0
9005,9020,Boneless Skinless Chicken Thighs,35,12,25.0
25564,25579,Naturally Smoked Trout Fillet,15,12,25.0
21452,21467,Wild Caught Raw Shrimp,15,12,25.0
...,...,...,...,...,...
4860,4875,Buffalo Wing Sauce,5,13,1.0
20948,20963,1 Step-1 Minute Noodles Toasted Sesame,66,6,1.0
46570,46587,Pearl Couscous Natural,63,9,1.0
48155,48172,Culture Club Kombucha Goji Ginger,31,7,1.0


Outliers have been removed and the data set is ready to be exported

In [14]:
df_prods_clean.to_csv(os.path.join(path, "02 Data", "Prepared Data", "products_checked_v2.csv"))