**Contents**

01 Importing Libraries and Datasets

02 Products data

02.1 Data profile

02.2 Removing Typo Values

03 Orders Data

03.1 Data profile

03.2 Data consistency check

04 Data export

# 01. Importing Libraries and Datasets

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Set Path
path = r'C:\Users\Forrest\Desktop\Work\CareerFoundry\Python\2022-10 Instacart Basket Analysis'

In [3]:
# Import cleaned products df
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_cleaned.csv'), index_col = 0)

In [7]:
# Import cleaned orders df
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = 0)

# 02. Products data

## 02.1 Data profile

In [27]:
df_prods.shape

(49672, 4)

In [6]:
df_prods.head(10)

Unnamed: 0_level_0,product_name,aisle_id,department_id,prices
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Chocolate Sandwich Cookies,61,19,5.8
2,All-Seasons Salt,104,13,9.3
3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
5,Green Chile Anytime Sauce,5,13,4.3
6,Dry Nose Oil,11,11,2.6
7,Pure Coconut Water With Orange,98,7,4.4
8,Cut Russet Potatoes Steam N' Mash,116,1,1.1
9,Light Strawberry Blueberry Yogurt,120,16,7.0
10,Sparkling Orange Juice & Prickly Pear Beverage,115,7,8.4


In [5]:
df_prods.describe()

Unnamed: 0,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0
mean,67.762442,11.728942,9.993282
std,38.315784,5.850779,453.615536
min,1.0,1.0,1.0
25%,35.0,7.0,4.1
50%,69.0,13.0,7.1
75%,100.0,17.0,11.1
max,134.0,21.0,99999.0


Based on the max price, at least one item is $99,999. This is likely a typo or missing data.


## 02.2 Removing typo values

In [36]:
# Identifying extreme outliers in price

df_prods[df_prods['prices'] > 25]

Unnamed: 0_level_0,product_name,aisle_id,department_id,prices
product_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33664,2 % Reduced Fat Milk,84,16,99999.0


In [64]:
# Setting typo values to null

df_prods.at[21553, 'prices'] = None
df_prods.at[33664, 'prices'] = None

In [70]:
# Checking that values have changed to null

df_prods.isnull().sum()

product_name     0
aisle_id         0
department_id    0
prices           2
dtype: int64

# 03. Orders data

## 03.1 Data profile

In [26]:
df_ords.shape

(3421083, 6)

In [8]:
df_ords.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
6,550135,1,7,1,9,20.0
7,3108588,1,8,1,14,14.0
8,2295261,1,9,1,16,0.0
9,2550362,1,10,4,8,30.0


## 03.2 Data consistency check

In [11]:
# Check for mixed data types
# Output is list of columns with mixed data types

for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

No columns of mixed data type!

In [23]:
# Check for null (NaN) values
# Output is number of null values per column

df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

206,209 values in "days_since_prior_order" are null (NaN). These values should represent first-time customers (who have no prior order to record). So, these null values are acceptable to remain in the dataset.

In [24]:
# Checking for duplicate rows
# Output is (rows, columns) of duplicates

df_ords[df_ords.duplicated()].shape

(0, 6)

No duplicate rows!

# 04. Data export

In [72]:
# Export df_prods

df_prods.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_cleaned.csv'))

No data transformations were made for df_ords, so no export is needed.