## Contents

1. [Importing Libraries](#1.-Importing-Libraries)     
	1.1. [Importing Dataframes](#1.1-Importing-Dataframes)   
2. [Mixed-Type Data](#2.-Mixed-Type-Data) 
3. [Missing Values](#3.-Missing-Values) 
4. [Duplicates](#4.-Duplicates)  
5. [Checking the Basic Statistics of the df_ords dataframe](#5.-Checking-the-Basic-Statistics-of-the-df-ords-dataframe)  
	5.1. [Checking for Mixed-Type Data](#5.1-Checking-for-Mixed-Type-Data)   
	5.2. [Checking for Missing Values](#5.2-Checking-for-Missing-Values)     
	5.3. [Checking for Duplicates](#5.3-Checking-for-Duplicates)    
6. [Exporting Dataframes](#6.-Exporting-Dataframes)

## 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

In [6]:
path = r'C:\Users\User 1\Documents\Instacart Basket Analysis 04-2023'

### 1.2. Importing Dataframes

In [7]:
# Importing products.csv dataframe 
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [8]:
# Importing orders_wrangled.csv dataframe
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

## 2. Mixed-Type Data

In [9]:
# Checking for mixed-type columns
# Creating a dataframe
df_test = pd.DataFrame()

In [10]:
# Creating a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [11]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [13]:
# Check for mixed types

for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [14]:
# Changing mixed data types into a singular data type
df_test['mix'] = df_test['mix'].astype('str')

## 3. Missing Values

In [17]:
# Finding Missing Values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [18]:
# Creating a new dataframe to view the 16 missing observations
df_nan = df_prods[df_prods['product_name'].isnull()==True]

In [19]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [25]:
# product_name is a string and therefore cannot be imputed with the median/mean

In [26]:
# To compare the number of rows once missing rows have been removed
df_prods.shape

(49693, 5)

In [28]:
#Creating a new dataframe to view non-missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [29]:
df_prods_clean.shape

(49677, 5)

16 rows with missing values has been removed

## 4. Duplicates

In [30]:
# To look for duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [31]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [32]:
# Comparing the rows before and after removing duplicates
df_prods_clean.shape

(49677, 5)

In [33]:
#Creating a new dataframe that doesn't include the duplicates you've just identified
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [34]:
df_prods_clean_no_dups.shape

(49672, 5)

5 duplicate records have been removed.

In [38]:
#Looking at the descriptive statistics
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


In [39]:
# Max value for prices is exponentially high
df_prods_clean_no_dups.loc[df_prods_clean_no_dups['prices'] == 99999]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


In [47]:
# Looking for other irregular prices for products over $50.00
df_prods_clean_no_dups[df_prods_clean_no_dups["prices"]> 50.00]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,14900.0
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


Since the errors have values that are inflated and not missing, it is highly plausible that the error was due to moving the placeholder.

In [49]:
# Prices for #21553 and 33664 will be replaced with 1.49 and 9.99 respectively. 
df_prods_clean_no_dups = df_prods_clean_no_dups.replace ({"prices":{99999.0: 9.99, 14900.0:1.49 }})

In [51]:
df_prods_clean_no_dups[df_prods_clean_no_dups["product_id"]==21553]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
21554,21553,Lowfat 2% Milkfat Cottage Cheese,108,16,1.49


In [52]:
df_prods_clean_no_dups[df_prods_clean_no_dups["product_id"]==33664]

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33666,33664,2 % Reduced Fat Milk,84,16,9.99


## 5. Checking the Basic Statistics of the df_ords dataframe

In [35]:
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [None]:
Nothing looks out of the ordinary to me.

### 5.1. Checking for Mixed-Type Data

In [62]:
# Checking for mixed-type data
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

Step 4: There are no mixed-type data

### 5.2. Checking for Missing Values

In [63]:
# Checking for missing values
df_ords.isnull().sum()

order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

In [64]:
# Missing values in df_ords
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [65]:
df_ords_nan

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
11,2168274,2,prior,1,2,11,
26,1374495,3,prior,1,1,14,
39,3343014,4,prior,1,6,11,
45,2717275,5,prior,1,3,12,
...,...,...,...,...,...,...,...
3420930,969311,206205,prior,1,4,12,
3420934,3189322,206206,prior,1,3,18,
3421002,2166133,206207,prior,1,6,19,
3421019,2227043,206208,prior,1,1,15,


Step 6: The missing values reside in days_since_prior_order and they also have an order_number = 1. This means that the customer/user has only ordered once and have not ordered since. This is verified by the fact that the maximum value in the user_id is the same as the rows of missing values = 206209. 

With regards to dealing with these missing values, since there is no error, there is no need to remove or impute the missing values. Rather, these missing values in the days_since_prior_orders can be used to indicate new customers.

### 5.3. Checking for Duplicates

In [66]:
# Finding duplicates
df_ords_dups = df_ords[df_ords.duplicated()]

In [67]:
df_ords_dups

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order


Step 8: There are no duplicates.

## 6. Exporting Dataframes

In [68]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_checked.csv'), index = False)

In [69]:
df_ords_dups.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_checked.csv'), index = False)