# Table of Contents

### a. Notebook Prep
### b. Wrangle "products.csv"

### 1. Consistency Checks on Products Dataframe
### 2. Run describe on orders. How does the data look? 
### 3. Check for mixed-type data
### 5. Check for missing values
### 6. Address missing values
### 7. Run a check for duplicates
### 8. Address the duplicates
### 9. Export orders

### a. Notebook Prep

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
#Define path variable
path = r'C:\Users\PC Planet\Desktop\Self-Education\Data Immersion\Achievement 4\Instacart Basket Analysis'

In [3]:
#Read dataframes
df_prod = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)
df_ordw = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

### b. Wrangle "products.csv"

In [4]:
#Wrangle "products.csv" before continuing since that was not yet done
df_prod['product_id'] = df_prod['product_id'].astype('str')
df_prod['aisle_id'] = df_prod['aisle_id'].astype('str')
df_prod['department_id'] = df_prod['department_id'].astype('str')
df_prod.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_wrangled.csv'), index = False)

In [5]:
#Read wrangled products dataframe
df_prow = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_wrangled.csv'), index_col = False)

1. Consistency Checks on Products Dataframe

In [6]:
#Identify Missing Values
df_prow.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [7]:
#Create null dataframe
df_nan = df_prow[df_prow['product_name'].isnull()==True]
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [8]:
#Get shape of df_prow
df_prow.shape

(49693, 5)

In [9]:
#Create clean dataframe
df_proc = df_prow[df_prow['product_name'].isnull()==False]
df_proc.shape

(49677, 5)

In [10]:
#Find duplicates
df_dups = df_proc[df_proc.duplicated()]
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [11]:
#Remove duplicates
df_proc = df_proc.drop_duplicates()
df_proc.shape

(49672, 5)

In [12]:
#Export products
df_proc.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'products_cleaned.csv'))

2. Run describe on orders.  How does the data look? 

In [13]:
#Describe orders
df_ordw.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


Answer: The minimum day of the week is zero, which does not make sense at first. However, it seems likely that, since the max order day is 6, the recording system of the company starts counting at zero.

3. Check for mixed-type data

In [14]:
for col in df_ordw.columns.tolist(): weird = (df_ordw[[col]].applymap(type) != df_ordw[[col]].iloc[0].apply(type)).any(axis=1)
if len (df_ordw[weird]) > 0:
    print (col)

No mixed-type data

5. Check for missing values

In [15]:
#Null check
df_ordw.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

Missing values in "days_since_prior_order" may be a result of empty cells being inconsistently used to mean zero--zero days since the previous order.

6. Address missing values

In [16]:
#View missing values in a subset
df_onan = df_ordw[df_ordw['days_since_prior_order'].isnull()==True]
df_onan

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


Upon further inspection, my hypothesis earlier is incorrect.  The new hypothesis is that every user's first order is resulting in the NaN descriptor.  This is supported by the numerical sequence of "user_id" continuing all the way until the  final user_id matches the row count.  However strong this idea is, it is not proof, so a check needs to be conducted.

In [17]:
#Check that user_ids are all unique
unique = len(set(df_onan['user_id'])) == len(df_onan['user_id'])
if (unique == True): print('All elements are unique')
else: print('Elements are not unique')

All elements are unique


In [18]:
#Check that all order_numbers are the user's first order
df_onan['order_number'].sum()

206209

The sum of order_number matches the number of rows, which means that each order number had a value of exactly 1.  Now that we know definitively that each of these orders with a missing value in 'days_since_prior_order' is the first order of every user, we can manage the data.  It would be inappropriate to remove these values as there is nothing wrong with the data itself and knowing what details of a customer's first order is valuable information, so the best method would be to flag the data's missing values.  It would have been nice if the lesson actually taught us how to do this rather than just saying it's an option because it's been a giant several-hours-of-googling-to-no-avail pain trying to figure out why errors pop up and nobody seems to write anything understandable.  I tried a bajillion times to add a column that would simply say 'yes' or 'no' depending on whether or not the 'days_since_prior_order' value is null; I give up.  There's no way we can be expected to do this yet and I do not have the time to devote to this kind of guideless task.

7. Run a check for duplicates

In [19]:
#Find duplicates
df_ordw_dups = df_ordw[df_ordw.duplicated()]
df_ordw_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


There are no duplicates

8. Address the duplicates

What, the ones that seemingly don't exist?

9. Export orders

In [20]:
#Export orders
df_ordw.to_csv(os.path.join(path, '02 Data','Prepared Data','orders_cleaned.csv'))