# Content
### Import Library
### Import Data
### Data Consistency Checks on Dataframe Products
#### Mixed type data
#### Missing Values
#### Duplicates
### Export Data
### Exercise


# I. LESSON

# 01. Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import os

# 02. Import Data 

In [4]:
path = r'/Users/maitran/Documents/Instacart Basket Analysis'

In [13]:
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [6]:
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

# 03. Data Consistency Checks on Dataframe Products

# Mixed type data

In [7]:
# Create a dataframe

df_test = pd.DataFrame()

In [8]:
# Create a mixed type column
df_test['mix'] = ['a', 'b', 1, True]

In [9]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [10]:
# Check for mixed types
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [11]:
df_test['mix'] = df_test['mix'].astype('str')

# Missing Values


In [14]:
# Find missing value
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [18]:
# Create a subset for product_name to find which products name is missing
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [17]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [20]:
df_prods.shape

(49693, 5)

In [38]:
# Create a new subset to store non-missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [22]:
df_prods_clean

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


# Duplicates

In [42]:
# Finding full duplicates
df_prods_dups = df_prods_clean[df_prods_clean.duplicated()]

In [24]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [26]:
df_prods_clean.shape

(49677, 5)

In [29]:
# Delete duplicates and create new dataframe
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [30]:
df_prods_clean_no_dups.shape

(49672, 5)

# Export Cleaned df_prods 

In [31]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))

# II. EXERCISE

# Data Consistency Checks on Dataframe Orders

### Question 2: Interpret output of describe function on df_ords

In [33]:
df_ords.describe()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710541.0,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,0.0,1.0,1.0,1.0,0.0,0.0,0.0
25%,855270.5,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710541.0,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421082.0,3421083.0,206209.0,100.0,6.0,23.0,30.0


Answer: When looling at the descriptive statistic of df_ords, it is strange at first that the minimum of orders_day_of_week, order_hour_of_day, and days_since_prior_order are all 0. However, 0 in orders_day_of_week could stand for Monday, 0 in order_hour_of_day could stand for 24 (12am). Though 0 in days_since_prior_order doesn't seem to make sense unless the client(s) placed one order only and haven't place the next order.

### Question 3: Check for mixed_type data in df_ords

In [35]:
# Check for mixed types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

Answer: There is no mixed-type data detected.

### Question 5: Check for missing values in df_ords

In [37]:
# Find missing value
df_ords.isnull().sum()

Unnamed: 0                     0
order_id                       0
user_id                        0
eval_set                       0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

In [40]:
# Create a subset for column days_since_prior_order to find missing values
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [41]:
df_ords_nan

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,prior,1,2,8,
11,11,2168274,2,prior,1,2,11,
26,26,1374495,3,prior,1,1,14,
39,39,3343014,4,prior,1,6,11,
45,45,2717275,5,prior,1,3,12,
...,...,...,...,...,...,...,...,...
3420930,3420930,969311,206205,prior,1,4,12,
3420934,3420934,3189322,206206,prior,1,3,18,
3421002,3421002,2166133,206207,prior,1,6,19,
3421019,3421019,2227043,206208,prior,1,1,15,


Answer: The column days_since_prior_order has 206209 missing values. When looking at the missing values table, I can see that all of them has the value of 1 in the order_number column. This shows that those cuctomers had only placed one order and haven't placed the next order yet, that's why the values in days_since_prior_order are missing. 

## Question 6: Address the missing values

Answer: I choose not to make any changes to the missing values, as it makes sense to not have the values yet in the days_since_prior_order column for fist time customers. If I remove all the missing values, there will be no data for first time customers and that will lead to inaccurate analysis. Imputing the value with means or median doesn't make sense either with the situation.

## Question 7 & 8: Check for duplicate values in df_ords

In [44]:
# Finding full duplicates
df_ords_dups = df_ords[df_ords.duplicated()]

In [45]:
df_ords_dups

Unnamed: 0.1,Unnamed: 0,order_id,user_id,eval_set,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


Answer: No duplicates found. No actions needed.

## Question 9: Export cleaned df_ords

In [47]:
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))