# 4.5 IC Data Consistency

### This script contains the following points: <br> <br> 
1. Importing Libraries <br> <br> 
2. Importing Data Sets <br> <br> 
3. Data Checks <br>
 > 3.1 Products Data Set <br>
 > 3.2 Orders Data Set <br>
4. Data Consistency Checks: Products Dataframe <br><br>
 > 4.1 Missing Values (Products Dataframe) <br>
 > 4.2 Checking for Duplicates (Products Dataframe) <br>
 > 4.3 Mixed Datatypes Check (Products Dataframe) <br>
 > 4.4 Consider outliers (Products Dataframe)
 > 4.5 Exporting Products Dataframe <br>
5. Data Consistency Checks: Orders Dataframe <br><br>
 > 5.1 Missing Values (Orders Dataframe) <br>
 > 5.2 Checking for Duplicates (Orders Dataframe) <br>
 > 5.3 Mixed Datatypes Check (Orders Dataframe) <br>
 > 5.4 Exporting Orders Dataframe <br> 

## 01 Import Libraries

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import os

## 02 Import Data

In [2]:
# First create a string of the path for the main project folder
path = r'/Users/mistystone/Library/CloudStorage/OneDrive-Personal/Documents/CF_Data_Ach4_Python/2023-05_Instacart_Basket_Analysis/'

In [3]:
# Import prodcucts.csv file
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'))

In [4]:
# Import saved (already wrangled) orders_wrangled.csv as df_ords
df_ords = pd.read_csv(os.path.join(path, '02 Data','Prepared Data','orders_wrangled.csv'), index_col = False)

## 03 Data Checks

### 03.01 Products Data Set

In [5]:
# df_prods head 
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [6]:
# df_prods shape
df_prods.shape

(49693, 5)

In [7]:
# df_prods descriptive statistics
df_prods.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49693.0,49693.0,49693.0,49693.0
mean,24844.345139,67.770249,11.728433,9.994136
std,14343.717401,38.316774,5.850282,453.519686
min,1.0,1.0,1.0,1.0
25%,12423.0,35.0,7.0,4.1
50%,24845.0,69.0,13.0,7.1
75%,37265.0,100.0,17.0,11.2
max,49688.0,134.0,21.0,99999.0


Note that there seem to be outliers in the prices column. 

### 03.02 Orders Data Set

In [8]:
# df_ords head
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [9]:
# df_ords shape
df_ords.shape

(3421083, 7)

This data frame did not import as expected. 
I am going to delete the Unnamed column.
I am going to change order_id and user_id to strings

In [10]:
# drop Unnamed = 0 column
df_ords = df_ords.drop(columns = ['Unnamed: 0'])

In [11]:
# Check orders head
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [12]:
# Change the data type of order_id and user_id to strings
df_ords['order_id'] = df_ords['order_id'].astype('str')
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [13]:
# Check orders data types
df_ords.dtypes

order_id                   object
user_id                    object
order_number                int64
order_day_of_week           int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

## 04 Data Consistency Checks: Products Dataframe

### 04.01 Missing Values (Products Dataframe)

In [14]:
# Check for missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

There are 16 missing values in the product_name column.

In [15]:
# Create a subset of rows with missing values.
df_nan = df_prods[df_prods['product_name'].isnull() == True]
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [16]:
# Check products size
df_prods.shape

(49693, 5)

In [17]:
# Delete missing values.
df_prods_nomissing = df_prods[df_prods['product_name'].isnull() == False]

In [18]:
# Check products size
df_prods_nomissing.shape

(49677, 5)

### 04.02 Checking for Duplicates (Products Dataframe)

In [19]:
# Checking for duplicates
df_dups = df_prods_nomissing[df_prods_nomissing.duplicated()]

In [20]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


There are 5 duplicate rows in this dataframe.

In [21]:
# Check shape
df_prods_nomissing.shape

(49677, 5)

In [22]:
# Delete duplicates
df_prods_clean = df_prods_nomissing.drop_duplicates()

In [23]:
# Check shape
df_prods_clean.shape

(49672, 5)

### 04.03 Mixed Datatypes Check (Products Dataframe)

In [24]:
# Checking for mixed data types
for col in df_prods_clean.columns.tolist():
  weird = (df_prods_clean[[col]].applymap(type) != df_prods_clean[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_prods_clean[weird]) > 0:
    print (col)

In [25]:
# There are no mixed data type variables.

### 04.04 Consider outliers (Products Dataframe)

In [26]:
# descriptive statistics for prices column
df_prods_clean['prices'].describe()

count    49672.000000
mean         9.993282
std        453.615536
min          1.000000
25%          4.100000
50%          7.100000
75%         11.100000
max      99999.000000
Name: prices, dtype: float64

In [27]:
# Find the highest price items -- everything over 100
df_high_prices = df_prods_clean.loc[df_prods_clean['prices'] > 100]

In [28]:
# change the prices of these high priced products to missing
df_prods_clean.loc[df_prods_clean['prices'] >100, 'prices'] = np.nan

In [29]:
# descriptive statistics for prices column
df_prods_clean['prices'].describe()

count    49670.000000
mean         7.680437
std          4.199381
min          1.000000
25%          4.100000
50%          7.100000
75%         11.100000
max         25.000000
Name: prices, dtype: float64

Much better. The highest priced item in the data set is $25.

### 04.05 Exporting Products Dataframe

In [30]:
df_prods_clean.shape

(49672, 5)

In [31]:
# Export
df_prods_clean.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_clean.csv'))

## 05 Data Consistency Checks: Orders Dataframe

In [32]:
# Orders check head
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [33]:
# Orders datatype head
df_ords.dtypes

order_id                   object
user_id                    object
order_number                int64
order_day_of_week           int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

In [34]:
# Orders size check
df_ords.shape

(3421083, 6)

In [35]:
# Orders dataset descriptive statistics
df_ords.describe()

Unnamed: 0,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3214874.0
mean,17.15486,2.776219,13.45202,11.11484
std,17.73316,2.046829,4.226088,9.206737
min,1.0,0.0,0.0,0.0
25%,5.0,1.0,10.0,4.0
50%,11.0,3.0,13.0,7.0
75%,23.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


All of the descriptive statistics seem reasonable. 
The order numbers start at 1 and end at 100, with an average of 17.
The day of the week starts at 1 and ends at 6, with an average of 2.8.
The order hour starts at 0 and ends at 23, with an average of 13.5.
The days since prior orders starts at 0 and ends at 30, with an average of 11.1. 
I'm wondering whether the dataset assigns a value of 30 to all days_since_prior_orders of 30 and above, or if this dataset is simply truncated to eliminate any observations over 30. 

### 05.01 Missing Values (Orders Dataset)

In [36]:
# Check for missing values
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
order_day_of_week              0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

There are 206209 missing values in the days_since_prior_order column. 

In [37]:
# Frequency table to check the mean and the median
df_ords['days_since_prior_order'].value_counts(dropna = False)

30.0    369323
7.0     320608
6.0     240013
4.0     221696
3.0     217005
5.0     214503
NaN     206209
2.0     193206
8.0     181717
1.0     145247
9.0     118188
14.0    100230
10.0     95186
13.0     83214
11.0     80970
12.0     76146
0.0      67755
15.0     66579
16.0     46941
21.0     45470
17.0     39245
20.0     38527
18.0     35881
19.0     34384
22.0     32012
28.0     26777
23.0     23885
27.0     22013
24.0     20712
25.0     19234
29.0     19191
26.0     19016
Name: days_since_prior_order, dtype: int64

In [38]:
# Check days_since_prior_order descriptive statistics
df_ords['days_since_prior_order'].describe()

count    3.214874e+06
mean     1.111484e+01
std      9.206737e+00
min      0.000000e+00
25%      4.000000e+00
50%      7.000000e+00
75%      1.500000e+01
max      3.000000e+01
Name: days_since_prior_order, dtype: float64

The most frequent observation is 30.
The mean is 11.11484.
The median is 7. 
Therefore the data is skewed to the right. 
I'm going to replace the missing values with 7.

In [39]:
# Replace missing values with median:
df_ords['days_since_prior_order'].fillna(7, inplace=True)

In [40]:
# Check for missing values
df_ords.isnull().sum()

order_id                  0
user_id                   0
order_number              0
order_day_of_week         0
order_hour_of_day         0
days_since_prior_order    0
dtype: int64

No more missing values

### 05.02 Checking for Duplicates (Orders Dataframe)

In [41]:
# Checking for duplicates
df_dups = df_ords[df_ords.duplicated()]

In [42]:
# Print duplicates
df_dups

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order


There are no duplicates in this dataframe.

### 05.03 Mixed Datatypes Check (Orders Dataframe)

In [43]:
# Checking for mixed data types
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

There are no mixed data type variables.

### 05.04 Exporting Orders Dataframe

In [44]:
# Check shape
df_ords.shape

(3421083, 6)

In [45]:
# Export
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_clean.csv'))