# Data Consistency Checks

## Content
#### 1. Importing libraries and data
#### 2. Wrangling procedures that haven't been saved properly to orders_wrangled.csv
#### 3. Data consistency checks for df_prods
#### 3.1 Mixed type data
#### 3.2 Missing values
#### 3.3 Duplicates
#### 4. Data consistency check for df_ords
#### 4.1 Mixed type data
#### 4.2 Missing values
#### 4.3 Duplicates
#### 5. Exporting dataframes

## Comments about orders_wrangled.csv
#### In previous notebook (from task 4.4) I slightly changed code for exporting dataframe called there df_ords to orders_wrangled.csv so that entire index column is skipped and not exported orders_wrangled.csv. Unfortunately some steps of wrangling orders.csv are not properly saved/exported and I will need to change data types for 2 columns order_id and user_id again.

# 1. Importing libraries and data

In [139]:
# Importing libraries
import pandas as pd
import numpy as np
import os

In [140]:
# Project folder path
path = r'C:\Users\Lara\Career Foundry Projects\21-09-2023 Instacart Basket Analysis'

In [141]:
# Importing datasets products.csv i orders_wrangled.csv
df_prods = pd.read_csv (os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)
df_ords = pd.read_csv (os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

# 2. Wrangling procedures that haven't been saved properly to orders_wrangled.csv

In [142]:
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [143]:
df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
dtypes: float64(1), int64(5)
memory usage: 156.6 MB


In [144]:
# Changing data types for columns order_id and user_id as it was done in Task 4.4
df_ords['order_id'] = df_ords['order_id'].astype('str')
df_ords['user_id'] = df_ords['user_id'].astype('str')

In [145]:
# Checking data types again
df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                object 
 1   user_id                 object 
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
dtypes: float64(1), int64(3), object(2)
memory usage: 156.6+ MB


# 3. Data consistency checks for df_prods

In [146]:
# Getting number of rows and columns before any changes are made
df_prods.shape

(49693, 5)

## 3.1 Mixed type data

In [147]:
# Check for mixed type columns
for col in df_prods.columns.tolist():
  weird_prods = (df_prods[[col]].applymap(type) != df_prods[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_prods[weird_prods]) > 0:
    print (col)

product_name


#### Function above printed column 'product_name' as column with mixed values.
#### It is possible missing values NaN are numeric instead of string.
#### I will procede with checking for missing values to see if this is the case.

## 3.2 Missing values

In [148]:
# Finding missing values in df_prods
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

#### Indeed, missing values in column 'product_name' have numeric data type int64.

In [149]:
# Create subset of 16 missing values in column product_name
df_prods_nan = df_prods[df_prods['product_name'].isnull() == True]

In [150]:
df_prods_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [151]:
# Create new dataframe from df_prods without missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [152]:
df_prods_clean.shape

(49677, 5)

#### 16 missing values were removed successfully. At the same time those were the rows that made column 'product_name' column with mixed data dype.
#### Checking again if everything is consistent now. No column names printed means there aren't any columns with mixed data type.

In [153]:
for col in df_prods_clean.columns.tolist():
  weird_prods_clean = (df_prods_clean[[col]].applymap(type) != df_prods_clean[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_prods_clean[weird_prods_clean]) > 0:
    print (col)

## 3.3 Duplicates

In [154]:
# Create subset with full duplicates - rows that has duplicates in every column
df_prods_dups = df_prods_clean[df_prods_clean.duplicated()]

In [155]:
df_prods_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [156]:
# Creating new dataframe without duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [157]:
df_prods_clean_no_dups.shape

(49672, 5)

# 4. Data consistency check for df_ords

In [158]:
# Getting number of rows and columns before any changes are made
df_ords.shape

(3421083, 6)

In [159]:
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


#### I notice that NaN in column 'days_since_prior_order' refers to N/A for first order from a new customer and this should be taken into consideration when dealing with such data.

In [160]:
df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                object 
 1   user_id                 object 
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
dtypes: float64(1), int64(3), object(2)
memory usage: 156.6+ MB


In [161]:
df_ords.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3214874.0
mean,17.15486,2.776219,13.45202,11.11484
std,17.73316,2.046829,4.226088,9.206737
min,1.0,0.0,0.0,0.0
25%,5.0,1.0,10.0,4.0
50%,11.0,3.0,13.0,7.0
75%,23.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


#### In all columns nothing looks out of the ordinary. For example, max for order_number is 100, but it is not unusual for someone to order 100 times from Instacart. 

## 4.1 Mixed data type

In [162]:
# Check for mixed type columns
for col in df_ords.columns.tolist():
  weird_ords = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird_ords]) > 0:
    print (col)

#### There are no columns with mixed data type

## 4.2 Missing values

In [163]:
# Finding missing values in df_ords
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

#### There are 206209 missing valees in column days_since_prior_order.
#### As mentioned above if all this values are for the first order of a new customer, then they should be all left as is, because imputing this values with 0 would skew the results. 0 days means somebody ordered twice in one day.

In [164]:
# Create subset of all rows that have order_number = 1
df_ords_first_order = df_ords.loc[df_ords['order_number']==1]

In [165]:
df_ords_first_order.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,


In [166]:
df_ords_first_order.shape

(206209, 6)

#### All 206209 NaN values from column "days_since_prior_order" are values for first orders from a new costumer.
#### I propose all of this values to be left as they are. 

## 4.3 Duplicates

In [167]:
# Create subset with full duplicates - rows that has duplicates in every column
df_ords_dups = df_ords[df_ords.duplicated()]

In [168]:
df_ords_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order


#### There are no duplicate values in this dataframe.

# 5. Exporting dataframes

#### For both dataframes, I am adding index_col = False so that exported files do not contain index column.

In [169]:
df_prods.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'), index = False)

In [170]:
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'), index = False)