# 3. Data consistency checks
** **
## Table of contents:

1. Importing libraries <br>
2. Importing dataframes <br>
3. Data consistency checks
    - 3.1 Mixed-type data <br>
    - 3.2 Missing values <br>
    - 3.3 Duplicates <br>
    - 3.4 Exporting the dataframe after consistency checks <br>
4. Tasks
    - 4.1 Task 2
    - 4.2 Task 3
    - 4.3 Task 5
    - 4.4 Task 6
    - 4.5 Task 7
    - 4.6 Exporting dataframes after consistency checks
** **

# 1. Importing libraries
** **

In [1]:
import pandas as pd
import numpy as np
import os

# 2. Importing dataframes
** **

In [2]:
# Creating a path variabile for the folder
path = r'C:\Users\Simone\Desktop\Career Foundry\Esercizi modulo 5\Instacart basket analysis'

In [3]:
# Importing orders wrangled dataframe from Prepared Data
df_ords = pd.read_csv(os.path.join(path, '02. Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [4]:
# Printing the first 5 rows to test
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


There is a strange column, like some sort of fake index.

In [5]:
# Removing first column
df_ords = df_ords.drop(columns = ['Unnamed: 0'])

In [6]:
# Testing
df_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


In [7]:
# Importing second dataframe
df_prods = pd.read_csv(os.path.join(path, '02. Data', 'Original Data', 'products.csv'), index_col = False)

In [8]:
# Printing the first 5 rows to test
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


# 3. Data consistency checks
** **

## 3.1 Mixed-type data

In this section is shown how to check if a column contains mixed-type data.

In [9]:
# Creating a new dataframe for test
df_test = pd.DataFrame ()

In [10]:
# Creating a column with mixed data
df_test['mix'] = ['a', 'b', 1, True]

This column contains mixed-type data: two strings, an integer and a boolean value.

In [11]:
# Testing
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [12]:
# Checking for mixed-type columns
for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


This code checks every column in the specified dataframe (df_test in this case) and prints the name of the columns containing mixed-type data (mix). <br>
If nothing is printed, it means the dataframe doesn't contain mixed-type columns.

## 3.2 Missing values

In this section we see how is possible to check for missing values and how to deal with them.

In [13]:
# Checking for missing values
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

The column "product_name" contains 16 missing values.

In [14]:
# Subsetting (creating a new dataframe) to visualize the missing values
df_nan = df_prods[df_prods['product_name'].isnull() == True]

In [15]:
# Visualizing subset
df_nan.head(20)

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


Here we can see a subset containing only the 16 rows with missing values.

In [16]:
# Checking the shape of the dataframe before deleting missing values
df_prods.shape

(49693, 5)

In [17]:
# Creating a new dataframe that excludes the 16 rows with missing values
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [18]:
# Checking if new dataset has 16 rows less...
df_prods_clean.shape

(49677, 5)

The new dataframe df_prods_clean has 16 rows less. We successfully filtered out the missing values.

## 3.3 Duplicates

In this section we see how to find duplicates and remove them.

In [19]:
# Creating a new dataframe containing only duplicates
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [20]:
# Checking the new dataframe
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [21]:
# Checking the number of rows before deleting duplicates
df_prods_clean.shape

(49677, 5)

In [22]:
# Removing duplicates
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [23]:
# Checking the number of orws after deletion
df_prods_clean_no_dups.shape

(49672, 5)

The new dataframe has 5 rows less. Duplicates have been removed.

## 3.4 Exporting the dataframe after consistency checks

In [24]:
# Exporting the new dataframe, that does not contain missing values and duplicates
df_prods_clean_no_dups.to_csv(os.path.join(path, '02. Data', 'Prepared Data', 'products_checked.csv'))

# 4. Tasks
** **

In this section, we will perform consistency checks on the orders dataframe. <br>
Before diving into the tasks, I noticed that the datatype for two variables in ords dataframe is wrong. I will fix them.

In [25]:
# Changing the datatypes
df_ords['order_id'] = df_ords['order_id'].astype('str')

In [26]:
df_ords['user_id'] = df_ords['user_id'].astype('str')

## 4.1 Task 2

In this section we will use the describe function to see if we find anything strange in the products dataframe.

In [27]:
# Using describe on df_prods
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


Max value for prices is 99999 this is extremely suspicious, considering that the mean is just 9. <br>
It should be an outlier made by mistake (or a missing value, with 99999 acting as a placeholder). <br>
Note: the issue has been addressed in a future script, 4.9.

In [28]:
# Creating a new subset to inspect the strange value
df_prices_sus = df_prods[df_prods['prices']==99999]

In [29]:
# Checking the output
df_prices_sus

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33666,33664,2 % Reduced Fat Milk,84,16,99999.0


It seems there is a single product (2 % Reduced Fat Milk) with this wrong price.

## 4.2 Task 3

In this section we will check if there are mixed-typde data columns in the orders dataframe.

In [30]:
# Checking mixed data-type in df_ords
for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)


<b> Observations: </b> <br>
No column printed. It means there aren't variables with mixed data-type in df_ords.

## 4.3 Task 5

In this section we will check if there are missing values inside the columns of the orders dataframe.

In [31]:
# Checking for missing values in df_ords
df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_creation         0
days_since_prior_order    206209
dtype: int64

In [32]:
# Investigating the missing values through a new subset
df_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]

In [33]:
# Checking the output
df_nan

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,
...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,


<b> Observations: </b> <br>
days_since_prior_order has 206209 missing values. <br>
My guess is that this is not a mistake, neither a missing value. <br>
The column contains a "missing value" everytime a new customer places the first order (see how the user_id changes, while the order_number is always 1). <br> 
Being the first order, is not possible to calculate how many days since the prior order, because there isn't a prior order.

## 4.4 Task 6

In [34]:
# Addressing the missing values using an appropriate method
df_ords_clean = df_ords[df_ords['days_since_prior_order'].isnull() == False]

I decided to create a new dataframe that filtered out the missing values. <br>
It doesn't make sense to impute values in this case. <br>
It's the first order, so is not possible to establish how much days since the prior order, because this "order" does not exists. <br>
Doing so, we are losing information on 206.209 records, but compared to the whole dataframe is only the 6%.

In [35]:
# Testing the new dataframe, that does not contains missing values
df_ords_clean

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


The new dataframe has 206.209 rows less.

## 4.5 Task 7

In this section, we will check for duplicates in the orders dataframe.

In [36]:
# Checking for duplicates in the new clean dataframe
df_ords_dups = df_ords_clean[df_ords_clean.duplicated()]

In [37]:
# Checking the new dataframe
df_ords_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order


The new dataframe contains 0 rows. 0 duplicates found.

In [38]:
# As a bonus, I would like to check if there were duplicates in the dataframe prior to deleting the rows with missing values
df_ords_dups_bonus = df_ords[df_ords.duplicated()]

In [39]:
# Checking the output
df_ords_dups_bonus

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order


No duplicates found even prior to missing values deletion.

## 4.6 Exporting dataframes after consistency checks

In this sections we will export both orders and products dataframes after consistency checks. <br>
Before exporting them, I will perform some additional checks to be sure everything is in order. <br>
Note: the orders dataframe exported will be df_ords_clean because we didn't find any duplicates neither mixed-type data.

In [40]:
# Checking head and tail
df_ords_clean

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
5,3367565,1,6,2,7,19.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


<b> Observations: </b> <br>
The dataframe now has 3.214.874 rows (206.209 rows with missing values have been deleted), however the index is the same as before and does not match with the new number of rows. <br>
I will check with tutor to see if I should reset the index. <br>
NOTE: When imported in script 4.6, the dataframe index reset on its own and the issue is fixed.

In [41]:
# Checking descriptive statistics
df_ords_clean.describe()

Unnamed: 0,order_number,orders_day_of_week,order_hour_of_creation,days_since_prior_order
count,3214874.0,3214874.0,3214874.0,3214874.0
mean,18.19107,2.777637,13.44082,11.11484
std,17.7995,2.044923,4.225992,9.206737
min,2.0,0.0,0.0,0.0
25%,6.0,1.0,10.0,4.0
50%,12.0,3.0,13.0,7.0
75%,25.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


<b> Observations: </b> <br>
Having deleted all the rows with missing values in days_since_prior_order column (and subsequently, all rows with 1 as order number), the minimum value for order number is now 2.

In [42]:
# Checking dtypes
df_ords_clean.dtypes

order_id                   object
user_id                    object
order_number                int64
orders_day_of_week          int64
order_hour_of_creation      int64
days_since_prior_order    float64
dtype: object

<b> Observations: </b> <br>
Everything seems correct. I'm in doubt if order_number should be also converted in string. <br>
NOTE: the datatype has been changed in a successive script.

I will perform the same checks also on the products dataframe.

In [43]:
# Checking head and tail
df_prods_clean_no_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3
...,...,...,...,...,...
49688,49684,"Vodka, Triple Distilled, Twist of Vanilla",124,5,5.3
49689,49685,En Croute Roast Hazelnut Cranberry,42,1,3.1
49690,49686,Artisan Baguette,112,3,7.8
49691,49687,Smartblend Healthy Metabolism Dry Cat Food,41,8,4.7


In [44]:
# Checking describe
df_prods_clean_no_dups.describe()

Unnamed: 0,product_id,aisle_id,department_id,prices
count,49672.0,49672.0,49672.0,49672.0
mean,24850.349775,67.762442,11.728942,9.993282
std,14340.705287,38.315784,5.850779,453.615536
min,1.0,1.0,1.0,1.0
25%,12432.75,35.0,7.0,4.1
50%,24850.5,69.0,13.0,7.1
75%,37268.25,100.0,17.0,11.1
max,49688.0,134.0,21.0,99999.0


<b> Observations: </b> <br>
Here it is again the matter of the max value of prices, that is going to be investigated in the future.

In [45]:
# Checking dtypes
df_prods_clean_no_dups.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

<b> Observations: </b> <br>
I don't know if product_id, aisle_id and department_id should be strings as well. I will check with my tutor. <br>
NOTE: the datatype has been changed in a successive script.

Both databases are now ready to be exported.

In [46]:
# Exporting the new orders dataframe that does not contain missing values
df_ords_clean.to_csv(os.path.join(path, '02. Data', 'Prepared Data', 'orders_checked.csv'))

In [47]:
# Exporting the new products dataframe that does not contain missing values and duplicates
df_prods_clean_no_dups.to_csv(os.path.join(path, '02. Data', 'Prepared Data', 'products_checked.csv'))