# 4.5 Task - Data Consistency Checks

### Script contents:

#### Importing libraries and data
#### Survey data with .describe() function
#### Check for mixed-type data
#### Check for missing values in data
#### Check for duplicates in data
#### Export data

## Importing libraries

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

## Importing data

In [5]:
#Creating a string of the path to main project folder
path = '/Users/jarrettpugh/Library/CloudStorage/OneDrive-Personal/Data Analytics/Career Foundry - DA Bootcamp/A4 Python Fundamentals for Data Analysts/Instacart Basket Analysis'

In [6]:
#import orders_wrangled.csv as dataframe df_ords
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

In [7]:
df_ords.head()

Unnamed: 0.1,Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,0,2539329,1,1,2,8,
1,1,2398795,1,2,3,7,15.0
2,2,473747,1,3,3,12,21.0
3,3,2254736,1,4,4,7,29.0
4,4,431534,1,5,4,15,28.0


In [11]:
# Dropped column 'Unnamed: 0' from df_ords
df_ords = df_ords.drop('Unnamed: 0', axis=1)

In [12]:
df_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0
...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0


# Task 4.5

In [4]:
# 1. If you haven’t performed the consistency checks covered in this Exercise on your 
# df_prods dataframe, do so now.

I have performed consistency checks on df_prods

In [13]:
# 2. Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret 
# the output of this function, share in a markdown cell whether anything about the data looks off or should be 
# investigated further.
# Tip: Keep an eye on min and max values!

df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


In [14]:
df_ords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 6 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
dtypes: float64(1), int64(5)
memory usage: 156.6 MB


All of the columns except for days_since_prior_order have the same values for count. This makes sense because days_since_prior_order is the only column that should contain some NULL values.

Order_id and user_id can be treated as string datatypes so the statistical summary of these columns does not mean much.

Other than order_number, the min/max values all look good. It makes sense that 0 is the minimum for orders_day_of_week, order_hour_of_day, and days_since_prior_order as well as their maximum values. A couple things that could be investigated is the maximum value for order_number (100). Although possible, I would want to double check that this is correct. The days_since_prior_order max is 30 which again is possible, but it's also plausible that values could be more than 30.

In [16]:
# 3. Check for mixed-type data in your df_ords dataframe.

for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

There are no mixed-type data in df_ords dataframe

In [17]:
# 4. If you find mixed-type data, fix it. The column in question should contain observations of a single data type.

I did not find any columns with mixed-type data

In [20]:
# 5. Run a check for missing values in your df_ords dataframe.
#       In a markdown cell, report your findings and propose an explanation for any missing values you find.

df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

The column days_since_prior_order has 206,209 NULL values. This makes sense because there are 206,209 unique user_id and each unique user_id's first order (order_number ==1) will have a NULL value for days_since_prior_order because they haven't previously ordered ever.

The NULL values should not be imputed with a mean/median value nor should these observations be removed from the data. Instead, a new variable can be created to act as a flag to designate this order to be a first order of a user.

In [24]:
# 6. Address the missing values using an appropriate method.
#      In a markdown cell, explain why you used your method of choice.

In [21]:
# Creating a new column 'first_order' with a boolean datatype to designate whether or not this is the first order of a user

df_ords['first_order'] = pd.isna(df_ords['days_since_prior_order'])

In [22]:
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False


In [23]:
df_ords[df_ords['days_since_prior_order'].isnull() == True]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
11,2168274,2,1,2,11,,True
26,1374495,3,1,1,14,,True
39,3343014,4,1,6,11,,True
45,2717275,5,1,3,12,,True
...,...,...,...,...,...,...,...
3420930,969311,206205,1,4,12,,True
3420934,3189322,206206,1,3,18,,True
3421002,2166133,206207,1,6,19,,True
3421019,2227043,206208,1,1,15,,True


I created a new column 'first_order' as a flag to designate each user's first order. True indicates this order is a user's first order; false designates that it is not. The first orders of these users should not be removed or filtered out nor imputed with a different value. It makes sense that their values are NULL but adding an additional column 'first_order' gives more context and understanding to why there are NULL values for the days_since_prior_order column.

In [26]:
# 7. Run a check for duplicate values in your df_ords data.
#    In a markdown cell, report your findings and propose an explanation for any duplicate values you find.

df_ords_dups = df_ords[df_ords.duplicated()]

In [27]:
df_ords_dups

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order


There are no rows with duplicated values.

In [28]:
# 8. Address the duplicates using an appropriate method.
#     In a markdown cell, explain why you used your method of choice.

I did not find any duplicates so I did not make any changes.

In [29]:
df_ords

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,first_order
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False
...,...,...,...,...,...,...,...
3421078,2266710,206209,10,5,18,29.0,False
3421079,1854736,206209,11,4,10,30.0,False
3421080,626363,206209,12,1,12,18.0,False
3421081,2977660,206209,13,1,12,7.0,False


# Export data

In [30]:
# 9. Export df_ords as orders_wrangled_checked.csv

df_ords.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled_checked.csv'))