# Data Consistency Checks

### This script contains the following points:
1. Importing libraries
2. Importing data sets
3. Checking for mixed data types, lesson follow-along
4. Checking for missing data, lesson follow-along
5. Checking for duplicate records, lesson follow-along
6. Data consistency checks task 4.5

### 1. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

### 2. Importing Data

In [2]:
path = r'C:\Users\keely\Documents\Courses\CareerFoundry\Immersion\Achievement 4 - Python\01-2023 Instacart Basket Analysis'

In [3]:
vars_list = ['order_id', 'user_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day', 'days_since_prior_order']

In [4]:
df_ords = pd.read_pickle(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.pkl'))

# df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), usecols = vars_list, index_col = False)

In [5]:
df_ords.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [6]:
df_ords.tail(5)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0
3421082,272231,206209,14,6,14,30.0


In [7]:
df_ords.shape

(3421083, 6)

In [11]:
df_ords.dtypes

order_id                   object
user_id                    object
order_number                int16
orders_day_of_week           int8
order_hour_of_day            int8
days_since_prior_order    float16
dtype: object

In [8]:
# Importing products dataframe.

df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [9]:
df_prods.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
0,1,Chocolate Sandwich Cookies,61,19,5.8
1,2,All-Seasons Salt,104,13,9.3
2,3,Robust Golden Unsweetened Oolong Tea,94,7,4.5
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1,10.5
4,5,Green Chile Anytime Sauce,5,13,4.3


In [10]:
df_prods.shape

(49693, 5)

In [11]:
df_prods.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

### 3. Checking for Mixed Data Types

In [10]:
# Mixed Type Data: Making a test dataframe to clean up mixed data types.

# Create a dataframe.
df_test = pd.DataFrame()

# Create a column in that dataframe with mixed data types.
df_test['mix'] = ['a', 'b', 1, True]

In [11]:
df_test.head()

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [12]:
# A function that looks over a dataframe for mixed data types:

for col in df_test.columns.tolist():
  weird = (df_test[[col]].applymap(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [13]:
# Turning the mixed type column in df_test to a string:

df_test['mix'] = df_test['mix'].astype('str')

# There may be times where you need to go in the opposite direction, as well—from string to numeric. 
# To change this, simply update the str within the astype() function to int64 or whichever numeric data 
# type you want to use.

### 4. Checking for Missing Data

In [18]:
# How to find missing data in the df_prods dataframe:

df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [19]:
# Now we are viewing these 16 missing values in the product_name column:

df_nan = df_prods[df_prods['product_name'].isnull() == True]
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [14]:
"""
Address missing values:

Three ways of dealing with missing data:
1. Create a new variable that acts like a flag based on the missing value. - Delinquent, not delinquent.
2. Impute the value with the mean or median of the column (if the variable is numeric).
3. Remove or filter out the missing data.

Imputing:
Use mean, find with describe() first - df['column with missings'].fillna(mean value, inplace=True)

Use median, find with median() first - df['column with missings'].fillna(median value, inplace=True)

"""


"\nAddress missing values:\n\nThree ways of dealing with missing data:\n1. Create a new variable that acts like a flag based on the missing value. - Delinquent, not delinquint.\n2. Impute the value with the mean or median of the column (if the variable is numeric).\n3. Remove or filter out the missing data.\n\nImputing:\nUse mean, find with describe() first - df['column with missings'].fillna(mean value, inplace=True)\n\nUse median, find with median() first - df['column with missings'].fillna(median value, inplace=True)\n\n"

In [20]:
# Finding number of rows and columns in original products dataframe:

df_prods.shape

(49693, 5)

In [21]:
# Creating a new dataframe, setting isnull to false, to create a dataframe with only records with non-missing product 
# names.

df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [22]:
df_prods_clean.shape

(49677, 5)

In [23]:
# You can drop nulls as well with these methods. The first drops the whole record, I think, that has a 
# missing product name. The second, focuses on the column with the missing values. How to they perform differently?

# 1. df_prods.dropna(inplace = True)
# 2. df_prods.dropna(subset = [‘product_name’], inplace = True)

# If you don't want to change the dataframe, create a new dataframe with the needed changes instead.

### 5. Checking for Duplicates

In [24]:
# Code that finds duplicates:

df_dups = df_prods_clean[df_prods_clean.duplicated()]

df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [25]:
# Addressing the duplicated rows:

# First, we will get a count of rows and columns before and after removing duplicates.

df_prods_clean.shape

(49677, 5)

In [26]:
# Now, we create another df with no duplicates, then count the rows and columns of this duplicate-free dataframe:

df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

df_prods_clean_no_dups.shape

(49672, 5)

In [28]:
df_prods_clean_no_dups.dtypes

product_id         int64
product_name      object
aisle_id           int64
department_id      int64
prices           float64
dtype: object

In [29]:
# Exporting products dataframe free of missing values and duplicates. Remember, df_prods_clean has records with missing 
# product names removed, while the final df_prods_clean_no_dups has no missing value AND duplicates removed.

df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))

### 6. Task 4.5 Data Consistency Checks

In [30]:
# 2) Run the df.describe() function on your df_ords dataframe. Using your new knowledge about how to interpret 
# the output of this function, share in a markdown cell whether anything about the data looks off or 
# should be investigated further.

df_ords.describe()



Unnamed: 0,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3214874.0
mean,17.15486,2.776219,13.45202,
std,17.73316,2.046829,4.226088,0.0
min,1.0,0.0,0.0,0.0
25%,5.0,1.0,10.0,4.0
50%,11.0,3.0,13.0,7.0
75%,23.0,5.0,16.0,15.0
max,100.0,6.0,23.0,30.0


##### The max order number seems low, only 100. Based on mean, min, etc., the order numbers are low. Also, days since prior order indicates there are fewer records for this variable compared with all other variables (columns). 

In [31]:
# 3) Check for mixed-type data in your df_ords dataframe.

for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].applymap(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col, ' mixed')
  else: print(col, ' consistent')
    

order_id  consistent
user_id  consistent
order_number  consistent
orders_day_of_week  consistent
order_hour_of_day  consistent
days_since_prior_order  consistent


#### 4) There does not appear to be any columns with mixed data types in df_ords.

In [33]:
# 5) Run a check for missing values in df_ords.

df_ords.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

In [34]:
df_ords_nan = df_ords[df_ords['days_since_prior_order'].isnull() == True]
df_ords_nan.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
11,2168274,2,1,2,11,
26,1374495,3,1,1,14,
39,3343014,4,1,6,11,
45,2717275,5,1,3,12,


In [36]:
df_ords_nan.tail()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
3420930,969311,206205,1,4,12,
3420934,3189322,206206,1,3,18,
3421002,2166133,206207,1,6,19,
3421019,2227043,206208,1,1,15,
3421069,3154581,206209,1,3,11,


In [37]:
# Verifying that the NaN values above are limited to the customers first order, so for order number 2
# below there are no NaN present in days_since_prior_order.

df_ords[df_ords['order_number'] == 2]

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
1,2398795,1,2,3,7,15.0
12,1501582,2,2,5,10,10.0
27,444309,3,2,3,19,9.0
40,2030307,4,2,4,11,19.0
46,1909121,5,2,0,16,11.0
...,...,...,...,...,...,...
3420931,2658896,206205,2,2,15,30.0
3420935,3351137,206206,2,6,17,3.0
3421003,1074448,206207,2,0,10,1.0
3421020,1959749,206208,2,2,14,8.0


In [38]:
df_ords_clean = df_ords

In [39]:
df_ords_clean.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
0,2539329,1,1,2,8,
1,2398795,1,2,3,7,15.0
2,473747,1,3,3,12,21.0
3,2254736,1,4,4,7,29.0
4,431534,1,5,4,15,28.0


In [40]:
df_ords_clean.tail(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order
3421073,2307371,206209,5,4,15,3.0
3421074,3186442,206209,6,0,16,3.0
3421075,550836,206209,7,2,13,9.0
3421076,2129269,206209,8,3,17,22.0
3421077,2558525,206209,9,4,15,22.0
3421078,2266710,206209,10,5,18,29.0
3421079,1854736,206209,11,4,10,30.0
3421080,626363,206209,12,1,12,18.0
3421081,2977660,206209,13,1,12,7.0
3421082,272231,206209,14,6,14,30.0


#### Explanation of Missing Values: The missing values here are likely due to nulls being where there should be 0 for days_since_prior_order when new customers make their first purchase. 

In [41]:
# Created a new row indicated if the customer was new or established at time of each transaction.
df_ords_clean['new_customer'] = df_ords['days_since_prior_order'].isnull() == True

In [42]:
df_ords_clean.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer
0,2539329,1,1,2,8,,True
1,2398795,1,2,3,7,15.0,False
2,473747,1,3,3,12,21.0,False
3,2254736,1,4,4,7,29.0,False
4,431534,1,5,4,15,28.0,False


#### The missing null values in days_since prior order were left as nulls, since turning them to 0 as I orginally did would not account for customers buying their first 2 items within the same day (making them new customers on the second purchase, even if they bought something less than 24 hours prior. Instead, a new column was made distinguishing new customers from established customers. 

In [43]:
# 7) Run a check for duplicate values in your df_ords data. In a markdown cell, report your findings 
# and propose an explanation for any duplicate values you find.

df_ords_dup = df_ords_clean[df_ords_clean.duplicated()]

df_ords_dup

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,new_customer


In [44]:
df_ords_clean.shape

(3421083, 7)

### There were no duplicates found in df_ords_clean. 

#### 8) There were no duplicates found in df_ords_clean. If there were, we would use the drop_duplicates() function and then look at the number of rows and columns before and after using the shape function.

In [45]:
# 9) Export final cleaned orders dataframe.

df_ords_clean.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.pkl'))

# df_ords_clean.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))