# 00. Contents

### 01. Importing libraries
### 02. Importing data frames
### 03. Data consistency checks
#### Working with mixed data types
#### Missing values
#### Duplicates
### 04. The task
#### 2. Look for inconsistencies
#### 3. Check for mixed-type dat
#### 5. Missing values
#### 6. Addressing missing values
#### 7. Duplicate data

# 01. Importing Libraries

In [11]:
# Import libraries
import pandas as pd
import numpy as np
import os

# 02. Importing data frames

In [15]:
path = r'/Users/agne/Documents/Studies/Data Analysis/Study Materials/Python/Instacart Basket Analysis 2020 11'

In [19]:
df_ords = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'orders.csv'), index_col = False)

In [17]:
df_prods = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'products.csv'), index_col = False)

In [21]:
df_ords_wrangled = pd.read_csv(os.path.join(path, '02 Data', 'Prepared Data', 'orders_wrangled.csv'), index_col = False)

# 03. Data consistency checks

### Working with mixed data types

In [23]:
# Create a test dataframe
df_test = pd.DataFrame()

In [25]:
# Create a mixed type column
df_test['mix'] = ['a','b', 1, True]

In [27]:
df_test

Unnamed: 0,mix
0,a
1,b
2,1
3,True


In [27]:
# Check for mixed types

for col in df_test.columns.tolist():
  weird = (df_test[[col]].map(type) != df_test[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_test[weird]) > 0:
    print (col)

mix


In [30]:
df_test['mix'] = df_test['mix'].astype('str')df_prods.isnull().sum()

## Missing values

In [33]:
df_prods.isnull().sum()

product_id        0
product_name     16
aisle_id          0
department_id     0
prices            0
dtype: int64

In [39]:
df_nan = df_prods.loc[df_prods['product_name'].isnull() == True]

In [41]:
df_nan

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
33,34,,121,14,12.2
68,69,,26,7,11.8
115,116,,93,3,10.8
261,262,,110,13,12.1
525,525,,109,11,1.2
1511,1511,,84,16,14.3
1780,1780,,126,11,12.3
2240,2240,,52,1,14.2
2586,2586,,104,13,12.4
3159,3159,,126,11,13.1


In [45]:
df_prods.shape

(49693, 5)

In [47]:
df_prods_clean = df_prods[df_prods['product_name'].isnull() == False]

In [49]:
df_prods_clean.shape

(49677, 5)

## Duplicates

In [52]:
df_dups = df_prods_clean[df_prods_clean.duplicated()]

In [54]:
df_dups

Unnamed: 0,product_id,product_name,aisle_id,department_id,prices
462,462,Fiber 4g Gummy Dietary Supplement,70,11,4.8
18459,18458,Ranger IPA,27,5,9.2
26810,26808,Black House Coffee Roasty Stout Beer,27,5,13.4
35309,35306,Gluten Free Organic Peanut Butter & Chocolate ...,121,14,6.8
35495,35491,Adore Forever Body Wash,127,11,9.9


In [56]:
df_prods_clean.shape

(49677, 5)

In [58]:
df_prods_clean_no_dups = df_prods_clean.drop_duplicates()

In [60]:
df_prods_clean_no_dups.shape

(49672, 5)

In [63]:
df_prods_clean_no_dups.to_csv(os.path.join(path, '02 Data','Prepared Data', 'products_checked.csv'))

# 04. The task

## 2. Look for inconcistencies

In [29]:
df_ords.describe()

Unnamed: 0,order_id,user_id,order_number,order_dow,order_hour_of_day,days_since_prior_order
count,3421083.0,3421083.0,3421083.0,3421083.0,3421083.0,3214874.0
mean,1710542.0,102978.2,17.15486,2.776219,13.45202,11.11484
std,987581.7,59533.72,17.73316,2.046829,4.226088,9.206737
min,1.0,1.0,1.0,0.0,0.0,0.0
25%,855271.5,51394.0,5.0,1.0,10.0,4.0
50%,1710542.0,102689.0,11.0,3.0,13.0,7.0
75%,2565812.0,154385.0,23.0,5.0,16.0,15.0
max,3421083.0,206209.0,100.0,6.0,23.0,30.0


It looks like all the values are valid and fall into the appropriate limits.

## 3. Check for mixed-type data

In [31]:
# Check for mixed types

for col in df_ords.columns.tolist():
  weird = (df_ords[[col]].map(type) != df_ords[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_ords[weird]) > 0:
    print (col)

No mixed type data present.

## 5. Missing values

In [32]:
df_ords.isnull().sum()

order_id                       0
user_id                        0
eval_set                       0
order_number                   0
order_dow                      0
order_hour_of_day              0
days_since_prior_order    206209
dtype: int64

Days_since_prior_order has a lot of null values. However, that can be explained as the column indicated how many days went by before the user purchased something again. If they purchase something the next day, the value in the days_since_prior_order column will be null. If they are first-time buyers, the value will also be null. So, there is no reason to take action to take this data out or replace it.

In [35]:
df_ords.shape

(3421083, 7)

## 6. Addressing Missing Values

## 7. Duplicate data

In [37]:
df_ords_dups = df_ords[df_ords.duplicated()]

In [39]:
df_ords_dups

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order


No duplicates found.

In [41]:
df_ords.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [95]:
df_ords.to_csv(os.path.join(path, '02 Data','Prepared Data', 'orders_checked.csv'))