# 4.09 Intro to Data Visualization with Python PART 1

## 01 Import your analysis libraries and customer data

## 03 Complete the fundamental data quality and consistency checks
#### a) missing values
#### b) duplicates
#### c) data types, including mixed types

## 02 Wrangle the data so that it follows consistent logic
#### a) rename columns
#### b) drop columns

## 04 Combine your customer data with the rest of your prepared Instacart data

## 05 Export the merged dataframe as a pickle file
#### ____________________________________________________________________________________

## 01 Import your analysis libraries and customer data

In [160]:
#import libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

# import dataframe

pathData = r'C:\Users\Michael\Desktop\Career Foundry\02 Data Immersion Course\04 Python Fundamentals for Data Analysts\Instacart Basket Analysis 2023 11'
df_cust = pd.read_csv(os.path.join(pathData, '02 Data', 'Original Data', 'customers.csv'))

## 03 Complete the fundamental data quality and consistency checks

### Initial checks on shape and contents

In [161]:
df_cust.shape

(206209, 10)

In [162]:
df_cust.shape

(206209, 10)

### a) missing values

In [163]:
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 10 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       206209 non-null  int64 
 1   First Name    194950 non-null  object
 2   Surnam        206209 non-null  object
 3   Gender        206209 non-null  object
 4   STATE         206209 non-null  object
 5   Age           206209 non-null  int64 
 6   date_joined   206209 non-null  object
 7   n_dependants  206209 non-null  int64 
 8   fam_status    206209 non-null  object
 9   income        206209 non-null  int64 
dtypes: int64(4), object(6)
memory usage: 15.7+ MB


df_cust.info() shows First_Name has some Null values.

This is ok, as the first names will be dropped due to customer privacy

In [164]:
# user_id should be object, as it is qualitative and not for calculations
df_cust['user_id'] = df_cust['user_id'].astype('str')

### b) duplicates

In [165]:
# Check for full rows that are duplicates
# using columns excluding user_id and date_joined in case a user accientantly created two accounts on separate days
df_cust[df_cust.duplicated(subset =
                           ['First Name',
                            'Surnam',
                            'Gender',
                            'STATE',
                            'n_dependants',
                            'income'
                           ]
                          )]

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income


no duplicated rows found.

### c) data types, including mixed types

from the .info() above, all columns had consistent types

## 02 Wrangle the data so that it follows consistent logic

In [166]:
# Check headers and data content of first indexes
df_cust.head(20)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


#### a) rename columns

In [167]:
# some columns are unclear
df_cust.rename(columns = {'STATE':'customer_state',
                          'income':'customer_income',
                          'Age':'customer_age',
                          'n_dependants':'customer_dependants'
                         },
               inplace = True)

#### b) drop columns

In [168]:
# Drop customer name details for privacy
df_cust = df_cust.drop(columns = ['First Name', 'Surnam'])

In [169]:
# check shape after changes
df_cust.shape

(206209, 8)

In [170]:
#check the dataframe with changes
df_cust.head(20)

Unnamed: 0,user_id,Gender,customer_state,customer_age,date_joined,customer_dependants,fam_status,customer_income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Male,Virginia,26,1/1/2017,2,married,32072


## 04 Combine your customer data with the rest of your prepared Instacart data

In [171]:
#import most recent dataframe
ords_prods_merged = pd.read_pickle(os.path.join(pathData, '02 Data', 'Prepared Data', 'df_ords_prods_merged_updated02.pkl'))

In [172]:
# find suitable column(s) fore megring on
ords_prods_merged.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,prices,_merge,price_range_loc,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_spend,order_frequency,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,both,Mid-range product,Regularly busy days,Average orders,10,New customer,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,both,Mid-range product,Least busy days,Average orders,10,New customer,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,both,Mid-range product,Least busy days,Most orders,10,New customer,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,both,Mid-range product,Least busy days,Average orders,10,New customer,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,both,Mid-range product,Least busy days,Most orders,10,New customer,Low spender,20.5,Non-frequent customer


In [174]:
# left merge on the ords_prods_merged with df_cust on user_id
# add the customer info to each product order
#

ords_prods_cust_merge = ords_prods_merged.merge(df_cust, on = ['user_id'], how = 'left')

In [175]:
# check the join
ords_prods_cust_merge.head(20)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_last_order,product_id,add_to_cart_order,reordered,product_name,...,mean_spend,order_frequency,order_frequency_flag,Gender,customer_state,customer_age,date_joined,customer_dependants,fam_status,customer_income
0,2539329,1,1,2,8,,196,1,0,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
5,3367565,1,6,2,7,19.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
6,550135,1,7,1,9,20.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
7,3108588,1,8,1,14,14.0,196,2,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
8,2295261,1,9,1,16,0.0,196,4,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423
9,2550362,1,10,4,8,30.0,196,1,1,Soda,...,Low spender,20.5,Non-frequent customer,Female,Alabama,31,2/17/2019,3,married,40423


In [178]:
ords_prods_cust_merge = ords_prods_cust_merge.drop(columns = ['_merge'])

In [180]:
ords_prods_cust_merge.isnull().sum()

order_id                       0
user_id                        0
order_number                   0
orders_day_of_week             0
order_hour_of_day              0
days_since_last_order    2076096
product_id                     0
add_to_cart_order              0
reordered                      0
product_name                   0
aisle_id                       0
department_id                  0
prices                         0
price_range_loc                0
busiest_days                   0
busiest_period_of_day          0
max_order                      0
loyalty_flag                   0
mean_spend                     0
order_frequency                5
order_frequency_flag           0
Gender                         0
customer_state                 0
customer_age                   0
date_joined                    0
customer_dependants            0
fam_status                     0
customer_income                0
dtype: int64

In [182]:
ords_prods_cust_merge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32404859 entries, 0 to 32404858
Data columns (total 28 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   order_id               object 
 1   user_id                object 
 2   order_number           int64  
 3   orders_day_of_week     int64  
 4   order_hour_of_day      int64  
 5   days_since_last_order  float64
 6   product_id             object 
 7   add_to_cart_order      int64  
 8   reordered              int64  
 9   product_name           object 
 10  aisle_id               int64  
 11  department_id          int64  
 12  prices                 float64
 13  price_range_loc        object 
 14  busiest_days           object 
 15  busiest_period_of_day  object 
 16  max_order              int64  
 17  loyalty_flag           object 
 18  mean_spend             object 
 19  order_frequency        float64
 20  order_frequency_flag   object 
 21  Gender                 object 
 22  customer_state  

Merge is complete. Null values are from the first order that customers have made, therefore there is no days since pervious

## 05 Export the merged dataframe as a pickle file

In [181]:
ords_prods_cust_merge.to_pickle(os.path.join(pathData, '02 Data', 'Prepared Data', 'ords_prods_cust.pkl'))