# Intro to Data Visualization with Python - Part 1

## Content
#### 1. Import libraries and data
#### 2. Data wrangling of dataframe df_cust
#### 2.1. Rename columns
#### 2.2. Drop columns
#### 2.3. Columns data types
#### 2.4. Cleaning data
#### 3. Consistency check for dataframe df_cust
#### 3.1. Missing values
#### 3.2. Duplicates
#### 3.3. Mixed data types
#### 4. Additional data wrangling for dataframe df_ords_prods
#### 5. Combining 2 dataframes df_ords_prods and df_cust
#### 6. Export data

# 1. Import libraries and data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os

In [2]:
# Create project folder path
path = r'C:\Users\Lara\Career Foundry Projects\21-09-2023 Instacart Basket Analysis'

In [3]:
# Import datasets orders_products_with_flags_new.pkl and customers.csv
df_ords_prods = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_with_flags_new.pkl'))
df_cust = pd.read_csv (os.path.join(path, '02 Data', 'Original Data', 'customers.csv'), index_col = False)

In [4]:
df_ords_prods.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_range,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_price,spending_flag,median_days,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Mid-range product,Regularly busy,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


In [5]:
df_ords_prods.shape

(32434212, 22)

In [6]:
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [7]:
df_cust.shape

(206209, 10)

# 2. Data wrangling of dataframe df_cust

## 2.1. Rename columns

In [8]:
# Rename columns First Name, Surnam, STATE, Age, n_dependants, fam_status
df_cust.rename(columns = {'First Name' : 'first_name'}, inplace = True)
df_cust.rename(columns = {'Surnam' : 'surname'}, inplace = True)
df_cust.rename(columns = {'Gender' : 'gender'}, inplace = True)
df_cust.rename(columns = {'STATE' : 'state'}, inplace = True)
df_cust.rename(columns = {'Age' : 'age'}, inplace = True)
df_cust.rename(columns = {'n_dependants' : 'number_of_dependants'}, inplace = True)
df_cust.rename(columns = {'fam_status' : 'marital_status'}, inplace = True)

In [9]:
# Check output
df_cust.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## 2.2. Drop columns

In [10]:
# Show all unique values from column date_joined and their counts
df_cust['date_joined'].value_counts(dropna = False).sort_index()

date_joined
1/1/2017     159
1/1/2018     147
1/1/2019     153
1/1/2020     153
1/10/2017    192
            ... 
9/8/2018     164
9/8/2019     158
9/9/2017     186
9/9/2018     174
9/9/2019     181
Name: count, Length: 1187, dtype: int64

#### This column is irrelevant for our future analysis and I'm dropping it to shorten dtaset before combining it with dataset orders_products_with_flags_new.

In [11]:
# Drop column date_joined and overwrite existing df_cust
df_cust = df_cust.drop (columns = ['date_joined'])

In [12]:
# Check output
df_cust.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,number_of_dependants,marital_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1,married,40374


## 2.3. Columns data types

In [13]:
# Check data types of all columns
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   user_id               206209 non-null  int64 
 1   first_name            194950 non-null  object
 2   surname               206209 non-null  object
 3   gender                206209 non-null  object
 4   state                 206209 non-null  object
 5   age                   206209 non-null  int64 
 6   number_of_dependants  206209 non-null  int64 
 7   marital_status        206209 non-null  object
 8   income                206209 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 14.2+ MB


#### All columns have the right data type. Numeric data types columns can be changed to use less memory and to optimize further steps.
#### It looks like only column 'first_name' has some missing values and I'll re-check this later.

In [14]:
# Check maximum value in all 4 numeric columns 'user_id' 'age', 'number_of_dependants' and 'income'
df_cust[['user_id', 'age', 'number_of_dependants', 'income']].max()

user_id                 206209
age                         81
number_of_dependants         3
income                  593901
dtype: int64

#### Based on these outputs, I will change data type for columns 'age' and 'number_of_dependants' to int8 and for 'user_id' and 'income' to int32

In [15]:
# Change data types in columns 'age', 'number_of_dependants' and 'income'
df_cust['age'] = df_cust['age'].astype('int8')
df_cust['number_of_dependants'] = df_cust['number_of_dependants'].astype('int8')
df_cust['user_id'] = df_cust['user_id'].astype('int32')
df_cust['income'] = df_cust['income'].astype('int32')

In [16]:
# Check all data types again and memory usage
df_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206209 entries, 0 to 206208
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype 
---  ------                --------------   ----- 
 0   user_id               206209 non-null  int32 
 1   first_name            194950 non-null  object
 2   surname               206209 non-null  object
 3   gender                206209 non-null  object
 4   state                 206209 non-null  object
 5   age                   206209 non-null  int8  
 6   number_of_dependants  206209 non-null  int8  
 7   marital_status        206209 non-null  object
 8   income                206209 non-null  int32 
dtypes: int32(2), int8(2), object(5)
memory usage: 9.8+ MB


## 2.4. Cleaning data

In [17]:
# Check if state names are correct
df_cust['state'].value_counts(dropna = False).sort_index()

state
Alabama                 4044
Alaska                  4044
Arizona                 4044
Arkansas                4044
California              4044
Colorado                4044
Connecticut             4044
Delaware                4044
District of Columbia    4044
Florida                 4044
Georgia                 4044
Hawaii                  4044
Idaho                   4044
Illinois                4044
Indiana                 4044
Iowa                    4044
Kansas                  4043
Kentucky                4043
Louisiana               4043
Maine                   4043
Maryland                4043
Massachusetts           4043
Michigan                4043
Minnesota               4043
Mississippi             4043
Missouri                4043
Montana                 4043
Nebraska                4043
Nevada                  4043
New Hampshire           4043
New Jersey              4043
New Mexico              4043
New York                4043
North Carolina          4043
North Da

#### All names are correct.

# 3. Consistency check for dataframe df_cust

## 3.1. Missing values

In [18]:
# Set Jupyter display option to show all rows
pd.options.display.max_rows = None

In [19]:
# Finding missing values in df_cust
df_cust.isnull().sum()

user_id                     0
first_name              11259
surname                     0
gender                      0
state                       0
age                         0
number_of_dependants        0
marital_status              0
income                      0
dtype: int64

#### As suspected, column 'first_name' has 11259 missing values (NaN) which are of int64 data type. For now I'll leave them as they are.

## 3.2. Duplicates

In [20]:
# Create subset with full duplicates
df_cust_dups = df_cust[df_cust.duplicated()]

In [21]:
df_cust_dups

Unnamed: 0,user_id,first_name,surname,gender,state,age,number_of_dependants,marital_status,income


#### There are no duplicates.

## 3.3. Mixed data types

In [22]:
# Check for mixed type columns
for col in df_cust.columns.tolist():
  weird = (df_cust[[col]].applymap(type) != df_cust[[col]].iloc[0].apply(type)).any(axis = 1)
  if len (df_cust[weird]) > 0:
    print (col)

first_name


#### I already concluded that this column has NaN values with data type int64 which is the reason why it was printed as column with "weird" type. For now I will leave them as they are.

# 4. Additional data wrangling for dataframe df_ords_prods

#### This dataframe has over 32 million of rows and 22 columns. I will check if some of the columns can be dropped or if some data types can be change to a data type that takes less memory. Data type for column 'user_id' needs to be change to int32 to match data type of column with same name in dataframe df_cust.

In [23]:
# Check data types for all columns
df_ords_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434212 entries, 0 to 32434211
Data columns (total 22 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   order_number            int64  
 3   orders_day_of_week      int64  
 4   order_hour_of_day       int64  
 5   days_since_prior_order  float64
 6   product_id              int64  
 7   add_to_cart_order       int64  
 8   reordered               int64  
 9   product_name            object 
 10  aisle_id                int64  
 11  department_id           int64  
 12  prices                  float64
 13  price_range             object 
 14  busiest_days            object 
 15  busiest_period_of_day   object 
 16  max_order               int64  
 17  loyalty_flag            object 
 18  mean_price              float64
 19  spending_flag           object 
 20  median_days             float64
 21  order_frequency_flag    objec

In [24]:
# First change data type for column 'user_id'
df_ords_prods['user_id'] = df_ords_prods['user_id'].astype('int32')

In [25]:
# Check output
df_ords_prods['user_id'].dtype

dtype('int32')

In [26]:
df_ords_prods.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_range,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_price,spending_flag,median_days,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Mid-range product,Regularly busy,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
5,3367565,1,6,2,7,19.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
6,550135,1,7,1,9,20.0,196,1,1,Soda,...,9.0,Mid-range product,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
7,3108588,1,8,1,14,14.0,196,2,1,Soda,...,9.0,Mid-range product,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
8,2295261,1,9,1,16,0.0,196,4,1,Soda,...,9.0,Mid-range product,Busiest days,Most orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer
9,2550362,1,10,4,8,30.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.367797,Low spender,20.5,Non-frequent customer


#### Round to 2 decimal places all values in column 'mean_price' so that they resemble format of prices in stores.

In [27]:
df_ords_prods['mean_price'] = df_ords_prods['mean_price'].round(2)

In [28]:
df_ords_prods.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,prices,price_range,busiest_days,busiest_period_of_day,max_order,loyalty_flag,mean_price,spending_flag,median_days,order_frequency_flag
0,2539329,1,1,2,8,,196,1,0,Soda,...,9.0,Mid-range product,Regularly busy,Awerage orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
5,3367565,1,6,2,7,19.0,196,1,1,Soda,...,9.0,Mid-range product,Regularly busy,Awerage orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
6,550135,1,7,1,9,20.0,196,1,1,Soda,...,9.0,Mid-range product,Busiest days,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
7,3108588,1,8,1,14,14.0,196,2,1,Soda,...,9.0,Mid-range product,Busiest days,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
8,2295261,1,9,1,16,0.0,196,4,1,Soda,...,9.0,Mid-range product,Busiest days,Most orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer
9,2550362,1,10,4,8,30.0,196,1,1,Soda,...,9.0,Mid-range product,Slowest days,Awerage orders,10,New customer,6.37,Low spender,20.5,Non-frequent customer


In [29]:
# Check maximum value for all columns with numeric data types (except 'user_id')
df_ords_prods[['order_id', 'order_number', 'orders_day_of_week', 'order_hour_of_day', 'days_since_prior_order', 'product_id', 'add_to_cart_order', 'reordered', 'aisle_id', 'department_id', 'prices', 'max_order', 'mean_price', 'median_days']].max()

order_id                  3421083.0
order_number                   99.0
orders_day_of_week              6.0
order_hour_of_day              23.0
days_since_prior_order         30.0
product_id                  49688.0
add_to_cart_order             145.0
reordered                       1.0
aisle_id                      134.0
department_id                  21.0
prices                         25.0
max_order                      99.0
mean_price                     23.2
median_days                    30.0
dtype: float64

In [30]:
# Change data type to int8
df_ords_prods['orders_day_of_week'] = df_ords_prods['orders_day_of_week'].astype('int8')
df_ords_prods['order_hour_of_day'] = df_ords_prods['order_hour_of_day'].astype('int8')
df_ords_prods['reordered'] = df_ords_prods['reordered'].astype('int8')
df_ords_prods['department_id'] = df_ords_prods['department_id'].astype('int8')

In [31]:
# Change data types to int16
df_ords_prods['order_number'] = df_ords_prods['order_number'].astype('int16')
df_ords_prods['add_to_cart_order'] = df_ords_prods['add_to_cart_order'].astype('int16')
df_ords_prods['aisle_id'] = df_ords_prods['aisle_id'].astype('int16')
df_ords_prods['max_order'] = df_ords_prods['max_order'].astype('int16')

In [32]:
# Change data types to int32
df_ords_prods['order_id'] = df_ords_prods['order_id'].astype('int32')
df_ords_prods['product_id'] = df_ords_prods['product_id'].astype('int32')

In [33]:
# Change data types to float32
df_ords_prods['days_since_prior_order'] = df_ords_prods['days_since_prior_order'].astype('float32')
df_ords_prods['prices'] = df_ords_prods['prices'].astype('float32')
df_ords_prods['mean_price'] = df_ords_prods['mean_price'].astype('float32')
df_ords_prods['median_days'] = df_ords_prods['median_days'].astype('float32')

In [34]:
# Check if all changes took place
df_ords_prods.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434212 entries, 0 to 32434211
Data columns (total 22 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int32  
 1   user_id                 int32  
 2   order_number            int16  
 3   orders_day_of_week      int8   
 4   order_hour_of_day       int8   
 5   days_since_prior_order  float32
 6   product_id              int32  
 7   add_to_cart_order       int16  
 8   reordered               int8   
 9   product_name            object 
 10  aisle_id                int16  
 11  department_id           int8   
 12  prices                  float32
 13  price_range             object 
 14  busiest_days            object 
 15  busiest_period_of_day   object 
 16  max_order               int16  
 17  loyalty_flag            object 
 18  mean_price              float32
 19  spending_flag           object 
 20  median_days             float32
 21  order_frequency_flag    objec

#### Changing all this data types resulted in lowering memory usage by about 2.4 GB, that is almost half of previous memory usage.
#### Column 'days_since_prior_order' needs to be of float data type because it contains NaN values.

# 5. Combining 2 dataframes df_ords_prods and df_cust

In [35]:
# Check shape for both dataframes
df_ords_prods.shape

(32434212, 22)

In [36]:
df_cust.shape

(206209, 9)

#### First I'll merge 2 dataframes with outer merge and check how many rows will be flaged with merge flag 'only left' and 'only right' and if sum of them is insignificant to all the rows combined, then I'll merge them with inner merge and check if all rows will be flaged with merge flag 'both'.

In [37]:
# Merge df_ords_prods and df_cust with outer merge with key column 'user_id'
df_merge_out = df_ords_prods.merge(df_cust, on = 'user_id', indicator = True, how = 'outer')

In [38]:
# Check if it's full merge
df_merge_out['_merge'].value_counts()

_merge
both          32434212
left_only            0
right_only           0
Name: count, dtype: int64

#### This was full merge. No further action is needed.

In [39]:
# Check shape of merged dataframe
df_merge_out.shape

(32434212, 31)

In [40]:
# Drop merge flag. It is no longer necessary.
df_merge_out = df_merge_out.drop(columns = ['_merge'])

In [41]:
# Check shape again and head()
df_merge_out.shape

(32434212, 30)

In [42]:
df_merge_out.head()

Unnamed: 0,order_id,user_id,order_number,orders_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,median_days,order_frequency_flag,first_name,surname,gender,state,age,number_of_dependants,marital_status,income
0,2539329,1,1,2,8,,196,1,0,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,3,married,40423
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,3,married,40423
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,3,married,40423
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,3,married,40423
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,20.5,Non-frequent customer,Linda,Nguyen,Female,Alabama,31,3,married,40423


# 6. Export data

In [43]:
# Export df_merge_out as pikle format
df_merge_out.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))