# 4.9 IC Introduce Customer dataset

### This script contains the following points: <br> <br> 
1. Importing Libraries <br> <br> 
2. Importing Data Sets <br> <br> 
3. Data Checks <br><br>
4. Wrangling<br><br>
5. Checking for Consistency<br>
 > 05.01 Check for missing values <br>
 > 05.02 Check for mixed types <br>
 > 05.03 Check for duplicates <br>
 > 05.04 Check for outliers <br>
6. Export Data
7. Merge Cust with Main Merged Dataset

## 01 Importing Libraries

In [1]:
# Importing Libraries 
import pandas as pd
import numpy as np
import os

## 02 Importing Data

In [2]:
# First create a string of the path for the main project folder
path = r'/Users/mistystone/Library/CloudStorage/OneDrive-Personal/Documents/CF_Data_Ach4_Python/2023-05_Instacart_Basket_Analysis/'

In [3]:
#  Import the customers.csv file.
df_cust = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'customers.csv'))

## 03 Data Checks

In [4]:
# Check head
df_cust.head()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


In [5]:
# Check tail
df_cust.tail()

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
206204,168073,Lisa,Case,Female,North Carolina,44,4/1/2020,1,married,148828
206205,49635,Jeremy,Robbins,Male,Hawaii,62,4/1/2020,3,married,168639
206206,135902,Doris,Richmond,Female,Missouri,66,4/1/2020,2,married,53374
206207,81095,Rose,Rollins,Female,California,27,4/1/2020,1,married,99799
206208,80148,Cynthia,Noble,Female,New York,55,4/1/2020,1,married,57095


In [6]:
# Check shape
df_cust.shape

(206209, 10)

In [7]:
# Check data types
df_cust.dtypes

user_id          int64
First Name      object
Surnam          object
Gender          object
STATE           object
Age              int64
date_joined     object
n_dependants     int64
fam_status      object
income           int64
dtype: object

In [8]:
# Check descriptive stats
df_cust.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


Notes about descriptive statistics: 
User_id looks fine. The minimum is 1 and maximum is 206209, as expected. The mean and median are equal, and 103105, as expected. 
Age looks fine. Minimum is 18 and maximum is 81. The mean (49.5) is relatively close to the median (49), so there is not much skew. 
N_dependents looks fine. The minimum is 0 and the maximum is 3. (Perhaps that was the maximum number available in the survey -- In a sample of 206,000 people, surely someone has more than 3 kids!) The meam is 1.5 and the median is 1, so the data is skewed to the right. The income variable looks fine. The minimum is about 26,000 and the maximum is about 549,000. The mean is about 95,000 and the median is about 94,000, so this variable is not particularly skewed.

## 04 Wrangling

In [9]:
# Delete First and Last names for privacy reasons
df_cust = df_cust.drop(columns = ['First Name','Surnam'])

In [10]:
# rename columns: Lowercase gender
df_cust.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [11]:
# rename columns: Lowercase state
df_cust.rename(columns = {'STATE' : 'state'}, inplace = True)

In [12]:
# rename columns: Lowercase age
df_cust.rename(columns = {'Age' : 'age'}, inplace = True)

In [13]:
# rename columns: number_dependents
df_cust.rename(columns = {'n_dependants' : 'number_dependents'}, inplace = True)

In [14]:
# rename columns: family_status
df_cust.rename(columns = {'fam_status' : 'family_status'}, inplace = True)

In [15]:
# check top of df
df_cust.head()

Unnamed: 0,user_id,gender,state,age,date_joined,number_dependents,family_status,income
0,26711,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Female,Maryland,26,1/1/2017,1,married,40374


## 05 Checking for Consistency

### 05.01 Check for missing values

In [16]:
# Frequency of user_id
df_cust['user_id'].value_counts(dropna = False)

26711     1
67322     1
173044    1
61044     1
98344     1
         ..
146847    1
154991    1
172193    1
184326    1
80148     1
Name: user_id, Length: 206209, dtype: int64

In [17]:
# Frequency of gender
df_cust['gender'].value_counts(dropna = False)

Male      104067
Female    102142
Name: gender, dtype: int64

In [18]:
# Frequency of state
df_cust['state'].value_counts(dropna = False)

Florida                 4044
Colorado                4044
Illinois                4044
Alabama                 4044
District of Columbia    4044
Hawaii                  4044
Arizona                 4044
Connecticut             4044
California              4044
Indiana                 4044
Arkansas                4044
Alaska                  4044
Delaware                4044
Iowa                    4044
Idaho                   4044
Georgia                 4044
Wyoming                 4043
Mississippi             4043
Oklahoma                4043
Utah                    4043
New Hampshire           4043
Kentucky                4043
Maryland                4043
Rhode Island            4043
Massachusetts           4043
Michigan                4043
New Jersey              4043
Kansas                  4043
South Dakota            4043
Minnesota               4043
Tennessee               4043
New York                4043
Washington              4043
Louisiana               4043
Montana       

In [19]:
# Frequency of age
df_cust['age'].value_counts(dropna = False).sort_index()

18    3195
19    3329
20    3240
21    3176
22    3236
      ... 
77    3261
78    3247
79    3234
80    3195
81    3263
Name: age, Length: 64, dtype: int64

In [20]:
# Frequency of date_joined
df_cust['date_joined'].value_counts(dropna = False).sort_index()

1/1/2017     159
1/1/2018     147
1/1/2019     153
1/1/2020     153
1/10/2017    192
            ... 
9/8/2018     164
9/8/2019     158
9/9/2017     186
9/9/2018     174
9/9/2019     181
Name: date_joined, Length: 1187, dtype: int64

In [21]:
# Frequency of number_dependents
df_cust['number_dependents'].value_counts(dropna = False).sort_index()

0    51602
1    51531
2    51482
3    51594
Name: number_dependents, dtype: int64

In [22]:
# Frequency of family_status
df_cust['family_status'].value_counts(dropna = False).sort_index()

divorced/widowed                     17640
living with parents and siblings      9701
married                             144906
single                               33962
Name: family_status, dtype: int64

In [23]:
# Frequency of income
df_cust['income'].value_counts(dropna = False).sort_index()

25903     1
25911     1
25937     1
25941     1
25955     1
         ..
584097    1
590790    1
591089    1
592409    1
593901    1
Name: income, Length: 108012, dtype: int64

No missing data

### 05.02 Check for mixed types

In [24]:
# Check for mixed types
for col in df_cust.columns.tolist():
    weird = (df_cust[[col]].applymap(type) != df_cust[[col]].iloc[0].apply(type)).any(axis = 1) 
    if len (df_cust[weird]) > 0:
        print (col)

No mixed types

### 05.03 Check for duplicates

In [25]:
# Checking for duplicates
df_dups = df_cust[df_cust.duplicated()]

In [26]:
df_dups

Unnamed: 0,user_id,gender,state,age,date_joined,number_dependents,family_status,income


No duplicates

### 05.04 Check for outliers

In [27]:
# Check descriptive statistics. Looking for very large max values or very small min values.
df_cust.describe()

Unnamed: 0,user_id,age,number_dependents,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


No obvious outlier values.

## 06 Export customers_clean

In [28]:
# Export data in Pickle format
df_cust.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'customers_clean.pkl'))

## 07 Merge Cust with Main Merged Dataset

In [28]:
# Check df_cust shape
df_cust.shape

(206209, 8)

In [29]:
# Import pickle files
# Import saved orders_products_merged_flags.pkl as df_ords_prods_mergedflags
df_ords_prods_merged_flags = pd.read_pickle(os.path.join(path, '02 Data','Prepared Data','orders_products_merged_flags.pkl')) 

In [30]:
# Check df_ords_prods_merged_flags shape
df_ords_prods_merged_flags.shape

(32404859, 23)

In [31]:
# Check df_ords_prods_merged_flags head
df_ords_prods_merged_flags.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,price_range_loc,busiest_day,Busiest_days,busiest_period_of_day,max_order,loyalty_flag,average_spend,spend_flag,frequent_orders,frequent_flag
0,2539329,1,1,2,8,7.0,196,1,0,Soda,...,Mid-range product,Regularly busy,Regular busy,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,Mid-range product,Regularly busy,Least busy,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Average orders,10,New customer,6.367797,Low spender,20.0,Regular customer
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,Mid-range product,Least busy,Least busy,Most orders,10,New customer,6.367797,Low spender,20.0,Regular customer


In [33]:
# Merge
df_ords_prods_all = df_ords_prods_merged_flags.merge(df_cust, on = 'user_id', indicator = True)

In [None]:
# Check merge. Want all rows to be in 'both' output row, and left_only and right_only to be 0.
df_ords_prods_all['_merge'].value_counts()

In [35]:
# Check top
df_ords_prods_all.head()

Unnamed: 0,order_id,user_id,order_number,order_day_of_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,...,frequent_orders,frequent_flag,gender,state,age,date_joined,number_dependents,family_status,income,_merge
0,2539329,1,1,2,8,7.0,196,1,0,Soda,...,20.0,Regular customer,Female,Alabama,31,2/17/2019,3,married,40423,both
1,2398795,1,2,3,7,15.0,196,1,1,Soda,...,20.0,Regular customer,Female,Alabama,31,2/17/2019,3,married,40423,both
2,473747,1,3,3,12,21.0,196,1,1,Soda,...,20.0,Regular customer,Female,Alabama,31,2/17/2019,3,married,40423,both
3,2254736,1,4,4,7,29.0,196,1,1,Soda,...,20.0,Regular customer,Female,Alabama,31,2/17/2019,3,married,40423,both
4,431534,1,5,4,15,28.0,196,1,1,Soda,...,20.0,Regular customer,Female,Alabama,31,2/17/2019,3,married,40423,both


In [36]:
# Delete _merge column
df_ords_prods_all = df_ords_prods_all.drop(columns = '_merge')

In [37]:
# Check shape
df_ords_prods_all.shape

(32404859, 30)

## 08 Export df_ords_prods_all

In [39]:
# Export data in Pickle format
df_ords_prods_all.to_pickle(os.path.join(path, '02 Data','Prepared Data', 'orders_products_all.pkl'))