# Task Exercise 4.9 Book 1. Instacart Customers Data Cleaning

## The senior Instacart officers have given you a new data set of customer information to go along with your product and order data. In part 2 of the task, you’ll need to incorporate this new data set into your project. In part 2, you’ll create some visualizations, conduct some exploratory analysis, and begin wrapping up everything you’ve done in this Achievement in preparation for the final task in the next Exercise, where you’ll write up a report for your client.

### Due to the size of the dataframes involved, this exercise was broken into multiple notebooks to conserve memory.  Book 1. focused on importing the 'customers.csv' dataframe, verifying, cleaning and modifying it, and exporting it as 'customers_checked.csv'.


## This script contains the following points:

## 1. Import and verify 'customers.csv' dataframe

## 2. Renaming columns

### 'FIrst Name' to 'first_name
### 'Surnam' to 'surname'
### 'Gender to 'gender'
### 'STATE' to 'state'
### 'Age' to 'age'
### 'n_dependants' to 'num_of_dependants'
### 'fam_status' to "family_status'

## 3. Finding duplicates

## 4. Convert 'user_id' to object to facilitate merger with 'orders_products_merged' dataframe

## 5. Export  'customers_checked.csv'


In [1]:
# import libraries

import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
path = r'C:\Users\howl6\OneDrive\Certificates\CareerFoundry\Coursework\Data_Immersion\Chapter 4\Instacart Basket Analysis'

In [3]:
df_cust = pd.read_csv(os.path.join(path,'02_Data','Original_Data', '4.9_customers', 'customers.csv'))

In [4]:
# view top 20 rows

df_cust.head(20)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374
5,133128,Cynthia,Noble,Female,Kentucky,43,1/1/2017,2,married,49643
6,152052,Chris,Walton,Male,Montana,20,1/1/2017,0,single,61746
7,168851,Joseph,Hickman,Male,South Carolina,30,1/1/2017,0,single,63712
8,69965,Jeremy,Vang,Male,Texas,47,1/1/2017,1,married,162432
9,82820,Shawn,Chung,Male,Virginia,26,1/1/2017,2,married,32072


In [5]:
# view bottom 35 rows

df_cust.tail(35)

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
206174,134553,Ralph,Avalos,Male,Indiana,25,4/1/2020,1,married,64482
206175,167749,Deborah,Farrell,Female,Florida,28,4/1/2020,1,married,30169
206176,186595,Ruth,Cunningham,Female,Mississippi,38,4/1/2020,1,married,92727
206177,199732,Katherine,Abbott,Female,Iowa,71,4/1/2020,1,married,31019
206178,138442,Gloria,Cantrell,Female,North Carolina,26,4/1/2020,1,married,46199
206179,177599,Donna,Ruiz,Female,Kansas,71,4/1/2020,3,married,119306
206180,48091,Jeremy,Willis,Male,West Virginia,37,4/1/2020,3,married,31048
206181,192077,Susan,Nash,Female,Georgia,54,4/1/2020,3,married,160664
206182,180181,Carl,Bridges,Male,West Virginia,77,4/1/2020,2,married,167232
206183,79560,Steven,Sutton,Male,Wyoming,78,4/1/2020,3,married,101764


In [6]:
# view number of rows, columns

df_cust.shape

(206209, 10)

In [7]:
# list of columns

df_cust.columns

Index(['user_id', 'First Name', 'Surnam', 'Gender', 'STATE', 'Age',
       'date_joined', 'n_dependants', 'fam_status', 'income'],
      dtype='object')

In [8]:
# view descriptive statistics

df_cust.describe()

Unnamed: 0,user_id,Age,n_dependants,income
count,206209.0,206209.0,206209.0,206209.0
mean,103105.0,49.501646,1.499823,94632.852548
std,59527.555167,18.480962,1.118433,42473.786988
min,1.0,18.0,0.0,25903.0
25%,51553.0,33.0,0.0,59874.0
50%,103105.0,49.0,1.0,93547.0
75%,154657.0,66.0,3.0,124244.0
max,206209.0,81.0,3.0,593901.0


### A review of the descriptive statistics does not suggest any significant outliers.

In [9]:
# view datatypes

df_cust.dtypes

user_id          int64
First Name      object
Surnam          object
Gender          object
STATE           object
Age              int64
date_joined     object
n_dependants     int64
fam_status      object
income           int64
dtype: object

### Variable types all seem appropriate.

In [10]:
# looking for missing values (NaN)

df_cust['user_id'].value_counts(dropna = False)

2049      1
167163    1
187633    1
181490    1
183539    1
         ..
150044    1
147997    1
154142    1
152095    1
2047      1
Name: user_id, Length: 206209, dtype: int64

In [11]:
df_cust['First Name'].value_counts(dropna = False)

NaN        11259
Marilyn     2213
Barbara     2154
Todd        2113
Jeremy      2104
           ...  
Merry        197
Eugene       197
Garry        191
David        186
Ned          186
Name: First Name, Length: 208, dtype: int64

In [12]:
df_cust['date_joined'].value_counts(dropna = False)

9/17/2018     213
2/10/2018     212
4/1/2019      211
9/21/2019     211
12/19/2017    210
             ... 
9/1/2018      141
1/22/2018     140
11/24/2017    139
7/18/2019     138
8/6/2018      128
Name: date_joined, Length: 1187, dtype: int64

In [13]:
df_cust['income'].value_counts(dropna = False)

95891     10
57192     10
95710     10
94809      9
97532      9
          ..
139481     1
152861     1
464181     1
228664     1
28658      1
Name: income, Length: 108012, dtype: int64

In [14]:
df_cust['Gender'].value_counts(dropna = False)

Male      104067
Female    102142
Name: Gender, dtype: int64

In [15]:
df_cust.isnull().sum()

user_id             0
First Name      11259
Surnam              0
Gender              0
STATE               0
Age                 0
date_joined         0
n_dependants        0
fam_status          0
income              0
dtype: int64

In [16]:
df_nan = df_cust[df_cust['First Name'].isnull() == True]

In [17]:
df_nan

Unnamed: 0,user_id,First Name,Surnam,Gender,STATE,Age,date_joined,n_dependants,fam_status,income
53,76659,,Gilbert,Male,Colorado,26,1/1/2017,2,married,41709
73,13738,,Frost,Female,Louisiana,39,1/1/2017,0,single,82518
82,89996,,Dawson,Female,Oregon,52,1/1/2017,3,married,117099
99,96166,,Oconnor,Male,Oklahoma,51,1/1/2017,1,married,155673
105,29778,,Dawson,Female,Utah,63,1/1/2017,3,married,151819
...,...,...,...,...,...,...,...,...,...,...
206038,121317,,Melton,Male,Pennsylvania,28,3/31/2020,3,married,87783
206044,200799,,Copeland,Female,Hawaii,52,4/1/2020,2,married,108488
206090,167394,,Frost,Female,Hawaii,61,4/1/2020,1,married,45275
206162,187532,,Floyd,Female,California,39,4/1/2020,0,single,56325


### A review of column values provides variable counts and identifies 11259 missing first names.


## 2. Renaming columns 

### 'FIrst Name' to 'first_name
### 'Surnam' to 'surname'
### 'Gender to 'gender'
### 'STATE' to 'state'
### 'Age' to 'age'
### 'n_dependants' to 'num_of_dependants'
### 'fam_status' to "family_status'

In [18]:
df_cust.rename(columns = {'First Name' : 'first_name'}, inplace = True)

In [19]:
df_cust.rename(columns = {'Surnam' : 'surname'}, inplace = True)

In [20]:
df_cust.rename(columns = {'Gender' : 'gender'}, inplace = True)

In [21]:
df_cust.rename(columns = {'STATE' : 'state'}, inplace = True)

In [22]:
df_cust.rename(columns = {'Age' : 'age'}, inplace = True)

In [23]:
df_cust.rename(columns = {'n_dependants' : 'num_of_dependants'}, inplace = True)

In [24]:
df_cust.rename(columns = {'fam_status' : 'family_status'}, inplace = True)

In [25]:
df_cust.head()

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,num_of_dependants,family_status,income
0,26711,Deborah,Esquivel,Female,Missouri,48,1/1/2017,3,married,165665
1,33890,Patricia,Hart,Female,New Mexico,36,1/1/2017,0,single,59285
2,65803,Kenneth,Farley,Male,Idaho,35,1/1/2017,2,married,99568
3,125935,Michelle,Hicks,Female,Iowa,40,1/1/2017,0,single,42049
4,130797,Ann,Gilmore,Female,Maryland,26,1/1/2017,1,married,40374


## 3. Finding duplicates

In [26]:
df_dups = df_cust[df_cust.duplicated()]

In [27]:
df_dups

Unnamed: 0,user_id,first_name,surname,gender,state,age,date_joined,num_of_dependants,family_status,income


## There are no duplicate variables in the dataframe.

## 4. Convert 'user_id' to object to facilitate merger with 'orders_products_merged' dataframe

In [28]:
df_cust['user_id'] = df_cust['user_id'].astype('str')

In [29]:
df_cust.dtypes

user_id              object
first_name           object
surname              object
gender               object
state                object
age                   int64
date_joined          object
num_of_dependants     int64
family_status        object
income                int64
dtype: object

## 5. Export  'customers_checked.csv'

In [30]:
df_cust.to_csv(os.path.join(path, '02_Data','Prepared_Data', 'customers_checked.csv'))