**Data Cleaning for Customer Demographic data sheet from Raw data**


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from datetime import datetime, date
plt.style.use('ggplot')

import warnings
warnings.filterwarnings('ignore')

In [3]:
# Loading customer demographic data  sheet from raw excel file
cust_demo = pd.read_excel('Raw_data.xlsx',sheet_name="CustomerDemographic")


In [4]:
#check record for data set
cust_demo.head(5)

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,default,owns_car,tenure
0,1,Laraine,Medendorp,F,93,1953-10-12 00:00:00,Executive Secretary,Health,Mass Customer,N,"""'",Yes,11.0
1,2,Eli,Bockman,Male,81,1980-12-16 00:00:00,Administrative Officer,Financial Services,Mass Customer,N,<script>alert('hi')</script>,Yes,16.0
2,3,Arlin,Dearle,Male,61,1954-01-20 00:00:00,Recruiting Manager,Property,Mass Customer,N,2018-02-01 00:00:00,Yes,15.0
3,4,Talbot,,Male,33,1961-10-03 00:00:00,,IT,Mass Customer,N,() { _; } >_[$($())] { touch /tmp/blns.shellsh...,No,7.0
4,5,Sheila-kathryn,Calton,Female,56,1977-05-13 00:00:00,Senior Editor,,Affluent Customer,N,NIL,Yes,8.0


In [64]:
#Data Types for Customer Demographic data
cust_demo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3912 entries, 0 to 3999
Data columns (total 13 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   customer_id                          3912 non-null   int64         
 1   first_name                           3912 non-null   object        
 2   last_name                            3912 non-null   object        
 3   gender                               3912 non-null   object        
 4   past_3_years_bike_related_purchases  3912 non-null   int64         
 5   DOB                                  3912 non-null   datetime64[ns]
 6   job_title                            3912 non-null   object        
 7   job_industry_category                3912 non-null   object        
 8   wealth_segment                       3912 non-null   object        
 9   deceased_indicator                   3912 non-null   object        
 10  owns_car         

Based on the above information, we can observe that the data types of most columns are appropriate; however, the DOB column is currently in an object format and needs to be converted. Additionally, some columns contain missing values, and the default column appears to be irrelevant to our analysis.

**Data Cleaning & Processing**



In [19]:
#No of rows and columns

print("No of columns in the dataset :",(cust_demo.shape[1]))

print("No of rows in the dataset: ",(cust_demo.shape[0]))

No of columns in the dataset : 13
No of rows in the dataset:  4000


In [21]:
#Convert data types
cust_demo['DOB'] = pd.to_datetime(cust_demo['DOB'], errors='coerce')


In [22]:
# select numeric columns
numeric = cust_demo.select_dtypes(include=[np.number])
numeric_cols = numeric.columns.values
print("The numeric columns are : {}".format(numeric_cols))


# select non-numeric columns
non_numeric =cust_demo.select_dtypes(exclude=[np.number])
non_numeric_cols=non_numeric.columns.values
print("The non numeric columns are : {}".format(non_numeric_cols))


The numeric columns are : ['customer_id' 'past_3_years_bike_related_purchases' 'tenure']
The non numeric columns are : ['first_name' 'last_name' 'gender' 'DOB' 'job_title'
 'job_industry_category' 'wealth_segment' 'deceased_indicator' 'default'
 'owns_car']


In [24]:
# Dropping irrelevant columns
cust_demo.drop(labels={'default'}, axis=1 , inplace=True) #This is not relevant to analysis


In [28]:
# Check for missing values

cust_demo.isnull().sum()

customer_id                              0
first_name                               0
last_name                              125
gender                                   0
past_3_years_bike_related_purchases      0
DOB                                     87
job_title                              506
job_industry_category                  656
wealth_segment                           0
deceased_indicator                       0
owns_car                                 0
tenure                                  87
dtype: int64

we could see missing values in column such as last_name, DOB, job_title and job_industry_category

In [29]:
# Last name missing value imputation 

cust_demo[cust_demo['last_name'].isnull()][['first_name', 'customer_id']].isnull().sum()

first_name     0
customer_id    0
dtype: int64

Since each individual has a unique customer_id and a first name with no missing values, it's easy to identify them. Therefore, it's acceptable to fill missing values in the last name column with "NA".

In [33]:
# replacing none for lastname

cust_demo['last_name'].fillna('None',axis=0, inplace=True)

print(cust_demo['last_name'].isnull().sum())

   


0


In [36]:
# DOB missing value imputation

round(cust_demo["DOB"].isnull().mean()*100,2)


2.17

Since only 2% of the dataset has missing values in the DOB column, it's reasonable to remove those rows where the date of birth is null to maintain data quality without significantly impacting the dataset size.

In [37]:
#remove null values form Date of birth columns
dob_index_drop = cust_demo[cust_demo['DOB'].isnull()].index
dob_index_drop

cust_demo.drop(index=dob_index_drop, inplace=True, axis=0)  
    


In [38]:

cust_demo['DOB'].isnull().sum()    


0

There is no missing values in DOB columns

**Creating age columns to check any descripency in data**

In [39]:
#Creating age columns 

# Function to calculate the age as of today based on the DOB of the customer.

def age(born):
    today = date.today()
    
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

cust_demo['Age'] = cust_demo['DOB'].apply(age)

In [40]:
#Statistics of age column

cust_demo['Age'].describe()

count    3913.000000
mean       47.316381
std        12.800961
min        23.000000
25%        38.000000
50%        47.000000
75%        57.000000
max       181.000000
Name: Age, dtype: float64

In [41]:
#Finding customer whose age is greater than 100

cust_demo[cust_demo['Age'] > 100]    


Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Age
33,34,Jephthah,Bachmann,U,59,1843-12-21,Legal Assistant,IT,Affluent Customer,N,No,20.0,181


Here we find only one customer with an age 181 which is clearly an outlier and we need to remove this data point

In [42]:
#drop the values
age_index_drop = cust_demo[cust_demo['Age']>100].index

cust_demo.drop(index=age_index_drop, inplace=True , axis=0)

In [46]:
# Fetching records where Job Title is missing.

cust_demo[cust_demo['job_title'].isnull()]
    


Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Age
3,4,Talbot,,Male,33,1961-10-03,,IT,Mass Customer,N,No,7.0,63
5,6,Curr,Duckhouse,Male,35,1966-09-16,,Retail,High Net Worth,N,Yes,13.0,58
6,7,Fina,Merali,Female,6,1976-02-23,,Financial Services,Affluent Customer,N,Yes,11.0,49
10,11,Uriah,Bisatt,Male,99,1954-04-30,,Property,Mass Customer,N,No,9.0,71
21,22,Deeanne,Durtnell,Female,79,1962-12-10,,IT,Mass Customer,N,No,11.0,62
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3967,3968,Alexandra,Kroch,Female,99,1977-12-22,,Property,High Net Worth,N,No,22.0,47
3971,3972,Maribelle,Schaffel,Female,6,1979-03-28,,Retail,Mass Customer,N,No,8.0,46
3978,3979,Kleon,Adam,Male,67,1974-07-13,,Financial Services,Mass Customer,N,Yes,18.0,50
3986,3987,Beckie,Wakeham,Female,18,1964-05-29,,Argiculture,Mass Customer,N,No,7.0,60


In [48]:
#Percenatge of missing values in data set
round(cust_demo["job_title"].isnull().mean()*100,2)

12.7

The percenatge of missing value is 13 need to replace as missing in null values

In [49]:
#Replacing NA  values "Missing"

cust_demo['job_title'].fillna('Missing', inplace=True, axis=0)
   

In [51]:
cust_demo["job_title"].isnull().sum()  

0

There is no missing values in job title

In [53]:
# Fetching records where Job industry is missing.

cust_demo[cust_demo['job_industry_category'].isnull()]


Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Age
4,5,Sheila-kathryn,Calton,Female,56,1977-05-13,Senior Editor,,Affluent Customer,N,Yes,8.0,47
7,8,Rod,Inder,Male,31,1962-03-30,Media Manager I,,Mass Customer,N,No,7.0,63
15,16,Harlin,Parr,Male,38,1977-02-27,Media Manager IV,,Mass Customer,N,Yes,18.0,48
16,17,Heath,Faraday,Male,57,1962-03-19,Sales Associate,,Affluent Customer,N,Yes,15.0,63
17,18,Marjie,Neasham,Female,79,1967-07-06,Professor,,Affluent Customer,N,No,11.0,57
...,...,...,...,...,...,...,...,...,...,...,...,...,...
3965,3966,Astrix,Sigward,Female,53,1968-09-15,Geologist I,,Mass Customer,N,Yes,11.0,56
3973,3974,Misha,Ranklin,Female,82,1961-02-11,Technical Writer,,Affluent Customer,N,Yes,9.0,64
3975,3976,Gretel,Chrystal,Female,0,1957-11-20,Internal Auditor,,Affluent Customer,N,Yes,13.0,67
3982,3983,Jarred,Lyste,Male,19,1965-04-21,Graphic Designer,,Mass Customer,N,Yes,9.0,60


In [55]:
round(cust_demo['job_industry_category'].isnull().mean()*100,2)

16.77

Since Percentage of missing Job Industry Category is 17. We will replace null values with Missing

In [56]:
#replacing "NA" values as "missing"
cust_demo['job_industry_category'].fillna('Missing', inplace=True, axis=0)


In [57]:
cust_demo["job_industry_category"].isnull().sum()

0

In [63]:
print(cust_demo.isnull().sum())


customer_id                            0
first_name                             0
last_name                              0
gender                                 0
past_3_years_bike_related_purchases    0
DOB                                    0
job_title                              0
job_industry_category                  0
wealth_segment                         0
deceased_indicator                     0
owns_car                               0
tenure                                 0
Age                                    0
dtype: int64


Now we could see that there is no missing values in the data set 

**Inconsistency in data set**

We will check whether there is inconsistent data / typo error data is present in the categorical columns.
The columns to be checked are 'gender', 'wealth_segment' ,'deceased_indicator', 'owns_car'

In [66]:
cust_demo["gender"].value_counts()

gender
Female    2037
Male      1872
F            1
Femal        1
M            1
Name: count, dtype: int64

We could see that there are some inconsistent data in gender columns like typo erro, spelling mistakes
replace "F","Femal" with "Female" and "M" with"Male"

In [68]:
def replace_gender_names(gender):
    
    # Making Gender as Male and Female as standards
    if gender=='M':
        return 'Male'
    elif gender=='F':
        return 'Female'
    elif gender=='Femal':
        return 'Female'
    else :
        return gender

cust_demo['gender'] = cust_demo['gender'].apply(replace_gender_names)




In [69]:
cust_demo["gender"].value_counts()

gender
Female    2039
Male      1873
Name: count, dtype: int64

The inconsistent data ,spelling mistakes and typos in gender column are removed.

In [70]:
#Wealth segment

cust_demo["wealth_segment"].value_counts()

wealth_segment
Mass Customer        1954
High Net Worth        996
Affluent Customer     962
Name: count, dtype: int64

There is no any inconsistent data , spelling mistakes or any typos error in wealth segment column

In [71]:
# deceased indicator

cust_demo["deceased_indicator"].value_counts()

deceased_indicator
N    3910
Y       2
Name: count, dtype: int64

There is no any inconsistent data , spelling mistakes or any typos error in deceased_indicator column

In [73]:
#owns_Car

cust_demo['owns_car'].value_counts()


owns_car
Yes    1974
No     1938
Name: count, dtype: int64

There is no any inconsistent data , spelling mistakes or any typos error in owns_Car column

**Duplicate Record**

We need to ensure that there is no duplication of records in the dataset. This may lead to error in data analysis due to poor data quality. If there are duplicate rows of data then we need to drop such records.

In [77]:
#Return boolean series of duplicate values 
duplicates=cust_demo.duplicated()


#display duplicates rows
cust_demo[duplicates]

Unnamed: 0,customer_id,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Age


There is no duplicate records in the data set 

**Exporting the Cleaned Customer Demographic Data Set to csv**

Currently the Customer Demographics dataset is clean. Hence we can export the data to a csv to continue our data analysis of Customer Segments by joining it to other tables.


In [79]:
cust_demo.to_csv('CustomerDemographic_Cleaned.csv', index=False)
