## Potential New Customer Data

KPMG has asked us to profile some potential customers for a targeted advert. The aim is to conduct sales and marketing campaigne with the limited available resources in order to increase sales and revenue. The data set was provided by KPMG and contains 1000 records and 23 featues. It is unclean and comprises of 1000 records and 23 features. It needs to go through some preporcessing steps, so that it can be fed into the machine learning model for prediction.

In [1]:
# importing the rquired libraries for he analysis
import numpy as np
import pandas as pd
import datetime

In [2]:
#load the data and view a samples of the data
new_cust = pd.read_excel('C:/Users/Chuks/datasets/kpmg_new_customer.xlsx')
new_cust.sample(5)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Rank,Value
290,Vittoria,Whitney,Female,95,1981-06-03,Research Assistant I,,High Net Worth,N,No,...,NSW,Australia,11,1.07,1.07,1.3375,1.3375,291,291,1.035937
381,Almira,Mangion,Female,4,1996-01-24,VP Product Management,Financial Services,Affluent Customer,N,Yes,...,NSW,Australia,12,0.52,0.65,0.8125,0.8125,382,382,0.95
954,Lyndell,Jereatt,Female,14,1994-11-28,Payment Adjustment Coordinator,,High Net Worth,N,No,...,NSW,Australia,12,0.5,0.5,0.625,0.625,954,954,0.45
38,Garik,Whitwell,Male,44,1955-06-13,,Property,Mass Customer,N,Yes,...,VIC,Australia,2,0.88,1.1,1.1,0.935,38,38,1.4375
240,Farra,Matyushkin,Female,18,1974-01-24,VP Quality Control,Manufacturing,High Net Worth,N,Yes,...,VIC,Australia,9,0.79,0.9875,1.234375,1.234375,241,241,1.0875


In [3]:
new_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   first_name                           1000 non-null   object        
 1   last_name                            971 non-null    object        
 2   gender                               1000 non-null   object        
 3   past_3_years_bike_related_purchases  1000 non-null   int64         
 4   DOB                                  983 non-null    datetime64[ns]
 5   job_title                            894 non-null    object        
 6   job_industry_category                835 non-null    object        
 7   wealth_segment                       1000 non-null   object        
 8   deceased_indicator                   1000 non-null   object        
 9   owns_car                             1000 non-null   object        
 10  tenure       

There a couple of missing values in last_name, DOB, job_title and job_industry_category. Some features have irrelevant data that doesn't seem to making any meaning. The 'country' and 'deceased_indicator' columns have single value in the column and can as wll be dropped. The columns that need to be removed are: 'first_name', 'last_name', 'job_title', 'DOB', 'Unnamed: 16', 'Unnamed: 17', 'Unnamed: 18', 'Rank', 'Unnamed: 19', 'Unnamed: 20', 'country', 'address', 'deceased_indicator', 'Value', 'postcode'.

In [4]:
# deceased_indicator and country are single value are irrelevant
print(new_cust['deceased_indicator'].unique())
print(new_cust['country'].unique())

['N']
['Australia']


In [5]:
# converting the date of birth to age 
now = pd.Timestamp('now')
new_cust['age'] = (now.year - new_cust['DOB'].dt.year) - ((now.month - new_cust['DOB'].dt.month) < 0)

In [6]:
#dropping some irrelevant columns
new_cust = new_cust.drop(['first_name', 'last_name', 'job_title', 'DOB', 'Unnamed: 16', 'Unnamed: 17', 
                          'Unnamed: 18', 'Rank', 'Unnamed: 19', 'Unnamed: 20', 'country', 'address', 
                          'deceased_indicator', 'Value', 'postcode'], axis=1)

In [7]:
new_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   gender                               1000 non-null   object 
 1   past_3_years_bike_related_purchases  1000 non-null   int64  
 2   job_industry_category                835 non-null    object 
 3   wealth_segment                       1000 non-null   object 
 4   owns_car                             1000 non-null   object 
 5   tenure                               1000 non-null   int64  
 6   state                                1000 non-null   object 
 7   property_valuation                   1000 non-null   int64  
 8   age                                  983 non-null    float64
dtypes: float64(1), int64(3), object(5)
memory usage: 70.4+ KB


In [8]:
# segmenting people into groups of young adult, adult and senior based on their respective age
cut_points = [17,35,60,90]
label_names = ["young adult","adult","senior"]
new_cust["age_categories"] = pd.cut(new_cust["age"],cut_points,labels=label_names)

In [9]:
new_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   gender                               1000 non-null   object  
 1   past_3_years_bike_related_purchases  1000 non-null   int64   
 2   job_industry_category                835 non-null    object  
 3   wealth_segment                       1000 non-null   object  
 4   owns_car                             1000 non-null   object  
 5   tenure                               1000 non-null   int64   
 6   state                                1000 non-null   object  
 7   property_valuation                   1000 non-null   int64   
 8   age                                  983 non-null    float64 
 9   age_categories                       983 non-null    category
dtypes: category(1), float64(1), int64(3), object(5)
memory usage: 71.5+ KB


In [10]:
new_cust['age'].dtype

dtype('float64')

In [11]:
#converting agefrom float to int
new_cust['age'] = np.array(new_cust['age']).astype(int)

In [12]:
#converting age_category from category to object
new_cust['age_categories'] = new_cust['age_categories'].astype(str)

In [13]:
new_cust.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   gender                               1000 non-null   object
 1   past_3_years_bike_related_purchases  1000 non-null   int64 
 2   job_industry_category                835 non-null    object
 3   wealth_segment                       1000 non-null   object
 4   owns_car                             1000 non-null   object
 5   tenure                               1000 non-null   int64 
 6   state                                1000 non-null   object
 7   property_valuation                   1000 non-null   int64 
 8   age                                  1000 non-null   int32 
 9   age_categories                       1000 non-null   object
dtypes: int32(1), int64(3), object(6)
memory usage: 74.3+ KB


In [14]:
# erroneous age data
erroneous_age = new_cust[new_cust['age'] == -2147483648]
erroneous_age

Unnamed: 0,gender,past_3_years_bike_related_purchases,job_industry_category,wealth_segment,owns_car,tenure,state,property_valuation,age,age_categories
59,U,5,IT,Mass Customer,No,4,VIC,5,-2147483648,
226,U,35,IT,Affluent Customer,Yes,11,NSW,9,-2147483648,
324,U,69,IT,Mass Customer,Yes,3,VIC,3,-2147483648,
358,U,65,Entertainment,Affluent Customer,No,5,QLD,8,-2147483648,
360,U,71,IT,Mass Customer,Yes,11,VIC,7,-2147483648,
374,U,66,IT,Mass Customer,No,15,QLD,6,-2147483648,
434,U,52,IT,Mass Customer,No,7,VIC,5,-2147483648,
439,U,93,IT,Mass Customer,Yes,14,VIC,6,-2147483648,
574,U,69,IT,Mass Customer,No,12,NSW,7,-2147483648,
598,U,15,IT,Affluent Customer,No,5,NSW,11,-2147483648,


In [15]:
# erroneous gender data
erroneous_gender = new_cust[new_cust['gender'] == 'U']
erroneous_gender

Unnamed: 0,gender,past_3_years_bike_related_purchases,job_industry_category,wealth_segment,owns_car,tenure,state,property_valuation,age,age_categories
59,U,5,IT,Mass Customer,No,4,VIC,5,-2147483648,
226,U,35,IT,Affluent Customer,Yes,11,NSW,9,-2147483648,
324,U,69,IT,Mass Customer,Yes,3,VIC,3,-2147483648,
358,U,65,Entertainment,Affluent Customer,No,5,QLD,8,-2147483648,
360,U,71,IT,Mass Customer,Yes,11,VIC,7,-2147483648,
374,U,66,IT,Mass Customer,No,15,QLD,6,-2147483648,
434,U,52,IT,Mass Customer,No,7,VIC,5,-2147483648,
439,U,93,IT,Mass Customer,Yes,14,VIC,6,-2147483648,
574,U,69,IT,Mass Customer,No,12,NSW,7,-2147483648,
598,U,15,IT,Affluent Customer,No,5,NSW,11,-2147483648,


In [16]:
#asserting the lenght of the erraneous data
print(len(erroneous_gender))
print(len(erroneous_age))

17
17


In [17]:
erroneous_gender == erroneous_age

Unnamed: 0,gender,past_3_years_bike_related_purchases,job_industry_category,wealth_segment,owns_car,tenure,state,property_valuation,age,age_categories
59,True,True,True,True,True,True,True,True,True,True
226,True,True,True,True,True,True,True,True,True,True
324,True,True,True,True,True,True,True,True,True,True
358,True,True,True,True,True,True,True,True,True,True
360,True,True,True,True,True,True,True,True,True,True
374,True,True,True,True,True,True,True,True,True,True
434,True,True,True,True,True,True,True,True,True,True
439,True,True,True,True,True,True,True,True,True,True
574,True,True,True,True,True,True,True,True,True,True
598,True,True,True,True,True,True,True,True,True,True


In [18]:
#dropping the errorneous data 
new_cust = new_cust[new_cust['age'] != -2147483648]

In [19]:
new_cust.isnull().sum()

gender                                   0
past_3_years_bike_related_purchases      0
job_industry_category                  165
wealth_segment                           0
owns_car                                 0
tenure                                   0
state                                    0
property_valuation                       0
age                                      0
age_categories                           0
dtype: int64

In [20]:
new_cust = new_cust.dropna()
new_cust.isnull().sum()

gender                                 0
past_3_years_bike_related_purchases    0
job_industry_category                  0
wealth_segment                         0
owns_car                               0
tenure                                 0
state                                  0
property_valuation                     0
age                                    0
age_categories                         0
dtype: int64

In [21]:
new_cust

Unnamed: 0,gender,past_3_years_bike_related_purchases,job_industry_category,wealth_segment,owns_car,tenure,state,property_valuation,age,age_categories
0,Male,86,Manufacturing,Mass Customer,Yes,14,QLD,6,62,senior
1,Male,69,Property,Mass Customer,No,16,NSW,11,50,adult
2,Female,10,Financial Services,Affluent Customer,No,10,VIC,5,45,adult
3,Female,64,Manufacturing,Affluent Customer,Yes,5,QLD,1,41,adult
4,Female,34,Financial Services,Affluent Customer,No,19,NSW,9,54,adult
...,...,...,...,...,...,...,...,...,...,...
995,Male,60,Financial Services,Affluent Customer,No,9,NSW,7,60,adult
996,Male,22,Health,Mass Customer,No,6,NSW,10,18,young adult
997,Female,17,Financial Services,Affluent Customer,Yes,15,QLD,2,65,senior
998,Male,30,Financial Services,Mass Customer,Yes,19,QLD,2,67,senior


In [22]:
#converting categorical data to numeric data
new_cust = pd.get_dummies(new_cust)

In [23]:
#saving the file to excel
new_cust.to_excel('new_cust1.xlsx')