# Sprocket Central Pty Ltd company: Customers Recommendation project - Data Modeling

<img src="sprocket_central.PNG" width="700" height="400" align="center"/>

### About the Dataset & objective of the report

**Sprocket Central Pty Ltd**, a medium size bikes & cycling accessories organisation which has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy. 

Primarily, Sprocket Central Pty Ltd needs help with its customer and transactions data. 

The client provided us with 3 datasets:

**Customer Demographic**

**Customer Addresses**

**Transactions data in the past 3 months**

The client provided us with 1 extra datasets with 1000 records of new customers and wants recommendations on which customers to target with the marketing campaigns

**New Customers Demographic**


I have joined the datasets together and conducted a Data wrangling , Data Exploratory analysis , RFM Analysis and segmented the customers into four segments: Platinum,Gold,Silver,Bronze

**The notebook objective:**

The client has provided me with a new customers list that has the same features as the old customers but they haven't purchased from the company before , he asked for recommendations for these customers for to target with his marketing campagins.

in order to solve this request i will use a machine learning classification model and fit it on my old customers dataset with the segmentation i labelled it with and after fitting it, i will use it predict the segmentation of new customers dataset that the client provided me with.

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
import seaborn as sns
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
import calendar
pd.set_option("display.max_columns", 100000)

In [2]:
#Importing the rfm segmented dataset 

CTA_rfm=pd.read_csv('CTA_rfm_allinfo.csv')

In [3]:
CTA_rfm.head(2)

Unnamed: 0.1,Unnamed: 0,transaction_id,product_id,customer_id,transaction_date,online_order,order_status,brand,product_line,product_class,product_size,list_price,standard_cost,Transaction_year,Transaction_month,Transaction_day,day_of_the_week,first_name,last_name,gender,past_3_years_bike_related_purchases,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,Year of birth,Age,address,postcode,state,country,property_valuation,recency,frequency,monetary,R,F,M,RFMClass,RFMscore,RFM_loyalty_level
0,0,1,2,2950,2017-02-25,False,Approved,Solex,Standard,medium,medium,71.49,53.62,1970-01-01 00:00:00.000002017,Feb,1970-01-01 00:00:00.000000025,Saturday,Kristos,Anthony,Male,19,Software Engineer I,Financial Services,Mass Customer,N,Yes,10.0,1955.0,66.0,984 Hoepker Court,3064,VIC,Australia,6,76,3,1953.15,3,4,4,344,11,Bronze
1,1,11065,1,2950,2017-10-16,False,Approved,Giant Bicycles,Standard,medium,medium,1403.5,954.82,1970-01-01 00:00:00.000002017,Oct,1970-01-01 00:00:00.000000016,Monday,Kristos,Anthony,Male,19,Software Engineer I,Financial Services,Mass Customer,N,Yes,10.0,1955.0,66.0,984 Hoepker Court,3064,VIC,Australia,6,76,3,1953.15,3,4,4,344,11,Bronze


creating a new dataframe with columns that i will use to train my classification model on to predict RFM_loyalty_level for a fresh dataset with similar features with 1000 new customer

In [4]:
old_customers=CTA_rfm[['gender','past_3_years_bike_related_purchases','job_industry_category','wealth_segment','owns_car','tenure','Age','property_valuation','RFM_loyalty_level']]

In [5]:
old_customers.head(1)

Unnamed: 0,gender,past_3_years_bike_related_purchases,job_industry_category,wealth_segment,owns_car,tenure,Age,property_valuation,RFM_loyalty_level
0,Male,19,Financial Services,Mass Customer,Yes,10.0,66.0,6,Bronze


In [6]:
old_customers.shape

(19327, 9)

## Old customers dataset Features Engineering

Since the gender,job_industry,own_car columns are nominal columns then i will use one hot coding to transform them into binary values to use them in my ML model

In [7]:
# changing gender data columns using one hot coding into binary

gender=old_customers[['gender']]
gender=pd.get_dummies(gender,drop_first=True)
gender.head()

Unnamed: 0,gender_Male
0,1
1,1
2,1
3,0
4,0


In [8]:
# changing job_industry_category data columns using one hot coding into binary

job_industry_category=old_customers[['job_industry_category']]
job_industry_category=pd.get_dummies(job_industry_category,drop_first=True)
job_industry_category.head()

Unnamed: 0,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications
0,0,1,0,0,0,0,0,0
1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0
4,0,0,1,0,0,0,0,0


In [9]:
# changing owns_car data columns using one hot coding into binary

owns_car=old_customers[['owns_car']]
owns_car=pd.get_dummies(owns_car,drop_first=True)
owns_car.head()

Unnamed: 0,owns_car_Yes
0,1
1,1
2,1
3,1
4,1


Converting wealth_segment column into binary column using label encoder since it is an ordinal category column

In [10]:
# changing wealth_segment data columns using Label  Encoder into binary

from sklearn.preprocessing import LabelEncoder
old_customers['wealth_segment']=LabelEncoder().fit_transform(old_customers['wealth_segment'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  old_customers['wealth_segment']=LabelEncoder().fit_transform(old_customers['wealth_segment'])


I will create a new dataframe to use in my MLmodel consists of binary transformed columns and numrical columns

In [11]:
old_customers1=old_customers[['past_3_years_bike_related_purchases','tenure','Age','property_valuation','wealth_segment']]

In [12]:
#Concatenating transformed categorical columns with the old_customers dataframe

old_customers1=pd.concat([gender,job_industry_category,owns_car,old_customers1],axis=1)

In [13]:
old_customers1.shape

(19327, 15)

In [14]:
# final result

old_customers1.head(1)

Unnamed: 0,gender_Male,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications,owns_car_Yes,past_3_years_bike_related_purchases,tenure,Age,property_valuation,wealth_segment
0,1,0,1,0,0,0,0,0,0,1,19,10.0,66.0,6,2


In [15]:
old_customers1.shape

(19327, 15)

Now i will import the new customers dataset

In [16]:
# Retrieving data with 1000 records of new customers

new_customers=pd.read_excel('New customer list.xlsx')

In [17]:
new_customers.head(2)

Unnamed: 0,Note: The data and information in this document is reflective of a hypothetical situation and client. This document is to be used for KPMG Virtual Internship purposes only.,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,,,,,,Rank,Value
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,14,45 Shopko Center,4500,QLD,Australia,6,0.46,0.575,0.71875,0.610938,1.0,1,1.71875


# Data Wrangling

In [18]:
# Making first row as header

new_customers.rename(columns=new_customers.iloc[0], inplace = True)
new_customers.drop([0], inplace = True)

In [19]:
new_customers.head(1)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,NaN,NaN.1,NaN.2,NaN.3,NaN.4,Rank,Value
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,14,45 Shopko Center,4500,QLD,Australia,6,0.46,0.575,0.71875,0.610938,1.0,1,1.71875


In [20]:
#Dropping (nan) header columns from dataset as it doesnt exist in the riginal dataset

new_customers.columns = new_customers.columns.fillna('to_drop')
new_customers.drop('to_drop', axis = 1, inplace = True)

In [21]:
#changing Date of birth column into date colun to extract Age from it

new_customers['DOB']=pd.to_datetime(new_customers['DOB'])

In [22]:
# This function converts given date to age

def from_dob_to_age(born):
    today = dt.date.today()
    return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

In [23]:
#creating a new Age column in the dataset

new_customers['Age']=new_customers['DOB'].apply(lambda x: from_dob_to_age(x))

In [24]:
# dropping columns i will now use in the Modeling

# new_customers.drop(columns=['first_name','last_name','postcode','address','deceased_indicator','country','DOB','Rank','Value'],inplace=True)

In [25]:
new_customers.head(1)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,Rank,Value,Age
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,14,45 Shopko Center,4500,QLD,Australia,6,1,1.71875,64.0


In [26]:
new_customers.shape

(1000, 19)

In [27]:
# checking for duplicates in the dataset

new_customers[new_customers.duplicated()]

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,Rank,Value,Age


No duplicates in the data

In [28]:
#checking for null values

new_customers.isnull().sum()

first_name                               0
last_name                               29
gender                                   0
past_3_years_bike_related_purchases      0
DOB                                     17
job_title                              106
job_industry_category                  165
wealth_segment                           0
deceased_indicator                       0
owns_car                                 0
tenure                                   0
address                                  0
postcode                                 0
state                                    0
country                                  0
property_valuation                       0
Rank                                     0
Value                                    0
Age                                     17
dtype: int64

In [29]:
# Percent of missing values in each column

(new_customers.isna().sum() / new_customers.shape[0]) * 100

first_name                              0.0
last_name                               2.9
gender                                  0.0
past_3_years_bike_related_purchases     0.0
DOB                                     1.7
job_title                              10.6
job_industry_category                  16.5
wealth_segment                          0.0
deceased_indicator                      0.0
owns_car                                0.0
tenure                                  0.0
address                                 0.0
postcode                                0.0
state                                   0.0
country                                 0.0
property_valuation                      0.0
Rank                                    0.0
Value                                   0.0
Age                                     1.7
dtype: float64

In [30]:
# we will fill in the job_title column with the mode value which is the most repeated value in the column

C=new_customers['job_title'].mode()[0]
new_customers['job_title'].fillna(C,inplace=True)

In [31]:
# we will fill in the job_industry_category column with the mode value which is the most repeated value in the column

J=new_customers['job_industry_category'].mode()[0]
new_customers['job_industry_category'].fillna(J,inplace=True)

In [32]:
# removing null values from dataset

new_customers['Age'].dropna(inplace=True,axis=0)

In [33]:
#checking for final result

new_customers.isnull().sum()

first_name                              0
last_name                              29
gender                                  0
past_3_years_bike_related_purchases     0
DOB                                    17
job_title                               0
job_industry_category                   0
wealth_segment                          0
deceased_indicator                      0
owns_car                                0
tenure                                  0
address                                 0
postcode                                0
state                                   0
country                                 0
property_valuation                      0
Rank                                    0
Value                                   0
Age                                    17
dtype: int64

Checking for consistences of values in categorical column

In [34]:
# Collecting the categorical columns into  list

cat_col=[]
for x in new_customers.dtypes.index:
    if new_customers.dtypes[x]=='object':
        cat_col.append(x)
cat_col

['first_name',
 'last_name',
 'gender',
 'past_3_years_bike_related_purchases',
 'job_title',
 'job_industry_category',
 'wealth_segment',
 'deceased_indicator',
 'owns_car',
 'tenure',
 'address',
 'postcode',
 'state',
 'country',
 'property_valuation',
 'Rank',
 'Value']

In [35]:
#checking for duplicated values in the categorical columns nd the accuracy of the values

for col in cat_col:
    print(col)
    print(new_customers[col].value_counts())
    print()
    print('*******')
    print()

first_name
Mandie      3
Dorian      3
Rozamond    3
Marcelia    2
Wheeler     2
           ..
Alexina     1
Melany      1
Jamison     1
Gilli       1
Fabio       1
Name: first_name, Length: 940, dtype: int64

*******

last_name
Eade             2
Hallt            2
Van den Velde    2
Burgoine         2
Borsi            2
                ..
Rouchy           1
Seekings         1
Skettles         1
Stirland         1
Earley           1
Name: last_name, Length: 961, dtype: int64

*******

gender
Female    513
Male      470
U          17
Name: gender, dtype: int64

*******

past_3_years_bike_related_purchases
60    20
59    18
70    17
42    17
37    16
      ..
9      5
19     5
92     5
85     4
20     3
Name: past_3_years_bike_related_purchases, Length: 100, dtype: int64

*******

job_title
Associate Professor             121
Software Consultant              14
Environmental Tech               14
Chief Design Engineer            13
Assistant Media Planner          12
                   

We will drop the value (U) from gender because it is inconsistent with the column values 

In [36]:
# dropping U from Gender

new_customers=new_customers[new_customers.gender!='U']

## New customers dataset Features Engineering

In [37]:
new_customers.head(1)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,Rank,Value,Age
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,14,45 Shopko Center,4500,QLD,Australia,6,1,1.71875,64.0


In [38]:
# changing categorical data columns into binary using one hot coding

gender_new=new_customers[['gender']]
gender_new=pd.get_dummies(gender_new,drop_first=True)
gender_new.head()

Unnamed: 0,gender_Male
1,1
2,1
3,0
4,0
5,0


In [39]:
# changing job_industry_category_new categorical column into binary using one hot coding

job_industry_category_new=new_customers[['job_industry_category']]
job_industry_category_new=pd.get_dummies(job_industry_category_new,drop_first=True)
job_industry_category_new.head()

Unnamed: 0,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications
1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,1,0,0
3,0,1,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0
5,0,1,0,0,0,0,0,0


In [40]:
# changing owns_car_new categorical column into binary using one hot coding

owns_car_new=new_customers[['owns_car']]
owns_car_new=pd.get_dummies(owns_car_new,drop_first=True)
owns_car_new.head()

Unnamed: 0,owns_car_Yes
1,1
2,0
3,0
4,1
5,0


Converting wealth_segment column into binary column using label encoder since it is an ordinal category column

In [41]:
#Transforming using label_encoder

from sklearn.preprocessing import LabelEncoder
new_customers['wealth_segment']=LabelEncoder().fit_transform(new_customers['wealth_segment'])

In [42]:
#creating a new dataframe with numerical values only

new_customers1=new_customers[['past_3_years_bike_related_purchases','tenure','Age','property_valuation','wealth_segment']]

In [43]:
#Concatenating transformed categorical columns with the new_customer numerical dataframe

new_customers1=pd.concat([gender_new,job_industry_category_new,owns_car_new,new_customers1],axis=1)

Now checking for the two transformed datasets

In [44]:
old_customers1.head(1)

Unnamed: 0,gender_Male,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications,owns_car_Yes,past_3_years_bike_related_purchases,tenure,Age,property_valuation,wealth_segment
0,1,0,1,0,0,0,0,0,0,1,19,10.0,66.0,6,2


In [45]:
old_customers1.shape

(19327, 15)

In [46]:
new_customers1.head(1)

Unnamed: 0,gender_Male,job_industry_category_Entertainment,job_industry_category_Financial Services,job_industry_category_Health,job_industry_category_IT,job_industry_category_Manufacturing,job_industry_category_Property,job_industry_category_Retail,job_industry_category_Telecommunications,owns_car_Yes,past_3_years_bike_related_purchases,tenure,Age,property_valuation,wealth_segment
1,1,0,0,0,0,1,0,0,0,1,86,14,64.0,6,2


In [47]:
new_customers1.shape

(983, 15)

# Model building

We will train the ML model on the old customers dataset and predict on the new customers dataset

In [48]:
# import necessary Libraries

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

In [49]:
# Split our data

train_features, test_features, train_labels, test_labels = train_test_split(old_customers1,
                                                                            old_customers['RFM_loyalty_level'],
                                                                            test_size= 0.25, random_state=10,)

In [50]:
# Train our decision tree

tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)

# Predict the labels for the test data

pred_labels_tree = tree.predict(test_features)

In [51]:
# Create the classification report for both models

class_rep_tree = classification_report(test_labels, pred_labels_tree)


In [52]:
#View the performance of the model

print("Decision Tree: \n", class_rep_tree)

Decision Tree: 
               precision    recall  f1-score   support

      Bronze       0.99      0.91      0.95       466
        Gold       0.98      0.99      0.99      1116
    Platinum       0.99      1.00      0.99      2343
      Silver       0.99      0.98      0.98       907

    accuracy                           0.99      4832
   macro avg       0.99      0.97      0.98      4832
weighted avg       0.99      0.99      0.99      4832



In [53]:
# predict the new data

output_label = tree.predict(new_customers1)

In [54]:
#print the new predicted labels

output_label

array(['Silver', 'Silver', 'Silver', 'Platinum', 'Platinum', 'Gold',
       'Gold', 'Gold', 'Bronze', 'Gold', 'Platinum', 'Gold', 'Platinum',
       'Gold', 'Platinum', 'Gold', 'Silver', 'Silver', 'Bronze', 'Silver',
       'Platinum', 'Gold', 'Platinum', 'Bronze', 'Platinum', 'Platinum',
       'Platinum', 'Silver', 'Silver', 'Silver', 'Silver', 'Gold', 'Gold',
       'Bronze', 'Silver', 'Platinum', 'Platinum', 'Gold', 'Platinum',
       'Gold', 'Platinum', 'Gold', 'Platinum', 'Silver', 'Silver', 'Gold',
       'Bronze', 'Platinum', 'Platinum', 'Bronze', 'Platinum', 'Platinum',
       'Platinum', 'Silver', 'Platinum', 'Silver', 'Bronze', 'Platinum',
       'Platinum', 'Gold', 'Platinum', 'Bronze', 'Silver', 'Bronze',
       'Silver', 'Gold', 'Bronze', 'Platinum', 'Silver', 'Platinum',
       'Bronze', 'Platinum', 'Silver', 'Platinum', 'Gold', 'Silver',
       'Silver', 'Gold', 'Gold', 'Silver', 'Gold', 'Silver', 'Platinum',
       'Silver', 'Silver', 'Bronze', 'Gold', 'Silver', 'Silve

Now concatenating the predicted array on the new customers dataset as a dataframe column

In [55]:
new_customers['RFM_segments_predicted']=output_label.tolist()

In [56]:
new_customers.head(2)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,tenure,address,postcode,state,country,property_valuation,Rank,Value,Age,RFM_segments_predicted
1,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,2,N,Yes,14,45 Shopko Center,4500,QLD,Australia,6,1,1.71875,64.0,Silver
2,Morly,Genery,Male,69,1970-03-22,Structural Engineer,Property,2,N,No,16,14 Mccormick Park,2113,NSW,Australia,11,1,1.71875,51.0,Silver


In [57]:
new_customers[['first_name','last_name','gender','RFM_segments_predicted']]

Unnamed: 0,first_name,last_name,gender,RFM_segments_predicted
1,Chickie,Brister,Male,Silver
2,Morly,Genery,Male,Silver
3,Ardelis,Forrester,Female,Silver
4,Lucine,Stutt,Female,Platinum
5,Melinda,Hadlee,Female,Platinum
...,...,...,...,...
996,Ferdinand,Romanetti,Male,Platinum
997,Burk,Wortley,Male,Silver
998,Melloney,Temby,Female,Silver
999,Dickie,Cubbini,Male,Platinum
