# Sprocket Central Pty Ltd Company Customers Recommendation Project Phase #3 Data Modeling

## KPMG Virtual Internship

<img src="sprocket_central.png">

### About the Dataset:
**Sprocket Central Pty Ltd**, a medium size bikes & cycling accessories organisation which has a large dataset relating to its customers, but their team is unsure how to effectively analyse it to help optimise its marketing strategy.

#### Dataset:
1. **Training Dataset:** The Dataset we genrated out of Phase#2 "EAD & RFM Analysis.
2. **New Customer List:** The dataset that the client provided of new potential customers to recommend customers for the marketing team to target.

#### In this Phase I'll:
1. Check the state of "customer_data_RFM_Analysis.csv" dataset to be ready for model.
2. Clean and Optimize "NewCustomerList" Dataset to be ready for the Model.
3. Train Two Machine Learning Models on the train data set and measure the accuracy of each model to choose one to deploy on the new Customer dataset.
4. Deploy the choosen Model on New Customers dataset and Extract the resulting recommendation in a CSV File.
5. Create a Tableau Dashboard to Reort the Resulted through (Dashboard Link).


In [1]:
#importing the necessary data analytics libraries
import pandas as pd
import numpy as np
import datetime as dt
import calendar

# ML Modeling
from sklearn.preprocessing import LabelEncoder

# suppress warnings 
import warnings
warnings.simplefilter("ignore")


## [1] Check the state of "customer_data_RFM_Analysis.csv" dataset to be ready for Modelling:

In [2]:
old_df = pd.read_csv("customer_data_RFM_Analysis.csv")
old_df.head(1)

Unnamed: 0.1,Unnamed: 0,transaction_id,product_id,customer_id,transaction_date,online_order,order_status,brand,product_line,product_class,...,standard_profit,recency,frequency,monetary,R,F,M,RFMClass,RFMscore,RFM_loyalty_level
0,0,1,2,2950,2017-02-25,False,Approved,Solex,Standard,medium,...,17.87,76,3,645.99,3,4,4,344,11,Bronze


In [3]:
old_df.drop(axis=1, columns='Unnamed: 0', inplace=True)
old_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19585 entries, 0 to 19584
Data columns (total 40 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   transaction_id                       19585 non-null  int64  
 1   product_id                           19585 non-null  int64  
 2   customer_id                          19585 non-null  int64  
 3   transaction_date                     19585 non-null  object 
 4   online_order                         19585 non-null  bool   
 5   order_status                         19585 non-null  object 
 6   brand                                19585 non-null  object 
 7   product_line                         19585 non-null  object 
 8   product_class                        19585 non-null  object 
 9   product_size                         19585 non-null  object 
 10  list_price                           19585 non-null  float64
 11  standard_cost               

### Creating old_df_1 dataset with only the features to be used in modeling.

In [4]:
old_df_1 = old_df[['gender', 'past_3_years_bike_related_purchases','Age', 'job_industry_category','wealth_segment',
                  'owns_car','tenure','postcode','state','property_valuation']]

In [5]:
old_df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19585 entries, 0 to 19584
Data columns (total 10 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   gender                               19585 non-null  object 
 1   past_3_years_bike_related_purchases  19585 non-null  float64
 2   Age                                  19585 non-null  int64  
 3   job_industry_category                19585 non-null  object 
 4   wealth_segment                       19585 non-null  object 
 5   owns_car                             19585 non-null  object 
 6   tenure                               19585 non-null  int64  
 7   postcode                             19585 non-null  float64
 8   state                                19585 non-null  object 
 9   property_valuation                   19585 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 1.1+ MB


### Cleaning Notes (Preparing Old Customer List for Modeling):
1. Transform (gender | job_industry_category | owns_car | state) into binary columns (nominal columns).
2. Convert (wealth_segment) inti binary using label incoder (ordinal category).
3. Change (postcode) into int.

In [6]:
old_df_1 = pd.concat([pd.get_dummies(old_df_1.gender,drop_first=True),old_df_1], axis=1)
old_df_1.drop(axis=1, columns='gender', inplace=True)
old_df_1.head(1)

Unnamed: 0,Male,past_3_years_bike_related_purchases,Age,job_industry_category,wealth_segment,owns_car,tenure,postcode,state,property_valuation
0,1,19.0,67,Financial Services,Mass Customer,Yes,10,3064.0,VIC,6


In [7]:
old_df_1 = pd.concat([pd.get_dummies(old_df_1.job_industry_category,drop_first=True),old_df_1], axis=1)
old_df_1.drop(axis=1, columns='job_industry_category', inplace=True)
old_df_1.head(1)

Unnamed: 0,Entertainment,Financial Services,Health,IT,Manufacturing,Property,Retail,Telecommunications,Male,past_3_years_bike_related_purchases,Age,wealth_segment,owns_car,tenure,postcode,state,property_valuation
0,0,1,0,0,0,0,0,0,1,19.0,67,Mass Customer,Yes,10,3064.0,VIC,6


In [8]:
old_df_1 = pd.concat([pd.get_dummies(old_df_1.owns_car,drop_first=True),old_df_1], axis=1)
old_df_1.drop(axis=1, columns='owns_car', inplace=True)
old_df_1.rename(columns={'Yes':'owns_car_Yes'}, inplace=True)
old_df_1.head(1)

Unnamed: 0,owns_car_Yes,Entertainment,Financial Services,Health,IT,Manufacturing,Property,Retail,Telecommunications,Male,past_3_years_bike_related_purchases,Age,wealth_segment,tenure,postcode,state,property_valuation
0,1,0,1,0,0,0,0,0,0,1,19.0,67,Mass Customer,10,3064.0,VIC,6


In [9]:
old_df_1 = pd.concat([pd.get_dummies(old_df_1.state,drop_first=True),old_df_1], axis=1)
old_df_1.drop(axis=1, columns='state', inplace=True)
old_df_1.head(1)

Unnamed: 0,QLD,VIC,owns_car_Yes,Entertainment,Financial Services,Health,IT,Manufacturing,Property,Retail,Telecommunications,Male,past_3_years_bike_related_purchases,Age,wealth_segment,tenure,postcode,property_valuation
0,0,1,1,0,1,0,0,0,0,0,0,1,19.0,67,Mass Customer,10,3064.0,6


In [10]:
# changing wealth_segment data columns using Label  Encoder into binary
old_df_1['wealth_segment']=LabelEncoder().fit_transform(old_df_1['wealth_segment'])

In [11]:
old_df_1.postcode = old_df_1.postcode.astype(int)

In [12]:
old_df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19585 entries, 0 to 19584
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   QLD                                  19585 non-null  uint8  
 1   VIC                                  19585 non-null  uint8  
 2   owns_car_Yes                         19585 non-null  uint8  
 3   Entertainment                        19585 non-null  uint8  
 4   Financial Services                   19585 non-null  uint8  
 5   Health                               19585 non-null  uint8  
 6   IT                                   19585 non-null  uint8  
 7   Manufacturing                        19585 non-null  uint8  
 8   Property                             19585 non-null  uint8  
 9   Retail                               19585 non-null  uint8  
 10  Telecommunications                   19585 non-null  uint8  
 11  Male                        

## [2] New Customer List (Data Wrangling, Cleaning, and Feature Engineering):

In [13]:
# Excel file name: 
filename = "KPMG_VI_New_raw_data_update_final.xlsx"

xls = pd.ExcelFile(filename)
sheets = xls.sheet_names
sheets

['Title Sheet',
 'Transactions',
 'NewCustomerList',
 'CustomerDemographic',
 'CustomerAddress']

In [14]:
# reading the Transactions sheet as trans DataFrame
new_df = pd.read_excel(xls, sheet_name=sheets[2], skiprows=1, na_values="n/a")
new_df.head(2)

Unnamed: 0,first_name,last_name,gender,past_3_years_bike_related_purchases,DOB,job_title,job_industry_category,wealth_segment,deceased_indicator,owns_car,...,state,country,property_valuation,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Rank,Value
0,Chickie,Brister,Male,86,1957-07-12,General Manager,Manufacturing,Mass Customer,N,Yes,...,QLD,Australia,6,0.56,0.7,0.875,0.74375,1,1,1.71875
1,Morly,Genery,Male,69,1970-03-22,Structural Engineer,Property,Mass Customer,N,No,...,NSW,Australia,11,0.89,0.89,1.1125,0.945625,1,1,1.71875


In [15]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   first_name                           1000 non-null   object        
 1   last_name                            971 non-null    object        
 2   gender                               1000 non-null   object        
 3   past_3_years_bike_related_purchases  1000 non-null   int64         
 4   DOB                                  983 non-null    datetime64[ns]
 5   job_title                            894 non-null    object        
 6   job_industry_category                835 non-null    object        
 7   wealth_segment                       1000 non-null   object        
 8   deceased_indicator                   1000 non-null   object        
 9   owns_car                             1000 non-null   object        
 10  tenure       

In [16]:
new_df.drop(axis=1, columns=new_df.columns[16:21], inplace=True)
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   first_name                           1000 non-null   object        
 1   last_name                            971 non-null    object        
 2   gender                               1000 non-null   object        
 3   past_3_years_bike_related_purchases  1000 non-null   int64         
 4   DOB                                  983 non-null    datetime64[ns]
 5   job_title                            894 non-null    object        
 6   job_industry_category                835 non-null    object        
 7   wealth_segment                       1000 non-null   object        
 8   deceased_indicator                   1000 non-null   object        
 9   owns_car                             1000 non-null   object        
 10  tenure       

### Data Cleaning Notes:
1. fill (last_name) with 'Not Available'.
2. Drop Data with NaN in (DOB).
3. fill NaN in (job_title | job_industry_category)
3. Create 'Age' column from 'DOB' as int

In [17]:
new_df.last_name.fillna('Not Available', inplace=True)
new_df.last_name.isnull().sum()

0

In [18]:
new_df.dropna(subset=['DOB'], inplace=True)
new_df[['DOB']].isna().sum()

DOB    0
dtype: int64

In [19]:
#filling the null values in Job title with the 12.5% most repeated Job_titles to avoid any missleading skewed data.
n_p = int(len(new_df.job_title.value_counts())*12.5/100)
new_df.job_title.value_counts()[:n_p+1]
new_df['job_title'] = new_df['job_title'].fillna(np.random.choice(np.array(new_df.job_title.value_counts()[:10].index)))

#filling the null values with random choices of the unique values in the feature based on the propability of occurance of each value
v_list = new_df.job_industry_category.value_counts()
p_list = [value/v_list.sum() for value in v_list]
new_df['job_industry_category'] = new_df['job_industry_category'].fillna(np.random.choice(np.array(v_list.index), p=p_list))

#check
new_df[['job_title','job_industry_category']].isna().sum()

job_title                0
job_industry_category    0
dtype: int64

In [20]:
#creating Age column from DOB columns
today = dt.date.today()
new_df['Age'] = new_df.DOB.apply(lambda d: today.year - d.year - ((today.month, today.day) < (d.month, d.day)))

### Ceating a new_df_1 data set with only the features used in the Modeling stage:

In [21]:
new_df_1= new_df[['gender', 'past_3_years_bike_related_purchases','Age', 'job_industry_category','wealth_segment',
                  'owns_car','tenure','postcode','state','property_valuation']]

### Additional Cleaning Notes:
1. Transform (gender | job_industry_category | owns_car | state) into binary columns (nominal columns).
2. Convert (wealth_segment) inti binary using label incoder (ordinal category).
3. Change (postcode) into int.

In [22]:
new_df_1 = pd.concat([pd.get_dummies(new_df_1.gender,drop_first=True),new_df_1], axis=1)
new_df_1.drop(axis=1, columns='gender', inplace=True)
new_df_1.head(1)

Unnamed: 0,Male,past_3_years_bike_related_purchases,Age,job_industry_category,wealth_segment,owns_car,tenure,postcode,state,property_valuation
0,1,86,64,Manufacturing,Mass Customer,Yes,14,4500,QLD,6


In [23]:
new_df_1 = pd.concat([pd.get_dummies(new_df_1.job_industry_category,drop_first=True),new_df_1], axis=1)
new_df_1.drop(axis=1, columns='job_industry_category', inplace=True)
new_df_1.head(1)

Unnamed: 0,Entertainment,Financial Services,Health,IT,Manufacturing,Property,Retail,Telecommunications,Male,past_3_years_bike_related_purchases,Age,wealth_segment,owns_car,tenure,postcode,state,property_valuation
0,0,0,0,0,1,0,0,0,1,86,64,Mass Customer,Yes,14,4500,QLD,6


In [24]:
new_df_1 = pd.concat([pd.get_dummies(new_df_1.owns_car,drop_first=True),new_df_1], axis=1)
new_df_1.drop(axis=1, columns='owns_car', inplace=True)
new_df_1.rename(columns={'Yes':'owns_car_Yes'}, inplace=True)
new_df_1.head(1)

Unnamed: 0,owns_car_Yes,Entertainment,Financial Services,Health,IT,Manufacturing,Property,Retail,Telecommunications,Male,past_3_years_bike_related_purchases,Age,wealth_segment,tenure,postcode,state,property_valuation
0,1,0,0,0,0,1,0,0,0,1,86,64,Mass Customer,14,4500,QLD,6


In [25]:
new_df_1 = pd.concat([pd.get_dummies(new_df_1.state,drop_first=True),new_df_1], axis=1)
new_df_1.drop(axis=1, columns='state', inplace=True)
new_df_1.head(1)

Unnamed: 0,QLD,VIC,owns_car_Yes,Entertainment,Financial Services,Health,IT,Manufacturing,Property,Retail,Telecommunications,Male,past_3_years_bike_related_purchases,Age,wealth_segment,tenure,postcode,property_valuation
0,1,0,1,0,0,0,0,1,0,0,0,1,86,64,Mass Customer,14,4500,6


In [26]:
# changing wealth_segment data columns using Label  Encoder into binary
new_df_1['wealth_segment']=LabelEncoder().fit_transform(new_df_1['wealth_segment'])

In [27]:
new_df_1.postcode = new_df_1.postcode.astype(int)

In [28]:
new_df_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 983 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                               Non-Null Count  Dtype
---  ------                               --------------  -----
 0   QLD                                  983 non-null    uint8
 1   VIC                                  983 non-null    uint8
 2   owns_car_Yes                         983 non-null    uint8
 3   Entertainment                        983 non-null    uint8
 4   Financial Services                   983 non-null    uint8
 5   Health                               983 non-null    uint8
 6   IT                                   983 non-null    uint8
 7   Manufacturing                        983 non-null    uint8
 8   Property                             983 non-null    uint8
 9   Retail                               983 non-null    uint8
 10  Telecommunications                   983 non-null    uint8
 11  Male                                 983 non-null    uint8

In [29]:
assert old_df_1.columns.all() == new_df_1.columns.all()

# Model building:
* **We will train 2 ML models on the old customers dataset,**
* **Choose the one with the more precious predictions,and**
* **Use it to Predict on the new customers dataset.**

In [30]:
# Splitting old_df_1 dataset
from sklearn.model_selection import train_test_split
train_features, test_features, train_labels, test_labels = train_test_split(old_df_1,
                                                                            old_df['RFM_loyalty_level'],
                                                                            test_size= 0.25, random_state=10,)

In [31]:
# Decision tree ML Classification Model
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=10)
tree.fit(train_features, train_labels)

# Predict the labels for the test data
pred_labels_tree = tree.predict(test_features)

In [32]:
# The classification report
from sklearn.metrics import classification_report
class_rep_tree = classification_report(test_labels, pred_labels_tree)
print("Decision Tree: \n", class_rep_tree)

Decision Tree: 
               precision    recall  f1-score   support

      Bronze       0.99      0.94      0.97       454
        Gold       0.98      0.99      0.99      1223
    Platinum       0.99      1.00      0.99      2301
      Silver       1.00      0.97      0.98       919

    accuracy                           0.99      4897
   macro avg       0.99      0.98      0.98      4897
weighted avg       0.99      0.99      0.99      4897



In [33]:
# Decision RandomForestClassifier ML Classification Model
from sklearn.ensemble import RandomForestClassifier
rs = RandomForestClassifier()
rs.fit(train_features, train_labels)

# Predict the labels for the test data
pred_labels_rs = rs.predict(test_features)

In [34]:
# Create the classification report
class_rep_rs = classification_report(test_labels, pred_labels_rs)
print("RandomForestClassifier: \n", class_rep_rs)

RandomForestClassifier: 
               precision    recall  f1-score   support

      Bronze       1.00      0.94      0.97       454
        Gold       0.98      1.00      0.99      1223
    Platinum       0.99      1.00      0.99      2301
      Silver       0.99      0.97      0.98       919

    accuracy                           0.99      4897
   macro avg       0.99      0.98      0.98      4897
weighted avg       0.99      0.99      0.99      4897



**Using Random Forest Classifier, It has higher Precision Than Decision Tree Model**

In [35]:
# predict the new segments using decision tree model
output_label = rs.predict(new_df_1)

In [36]:
#converting an array into a dataframe column
new_df['RFM_segments_predicted']=output_label.tolist()

In [37]:
#checking final results
new_df[['first_name','last_name','gender','RFM_segments_predicted']]

Unnamed: 0,first_name,last_name,gender,RFM_segments_predicted
0,Chickie,Brister,Male,Platinum
1,Morly,Genery,Male,Gold
2,Ardelis,Forrester,Female,Bronze
3,Lucine,Stutt,Female,Platinum
4,Melinda,Hadlee,Female,Gold
...,...,...,...,...
995,Ferdinand,Romanetti,Male,Platinum
996,Burk,Wortley,Male,Silver
997,Melloney,Temby,Female,Platinum
998,Dickie,Cubbini,Male,Platinum


In [38]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 983 entries, 0 to 999
Data columns (total 20 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   first_name                           983 non-null    object        
 1   last_name                            983 non-null    object        
 2   gender                               983 non-null    object        
 3   past_3_years_bike_related_purchases  983 non-null    int64         
 4   DOB                                  983 non-null    datetime64[ns]
 5   job_title                            983 non-null    object        
 6   job_industry_category                983 non-null    object        
 7   wealth_segment                       983 non-null    object        
 8   deceased_indicator                   983 non-null    object        
 9   owns_car                             983 non-null    object        
 10  tenure        

#### Extracting the predicted segments of the new Customer list into CSV file for Dashboard Creation.

In [39]:
new_df.to_csv("new_customer_recommendations.csv")