<h1>Customer Personality Analysis - Prediction of Customer Response to Marketing</h1>
<h2>Table of Contents</h2>

* [Problem Statement](#1)
    
* [Project Objectives](#2)
    
* [Importing Libraries and Read In Dataset](#3)
    
* [Split Dataset and Balance Classes](#4)   
    
* [Create Models](#5) 
      
* [Ensemble the Models](#6)

* [Conclusions](#7)

<a id="1"></a>
<h2>Problem Statement</h2>
<p>
    Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers.

Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment.
</p>
<p>
    Source: <a href="https://www.kaggle.com/imakash3011/customer-personality-analysis">Kaggle - Customer Personality Analysis</a></p>

<a id="2"></a>
<h2>Project Objectives</h2>
<ol>
    <li>Determine customer traits and behaviors</li>
    <li>Group similar customers based on traits and behaviors</li>
    <li>Create predictive model to predict which customers will respond to marketting campaigns</li>
</ol>
<h5>In this notebook I will focus on the 3rd objective. I will read in the new cleaned dataset from a previous notebook
    (1.Feat_Eng_EDA.ipynb). Then, I will clean and standardize the data before splitting it into a train, test, and 
    validation set. To balnace the classes (responded and didnt respond) I will be using a technique called SMOTE. Using this new dataset, I will create 5 weak models and evaluate each. Using these 5 models, I will
create an ensemble model which will take a majority vote as to what to classify the customer. Finally, I will present my
conclusions.</h5>

<a id="3"></a>
<h2>Importing Libraries and Read In Dataset</h2>

In [145]:
#Import Libraries
import numpy as np #linear algebra
import pandas as pd #data processing
import matplotlib.pyplot as plt #data viz
import seaborn as sns #data viz
from sklearn.preprocessing import StandardScaler, OneHotEncoder #preprocessing
from sklearn.compose import ColumnTransformer #preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV #data split, grid search
from imblearn.over_sampling import SMOTE #balance classes
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn.svm import SVC #support vector machine
from sklearn.neighbors import KNeighborsClassifier #knn
from sklearn.naive_bayes import GaussianNB #bayes
from xgboost import XGBClassifier #gradient boosting tree
from sklearn.metrics import accuracy_score, recall_score #calculates accuracy, recall
from sklearn.ensemble import VotingClassifier#ensemble

In [146]:
#Read in previously cleaned dataset
pd.set_option('display.max_columns', None)
clean_data = pd.read_csv('data/mc_fe.csv', index_col='ID')
clean_data.head()

Unnamed: 0_level_0,Education,Income,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,Complain,Response,Age,Days_Since_Customer,Fam_Size,Num_Accepted,MntTotal
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5524,Graduation,58138.0,58,635,88,546,172,88,88,3,8,10,4,7,0,1,64,3509.686794,1,0,2252
2174,Graduation,46344.0,38,11,1,6,2,1,6,2,1,1,2,5,0,0,67,2663.686794,3,0,38
4141,Graduation,71613.0,26,426,49,127,111,21,42,1,8,2,10,4,0,0,56,3010.686794,2,0,1202
6182,Graduation,26646.0,26,11,4,20,10,3,5,2,2,0,4,6,0,0,37,2603.686794,3,0,64
5324,PhD,58293.0,94,173,43,118,46,27,15,5,5,3,6,5,0,0,40,2859.686794,3,0,595


<h5>The data will need some preprocessing before the predictive model is created. 'Education' needs to be encoded and the other
    columns need to be scaled. Also, the target column 'Repsonse' needs to be removed from the dataset.</h5>

In [147]:
#Remove the 'Response' column because it is the target of future predictive model
X, y = clean_data.drop('Response', axis=1).values, clean_data['Response'].values

#Creates a column transformer that sends 'Education' to be encoded and rest scaled
ct = ColumnTransformer([
    ('catagoric', OneHotEncoder(), [0]),
    ('numeric', StandardScaler(), list(range(1, len(X.T))))
])

#Sends the data through the column transformer
X_transformed = ct.fit_transform(X)
print('Preprocessed Data:')
print(X_transformed[0])

Preprocessed Data:
[ 0.          0.          1.          0.          0.          0.23532677
  0.30703926  0.98378127  1.55157698  1.67970233  2.46214705  1.4765001
  0.84320691  0.34941394  1.40930394  2.51089024 -0.55078479  0.69390374
 -0.09728167  0.98534473  1.97674456 -1.75911463 -0.43903713  1.4669731 ]


<a id="4"></a>
<h2>Split Dataset and Balance Classes</h2>

In [148]:
#Split into train (70%) and test (30%)
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=8)

#Split the test set into 2 sets; 1 for test, 1 for validation
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=8)

#Display length of each set
print('Length of Each Dataset:')
print('Training Set:', len(X_train))
print('Validation Set:', len(X_val))
print('Test Set:', len(X_test))

Length of Each Dataset:
Training Set: 1568
Validation Set: 336
Test Set: 336


In [149]:
#Balance the training data set using SMOTE
#create the SMOTE object
sm = SMOTE(random_state=8)

#create new training set with SMOTE object
X_bal, y_bal = sm.fit_resample(X_train, y_train)

#Displays perccent of each class
print('Initial Training Set')
print('Percent "Responded":', y_train.sum()/len(y_train))
print('Balanced Training Set')
print('Percent "Responded":', y_bal.sum()/len(y_bal))

Initial Training Set
Percent "Responded": 0.14732142857142858
Balanced Training Set
Percent "Responded": 0.5


<h5>After using SMOTE the dataset is perfeclty balanced. This will improve the performance of the models in the next section.</h5>

<a id="5"></a>
<h2>Create Models</h2>

<h5>For the following models, recall will be a very important metric. This is because we would rather have more False Positives (customers who will not respond to the marketing but was targeted anyway) than False Negatives (customers who would have responded to the add but were not targeted). For these reasons, a balance must be struck between accuracy of the model and the recall of the model.</h5>

<h4>Logistic Regression</h4>

In [150]:
#Create a Logistic Regression Model
#Params to test in grid search
lr_params = {'solver': ['liblinear'], 'penalty': ['l1'], 'C': [1.0, 0.5, 0.25]}

#grid search
lr_grid = GridSearchCV(LogisticRegression(), lr_params, cv=3, scoring='recall')

#fit the grid to the training set
lr_grid.fit(X_bal, y_bal)

#ID the best model
lr = lr_grid.best_estimator_

#Display Best Parameters
print('Best Parameters:', lr_grid.best_params_)

#Display the metrics for the validation set
lr_preds = lr.predict(X_val)
lr_val_acc = accuracy_score(y_val, lr_preds)
lr_val_rec = recall_score(y_val, lr_preds)
print('Logistic Regression Model Accuracy:', lr_val_acc)
print('Logistic Regression Model Recall:', lr_val_rec)

Best Parameters: {'C': 0.25, 'penalty': 'l1', 'solver': 'liblinear'}
Logistic Regression Model Accuracy: 0.7886904761904762
Logistic Regression Model Recall: 0.8518518518518519


<h4>Support Vector Machine</h4>

In [151]:
#Create a Support Vector machine
#Params to test in grid search
svm_params = {'kernel': ['poly', 'rbf'], 'C': [1.0, 0.5, 0.25], 'gamma': ['scale', 'auto']}

#grid search
svm_grid = GridSearchCV(SVC(), svm_params, cv=3, scoring='recall')

#fit the grid to the training set
svm_grid.fit(X_bal, y_bal)

#ID the best model
svm = svm_grid.best_estimator_

#Display Best Parameters
print('Best Parameters:', svm_grid.best_params_)

#Display the metrics for the validation set
svm_preds = svm.predict(X_val)
svm_val_acc = accuracy_score(y_val, svm_preds)
svm_val_rec = recall_score(y_val, svm_preds)
print('Support Vector Machine Accuracy:', svm_val_acc)
print('Support Vector Machine Recall:', svm_val_rec)

Best Parameters: {'C': 1.0, 'gamma': 'scale', 'kernel': 'rbf'}
Support Vector Machine Accuracy: 0.8273809523809523
Support Vector Machine Recall: 0.7407407407407407


<h4>K Nearest Neighbors</h4>

In [152]:
#Create a knn model
#Params to test in grid search
knn_params = {'n_neighbors': [7, 9, 11], 'algorithm': ['ball_tree', 'kd_tree', 'brute'],
             'weights': ['uniform', 'distance']}

#grid search
knn_grid = GridSearchCV(KNeighborsClassifier(), knn_params, cv=3, scoring='recall')

#fit the grid to the training set
knn_grid.fit(X_bal, y_bal)

#ID the best model
knn = knn_grid.best_estimator_

#Display Best Parameters
print('Best Parameters:', knn_grid.best_params_)

#Display the metrics for the validation set
knn_preds = knn.predict(X_val)
knn_val_acc = accuracy_score(y_val, knn_preds)
knn_val_rec = recall_score(y_val, knn_preds)
print('K Nearest Neighbors Accuracy:', knn_val_acc)
print('K Nearest Neighbors Recall:', knn_val_rec)

Best Parameters: {'algorithm': 'ball_tree', 'n_neighbors': 7, 'weights': 'distance'}
K Nearest Neighbors Accuracy: 0.7529761904761905
K Nearest Neighbors Recall: 0.8333333333333334


<h4>Naive Bayes</h4>

In [153]:
#Create a naive bayes model
nb = GaussianNB()

#fit the model to the training set
nb.fit(X_bal, y_bal)

#Display the metrics for the validation set
nb_preds = nb.predict(X_val)
nb_val_acc = accuracy_score(y_val, nb_preds)
nb_val_rec = recall_score(y_val, nb_preds)
print('Naive Bayes Accuracy:', nb_val_acc)
print('Naive Bayes Machine Recall:', nb_val_rec)

Naive Bayes Accuracy: 0.6934523809523809
Naive Bayes Machine Recall: 0.6481481481481481


<h4>Gradient Boosting Tree</h4>

In [154]:
#Create a xgboost model
#Params to test in grid search
xgb_params = {'n_estimators': [240, 250, 260], 'max_depth': [15, 16, 17],
             'colsample_bytree': [0.6, 0.7, 0.8, 1.0]}

#grid search
xgb_grid = GridSearchCV(XGBClassifier(use_label_encoder=False, verbosity=0), xgb_params, cv=3, 
                        scoring='recall')

#fit the grid to the training set
xgb_grid.fit(X_bal, y_bal)

#ID the best model
xgb = xgb_grid.best_estimator_

#Display Best Parameters
print('Best Parameters:', xgb_grid.best_params_)

#Display the metrics for the validation set
xgb_preds = xgb.predict(X_val)
xgb_val_acc = accuracy_score(y_val, xgb_preds)
xgb_val_rec = recall_score(y_val, xgb_preds)
print('Gradient Boosting Tree Accuracy:', xgb_val_acc)
print('Gradient Boosting Tree Recall:', xgb_val_rec)

Best Parameters: {'colsample_bytree': 0.8, 'max_depth': 15, 'n_estimators': 250}
Gradient Boosting Tree Accuracy: 0.8988095238095238
Gradient Boosting Tree Recall: 0.6111111111111112


<a id="6"></a>
<h2>Ensemble the Models</h2>

In [155]:
#Create ensemble model of all the other models
#list of models
models = [('logistic_regression', lr), ('support vector machine', svm), 
        ('knn', knn), ('naive_bayes', nb), ('gradient_boost', xgb)]

#Combine models
ensemble_model = VotingClassifier(estimators=models)

#fit the model on the training set
ensemble_model.fit(X_bal, y_bal)

#Display the metrics for the validation set
ensemble_preds = ensemble_model.predict(X_val)
ensemble_val_acc = accuracy_score(y_val, ensemble_preds)
ensemble_val_rec = recall_score(y_val, ensemble_preds)
print('Ensemble Model Accuracy:', ensemble_val_acc)
print('Ensemble Model Recall:', ensemble_val_rec)

Ensemble Model Accuracy: 0.8571428571428571
Ensemble Model Recall: 0.8518518518518519


<h5>This model takes the predictions of the previous 5 models and will output the class with the highest amount of votes.
    The ensemble models strikes a good balance between recall and accuracy (both around 85%). Now, let us see the results on the test set.</h5>

In [156]:
#Display the metrics of the Ensemble model on the test set
test_preds = ensemble_model.predict(X_test)
test_acc = accuracy_score(y_test, test_preds)
test_rec = recall_score(y_test, test_preds)
print('Test Set Metrics')
print('Ensemble Model Accuracy:', test_acc)
print('Ensemble Model Recall:', test_rec)

Test Set Metrics
Ensemble Model Accuracy: 0.8511904761904762
Ensemble Model Recall: 0.7551020408163265


<a id="7"></a>
<h2>Conclusions</h2>

<ul>
    <li><strong>The final model's accuracy is 85% and recall is 75%.</strong> This is a reasonable balance between the two metrics and will allow the store to identify and target the majority of customers who will respond to marketing while not having to spend an excess of resources targeting large amounts of customers who will not respond. If the store was willing to spend a bit more on marketing, the ensemble model could be modified to identify customers who will respond to the add if 2 or more of the 5 weak models vote.</li>
    <li><strong>This dataset may not be complex enough.</strong> Customers are complex. There are a variety of reasons why a customer would respond to marketing and the dataset used here only includes a small fraction of all variables that need to be considered. That being said, some more feature that could be useful if provided would be: what items are being marketed for each of the campaigns, times of year of purchases and marketing campaigns, the location of the store(s), and how each marketing campaign was presented to customers (web only?, catalog and web?).</li>
</ul>

<h5>Thank you for reading through this project and a special thanks to friends and family. :)</h5> 