In [1]:
import pandas as pd

Dataset Summary - Classification:

This data is a set of customers for a car company. The company wants to classify customers into predetermined groups (Segmentation) based off of their history, and they want to classify these customers for marketing purposes.

Here's a description of all variables and their meaning:

ID: ID <br>
Gender: 1 for male, 0 for Female<br>
Ever_Married: 1 for Yes, 0 for No<br>
Age: Age<br>
Graduated: 1 for Yes, 0 for No<br>
Profession: One-hot encoded, 1 or 0 at end<br>
Work_Experience: Numerical<br>
Spending_Score: 2 for low, 1 for high, 0 for average<br>
Family_Size: Numerical<br>
Segmentation: A through D, 0 through 3<br>

In [2]:
scores_te = pd.read_csv("datasets_training\classification_final_toml.csv", delimiter=',')
scores_te = scores_te.drop("Unnamed: 0", axis=1) #Data needed to be dropped, dropped from original file anyway

scores_te.head(10)

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation,Profession_Artist,Profession_Doctor,Profession_Engineer,Profession_Entertainment,Profession_Executive,Profession_Healthcare,Profession_Homemaker,Profession_Lawyer,Profession_Marketing
0,462809,1,0,22,0,1.0,2,4.0,3,3,0,0,0,0,0,1,0,0,0
1,466315,0,1,67,1,1.0,2,1.0,5,1,0,0,1,0,0,0,0,0,0
2,461735,1,1,67,1,0.0,1,2.0,5,1,0,0,0,0,0,0,0,1,0
3,461319,1,1,56,0,0.0,0,2.0,5,2,1,0,0,0,0,0,0,0,0
4,460156,1,0,32,1,1.0,2,3.0,5,2,0,0,0,0,0,1,0,0,0
5,464347,0,0,33,1,1.0,2,3.0,5,3,0,0,0,0,0,1,0,0,0
6,465015,0,1,61,1,0.0,2,3.0,6,3,0,0,1,0,0,0,0,0,0
7,465176,0,1,55,1,1.0,0,4.0,5,2,1,0,0,0,0,0,0,0,0
8,464041,0,0,26,1,1.0,2,3.0,5,0,0,0,1,0,0,0,0,0,0
9,464942,1,0,19,0,4.0,2,4.0,3,3,0,0,0,0,0,1,0,0,0


In [3]:
#Training imports (some may not be used)

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

Splitting the dataset:<br>
We split by 80:20.

In [4]:
y = scores_te["Segmentation"]
scores_te_2 = scores_te.drop("ID", axis=1)
x = scores_te.drop("Segmentation", axis=1)
x.head(10)

#Train/Split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

In [5]:

scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)

# Logistic Regression model - RandomForest proved similar
model = RandomForestClassifier(random_state=42, class_weight='balanced')
model.fit(x_train, y_train)


y_pred = model.predict(x_test)


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print("Summary:\n-------------")
print(classification_report(y_test, y_pred))

#.48 is pretty bad, check data set

Accuracy: 0.5176294073518379
Summary:
-------------
              precision    recall  f1-score   support

           0       0.48      0.53      0.50       329
           1       0.39      0.38      0.38       320
           2       0.55      0.52      0.54       356
           3       0.64      0.65      0.64       328

    accuracy                           0.52      1333
   macro avg       0.52      0.52      0.52      1333
weighted avg       0.52      0.52      0.52      1333



Algorithm Choices/Choice Made:<br>
For the classification set, I made the choice of a RandomForestClassifier model. It's clearly the best model for the task, with multiple decision trees and good handling of multiple data types/scaling, as well as the fact that it's not prone to overfitting, or at least in comparison with other models. It also provided the best eval out of any model I tested (there were many). Although it didn't preform as well as I would've liked, it did preform the best. More comments in the summary on that.

Ways to evalulate the model: <br>
TPR - True Positive Rate - the rate that positive cases are identified.<br>
FPR - False Positive Rate - Proportion of cases that are falsely identified by the model. <br>
F1 Score - F1 Score factors in errors and provides a useful overall accuracy metric, where closer to 1 is better.

In [6]:
#Eval
conf_matrix = confusion_matrix(y_test, y_pred)
TPR = conf_matrix[1, 1] / (conf_matrix[1, 0] + conf_matrix[1, 1]) #Regular TPR calculation
print("TPR:", TPR)
FPR = conf_matrix[0, 1] / (conf_matrix[0, 0] + conf_matrix[0, 1])
print("FPR:", FPR)
f1 = f1_score(y_test, y_pred, average='weighted')
print("F1 Score:", f1)

TPR: 0.625
FPR: 0.2967479674796748
F1 Score: 0.5170530855373333


Summary:<br>
In summary, the classification model was trained through a random forest classifier model, and finally evaluated based on TPR, F1, and FPR. The model did not preform to the standards expected, although it did preform better than a random guess (which is always preferrable) by a considerable amount, according to the F1 score of the model. For the future, a more reasonable approach would be to use one-hot encoding on much of the data, as well as transforming the data into the training set in a more exact manner/classifying variables for OH encoding in a better way. 

Model shown:

In [7]:
import json
new_data_scaled = scaler.transform([[1,0,22,0,1.0,2,4.0,3,0,0,0,0,0,1,0,0,0,0]]) # expects 18
new_predictions = model.predict(new_data_scaled)
new_predictions_json = json.dumps(new_predictions.tolist())


print("New data provided:", new_data_scaled[0])
print("Corresponding prediction's Segmentation:", new_predictions[0])


New data provided: [-1.80621838e+02 -1.10931917e+00  4.35560652e+01 -2.63490733e+00
  7.54058603e-01 -1.84753897e-01  3.10848507e+00  1.04215382e-01
 -2.96516851e+00 -7.00036727e-01 -3.12218917e-01 -3.09316142e-01
 -3.71683881e-01  3.49256636e+00 -4.39015297e-01 -1.64208894e-01
 -2.84785886e-01 -1.90328990e-01]
Corresponding prediction's Segmentation: 3


