**AutoML Benchmark model to predict Customer Satisfaction Score**<br>
**Objective**: Use FLAML to select the best regression model to predict `SatisfactionScore`  
**Dataset**: Customer Demographics & Feedback (38,000+ entries)  
**Models Evaluated**: CatBoost, LightGBM 

In [None]:
#imprting all necessary libraries
from flaml import AutoML
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#reading the dataset
dataset=pd.read_csv('customer_feedback_satisfaction.csv')

In [None]:
#taking a peek into our dataset
dataset.head()

Unnamed: 0,CustomerID,Age,Gender,Country,Income,ProductQuality,ServiceQuality,PurchaseFrequency,FeedbackScore,LoyaltyLevel,SatisfactionScore
0,1,56,Male,UK,83094,5,8,5,Low,Bronze,100.0
1,2,69,Male,UK,86860,10,2,8,Medium,Gold,100.0
2,3,46,Female,USA,60173,8,10,18,Medium,Silver,100.0
3,4,32,Female,UK,73884,7,10,16,Low,Gold,100.0
4,5,60,Male,UK,97546,6,4,13,Low,Bronze,82.0


In [None]:
#dimensions of dataset
dataset.shape

(38444, 10)

In [None]:
#setting CustomerID as index
dataset.set_index('CustomerID', inplace=True)

We set **CustomerID** as index since the dataset includes ratings from 38444 unique customers

In [None]:
#general information about dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38444 entries, 1 to 38444
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                38444 non-null  int64  
 1   Gender             38444 non-null  object 
 2   Country            38444 non-null  object 
 3   Income             38444 non-null  int64  
 4   ProductQuality     38444 non-null  int64  
 5   ServiceQuality     38444 non-null  int64  
 6   PurchaseFrequency  38444 non-null  int64  
 7   FeedbackScore      38444 non-null  object 
 8   LoyaltyLevel       38444 non-null  object 
 9   SatisfactionScore  38444 non-null  float64
dtypes: float64(1), int64(5), object(4)
memory usage: 3.2+ MB


In [None]:
#column stats
dataset.describe(include='all')

Unnamed: 0,Age,Gender,Country,Income,ProductQuality,ServiceQuality,PurchaseFrequency,FeedbackScore,LoyaltyLevel,SatisfactionScore
count,38444.0,38444,38444,38444.0,38444.0,38444.0,38444.0,38444,38444,38444.0
unique,,2,5,,,,,3,3,
top,,Female,USA,,,,,High,Gold,
freq,,19294,7762,,,,,12918,12912,
mean,43.496853,,,75076.619238,5.494746,5.492769,10.453881,,,85.276409
std,14.972748,,,25975.752966,2.873192,2.875812,5.765621,,,16.898577
min,18.0,,,30001.0,1.0,1.0,1.0,,,4.28
25%,31.0,,,52624.5,3.0,3.0,5.0,,,74.47
50%,43.0,,,75236.0,5.0,5.0,10.0,,,91.27
75%,56.0,,,97606.75,8.0,8.0,15.0,,,100.0


In [None]:
#Range of values in numerical columns
print("Range of age is : ",min(dataset['Age'])," - ",max(dataset['Age']))
print("Range of income is : $",min(dataset['Income'])," - $",max(dataset['Income']))
print("Range of ProductQuality score is : ",min(dataset['ProductQuality'])," - ",max(dataset['ProductQuality']))
print("Range of ServiceQuality score is : ",min(dataset['ServiceQuality'])," - ",max(dataset['ServiceQuality']))
print("Range of Purchase Frequency is : ",min(dataset['PurchaseFrequency'])," - ",max(dataset['PurchaseFrequency']))

Range of age is :  18  -  69
Range of income is : $ 30001  - $ 119999
Range of ProductQuality score is :  1  -  10
Range of ServiceQuality score is :  1  -  10
Range of Purchase Frequency is :  1  -  20


In [None]:
#Unique values in categorical columns
print("Unique Countries: ", dataset['Country'].unique())
print("Unique Genders: ", dataset['Gender'].unique())
print("Unique Feedback Score: ", dataset['FeedbackScore'].unique())
print("Unique Loyalty Level: ", dataset['LoyaltyLevel'].unique())

Unique Countries:  ['UK' 'USA' 'France' 'Germany' 'Canada']
Unique Genders:  ['Male' 'Female']
Unique Feedback Score:  ['Low' 'Medium' 'High']
Unique Loyalty Level:  ['Bronze' 'Gold' 'Silver']


In [None]:
#Range of satisfaction score
print("Range of Satisfaction Score is : 0 - ",max(dataset['SatisfactionScore']))

Range of Satisfaction Score is : 0 -  100.0


In [None]:
dataset=pd.get_dummies(dataset, columns=['Country','Gender','FeedbackScore','LoyaltyLevel'])

In [None]:
Y=dataset['SatisfactionScore']

In [None]:
X=dataset.drop('SatisfactionScore', axis=1)

In [31]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.25, random_state=42)

In [None]:
automl=AutoML()
automl.fit(
      X_train=X_train,
      Y_train=Y_train,
      task='regression',
      time_budget=45,
      estimator_list=['catboost','lgbm']
)

[flaml.automl.logger: 06-26 23:15:33] {1752} INFO - task = regression
[flaml.automl.logger: 06-26 23:15:33] {1763} INFO - Evaluation method: holdout
[flaml.automl.logger: 06-26 23:15:33] {1862} INFO - Minimizing error metric: 1-r2
[flaml.automl.logger: 06-26 23:15:33] {1979} INFO - List of ML learners in AutoML Run: ['catboost', 'lgbm']
[flaml.automl.logger: 06-26 23:15:33] {2282} INFO - iteration 0, current learner catboost
[flaml.automl.logger: 06-26 23:15:33] {2417} INFO - Estimated sufficient time budget=2281s. Estimated necessary time budget=2s.
[flaml.automl.logger: 06-26 23:15:33] {2466} INFO -  at 0.3s,	estimator catboost's best error=0.2082,	best estimator catboost's best error=0.2082
[flaml.automl.logger: 06-26 23:15:33] {2282} INFO - iteration 1, current learner lgbm
[flaml.automl.logger: 06-26 23:15:33] {2466} INFO -  at 0.3s,	estimator lgbm's best error=0.7083,	best estimator catboost's best error=0.2082
[flaml.automl.logger: 06-26 23:15:33] {2282} INFO - iteration 2, curr

In [33]:
Y_pred=automl.predict(X_test)
print("FLAML Best ML model: ",automl.best_estimator)
print("Best configuration: ",automl.best_config)
print("Best loss: ", automl.best_loss)

FLAML Best ML model:  catboost
Best configuration:  {'early_stopping_rounds': 10, 'learning_rate': 0.06233639237958607, 'n_estimators': 8192}
Best loss:  0.20758933852808992


In [36]:
from sklearn.metrics import r2_score,root_mean_squared_error
print("R^2 score(FLAML): ",r2_score(Y_test,Y_pred))
print("Root mean squared error(FLAML): ",root_mean_squared_error(Y_test,Y_pred))

R^2 score(FLAML):  0.7919576959910408
Root mean squared error(FLAML):  7.6377344472081985
