<a href="https://colab.research.google.com/github/ITU-Business-Analytics-Team/Business_Analytics_for_Professionals/blob/main/Part%20I%20%3A%20Methods%20%26%20Technologies%20for%20Business%20Analytics/Chapter%203%3A%20Prediction%20Modelling/3_5_Support_Vector_Machines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Prediction Modelling: Machine Learning**
## Support Vector Machines

[Dataset](https://www.kaggle.com/prachi15gupta98/airline-passenger-satisfaction): from Kaggle

In this case study, customer satisfaction of Airline A company is estimated. The company aims to increase the satisfaction of its customers. For this purpose, data are obtained from customers through surveys that will enable the airline company to monitor customer satisfaction on each flight. Evaluating different aspects of airline service, the customer survey consists of a combination of customer-specific information, such as age, and information provided by the airline, such as flight distance. The dataset provided through the variables included in the survey will analyze the responses of 129880 customers.

In [None]:
#import libraries

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
from sklearn.datasets import  make_classification
from sklearn.svm import SVC
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, recall_score,precision_score
from sklearn.metrics import classification_report
import sklearn.metrics as metrics
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate,GridSearchCV

### **Data Preparation**

In [None]:
#Dataset import
url='https://drive.google.com/file/d/10csrmhoGgaewEg88SOSVsPgvbPsZNRXy/view?usp=sharing'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
data = pd.read_csv(path) 

In [None]:
data.head()

Unnamed: 0,satisfaction,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,...,4,2,2,0,2,4,2,5,0,0.0


In [None]:
#see data size
data.shape

(129880, 23)

In [None]:
#learn data types
data.dtypes

satisfaction                          object
Gender                                object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival De

In [None]:
#null data check 
data.isnull().sum()

satisfaction                           0
Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Seat comfort                           0
Departure/Arrival time convenient      0
Food and drink                         0
Gate location                          0
Inflight wifi service                  0
Inflight entertainment                 0
Online support                         0
Ease of Online booking                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Cleanliness                            0
Online boarding                        0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
dtype: int64

In [None]:
#drop unnecessary variables
data.drop(['Flight Distance', 'Gate location', 'Departure/Arrival time convenient','Arrival Delay in Minutes','Departure Delay in Minutes'], axis=1,inplace=True)
data.head()

Unnamed: 0,satisfaction,Gender,Customer Type,Age,Type of Travel,Class,Seat comfort,Food and drink,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding
0,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,0,0,2,4,2,3,3,0,3,5,3,2
1,satisfied,Male,Loyal Customer,47,Personal Travel,Business,0,0,0,2,2,3,4,4,4,2,3,2
2,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,0,0,2,0,2,2,3,3,4,4,4,2
3,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,0,0,3,4,3,1,1,0,1,4,1,3
4,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,0,0,4,3,4,2,2,0,2,4,2,5


In [None]:
#convert categorical variables to dummy
data_new = pd.get_dummies(data, drop_first=True)
data_new.head()

Unnamed: 0,Age,Seat comfort,Food and drink,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,satisfaction_satisfied,Gender_Male,Customer Type_disloyal Customer,Type of Travel_Personal Travel,Class_Eco,Class_Eco Plus
0,65,0,0,2,4,2,3,3,0,3,5,3,2,1,0,0,1,1,0
1,47,0,0,0,2,2,3,4,4,4,2,3,2,1,1,0,1,0,0
2,15,0,0,2,0,2,2,3,3,4,4,4,2,1,0,0,1,1,0
3,60,0,0,3,4,3,1,1,0,1,4,1,3,1,0,0,1,1,0
4,70,0,0,4,3,4,2,2,0,2,4,2,5,1,0,0,1,1,0


In [None]:
#dependent and independent variable distinction
X = data_new.drop("satisfaction_satisfied", axis = 1)
y = data_new["satisfaction_satisfied"]


In [None]:
#scale the dataset with the standard scaler
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

std_mdl = StandardScaler().fit(X_train)
X_train = std_mdl.transform(X_train)
X_test = std_mdl.transform(X_test)

### **Building the Models**

The airline company's customer satisfaction will be modeled with 4 different kernels using the support vector machine classification algorithm. The values obtained from each kernel will be compared in the result section.

**Radial Basis Function (RBF) Kernel Model**

In [None]:
#fit the model
mdl = SVC(kernel="rbf", C=1, probability=True, random_state=42)
mdl.fit(X_train,y_train)

In [None]:
#calculate accuracy scores of model training and test data
ypred_train = mdl.predict(X_train)
ypred_test = mdl.predict(X_test)

In [None]:
rbf_training_accuracy=accuracy_score(y_train,ypred_train)
rbf_testing_accuracy=accuracy_score(y_test,ypred_test)
print(rbf_training_accuracy)
print(rbf_testing_accuracy)

In [None]:
#calculate the performance measurement
print(confusion_matrix(y_test,ypred_test))

rbf_recall=recall_score(y_test,ypred_test)
print(rbf_recall)

rbf_precision=precision_score(y_test,ypred_test)
print(rbf_precision)

rbf_f1score=f1_score(y_test,ypred_test)
print(rbf_f1score)

In [None]:
# show the classification report
print(classification_report(y_test,ypred_test))

In [None]:
# Plot the ROC graph
probs = mdl.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Confusion Matrix Visualization
sns.heatmap(confusion_matrix(y_test,ypred_test), cmap='Blues', annot=True,fmt='d')
plt.ylabel("True Values")
plt.xlabel("Predicted Values")
plt.title("Confusion Matrix Visualization")
plt.show()

**Linear Kernel Model**

In [None]:
#fit the model using linear kernel
mdl2 = SVC(kernel="linear", C=1, probability=True, random_state=42)
mdl2.fit(X_train,y_train)

In [None]:
#calculate accuracy scores of model training and test data
ypred_train2 = mdl2.predict(X_train)
ypred_test2 = mdl2.predict(X_test)

In [None]:
#calculate the performance measurement
print(confusion_matrix(y_test,ypred_test2))

linear_recall=recall_score(y_test,ypred_test2)
print(linear_recall)

linear_precision=precision_score(y_test,ypred_test2)
print(linear_precision)

linear_f1score=f1_score(y_test,ypred_test2)
print(linear_f1score)

In [None]:
linear_training_accuracy=accuracy_score(y_train,ypred_train2)
linear_testing_accuracy=accuracy_score(y_test,ypred_test2)
print(linear_training_accuracy)
print(linear_testing_accuracy)

In [None]:
# show the classification report
print(classification_report(y_test,ypred_test2))

In [None]:
# Plot the ROC graph
probs = mdl2.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Confusion Matrix Visualization
sns.heatmap(confusion_matrix(y_test,ypred_test), cmap='Blues', annot=True,fmt='d')
plt.ylabel("True Values")
plt.xlabel("Predicted Values")
plt.title("Confusion Matrix Visualization")
plt.show()

**Polinomial Kernel Model**

In [None]:
#fit the model using poly kernel
mdl3 = SVC(kernel="poly", C=1, probability=True, random_state=42)
mdl3.fit(X_train,y_train)

In [None]:
#calculate accuracy scores of model training and test data
ypred_train3 = mdl3.predict(X_train)
ypred_test3 = mdl3.predict(X_test)

In [None]:
#calculate the performance measurement
print(confusion_matrix(y_test,ypred_test3))

poly_recall=recall_score(y_test,ypred_test3)
print(poly_recall)

poly_precision=precision_score(y_test,ypred_test3)
print(poly_precision)

poly_f1score=f1_score(y_test,ypred_test3)
print(poly_f1score)

In [None]:
poly_training_accuracy=accuracy_score(y_train,ypred_train3)
poly_testing_accuracy=accuracy_score(y_test,ypred_test3)
print(poly_training_accuracy)
print(poly_testing_accuracy)

In [None]:
# show the classification report
print(classification_report(y_test,ypred_test3))

In [None]:
# Plot the ROC graph
probs = mdl3.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Confusion Matrix Visualization
sns.heatmap(confusion_matrix(y_test,ypred_test), cmap='Blues', annot=True,fmt='d')
plt.ylabel("True Values")
plt.xlabel("Predicted Values")
plt.title("Confusion Matrix Visualization")
plt.show()

**Sigmoid Kernel Model**

In [None]:
#fit the model using sigmoid kernel
mdl4 = SVC(kernel="sigmoid", C=1, probability=True, random_state=42)
mdl4.fit(X_train,y_train)

In [None]:
#calculate accuracy scores of model training and test data
ypred_train4 = mdl4.predict(X_train)
ypred_test4 = mdl4.predict(X_test)

In [None]:
sigmoid_training_accuracy=accuracy_score(y_train,ypred_train4)
sigmoid_testing_accuracy=accuracy_score(y_test,ypred_test4)
print(sigmoid_training_accuracy)
print(sigmoid_testing_accuracy)

In [None]:
#calculate the performance measurement
print(confusion_matrix(y_test,ypred_test4))

sigmoid_recall=recall_score(y_test,ypred_test4)
print(sigmoid_recall)

sigmoid_precision=precision_score(y_test,ypred_test4)
print(sigmoid_precision)

sigmoid_f1score=f1_score(y_test,ypred_test4)
print(sigmoid_f1score)

In [None]:
# show the classification report
print(classification_report(y_test,ypred_test4))

In [None]:
# Plot the ROC graph
probs = mdl4.predict_proba(X_test)
preds = probs[:,1]
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

In [None]:
# Confusion Matrix Visualization
sns.heatmap(confusion_matrix(y_test,ypred_test4), cmap='Blues', annot=True,fmt='d')
plt.ylabel("True Values")
plt.xlabel("Predicted Values")
plt.title("Confusion Matrix Visualization")
plt.show()

### **Results**
The results from the table below can be summarized as follows.

It is concluded that the Kernel model that best predicts the airline company's customer satisfaction using support vector machines is RBF. In the results obtained from the performance evaluation criteria, it is seen that the best values belong to the RBF Kernel. Based on performance evaluation results, RBF Kernel is followed by Polinomial Kernel, Linear Kernel and Sigmoid Kernel.

In [None]:
!pip install texttable
from texttable import Texttable
t = Texttable()
t.add_rows([['Model', 'Training Accuracy', 'Testing Accuracy', 'Test Precision', 'Test Recall', 'Test F1'], ['rbf', rbf_training_accuracy, rbf_testing_accuracy, rbf_precision, rbf_recall, rbf_f1score], ['linear', linear_training_accuracy , linear_testing_accuracy , linear_precision,linear_recall, linear_f1score], ['poly', poly_training_accuracy,poly_testing_accuracy,poly_precision,poly_recall,poly_f1score ], ['sigmoid', sigmoid_training_accuracy,sigmoid_testing_accuracy,sigmoid_precision, sigmoid_recall, sigmoid_f1score]])
print(t.draw())