## Classification
In this notebook, we will cover the following concepts of classification with the help of a business use case:
- Linear vs. nonlinear classifiers
- Naive Baye's theorem
- Support vector machines
- Evaluating the model using accuracy and confusion matrix

### Problem Statement
Our aim in this project is to predict if a person would buy an iPhone with respect to their gender, age, and income. We will also compare
different classification algorithms.

In [1]:
#import required libraries
import pandas as pd
from pandas import Series, DataFrame
#import required libraries for visualization
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

In [2]:
#Importing warnings
import warnings
warnings.filterwarnings("ignore")

In [3]:
# Step 1 - Load Data
data_set = pd.read_csv("./iphone_purchase_records.csv")
X = data_set.iloc[:,:-1].values
y = data_set.iloc[:, 3].values


In [4]:
X

array([['Male', 19, 19000],
       ['Male', 35, 20000],
       ['Female', 26, 43000],
       ...,
       ['Female', 50, 20000],
       ['Male', 36, 33000],
       ['Female', 49, 36000]], dtype=object)

In [5]:
y

array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1,

In [6]:
data_set.head()

Unnamed: 0,Gender,Age,Salary,Purchase Iphone
0,Male,19,19000,0
1,Male,35,20000,0
2,Female,26,43000,0
3,Female,27,57000,0
4,Male,19,76000,0


In [7]:
#Check the data type
data_set.dtypes

Gender             object
Age                 int64
Salary              int64
Purchase Iphone     int64
dtype: object

### Feature Extraction
In the below code, you are using the sklearn library, which contains a lot of tools for machine learning and statistical modeling, including
classification, regression, clustering, and dimensionality reduction.

In [8]:
from sklearn import preprocessing

from sklearn.preprocessing import LabelEncoder

labelEncoder_gender = LabelEncoder()
X[:,0] = labelEncoder_gender.fit_transform(X[:,0])

X = np.vstack(X[:,:]).astype(np.float)

In [9]:
X

array([[1.0e+00, 1.9e+01, 1.9e+04],
       [1.0e+00, 3.5e+01, 2.0e+04],
       [0.0e+00, 2.6e+01, 4.3e+04],
       ...,
       [0.0e+00, 5.0e+01, 2.0e+04],
       [1.0e+00, 3.6e+01, 3.3e+04],
       [0.0e+00, 4.9e+01, 3.6e+04]])

In [10]:
# Step 3 - Split data into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

In [11]:
# Step 4 - Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [12]:
# Step 5 - Logistic regression classifier
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=0, solver="liblinear")
classifier.fit(X_train, y_train)

In [13]:

# Step 6 - Predicting logistic regression model on x_test
y_pred = classifier.predict(X_test)

In [14]:
# Step 7 - Confusion matrix
from sklearn import metrics
cm = metrics.confusion_matrix(y_test, y_pred)
print(cm)
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy score:",accuracy)
precision = metrics.precision_score(y_test, y_pred)
print("Precision score:",precision)
recall = metrics.recall_score(y_test, y_pred)
print("Recall score:",recall)

[[65  3]
 [ 6 26]]
Accuracy score: 0.91
Precision score: 0.896551724137931
Recall score: 0.8125


**The model has a 91 % accuracy score, an 89 % precision score, and an 81 % recall score, indicating that it works
effectively.**

In [15]:
# Step 8 - Feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X = sc.fit_transform(X)

### Model Comparison

In [16]:
# Step 9 - Compare classification algorithms
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [17]:
classification_models = []
classification_models.append(('Logistic Regression', LogisticRegression(solver="liblinear")))
classification_models.append(('K Nearest Neighbor', KNeighborsClassifier(n_neighbors=5,
 metric="minkowski",p=2)))
classification_models.append(('Kernel SVM', SVC(kernel = 'rbf',gamma='scale')))
classification_models.append(('Naive Bayes', GaussianNB()))
classification_models.append(('Decision Tree', DecisionTreeClassifier(criterion = "entropy")))
classification_models.append(('Random Forest', RandomForestClassifier(n_estimators=100,
 criterion="entropy")))

In [18]:
for name, model in classification_models:
 kfold = KFold(n_splits=10, random_state=(7), shuffle=(True))
 result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
 print("%s: Mean Accuracy = %.2f%% - SD Accuracy = %.2f%%" % (name, result.mean()*100,
 result.std()*100))

Logistic Regression: Mean Accuracy = 84.00% - SD Accuracy = 6.24%
K Nearest Neighbor: Mean Accuracy = 91.25% - SD Accuracy = 5.15%
Kernel SVM: Mean Accuracy = 90.75% - SD Accuracy = 4.88%
Naive Bayes: Mean Accuracy = 88.75% - SD Accuracy = 5.15%
Decision Tree: Mean Accuracy = 85.75% - SD Accuracy = 5.92%
Random Forest: Mean Accuracy = 89.00% - SD Accuracy = 4.36%


**From the results, we can see that KNN and Kernel SVM have done better than the others for this particular dataset.**