# **In this notebook I have tried to compare and explain the working of two famous supervised learning classification algorithms, Decision tree classifier and K-Nearest Neighbor classifier.**

*The dataset used here contains demographic data for a telecommunication service provider. The objective is to use demographic data, such as region, age, and marital, to predict customer usage patterns. This project might be useful for a telecomunication service provider like (safaricom) for purposes of customer segmentation.The company will then use this categorized groups of customers to customize offers.Dataset can be obtained from Kaggle.*

In [14]:
#import required libraries
import numpy as np
import pandas as pd
#incase we will need to visualize any behavior of the data
import matplotlib.pyplot as plt
#lets import the preprocessing package of the scikit-learn will help us normalize data since algorithms such as knn require 
# that you normalize data
from sklearn import preprocessing
#import train_test split to split your data
from sklearn.model_selection import train_test_split

## Load Data

In [3]:
saf = pd.read_csv('teleCust1000t.csv')
saf.head()
#we want to classify category of customer(custcat)

Unnamed: 0,region,tenure,age,marital,address,income,ed,employ,retire,gender,reside,custcat
0,2,13,44,1,9,64.0,4,5,0.0,0,2,1
1,3,11,33,1,7,136.0,5,5,0.0,0,6,4
2,3,68,52,1,24,116.0,1,29,0.0,1,2,3
3,2,33,33,0,12,33.0,2,0,0.0,1,1,1
4,2,23,30,1,9,30.0,1,2,0.0,0,4,3


## Understand data

In [4]:
#which columns are present
saf.columns

Index(['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed',
       'employ', 'retire', 'gender', 'reside', 'custcat'],
      dtype='object')

In [6]:
#How many of each category do we have in our target attribute
saf[['custcat']].value_counts()

custcat
3          281
1          266
4          236
2          217
dtype: int64

Well before we continue it is good to understand what is meant by the 4 categories 1,2,3,4;<br>
1-Basic service<br>
2-E-service<br>
3-Plus service<br>
4-Total service<br>

**NB:** you will find this in the dataset meta data 

In [7]:
#How many attributes and samples do we have in our data set
saf.shape

(1000, 12)

In [8]:
#Data types present
saf.dtypes
#All are are numerical,we are lucky we will not need to use sklearn's LabelEncoder() to convert them for the purpose of 
#fitting our model

region       int64
tenure       int64
age          int64
marital      int64
address      int64
income     float64
ed           int64
employ       int64
retire     float64
gender       int64
reside       int64
custcat      int64
dtype: object

Most algorithms requires that the data be in numeric form

## Preprocessing data

### To use scikit learn,your data needs to be in a numpy array,so lets convert it first and also separate depedent and indepedent variables

In [27]:
#Isolate predictor variables and convert them to a numpy array
X = saf[['region', 'tenure', 'age', 'marital', 'address', 'income', 'ed',
       'employ', 'retire', 'gender', 'reside']].values

#Isolate target variable and convert it to a numpy array(should be in 1d not as a dataFrame)
y = saf['custcat'].values

Our data is already in numeric form as required by many machine learning algorithms,but it needs to be normalized for most algorithms(KNN included) to work well.Normalizing data gives it zero mean and unit variance

In [12]:
#Lets normalize our data(of course the predictors)
X = preprocessing.StandardScaler().fit_transform(X.astype('float'))
X 

array([[-0.02696767, -1.055125  ,  0.18450456, ..., -0.22207644,
        -1.03459817, -0.23065004],
       [ 1.19883553, -1.14880563, -0.69181243, ..., -0.22207644,
        -1.03459817,  2.55666158],
       [ 1.19883553,  1.52109247,  0.82182601, ..., -0.22207644,
         0.96655883, -0.23065004],
       ...,
       [ 1.19883553,  1.47425216,  1.37948227, ..., -0.22207644,
         0.96655883, -0.92747794],
       [ 1.19883553,  1.61477311,  0.58283046, ..., -0.22207644,
         0.96655883, -0.92747794],
       [ 1.19883553,  0.67796676, -0.45281689, ..., -0.22207644,
         0.96655883,  0.46617787]])

## Split your data and reserve some for testing

You split the predictors data into training and testing.This is done to ensure better out of sample accuracy

In [25]:
#split data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=3)

In [16]:
#Shape of training set
print('Shape training set X:',X_train.shape)
print('Shape training set y:',y_train.shape)

Shape training set X: (700, 11)
Shape training set y: (700, 1)


In [17]:
#Shape of testing set
print('Shape testing set X:',X_test.shape)
print('Shape testing set y:',y_test.shape)

Shape testing set X: (300, 11)
Shape testing set y: (300, 1)


Great! successfully separated our data.So we are going to train our two models using the training data and test them using testing data

## Models

### (1).K-Nearest Neighbors

#### *Train

In [43]:
#Import the classifier
from sklearn.neighbors import KNeighborsClassifier 
#Create model(I will give it 4 neighbors)
KNN = KNeighborsClassifier(n_neighbors=6)
#Fit model to the training data
KNN.fit(X_train,y_train)
KNN

KNeighborsClassifier(n_neighbors=6)

#### *Predict

In [51]:
#Lets make a prediction and visually compare with real values
yhat = KNN.predict(X_test)
print('Predicted: ',yhat[0:5])
print('Real: ',y_test[0:5])

Predicted:  [1 1 1 4 2]
Real:  [3 3 1 4 1]


#### *Evaluate

In [47]:
#We will evaluate using accuracy classification score
from sklearn import metrics
print('Accuracy score of KNN:',metrics.accuracy_score(y_test,yhat))
#The accuracy is less about 33%

Accuracy score of KNN: 0.3333333333333333


### (2).Decision Tree

#### *Train

In [49]:
#Import the classifier
from sklearn.tree import DecisionTreeClassifier
#Create model
DT = DecisionTreeClassifier(criterion = 'entropy',max_depth=4)
#Fit model to the training data
DT.fit(X_train,y_train)
DT

DecisionTreeClassifier(criterion='entropy', max_depth=4)

#### *Predict

In [52]:
#Lets make a prediction and visually compare with real values
yhat = DT.predict(X_test)
print('Predicted: ',yhat[0:5])
print('Real: ',y_test[0:5])

Predicted:  [1 3 1 2 1]
Real:  [3 3 1 4 1]


#### *Evaluate

In [53]:
#We will evaluate using accuracy classification score
from sklearn import metrics
print('Accuracy score of Decision Tree:',metrics.accuracy_score(y_test,yhat))
#It's accuracy is better though not perfect compared to KNN on this problem,it has a 37% out of sample accuracy 

Accuracy score of Decision Tree: 0.37333333333333335


### CONCLUSION: IF I WERE TO CHOOSE BETWEEN THIS TWO MODELS ON THIS PARTICULAR PROBLEM I WOULD GO FOR DECISION TREE,THE REASON IS, I TRIED TWISTING THE PARAMETERS OF KNN(n_neighbors) BUT THE BEST IT COULD GIVE WAS BELOW DT's ACCURACY.NOTE THAT THIS ACCURACY SCORES ARE OUT OF SAMPLE SCORES.

## MORE ON MACHINE LEARNING,DEEP LEARNING COMING IN FUTURE....

**Regards Samuel**,HAPPY DATA SCIENCE CAREER JOURNEY!