<font color="green">*To start working on this notebook, or any other notebook that we will use in the Moringa Data Science Course, we will need to save our own copy of it. We can do this by clicking File > Save a Copy in Drive. We will then be able to make edits to our own copy of this notebook.*</font>

# Python Programming: The K-Nearest Neighbours (KNN)

## Examples

### Example 1: Classification

In [None]:
# Example 
# ---
# Question: Predict the class to which these plants belong. 
# There are three classes in the dataset: Iris-setosa, Iris-versicolor and Iris-virginica. 
# ---
# Dataset url = http://bit.ly/DatasetIris
# ---
# 


In [109]:
# Importing our libraries
# ---
# 
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
# Loading our dataset
# ---
# 

# Assign colum names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

# Read dataset to pandas dataframe
iris = pd.read_csv("iris.csv", names = names)

In [4]:
# Previewing our datset
# ---
# 
iris.head()

Unnamed: 0,sepal-length,sepal-width,petal-length,petal-width,Class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [5]:
# Splitting our dataset into its attributes and labels
# ---
# The X variable contains the first four columns of the dataset (i.e. attributes) while y contains the labels.
# ---
# 
X = iris.iloc[:, :-1].values
y = iris.iloc[:, 4].values

In [6]:
# Train Test Split
# ---
# To avoid over-fitting, we will divide our dataset into training and test splits, 
# which gives us a better idea as to how our algorithm performed during the testing phase. 
# This way our algorithm is tested on un-seen data
# ---
# 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

In [7]:
# Feature Scaling
# ---
# Before making any actual predictions, it is always a good practice to scale the features 
# so that all of them can be uniformly evaluated.
# ---
# 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
# Training and Predictions
# ---
# The first step is to import the KNeighborsClassifier class from the sklearn.neighbors library. 
# In the second line, this class is initialized with one parameter, i.e. n_neigbours. 
# This is basically the value for the K. There is no ideal value for K and it is selected after testing and evaluation, 
# however to start out, 5 seems to be the most commonly used value for KNN algorithm.
# ---
# 
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [9]:
# The final step is to make predictions on our test data
# ---
# 
y_pred = classifier.predict(X_test)

In [296]:
# Evaluating the Algorithm
# ---
# For evaluating an algorithm, confusion matrix, precision, recall and f1 score are the most commonly used metrics. 
# The confusion_matrix and classification_report methods of the sklearn.metrics can be used to calculate these metrics. 
# ---
# 
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      7319
           1       0.72      0.62      0.67       919

    accuracy                           0.93      8238
   macro avg       0.84      0.80      0.82      8238
weighted avg       0.93      0.93      0.93      8238

[[7100  219]
 [ 345  574]]


### Example 2: Regression

In [None]:
# Example 2
# ---
# Question: Predict the age of a voter through the use of other variables in the dataset.
# ---
# 


In [12]:
# Then loading our libraries
# 
from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

In [14]:
# Previewing our turnout dataset
# ---
# 
df = data("turnout")
df.head()

Unnamed: 0,race,age,educate,income,vote
1,white,60,14.0,3.3458,1
2,white,51,10.0,1.8561,0
3,white,24,12.0,0.6304,0
4,white,38,8.0,3.4183,1
5,white,25,12.0,2.7852,1


In [15]:
# Determining the size of the dataset
# 
df.shape

(2000, 5)

In [16]:
# Splitting our data
# ---
# 
X = df[['age','income','vote']]
y = df['educate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

In [17]:
# Training our algorithm
# ---
# 
clf = KNeighborsRegressor(11)
clf.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=11)

In [20]:
# Making our prediction
# ---
# 
y_pred = clf.predict(X_test)
# print(y_pred)
print(mean_squared_error(y_test, y_pred))

9.241401515151516


## <font color="green">Challenge 1</font>

In [232]:
# Challenge 1
# ---
# Question: Predict the income level based on the individual’s personal information in the given dataset.
# ---
# Dataset url = http://bit.ly/DatasetAdult
# ---
# 
#Reading of the dataset
adults = pd.read_csv('adult.csv')
adults.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [233]:
adults.drop('fnlwgt', axis=1 , inplace=True)

In [221]:
adults['income'].unique()

array(['<=50K', '>50K'], dtype=object)

In [234]:
adults.columns

Index(['age', 'workclass', 'education', 'educational-num', 'marital-status',
       'occupation', 'relationship', 'race', 'gender', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country', 'income'],
      dtype='object')

In [223]:
adults.dtypes

age                 int64
workclass          object
education          object
educational-num     int64
marital-status     object
occupation         object
relationship       object
race               object
gender             object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
income             object
dtype: object

In [235]:
#Preprocessing o the data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
objects = [col for col in adults.columns if adults[col].dtype == 'object'] 
objects


['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'gender',
 'native-country',
 'income']

In [236]:
#Label encoding the data
for col in objects:
  adults[col] = le.fit_transform(adults[col])

In [101]:
adults.dtypes

age                int64
workclass          int32
education          int32
educational-num    int64
marital-status     int32
occupation         int32
relationship       int32
race               int32
gender             int32
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country     int32
income             int32
dtype: object

In [54]:
adults.shape

(48842, 14)

In [218]:
adults['workclass'].unique()

array([4, 2, 0, 6, 1, 7, 5, 8, 3])

In [245]:
adults.columns

Index(['age', 'workclass', 'education', 'educational-num', 'marital-status',
       'occupation', 'relationship', 'race', 'gender', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country', 'income'],
      dtype='object')

In [252]:
#Spliiting of the dataset into x and y
X = adults[['age', 'workclass', 'education', 'educational-num', 'marital-status',
       'occupation', 'relationship', 'race', 'gender', 'capital-gain',
       'capital-loss', 'hours-per-week', 'native-country']].values
y = adults['income'].values

In [253]:
#Crafting of the dataset
from sklearn.model_selection import train_test_split
X_train,X_test,y_train, y_test = train_test_split(X,y,test_size= 0.2, random_state= 0)

In [256]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [257]:
#Training of the dataset
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(5)
knn.fit(X_train,y_train)

KNeighborsClassifier()

In [258]:
#Making predictions
y_predicted = knn.predict(X_test) 

In [259]:
#Evaluation of the model
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_predicted))
print(classification_report(y_test, y_predicted ))

[[6700  720]
 [ 933 1416]]
              precision    recall  f1-score   support

           0       0.88      0.90      0.89      7420
           1       0.66      0.60      0.63      2349

    accuracy                           0.83      9769
   macro avg       0.77      0.75      0.76      9769
weighted avg       0.83      0.83      0.83      9769



## <font color="green">Challenge 2</font>

In [298]:
# Challenge 2
# ---
# Question: Using KNN, predict if the client will subscribe a term deposit (variable y).
# ---
# Dataset url = http://bit.ly/DatasetBank
# ---
# Dasest info = http://bit.ly/DatasetBankInfo
# ---
# 
#reading of the adatset
bank = pd.read_csv('bank.csv', delimiter=';')
bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [299]:
bank.columns

Index(['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx',
       'cons.conf.idx', 'euribor3m', 'nr.employed', 'y'],
      dtype='object')

In [300]:
bank.dtypes

age                 int64
job                object
marital            object
education          object
default            object
housing            object
loan               object
contact            object
month              object
day_of_week        object
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome           object
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                  object
dtype: object

In [301]:
#Pre processing the data
# importing label encoding library
from sklearn.preprocessing import LabelEncoder
LabelEncoder =LabelEncoder()

#defining a function to encode data
def enc(dataframe, col):
  dataframe[col]=LabelEncoder.fit_transform(dataframe[col])


# making a list of columns we want to flip
l = ['job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week','poutcome','y']


for i in l:
  enc(bank, i)

In [286]:
bank.dtypes

age                 int64
job                 int32
marital             int32
education           int32
default             int32
housing             int32
loan                int32
contact             int32
month               int32
day_of_week         int32
duration            int64
campaign            int64
pdays               int64
previous            int64
poutcome            int32
emp.var.rate      float64
cons.price.idx    float64
cons.conf.idx     float64
euribor3m         float64
nr.employed       float64
y                   int32
dtype: object

In [302]:
#Splitting and crafting of the data

X = bank.drop('y', axis=1).values
y = bank['y'].values

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size= .2, random_state=0)

In [303]:
#Training of the dataset
from sklearn.neighbors import KNeighborsClassifier  
knc = KNeighborsClassifier(5) 
knc.fit(X_test,y_test)

KNeighborsClassifier()

In [304]:
#Making predictions using the model
y_preds = knc.predict(X_test)

In [305]:
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.97      0.96      7319
           1       0.72      0.62      0.67       919

    accuracy                           0.93      8238
   macro avg       0.84      0.80      0.82      8238
weighted avg       0.93      0.93      0.93      8238

[[7100  219]
 [ 345  574]]


## <font color="green">Challenge 3</font>

In [306]:
# Challenge 3
# ---
# Question: Predict if a person will have diabetes or not using the KNN algorithm.
# ---
# Dataset url = http://bit.ly/DatasetDiabetes
# ---
# 
diabetes = pd.read_csv('diabetes.csv')
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [145]:
diabetes['Outcome'].unique()

array([1, 0], dtype=int64)

In [147]:
diabetes.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [307]:
#Splitting of the x and y variables
X = diabetes[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age']].values
y = diabetes['Outcome'].values

In [308]:
diabetes.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [309]:
#Crafting of the dataset
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

In [310]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [311]:
#Training of the model
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)

KNeighborsClassifier()

In [312]:
#making predictions
y_predicted = classifier.predict(X_test)

In [314]:
#Feature engineering
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test,y_predicted))
print(confusion_matrix(y_test,y_predicted))

              precision    recall  f1-score   support

           0       0.82      0.85      0.83       157
           1       0.66      0.59      0.62        74

    accuracy                           0.77       231
   macro avg       0.74      0.72      0.73       231
weighted avg       0.77      0.77      0.77       231

[[134  23]
 [ 30  44]]


## <font color="green">Challenge 4</font>

In [261]:
# Challenge 4
# ---
# Question: Predict the miles per gallon (mpg) of a car, given its displacement and horsepower.
# ---
# Dataset Train url = http://bit.ly/AutoMPGTrainDataset
# Dataset Test url = http://bit.ly/AutoMPGTestDataset 
# ---
#Loading of the dataset
test = pd.read_csv('auto_test.csv')
train = pd.read_csv('auto_train.csv')

In [262]:
#Preview the data
print(test.head())
print(train.head())

   displacement  horsepower   mpg
0            89          71  31.9
1            86          65  34.1
2            98          80  35.7
3           121          80  27.4
4           183          77  25.4
   displacement  horsepower   mpg
0         307.0         130  18.0
1         350.0         165  15.0
2         318.0         150  18.0
3         304.0         150  16.0
4         302.0         140  17.0


In [272]:
#Splitting of the dataset into x and y
X_train = train[['displacement', 'horsepower']].values
y_train = train['mpg'].values

X_test = test[['displacement', 'horsepower']].values
y_test = test['mpg'].values


In [273]:
#Training of the dataset
from sklearn.neighbors import KNeighborsRegressor
regressor = KNeighborsRegressor(5)
trained = regressor.fit(X_train, y_train)

In [274]:
#Making predictions
y_pred = regressor.predict(X_test)

In [275]:
#Evaluation of the model
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_test, y_pred))

53.12306138613861


In [278]:
a = X_test.mean() 
b = y_test.mean()
c = (a + b) / 2
c 

68.6029702970297

In [281]:
a = X_train.mean()
b = y_train.mean()
c = ( a + b ) / 2
c

92.63530927835052

In [282]:
a = (68.61 + 92.64)/2
a

80.625

## <font color="green">Challenge 5</font>

In [None]:
# Challenge 6
# ---
# Question: Predict the target class given the following dataset.
# ---
# Dataset url = http://bit.ly/ClassifiedDataset
# ---
# 
OUR CODE GOES HERE