<hr/>
# **k-NN, Logistic Regression and k-Fold Cross Validation from Scratch**
<span id="0"></span>
[**Burhan Y. Kiyakoglu**](https://www.kaggle.com/burhanykiyakoglu)
<hr/>
<font color=green>

1. [Overview](#1)
1. [Importing Modules, Reading the Dataset](#2)
1. [k-Nearest Neighbors (k-NN)](#3)
1. [Logistic Regression](#4)
   * [Sigmoid Function](#5)
   * [Cost Function](#6)
   * [Gradient Descent Function](#7)
   * [Main Logistic Function](#8)
1. [Logistic Regression from Neural Network Perspective](#9)
   * [Propagation](#10)
   * [Optimization](#11)
   * [Predict](#12)
   * [Main Function](#13)
1. [Testing the Functions](#14)
   * [k-NN from Scratch](#15)
   * [k-NN from Scratch vs scikit-learn k-NN](#16)
   * [Logistic Regression from Scratch](#17)
   * [Logistic Regression from Scratch vs Logistic Regression from Neural Network Perspective](#18)
   * [Logistic Regression from Scratch vs scikit-learn Logistic Regression](#19)
1. [k-Fold Cross Validation from Scratch](#20)   
   * [k-Fold Cross Validation from Scratch vs scikit-learn k-Fold Cross Validation](#21) 
1. [Conclusion](#22)   

# <span id="1"></span> Overview
<hr/>
Welcome to my Kernel! In this kernel I aim to apply machine learning algorithms by my own functions. By doing this, I belive that we will undestand the mechanism and theory behind the scence better.

If you have a question or feedback, feel free to write and if you like this kernel, please  leave an <font color="green"><b>UPVOTE</b> </font>:  **It will be very much appreciated and will motivate me to offer more content to the** <font color=#47A8E5><b>kaggle</b> </font> **community** 🙂 
<br/>
<img src="https://i.imgur.com/QPWu3Rd.png" title="source: Gradient Descent" height="400" width="800" />

# <span id="2"></span> Importing Modules, Reading the Dataset
#### [Return Contents](#0)
<hr/>

In order to make some analysis, we need to set our environment up. To do this, I firstly imported some modules and read the data. The below output is the head of the data but if you want to see more details, you might try removing ***#*** signs in front of the ***df.describe()*** and ***df.info()***. 

In [None]:
#import libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from random import randrange
from random import seed
from statistics import mean 
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

# Reading Data 
df = pd.read_csv('../input/Iris.csv')
#df.describe()
#df.info()
df['Class']=df['Species']
df['Class'] = df['Class'].map({'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2})
df["Class"].unique()
df.head()

# <span id="3"></span> k-Nearest Neighbors (k-NN)
#### [Return Contents](#0)
<hr/>

At k-NN, we find the k nearest neighbors of a point and then count these neighbors' labels. Afterwards, this point gets the label that has the highest count. In order to write my main function easier, I defined the distance functions as by below definitions.

$$\textbf{Euclidian Distance}$$
$$ $$
$$d(i,j)=\sqrt{\sum_{k=1}^n (x_{i,k}-x_{j,k})^{2}}$$
$$ $$
$$\textbf{Manhattan Distance}$$
$$ $$
$$d(i,j)=\sum_{k=1}^n |x_{i,k}-x_{j,k}|\;$$
$$ $$
$$\textbf{Minkowski Distance}$$
$$ $$
$$d(i,j)=\left(\sum_{k=1}^n |x_{i,k}-x_{j,k}|^{q}\right)^{1/q}$$

In [None]:
# Distances
def euclidian(p1, p2): 
    dist = 0
    for i in range(len(p1)):
        dist = dist + np.square(p1[i]-p2[i])
    dist = np.sqrt(dist)
    return dist;

def manhattan(p1, p2): 
    dist = 0
    for i in range(len(p1)):
        dist = dist + abs(p1[i]-p2[i])
    return dist;

def minkowski(p1, p2, q): 
    dist = 0
    for i in range(len(p1)):
        dist = dist + abs(p1[i]-p2[i])**q
    dist = np.sqrt(dist)**(1/q)
    return dist;

The below code is my main function. It calculates the distance between a point and all points in the dataset. Then, it takes the k nearest points and count the labels. Finally, it returns the label that has the maximum count.

In [None]:
# kNN Function
def kNN(X_train,y_train, X_test, k, dist='euclidian',q=2):
    pred = []
    # Adjusting the data type
    if isinstance(X_test, np.ndarray):
        X_test=pd.DataFrame(X_test)
    if isinstance(X_train, np.ndarray):
        X_train=pd.DataFrame(X_train)
        
    for i in range(len(X_test)):    
        # Calculating distances for our test point
        newdist = np.zeros(len(y_train))

        if dist=='euclidian':
            for j in range(len(y_train)):
                newdist[j] = euclidian(X_train.iloc[j,:], X_test.iloc[i,:])
    
        if dist=='manhattan':
            for j in range(len(y_train)):
                newdist[j] = manhattan(X_train.iloc[j,:], X_test.iloc[i,:])
    
        if dist=='minkowski':
            for j in range(len(y_train)):
                newdist[j] = minkowski(X_train.iloc[j,:], X_test.iloc[i,:],q)

        # Merging actual labels with calculated distances
        newdist = np.array([newdist, y_train])

        ## Finding the closest k neighbors
        # Sorting index
        idx = np.argsort(newdist[0,:])

        # Sorting the all newdist
        newdist = newdist[:,idx]
        #print(newdist)

        # We should count neighbor labels and take the label which has max count
        # Define a dictionary for the counts
        c = {'0':0,'1':0,'2':0 }
        # Update counts in the dictionary 
        for j in range(k):
            c[str(int(newdist[1,j]))] = c[str(int(newdist[1,j]))] + 1

        key_max = max(c.keys(), key=(lambda k: c[k]))
        pred.append(int(key_max))
        
    return pred

# <span id="4"></span> Logistic Regression
#### [Return Contents](#0)
<hr/>

<span id="5"></span>In order to get results between 0 and 1, a function, which is called **sigmoid**, is used to transform our hypothesis function. It is defined as
$$ $$
$$h_{\theta}(x) = g(\theta^{T} x)$$ 
$$ $$
where $h_{\theta}(x)$ is the hypothesis function, $x$ is a single record and 
$$ $$
$$g(z)=\dfrac{1}{1+e^{-z}}$$
$$ $$
By using $g(\theta^{T} x)$, we obtain the probablity and if $h_{\theta}(x) \geq 0.5$, we get $y=1$; if $h_{\theta}(x) < 0.5$, we get $y=0$. Further, when $z \geq 0$, $g(z) \geq 0.5$ is another detail. Thus, if the $\theta^{T} x \geq 0$, then $y=1$.
 
By the definition, I defined the below ***sigmoid*** function.<span id="5"></span>

In [None]:
# Sigmoid Function 
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

We can't use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function. That's why we need to define a different cost function for logistic regression. It is simply defined as
$$ $$
$$J(\theta) = \dfrac{1}{m} \sum^{m}_{i=1}Cost(h_{\theta}(x^{(i)}), y^{(i)})$$ 
$$ $$
where 
$$ $$
$$Cost(h_{\theta}(x^{(i)}), y^{(i)})=-y^{(i)} \; log(h_{\theta}(x^{(i)}))-(1-y^{(i)}) \; log(1-h_{\theta}(x^{(i)}))$$
$$ $$
As the sanity check, $J(\theta)$ can be plotted or printed as a function of the number of iterations to be sure that $J(\theta)$ is **decreasing on every iteration**, which shows that it is converging correctly. At this point, choice of $\alpha$ is important. If we select a high or small $\alpha$ value, we might have problem about the converging.<span id="6"></span>

In [None]:
# Cost Function
def J(h, y):
    return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()

In order to find the $\theta$ values that minimizes the cost function, I use gradient descent and we can summarize it as

Repeat{
       1. Calculate gradient average
       2. Multiply by learning rate alpha
       3. Subtract from theta
}

Also, it can be mathemathically demonstrated as
$$ $$
$$\textbf{Repeat}\{ \; \theta_{j}:= \theta_{j}-\alpha \dfrac{\partial}{\partial \theta_{j}}J(\theta) \; \} \;  where \;  j \in \{0,1,2,...,n \}$$
$$ $$
$$or$$
$$ $$
$$\textbf{Repeat}\{ \; \theta_{j}:= \theta_{j}-\dfrac{\alpha}{m} \sum^{m}_{i=1} (h_{\theta}(x^{(i)})-y^{(i)}) \; x_{j}^{(i)} \; \} \;  where \;  j \in \{0,1,2,...,n \}$$
$$ $$
Algorithm looks identitcal to linear regression but be aware that this time $h_{\theta}(x^{(i)})$ function has a **different definition** and that's why, they are not the same.

I would also like to explain **regularization**. Regularization is designed to address the problem of overfitting and undefitting. To start with the **overfitting**, it means high variance and it is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data. This function fits to training data well but might cause poor results for the test set. On the other hand, **underfitting** means low variance and a very simple model. This might also cause poor results too. In this situation, we need to adjust features manually or use some model selection algoritms which brings an extra workload. Conversely, when we apply regularization, all the features are kept and the model adjusts $\theta_{j}$. This especially works when we have a lot of slightly useful features.

When we add regularization, the new cost fucntion is
$$ $$
$$J(\theta) = \dfrac{1}{m} \sum^{m}_{i=1}\left[-y^{(i)} \; log(h_{\theta}(x^{(i)}))-(1-y^{(i)}) \; log(1-h_{\theta}(x^{(i)}))\right]+\dfrac{\lambda}{2m}\sum^{n}_{j=1}\theta^{2}_{j}$$
$$ $$
Also, the new gradient descent can be mathemathically demonstrated as 
$$ $$
$\textbf{Repeat}\{$ $$  \theta_{0}:= \theta_{0}-\dfrac{\alpha}{m} \sum^{m}_{i=1} (h_{\theta}(x^{(i)})-y^{(i)}) \; x_{0}^{(i)} \\ \theta_{j}:= \theta_{j}- \alpha \left[ \left( \dfrac{1}{m} \sum^{m}_{i=1} (h_{\theta}(x^{(i)})-y^{(i)}) \; x_{j}^{(i)} \right) + \dfrac{\lambda}{m} \; \theta_{j} \right] \;where\;  j \in \{1,2,...,n \} $$ $\}$ <span id="7"></span>

In [None]:
# Gradient Descent Function
def gradientdescent(X, y, lmd, alpha, num_iter, print_cost):

    # select initial values zero
    theta = np.zeros(X.shape[1])
    
    costs = []  
    
    for i in range(num_iter):
        z = np.dot(X, theta)
        h = sigmoid(z)
        
        # adding regularization 
        reg = lmd / y.size * theta
        # first theta is intercept
        # it is not regularized
        reg[0] = 0
        cost = J(h, y)
        
        gradient = np.dot(X.T, (h - y)) / y.size + reg
        theta = theta - alpha * gradient
    
        if print_cost and i % 100 == 0: 
            print('Number of Iterations: ', i, 'Cost : ', cost, 'Theta: ', theta)
        if i % 100 == 0:
            costs.append(cost)
      
    return theta, costs

In order to calculate the probability easily, I defined the below function but it is not essential.

In [None]:
# Predict Function 
def predict(X_test, theta):
    z = np.dot(X_test, theta)
    return sigmoid(z)

Lastly, I defined my main function for the logistic regression. However, there is one more point to explain. When we have more than two classes we  can't apply the method we use for the binary classification. At this point, I prefered to use one vs all (one vs rest) method. Mathematically, it can be demonstrated as
$$ $$
$$h_{\theta}^{(i)}(x)=P(y=i \;  | \;  x;\theta) \;\;\;\;\; (i=1,2,...,n)$$ 
$$ $$
where $n$ is the number of classes. After calculating the above equation, we pick the class $i$ that maximizes $h_{\theta}^{(i)}(x)$ to decide the class.<span id="8"></span>

In [None]:
# Main Logistic Function
def logistic(X_train, y_train, X_test, lmd=0, alpha=0.1, num_iter=30000, print_cost = False):
    # Adding intercept
    intercept = np.ones((X_train.shape[0], 1))
    X_train = np.concatenate((intercept, X_train), axis=1)
    
    intercept = np.ones((X_test.shape[0], 1))
    X_test = np.concatenate((intercept, X_test), axis=1)

    # one vs rest
    u=set(y_train)
    t=[]
    allCosts=[]   
    for c in u:
        # set the labels to 0 and 1
        ynew = np.array(y_train == c, dtype = int)
        theta_onevsrest, costs_onevsrest = gradientdescent(X_train, ynew, lmd, alpha, num_iter, print_cost)
        t.append(theta_onevsrest)
        
        # Save costs
        allCosts.append(costs_onevsrest)
        
    # Calculate probabilties
    pred_test = np.zeros((len(u),len(X_test)))
    for i in range(len(u)):
        pred_test[i,:] = predict(X_test,t[i])
    
    # Select max probability
    prediction_test = np.argmax(pred_test, axis=0)
    
    # Calculate probabilties
    pred_train = np.zeros((len(u),len(X_train)))
    for i in range(len(u)):
        pred_train[i,:] = predict(X_train,t[i])
    
    # Select max probability
    prediction_train = np.argmax(pred_train, axis=0)
    
    d = {"costs": allCosts,
         "Y_prediction_test": prediction_test, 
         "Y_prediction_train" : prediction_train, 
         "learning_rate" : alpha,
         "num_iterations": num_iter,
         "lambda": lmd}
        
    return d

# <span id="9"></span> Logistic Regression from Neural Network Perspective
#### [Return Contents](#0)
<hr/>

In the previous section I explaned logistic regression and created my functions but I also want to explain it with the neural network mindset. Althought, the below functions will do the same and are similar to the above functions, I think this section will help us to understand neural networks better for the further studies. Since I already explained most of the details in the previous section, I will not go into too much detail.

I would like to start with the below computation graph which summarizes the neural network perspective
$$ $$
\begin{split}
\large x &\\
\large w & \;\; \large\rightleftarrows \; \boxed{z = wx + b} \; \rightleftarrows \; \boxed{a = \sigma(z)} \; \rightleftarrows \; \boxed{\mathcal{L}(a,y)}\\
\large b &
\end{split}
$$ $$
where $\sigma$ represents the sigmoid function, $\mathcal{L}$ is the loss, $\mathcal{L}(\hat y, y)= -y \; log(\hat y)-(1-y) \; log(1-\hat y)$, the right arrows determine the forward propogation and the left arrows determine the backpropagation. 

Above graph gives hint about the way we follow but we do not have a single $x$ or $w$ at logistic regression. Thus, the below mathematical algorithm might be more clear. For one example $x^{(i)}$

$$z^{(i)} = w^T x^{(i)} + b $$
$$ $$
$$\hat{y}^{(i)} = a^{(i)} = sigmoid(z^{(i)})$$ 
$$ $$
$$ \mathcal{L}(a^{(i)}, y^{(i)}) =  - y^{(i)}  \log(a^{(i)}) - (1-y^{(i)} )  \log(1-a^{(i)})$$
$$ $$
Then, the cost is computed by summing over all training examples
$$ $$
$$ J = \frac{1}{m} \sum_{i=1}^m \mathcal{L}(a^{(i)}, y^{(i)})$$
$$ $$
For the backpropagation we will use the below derivaitions (I did not determined the all derivation steps for the backward elemination)  
$$ $$
$$ \partial w = \frac{\partial J}{\partial w} = \frac{\partial J}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial w}  = \frac{1}{m}X(A-Y)^T \\$$ 
$$ \partial b = \frac{\partial J}{\partial b} = \frac{\partial J}{\partial a} \frac{\partial a}{\partial z} \frac{\partial z}{\partial b} = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$$

In [None]:
# Sigmoid Function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Select initial values zero
def initialize_with_zeros(dim):
    return np.zeros((dim,1)), 0

## <span id="10"></span> Propagation

$$\textbf{Forward Propagation}$$
$$ $$
\begin{split}
X \;\; & \large \Rightarrow & \;\; A = \sigma(w^T X + b) = (a^{(1)}, a^{(2)}, ..., a^{(m-1)}, a^{(m)}) \;\; & \large \Rightarrow & \;\; J = -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(a^{(i)})+(1-y^{(i)})\log(1-a^{(i)})
\end{split}
$$ $$
$$\textbf{Backpropagation}$$
$$ $$
$$ \partial w = \frac{1}{m}X(A-Y)^T $$
$$ $$
$$ \partial b = \frac{1}{m} \sum_{i=1}^m (a^{(i)}-y^{(i)})$$

In [None]:
def propagate(w, b, X, Y):
    m = X.shape[1]
    
    # FORWARD PROPAGATION (FROM X TO COST)
    A = sigmoid(np.dot(w.T,X)+b) # compute activation
    cost = -1/m*np.sum(Y*np.log(A)+(1-Y)*np.log(1-A)) # compute cost
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw = 1/m*np.dot(X,(A-Y).T)
    db = 1/m*np.sum(A-Y)
    
    # keep grads in a dictionary 
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

## <span id="11"></span> Optimization

The goal is to learn $w$ and $b$ by minimizing the cost function $J$. Recall the gradient decent, for a parameter $\theta$, the update rule is $ \theta = \theta - \alpha \text{ } d\theta$, where $\alpha$ is the learning rate.

In [None]:
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):    
    costs = []
    
    for i in range(num_iterations):
        # Cost and gradient calculation
        grads, cost = propagate(w, b, X, Y)
        
        # Retrieve derivatives from grads
        dw = grads["dw"]
        db = grads["db"]
        
        # update rule
        w = w-learning_rate*dw
        b = b-learning_rate*db 
        
        # Record the costs
        if i % 100 == 0:
            costs.append(cost)
            
        # Print the cost every 100 training iterations
        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
    
    # Save pameters and gradients
    params = {"w": w,
              "b": b}
    
    grads = {"dw": dw,
             "db": db}
    
    return params, grads, costs

## <span id="12"></span> Predict

In order to calculate the probability easily, we need the below function but it is not essential as in the previous section.

In [None]:
def predict_nn(w, b, X):    
    m = X.shape[1]
    Y_prediction = np.zeros((1,m))
    w = w.reshape(X.shape[0], 1)
    
    # Compute vector "A" predicting the probabilities
    A = sigmoid(np.dot(w.T,X)+b)
        
    return A

## <span id="13"></span> Main Function

Now, we put together all the building blocks. Also, I did not forget to make necessary adjustments to add one vs rest method.   

In [None]:
def model(X_train, Y_train, X_test, Y_test, num_iterations = 30000, learning_rate = 0.1, print_cost = False): 
    # pandas to numpy
    X_train = X_train.values
    Y_train = Y_train.values.reshape((1,Y_train.shape[0]))
    X_test = X_test.values
    Y_test = Y_test.values.reshape((1,Y_test.shape[0]))
    
    # take transpose of X
    X_train = X_train.T
    X_test = X_test.T
    
    # initialize parameters with zeros 
    w, b = initialize_with_zeros(X_train.shape[0])
    
    # one vs all
    u = set(y_train)
    param_w = []
    param_b = []
    allCosts = []
    for c in u:
        # set the labels to 0 and 1
        ynew = np.array(y_train == c, dtype = int)
        # Gradient descent 
        parameters, grads, costs = optimize(w, b, X_train, ynew, num_iterations, learning_rate, print_cost = print_cost)
        
        # Save costs
        allCosts.append(costs)
        
        # Retrieve parameters w and b from dictionary "parameters"
        param_w.append(parameters["w"])
        param_b.append(parameters["b"])
    
    # Calculate probabilties
    pred_test = np.zeros((len(u),X_test.shape[1]))
    for i in range(len(u)):
        pred_test[i,:] = predict_nn(param_w[i], param_b[i], X_test)
    
    # Select max probability
    Y_prediction_test = np.argmax(pred_test, axis=0)
    
    # Calculate probabilties
    pred_train = np.zeros((len(u),X_train.shape[1]))
    for i in range(len(u)):
        pred_train[i,:] = predict_nn(param_w[i], param_b[i], X_train)
    
    # Select max probability
    Y_prediction_train = np.argmax(pred_train, axis=0)
        
    d = {"costs": allCosts,
         "Y_prediction_test": Y_prediction_test, 
         "Y_prediction_train" : Y_prediction_train, 
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}
    
    return d

# <span id="14"></span> Testing the Functions
#### [Return Contents](#0)
<hr/>

To test my functions, I defined 3 different points which are very close to the existing 3 points and splitted the data as training and test sets. I expect that the predicted labels will be the same as the real points and my functions will give similar results to scikit learn's functions.

In [None]:
# I chose data points close to the real data points X[15], X[66] and X[130]
test = np.array([[5.77,4.44,1.55,0.44],[5.66,3.01,4.55,1.55],[7.44, 2.88, 6.11, 1.99]])
print("TEST POINTS\n", test)

all_X = df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
all_y = df['Class']

# split data as training and test
df=df[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm','Class']]
train_data,test_data = train_test_split(df,train_size = 0.8,random_state=2)
X_train = train_data[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
y_train = train_data['Class']
X_test = test_data[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]
y_test = test_data['Class']

def transform(i):
    if i == 0:
        return 'Iris-setosa'
    if i == 1:
        return 'Iris-versicolor'
    if i == 2:
        return 'Iris-virginica'

Before starting to predict test points' labels, I wanted to see the places of these points according to some features. Thus, I drew the below charts which will help us to guess new points' labels.

In [None]:
plt.figure(figsize=(10,10))
t=np.unique(all_y)

ax1=plt.subplot(2, 2, 1)
ax1.set(xlabel='Sepal Length (cm)', ylabel='Sepal Width (cm)')
plt.plot(df[df['Class']==t[0]].iloc[:,0], df[df['Class']==t[0]].iloc[:,1], 'o', color='y')
plt.plot(df[df['Class']==t[1]].iloc[:,0], df[df['Class']==t[1]].iloc[:,1], 'o', color='r')
plt.plot(df[df['Class']==t[2]].iloc[:,0], df[df['Class']==t[2]].iloc[:,1], 'o', color='b')
# test datapoints
plt.plot(test[0,0],test[0,1],'*',color="k")
plt.plot(test[1,0],test[1,1],'*',color="k")
plt.plot(test[2,0],test[2,1],'*',color="k")

ax2=plt.subplot(2, 2, 2)
ax2.set(xlabel='Petal Length (cm)', ylabel='Petal Width (cm)')
ax2.yaxis.set_label_position("right")
ax2.yaxis.tick_right()
plt.plot(df[df['Class']==t[0]].iloc[:,2], df[df['Class']==t[0]].iloc[:,3], 'o', color='y')
plt.plot(df[df['Class']==t[1]].iloc[:,2], df[df['Class']==t[1]].iloc[:,3], 'o', color='r')
plt.plot(df[df['Class']==t[2]].iloc[:,2], df[df['Class']==t[2]].iloc[:,3], 'o', color='b')
# test datapoints
plt.plot(test[0,2],test[0,3],'*',color="k")
plt.plot(test[1,2],test[1,3],'*',color="k")
plt.plot(test[2,2],test[2,3],'*',color="k")

ax3=plt.subplot(2, 2, 3)
ax3.set(xlabel='Sepal Length (cm)', ylabel='Petal Length (cm)')
plt.plot(df[df['Class']==t[0]].iloc[:,0], df[df['Class']==t[0]].iloc[:,2], 'o', color='y')
plt.plot(df[df['Class']==t[1]].iloc[:,0], df[df['Class']==t[1]].iloc[:,2], 'o', color='r')
plt.plot(df[df['Class']==t[2]].iloc[:,0], df[df['Class']==t[2]].iloc[:,2], 'o', color='b')
# test datapoints
plt.plot(test[0,0],test[0,2],'*',color="k")
plt.plot(test[1,0],test[1,2],'*',color="k")
plt.plot(test[2,0],test[2,2],'*',color="k")

ax4=plt.subplot(2, 2, 4)
ax4.set(xlabel='Sepal Width (cm)', ylabel='Petal Width (cm)')
ax4.yaxis.set_label_position("right")
ax4.yaxis.tick_right()
plt.plot(df[df['Class']==t[0]].iloc[:,1], df[df['Class']==t[0]].iloc[:,3], 'o', color='y')
plt.plot(df[df['Class']==t[1]].iloc[:,1], df[df['Class']==t[1]].iloc[:,3], 'o', color='r')
plt.plot(df[df['Class']==t[2]].iloc[:,1], df[df['Class']==t[2]].iloc[:,3], 'o', color='b')
# test datapoints
plt.plot(test[0,1],test[0,3],'*',color="k")
plt.plot(test[1,1],test[1,3],'*',color="k")
plt.plot(test[2,1],test[2,3],'*',color="k");


# <span id="15"></span> k-NN from Scratch

The result of the k-NN from scratch function for my test points are below:

In [None]:
# Predicting the classes of the test data by kNN 
# Decide k value
k = 5
# print results
print("k-NN ("+str(k)+"-nearest neighbors)\n")
c = kNN(all_X,all_y,test,k)
for i in range(len(c)):
    ct=set(map(transform,[c[i]]))
    print("Test point: "+str(test[i,:])+"  Label: "+str(c[i])+" "+str(ct))

# <span id="16"></span> k-NN from Scratch vs scikit-learn k-NN

In this section, I compare my kNN function with the scikit learn's k-NN function and determine the confusion matrixes for the both models. The results look same but by removing the *random_state* in the *train_test_split* function and chaging the *train_size* different results can be found.

In [None]:
# k-NN from scratch
c=kNN(X_train,y_train,X_test,k)
cm=confusion_matrix(y_test, c)

# logistic regression - scikit learn
sck = KNeighborsClassifier(n_neighbors = k).fit(X_train, y_train)
sck_cm=confusion_matrix(y_test, sck.predict(X_test))

plt.figure(figsize=(15,6))
plt.suptitle("Confusion Matrixes",fontsize=24)

plt.subplot(1,2,1)
plt.title("k-NN from Scratch")
sns.heatmap(cm, annot = True, cmap="Greens",cbar=False);

plt.subplot(1,2,2)
plt.title("k-NN - scikit learn")
sns.heatmap(sck_cm, annot = True, cmap="Greens",cbar=False);

# <span id="17"></span> Logistic Regression from Scratch

The result of the logistic regression from scratch function for my test points are below:

In [None]:
# Predicting the classes of the test data by Logistic Regression
print("Logistic Regression\n")
c=logistic(X_train,y_train,test)
# print results
for i in range(len(c['Y_prediction_test'])):
    ct=set(map(transform,[c['Y_prediction_test'][i]]))
    print("Test point: "+str(test[i,:])+"  Label: "+str(c['Y_prediction_test'][i])+" "+str(ct))

# <span id="18"></span> Logistic Regression from Scratch vs Logistic Regression from Neural Network Perspective

Below confusion matrices show that each model gives the same results when the parameters are selected the same.

In [None]:
# logistic regression from scratch
start=dt.datetime.now()
c=logistic(X_train,y_train,X_test)
# Print train/test Errors
print('Elapsed time of logistic regression from scratch: ',str(dt.datetime.now()-start))
print("train accuracy: {} %".format(100 - np.mean(np.abs(c["Y_prediction_train"] - y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(c["Y_prediction_test"] - y_test)) * 100))


# Logistic Regression from Neural Network Perspective
start=dt.datetime.now()
d = model(X_train, y_train, X_test, y_test)
print('\nElapsed time of Logistic Regression from Neural Network Perspective: ',str(dt.datetime.now()-start))
print("train accuracy: {} %".format(100 - np.mean(np.abs(d["Y_prediction_train"] - y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(d["Y_prediction_test"] - y_test)) * 100))


cm=confusion_matrix(y_test, c['Y_prediction_test'])

plt.figure(figsize=(15,6))
plt.suptitle("Confusion Matrixes",fontsize=24)

plt.subplot(1,2,1)
plt.title("Logistic Regression from Scratch")
sns.heatmap(cm, annot = True, cmap="Greens",cbar=False);

cm=confusion_matrix(y_test, d['Y_prediction_test'].reshape(30,))

plt.subplot(1,2,2)
plt.title("Logistic Regression from Neural Network Perspective")
sns.heatmap(cm, annot = True, cmap="Greens",cbar=False);

Moreover, below line charts determine the costs for different learning rates. These plots generally used for the sanity check. They look like same for each model. They look pretty good and the effect of learning rate can be observed clearly. Since I used one vs rest, for each learning rate I drew 3 cost lines. 

In [None]:
# Learning rates
lr = [0.1, 0.01, 0.001]

for i in range(len(lr)):
    # Run the model for different learning rates
    c = logistic(X_train,y_train,X_test, alpha = lr[i])
    
    # Adjust results to plot
    dfcost = pd.DataFrame(list(c['costs'])).transpose()
    dfcost.columns = ['0 (Iris-setosa) vs rest','1 (Iris-versicolor) vs rest','2 (Iris-virginica) vs rest']
    
    # Plot the costs
    if i==0 : f, axes = plt.subplots(1, 3,figsize=(24,4))
    sns.lineplot(data = dfcost.iloc[:, :3], ax=axes[i])
    sns.despine(right=True, offset=True)
    axes[i].set(xlabel='Iterations (hundreds)', ylabel='Cost ' +'(Learning Rate: ' + str(lr[i]) + ')')
    
plt.suptitle("Logistic Regression from Scratch\n",fontsize=24);  

for i in range(len(lr)):
    # Run the model for different learning rates
    d = model(X_train, y_train, X_test, y_test, learning_rate = lr[i])
    
    # Adjust results to plot
    dfcost = pd.DataFrame(list(d['costs'])).transpose()
    dfcost.columns = ['0 (Iris-setosa) vs rest','1 (Iris-versicolor) vs rest','2 (Iris-virginica) vs rest']
    
    # Plot the costs
    if i==0 : f, axes = plt.subplots(1, 3,figsize=(30,5))
    sns.lineplot(data = dfcost.iloc[:, :3], ax=axes[i])
    sns.despine(right=True, offset=True)
    axes[i].set(xlabel='Iterations (hundreds)', ylabel='Cost ' +'(Learning Rate: ' + str(lr[i]) + ')')
    
plt.suptitle("Logistic Regression from Neural Network Perspective\n",fontsize=24);    

# <span id="19"></span> Logistic Regression from Scratch vs scikit-learn Logistic Regression

This time, I compare my logistic function with the scikit learn's logistic function and determine the confusion matrixes for the both models. Also, I changed the regularization parameter ($\lambda$) for both models and determined the confusion matrixes for them too. The results look similar but they are not the same as expected. By removing the *random_state* in the *train_test_split* function and chaging the *train_size* different results can be obtained. 

In [None]:
# logistic regression from scratch
c=logistic(X_train,y_train,X_test)
cm=confusion_matrix(y_test, c['Y_prediction_test'])

# logistic regression - scikit learn
sck = LogisticRegression().fit(X_train, y_train)
sck_cm=confusion_matrix(y_test, sck.predict(X_test))

# logistic regression from scratch
c_r=logistic(X_train,y_train,X_test,lmd=0.01)
cm_r=confusion_matrix(y_test, c_r['Y_prediction_test'])

# logistic regression - scikit learn
sck_r = LogisticRegression(C=100).fit(X_train, y_train)
sck_cm_r=confusion_matrix(y_test, sck_r.predict(X_test))

plt.figure(figsize=(15,12))
plt.suptitle("Confusion Matrixes",fontsize=24)

plt.subplot(2,2,1)
plt.title("Logistic Regression from Scratch")
sns.heatmap(cm, annot = True, cmap="Greens",cbar=False);

plt.subplot(2,2,2)
plt.title("Logistic Regression - scikit learn")
sns.heatmap(sck_cm, annot = True, cmap="Greens",cbar=False);

plt.subplot(2,2,3)
plt.title("Logistic Regression from Scratch ( $\lambda$ = 0.01 )")
sns.heatmap(cm_r, annot = True, cmap="Greens",cbar=False);

plt.subplot(2,2,4)
plt.title("Logistic Regression ( $\lambda$ = 0.01 / C = 100 ) - scikit learn")
sns.heatmap(sck_cm_r, annot = True, cmap="Greens",cbar=False);

# <span id="20"></span> k-Fold Cross Validation from Scratch
#### [Return Contents](#0)
<hr/>

k-Fold Cross Validation is a very useful technique to check how well a model performs when we apply it on an independent data. It is often used to flag problems caused by overfitting and selection bias. However, it brings an additional data processing load and time. 

The below figure depicts the k-fold cross validation. Briefly, we randomly divide data to k folds, take one of the folds as the testing set in each step and calculate the accuracy. 

<img src="https://i.imgur.com/hq45Jfq.png" title="source: imgur.com" />

I divided my k-fold cross validation to two parts. First, I defined the below function to split the data to k folds.

In [None]:
def cross_validation_split(dataset, folds):
        dataset_split = []
        df_copy = dataset
        fold_size = int(df_copy.shape[0] / folds)
        
        # for loop to save each fold
        for i in range(folds):
            fold = []
            # while loop to add elements to the folds
            while len(fold) < fold_size:
                # select a random element
                r = randrange(df_copy.shape[0])
                # determine the index of this element 
                index = df_copy.index[r]
                # save the randomly selected line 
                fold.append(df_copy.loc[index].values.tolist())
                # delete the randomly selected line from
                # dataframe not to select again
                df_copy = df_copy.drop(index)
            # save the fold     
            dataset_split.append(np.asarray(fold))
            
        return dataset_split 

By using the *cross_validation_split* function I defined my main function below. This function takes each fold as test and returns the accuricies for each fold.

In [None]:
def kfoldCV(dataset, f=5, k=5, model="logistic"):
    data=cross_validation_split(dataset,f)
    result=[]
    # determine training and test sets 
    for i in range(f):
        r = list(range(f))
        r.pop(i)
        for j in r :
            if j == r[0]:
                cv = data[j]
            else:    
                cv=np.concatenate((cv,data[j]), axis=0)
        
        # apply the selected model
        # default is logistic regression
        if model == "logistic":
            # default: alpha=0.1, num_iter=30000
            # if you change alpha or num_iter, adjust the below line         
            c = logistic(cv[:,0:4],cv[:,4],data[i][:,0:4])
            test = c['Y_prediction_test']
        elif model == "knn":
            test = kNN(cv[:,0:4],cv[:,4],data[i][:,0:4],k)
            
        # calculate accuracy    
        acc=(test == data[i][:,4]).sum()
        result.append(acc/len(test))
        
    return result

We can observe the accuricies of 3-fold cross validation for my logistic regression and k-NN from scratch functions below. In order to get different results, we need to comment out *seed(1)*. 

In [None]:
print("3-Fold Cross Validation for Logistic Regression from Scratch")
print("Fold Size:",int(df.shape[0] / 3))
seed(1)
acc=kfoldCV(df,3)
print("Accuricies:", acc)
print("Average of the Accuracy:", round(mean(acc),2))

print("\n3-Fold Cross Validation for k-NN from Scratch")
print("Fold Size:",int(df.shape[0] / 3))
seed(1)
acc=kfoldCV(df,3,model="knn")
print("Accuricies:", acc)
print("Average of the Accuracy:", round(mean(acc), 2))

In the **bias-variance behavior**, higher training set means higher variance. At this point, depending to our choice of number of folds, accuracies might change. This change can be observed from the below figures for my models.


In [None]:
seed(1)
bva_lr=[]
bva_knn=[]
for f in range(2,11):
    # k-fold cv from scratch for logistic regression
    bva_lr.append(mean(kfoldCV(df,f)))
    # k-fold cv from scratch for k-NN
    bva_knn.append(mean(kfoldCV(df,f,model="knn")))

# plot the change in the average accuracy according to k 
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
plt.title("Logistic Regression")
plt.xlabel("Number of Folds (k)")
plt.ylabel("Average Accuracy")
plt.plot(range(2,11),bva_lr);

plt.subplot(1,2,2)
plt.title("k-NN")
plt.xlabel("Number of Folds (k)")
plt.ylabel("Average Accuracy")
plt.plot(range(2,11),bva_knn);

# <span id="21"></span> k-Fold Cross Validation from Scratch vs scikit-learn k-Fold Cross Validation

I compare my k-fold cross validation function with the scikit learn's k-fold cross validation function. Here, we shouldn't forget that my fuction uses the k-NN and logistic regression functions from scratch and this may increase the diffrence between the results. When I use *seed(1)* as in the previous sections, it gives plausible results. 

In [None]:
seed(1)
lr_scratch=kfoldCV(df,3)
knn_scratch=kfoldCV(df,3,model="knn")
lr_sck=cross_val_score(LogisticRegression(), all_X, all_y, cv=3)
knn_sck=cross_val_score(KNeighborsClassifier(n_neighbors = k), all_X, all_y, cv=3)

print("RESULTS")
print("Logistic Regression & k-Fold Cross Validation from Scratch: ",lr_scratch,"\nMean: ",round(mean(lr_scratch),2))
print("\nLogistic Regression & k-Fold Cross Validation (scikit-learn): ",lr_sck,"\nMean: ",round(mean(lr_sck),2))
print("\nk-NN & k-Fold Cross Validation from Scratch: ",knn_scratch,"\nMean: ",round(mean(knn_scratch),2))
print("\nk-NN & k-Fold Cross Validation (scikit-learn): ",knn_sck,"\nMean: ",round(mean(knn_sck),2))

# <span id="22"></span> Conclusion
#### [Return Contents](#0)
<hr/>

In this kernel, I used my own functions and tried to explain the theory behind them without using scikit-learn or any other built in functions. Probably, you use the bult in functions in your daily tasks as me and I guess they perform better. However, I thing digging deeper and understanding the logic behind them will make it easier to see the whole picture.  

<b><font color="green">Thank you for reading my kernel </font></b> **and If you liked this kernel, please** <b><font color="red">do not forget to <b></font><font color="green">UPVOTE </font></b> 🙂
    
If you would like to glance my other notebooks, please [**CLICK HERE**](https://www.kaggle.com/burhanykiyakoglu/notebooks).     