# Logistic Regression from scratch

#### Logistic regression is a supervised learning algorithm that predicts the probability of data using activation function and classifies it into a more likely category according to the probability.
#### - Activation function: sigmoid(range: between 0 and 1)
#### - Classify using threshold

## Import Libraries

In [82]:
import pandas as pd
import numpy as np

## Load the dataset

In [83]:
mnist_train = pd.read_csv("./MNIST_training.csv")
mnist_test = pd.read_csv("./MNIST_test.csv")
mnist_train

Unnamed: 0,label,pixel0,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,...,pixel774,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
944,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
945,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
946,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
947,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Split X, y


In [84]:
x_train = mnist_train.drop(columns='label')
y_train = mnist_train['label']
x_test = mnist_test.drop(columns='label')
y_test = mnist_test['label']

## Check the train and test are divided well

In [85]:
print('train data lebel bincount:', np.bincount(y_train))
print('test data label bincount:', np.bincount(y_test))

train data lebel bincount: [95 95 95 95 95 95 94 95 95 95]
test data label bincount: [5 5 5 5 5 5 5 5 5 5]


For each test data in “MNIST_test.csv”, compute distances or similarity (Euclidean, Manhattan, or Cosine similarity) with the training data.  

In [86]:
print(len(x_train))
print(x_train.shape[1])

949
784


## Functions to compute the models

### Euclidean distance

![](Euclidean_distance.png)

In [87]:
def euclidean_distance(x_test, x_train):
    euclidean_dist = np.zeros((x_test.shape[0], x_train.shape[0]))

    for i in range(x_test.shape[0]):
        for j in range(x_train.shape[0]):
            # pd.Series -> np.ndarray
            euclidean_dist[i, j] = np.sqrt(np.sum(pow(x_test.iloc[i].values - x_train.iloc[j].values, 2)))
            print('euclidean_dist[%d][%d] = %f' %(i, j, euclidean_dist[i, j]))

    return euclidean_dist

### Manhattan distance

![](Manhattan_distance.png)

In [88]:
def manhattan_distance(x_test, x_train):
    manhattan_dist = np.zeros((x_test.shape[0], x_train.shape[0]))

    for i in range(x_test.shape[0]):
        for j in range(x_train.shape[0]):
            # pd.Series -> np.ndarray
            manhattan_dist[i, j] = np.sum(abs(x_test.iloc[i].values - x_train.iloc[j].values))

    return manhattan_dist

### Cosine similarity

![](Cosine_similarity.png)

### Activation function(sigmoid)

In [89]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

## Find the K-nearest neighbors and decide the majority class. You can empirically specify the value of “K”. 

## Compare the prediction with the ground truth in the test data

## Evaluate model performance

![](Logistic_Regression_1.png)

In [93]:
euclidean_dist = euclidean_distance(x_test, x_train)
# print(euclidean_dist)
manhattan_dist = manhattan_distance(x_test, x_train)
print(manhattan_dist[-1][-1])

euclidean_dist[0][0] = 2221.794545
euclidean_dist[0][1] = 2787.881633
euclidean_dist[0][2] = 2837.138347
euclidean_dist[0][3] = 3066.371797
euclidean_dist[0][4] = 2110.545190
euclidean_dist[0][5] = 1885.622974
euclidean_dist[0][6] = 2874.071502
euclidean_dist[0][7] = 2801.888649
euclidean_dist[0][8] = 3034.952224
euclidean_dist[0][9] = 2953.370109
euclidean_dist[0][10] = 2686.990510
euclidean_dist[0][11] = 3245.244367
euclidean_dist[0][12] = 2463.437842
euclidean_dist[0][13] = 2939.788258
euclidean_dist[0][14] = 2309.723793
euclidean_dist[0][15] = 2779.663649
euclidean_dist[0][16] = 2407.519470
euclidean_dist[0][17] = 3032.322377
euclidean_dist[0][18] = 2214.908350
euclidean_dist[0][19] = 2269.973348
euclidean_dist[0][20] = 3030.479665
euclidean_dist[0][21] = 1800.426616
euclidean_dist[0][22] = 2387.829768
euclidean_dist[0][23] = 1857.285923
euclidean_dist[0][24] = 3116.493061
euclidean_dist[0][25] = 1810.859188
euclidean_dist[0][26] = 2725.515181
euclidean_dist[0][27] = 2046.094573
eu