# Machine Learning Exercises 2

## 1 - Naive Bayes Classification

![](Q1_1.png "Q1_1")

![](Q1_2.png "Q1_2")

Question 1 Answer:
We want P(buys_computer = yes | age = youth, income = medium, student = yes, credit_rating = fair)
    and P(buys_computer = no | age = youth, income = medium, student = yes, credit_rating = fair)

Our answer is the largest of the two (the argmax class). We will ignore the denominators as they are the same for both calculations.

P(yes | youth, medium, styes, fair) => P(yes) * P(youth | yes) * P(medium | yes) * P(styes | yes) * P(fair | yes)
 => (9/14) * (2/9) * (4/9) * (6/9) * (6/9) = 0.0282
 
P(no | youth, medium, styes, fair) => P(no) * P(youth | no) * P(medium | no) * P(styes | no) * P(fair | no)
 => (5/14) * (3/5) * (2/5) * (1/5) * (2/5) = 0.00686
 
These numbers could be normalized, but that's unnecessary here. It is clear that this customer will buy a computer.

## 2 - Classification Metrics

![](Q2.png "Q2")

Question 2 Answer:
1. There are 3 predicted positives which are actually negative. There is 1 predicted negative which is actually positive.
2. Accuracy = #correct/#total = (33+72)/(33+72+1+3) = 0.963 = 96.3%
3. Precision = #truepos/(#truepos+#falsepos) = (33)/(33+3) = 0.917 = 91.7%
   Recall = #truepos/(#truepos + #falseneg) = (33)/(33+1) = 0.971 = 97.1%
4. F1 = 2*(precision*recall)/(precision+recall) = 2*(0.917*0.971)/(0.917+0.971) = 0.943
5. We obviously have greater attention for infected people rather than uninfected people. Thus, we'd care more about the rate of false negatives, who would think they were Covid free and may go on to infect others, which means we'd prioritize a higher recall.

## 3 -  Logistic Regression and Perceptron

 In this problem you will be applying logistic regression and perceptron to the breastcancer dataset for binary classification:

 **default of credit card clients**:  This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.



### Task
- Prepare a normalized version of data. Use min-max normalization. 
- Train two logistic regression models using gradient descent with raw as well as normalized data. 
- Train two perceptron classifiers with raw as well as normalized data.
- Compare training and test results of four models in terms of accuracy. 

Note:

The skeleton code is only a guide. You can change the method definitions where necessary with appropriate comments.

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
def load_data(data_dir):
    ''' data: input features
        labels: output features
    '''
    df = pd.read_excel(data_dir)

    datanp = df.to_numpy()
    labels = datanp[:,-1]
    data = np.delete(datanp,-1,1) #don't need the labels in the data
    data = np.delete(data,0,1) #don't need the ID in the data either
    data = data.astype(float)
    return data, labels

data, labels = load_data('Credit card Default.xlsx')
normdata = np.copy(data)

In [3]:
#Min-max Normalization
for i in range(len(normdata[0])): #for each row of data
    cmin = np.amin(normdata[:,i])
    cmax = np.amax(normdata[:,i])
    print(cmax)
    normdata[:,i] -= cmin
    normdata[:,i] = normdata[:,i]/float(cmax-cmin)
    
print(data)
print(normdata)
print(labels)

1000000.0
2.0
6.0
3.0
75.0
8.0
7.0
7.0
7.0
7.0
8.0
964511.0
983931.0
578971.0
891586.0
927171.0
961664.0
368199.0
344261.0
896040.0
497000.0
332000.0
528666.0
[[2.000e+04 2.000e+00 2.000e+00 ... 0.000e+00 0.000e+00 0.000e+00]
 [1.200e+05 2.000e+00 2.000e+00 ... 1.000e+03 0.000e+00 2.000e+03]
 [9.000e+04 2.000e+00 2.000e+00 ... 1.000e+03 1.000e+03 5.000e+03]
 ...
 [3.200e+05 1.000e+00 1.000e+00 ... 7.500e+03 7.500e+03 7.500e+03]
 [1.000e+05 2.000e+00 2.000e+00 ... 3.000e+03 0.000e+00 0.000e+00]
 [1.000e+05 2.000e+00 1.000e+00 ... 0.000e+00 3.622e+03 0.000e+00]]
[[0.01010101 1.         0.33333333 ... 0.         0.         0.        ]
 [0.11111111 1.         0.33333333 ... 0.00201207 0.         0.00378311]
 [0.08080808 1.         0.33333333 ... 0.00201207 0.00301205 0.00945777]
 ...
 [0.31313131 0.         0.16666667 ... 0.01509054 0.02259036 0.01418665]
 [0.09090909 1.         0.33333333 ... 0.00603622 0.         0.        ]
 [0.09090909 1.         0.16666667 ... 0.         0.01090964 0.

### 3.1 - Implementation of sigmoid and cost function

In [4]:
def sigmoid(z):
    ''' return sigmoid'''
    sig = 1/(1+math.e**(-1*z)) #small constant to avoid getting sig = 1 for large z's
    return sig

In [5]:
## Implement the loss function for logistic regression

def compute_cost(ip, op, params):
    """
    Cost function in linear regression where the cost is calculated
    ip: input variables
    op: output variables
    params: corresponding parameters
    Returns cost
    """
    loss = 0
    epsilon = 0.05
    #max and mins added to account for cases where colossal dot products overwhelm machine precision
    for i in range(len(ip)):
        add = -1 * op[i] * math.log(max(sigmoid(np.dot(ip[i], params)), epsilon)) - \
            (1 - op[i]) * math.log(1 - min(sigmoid(np.dot(ip[i], params)),1-epsilon))
        loss += add
    
    return loss


### 3.2  Implement logistic regression using batch gradient descent and evaluation
Algorithm can be given as follows:

```for j in 0 -> max_iteration: 
    for i in 0 -> m: 
        theta += (alpha / m) * (y[i] - h(x[i])) * x_bar
```

In [6]:
def logistic_regression_using_batch_gradient_descent(ip, op, params, alpha, max_iter, batch_size = 1):
    """
    Compute the params for logistic regression using batch gradient descent
    ip: input variables
    op: output variables
    params: corresponding parameters
    alpha: learning rate
    max_iter: maximum number of iterations
    batch_size: size of the batch, 1 for unit batches, len(ip) for one big batch
    Returns parameters, cost, params_store
    """ 
    #initialize iteration, number of samples, cost and parameter array
    iteration = 0
    num_samples = len(ip)
    costs = np.zeros(max_iter)
    param_array = np.zeros([max_iter,len(ip[0])])
    
    #batchify the data into mini-batches, I've done this but why is this comment here when the 
    #line above this cell says to use batch. It does not say to use minibatch... so I use batch below
    x_batches = []
    y_batches = []
    batch_count = math.ceil(num_samples/batch_size)
    for i in range(batch_count):
        if i == batch_count-1:
            x_batches.append(ip[i*batch_size:,:])
            y_batches.append(op[i*batch_size:])
        else:
            x_batches.append(ip[i*batch_size:(i+1)*batch_size,:])
            y_batches.append(op[i*batch_size:(i+1)*batch_size])
    
    #Compute the cost and store the params for the corresponding cost
    while iteration < max_iter:
        costs[iteration] = compute_cost(ip,op,params)
        param_array[iteration,:] = params
        
        #assuming 1 batch
        sigs = np.zeros(len(ip))
        for i in range(len(ip)):
            sigs[i] = (sigmoid(np.dot(ip[i], params)))
        
        params = params - alpha * np.dot(ip.T, (sigs-op))              
                
        iteration += 1
    
    return param_array, costs

### 3.3 - Implementation of perceptron 

In [7]:
class Perceptron:
# constructor 
    def __init__ (self):
        self.w = None
        #bias contained within w vector, at end
    
    def calc(self, x):
        yhat = np.dot(self.w,x)
        return yhat
    
    def update(self, x, y, learn_rate):
        x = np.append(x,1)
        self.w += learn_rate*y*x
        
    def predict(self, x):
        x = np.append(x,1)
        yhat = self.calc(x)
        if yhat >= 0:
            return 1
        else:
            return -1
        
    def fit(self, x, y, learn_rate):
        no_mistakes = False
        i = 0
        k = 0
        while not no_mistakes:
            for j in range(len(x)):
                if i == len(x) or k == 5000000: #stop conditions
                    no_mistakes = True #either we go through all the data once without error
                    break              #or we stop after 5,000,000, because we don't know that it will stop
                yhat = self.predict(x[j])
                y_temp = 0 #truth
                if y[j] == 1:
                    y_temp = 1
                else:
                    y_temp = -1
                if yhat*y_temp <= 0: 
                    self.update(x[j], y_temp, learn_rate)
                    i = 0 #we reset if we have had an error
                    k += 1
                else:
                    i += 1
                    k += 1
    

### 3.4 - Apply 80-20 split on data to prepare training and test sets. Report training and test results in terms of accuracy, precision and recall for both logistic regression and perceptron

In [8]:
# Sample training code cell change according to your variables and structure

# Training the model

from sklearn.model_selection import train_test_split
#reserve the test data, do not use them for cross-validation!

data, labels = load_data('Credit card Default.xlsx')
normdata = np.copy(data)
#Min-max Normalization
for i in range(len(normdata[0])): #for each row of data
    cmin = np.amin(normdata[:,i])
    cmax = np.amax(normdata[:,i])
    normdata[:,i] -= cmin
    normdata[:,i] = normdata[:,i]/float(cmax-cmin)

x_train_norm, x_test_norm, y_train_norm, y_test_norm = train_test_split(normdata, labels, test_size = 0.20)
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.20)



In [11]:
pA = Perceptron()
pA.w = np.zeros(len(data[0])+1) #+1 for the bias term at the end
pA.fit(x_train, y_train, 0.05)

conf = [0, 0, 0, 0] #true positives, false positives, true negatives, false negatives
for i in range(len(x_test)):
    pred = pA.predict(x_test[i])
    if (pred == 1 and y_test[i] == 1):
        conf[0] += 1
    elif(pred == 1 and y_test[i] == 0):
        conf[1] += 1
    elif(pred == -1 and y_test[i] == 0):
        conf[2] += 1
    else:
        conf[3] += 1

print(conf)
print('Accuracy, Precision, and Recall for Perceptron on Raw Data')
print('Accuracy: ' + str((conf[0] + conf[2])/sum(conf)))
print('Precision: '  + str((conf[0])/(conf[0]+conf[1])))
print('Recall: ' + str((conf[0])/(conf[0]+conf[3])))

[0, 0, 780, 216]
Accuracy, Precision, and Recall for Perceptron on Raw Data
Accuracy: 0.7831325301204819


ZeroDivisionError: division by zero

In [12]:
pB = Perceptron()
pB.w = np.zeros(len(data[0])+1) #+1 for the bias term at the end
pB.fit(x_train_norm, y_train_norm, 0.05)

conf = [0, 0, 0, 0] #true positives, false positives, true negatives, false negatives
for i in range(len(x_test_norm)):
    pred = pB.predict(x_test_norm[i])
    if (pred == 1 and y_test_norm[i] == 1):
        conf[0] += 1
    elif(pred == 1 and y_test_norm[i] == 0):
        conf[1] += 1
    elif(pred == -1 and y_test_norm[i] == 0):
        conf[2] += 1
    else:
        conf[3] += 1
print(conf)
print(sum(conf))
print(len(x_test_norm))
print('Accuracy, Precision, and Recall for Perceptron on Normalized Data')
print('Accuracy: ' + str((conf[0] + conf[2])/sum(conf)))
print('Precision: '  + str((conf[0])/(conf[0]+conf[1])))
print('Recall: ' + str((conf[0])/(conf[0]+conf[3])))

[178, 555, 231, 32]
996
996
Accuracy, Precision, and Recall for Perceptron on Normalized Data
Accuracy: 0.4106425702811245
Precision: 0.24283765347885403
Recall: 0.8476190476190476


In [13]:
params = np.random.rand(len(x_train[0]))
lr_params, lr_costs = logistic_regression_using_batch_gradient_descent(x_train, y_train, params, 0.001, 100, 5)

correct = 0
for i in range(len(x_test)):
   
    if pred == y_test[i]:
        correct += 1

conf = [0, 0, 0, 0] #true positives, false positives, true negatives, false negatives
for i in range(len(x_test)):
    p = sigmoid(np.dot(x_test[i],lr_params[-1]))
    if p > 0.5:
        pred = 1
    else:
        pred = 0    
    
    if (pred == 1 and y_test_norm[i] == 1):
        conf[0] += 1
    elif(pred == 1 and y_test_norm[i] == 0):
        conf[1] += 1
    elif(pred == 0 and y_test_norm[i] == 0):
        conf[2] += 1
    else:
        conf[3] += 1

print('Accuracy, Precision, and Recall for Logistic Regression on Raw Data')
print('Accuracy: ' + str((conf[0] + conf[2])/sum(conf)))
print('Precision: '  + str((conf[0])/(conf[0]+conf[1]+0.0001)))
print('Recall: ' + str((conf[0])/(conf[0]+conf[3])))

#There is some weird stuff going on with the unnormalized data and the sigmoid calculation
#I believe (via some testing) that it results in some weird flipflopping between 2 cost values determined by the epsilon


  sig = 1/(1+math.e**(-1*z)) #small constant to avoid getting sig = 1 for large z's


Accuracy, Precision, and Recall for Logistic Regression on Raw Data
Accuracy: 0.7891566265060241
Precision: 0.0
Recall: 0.0


In [14]:
params = np.random.rand(len(x_train_norm[0]))
lr_params_norm, lr_costs_norm = logistic_regression_using_batch_gradient_descent(x_train_norm, y_train_norm, params, 0.001, 100, len(x_train_norm))

conf = [0, 0, 0, 0] #true positives, false positives, true negatives, false negatives
for i in range(len(x_test_norm)):
    p = sigmoid(np.dot(x_test_norm[i],lr_params_norm[-1]))
    
    if p > 0.5:
        pred = 1
    else:
        pred = 0
        
    if (pred == 1 and y_test_norm[i] == 1):
        conf[0] += 1
    elif(pred == 1 and y_test_norm[i] == 0):
        conf[1] += 1
    elif(pred == 0 and y_test_norm[i] == 0):
        conf[2] += 1
    else:
        conf[3] += 1

print('Accuracy, Precision, and Recall for Logistic Regression on Normalized Data')
print('Accuracy: ' + str((conf[0] + conf[2])/sum(conf)))
print('Precision: '  + str((conf[0])/(conf[0]+conf[1])))
print('Recall: ' + str((conf[0])/(conf[0]+conf[3])))

Accuracy, Precision, and Recall for Logistic Regression on Normalized Data
Accuracy: 0.8002008032128514
Precision: 0.6571428571428571
Recall: 0.10952380952380952
