## ENSE 496AD - Lab 2 - Classification with Continuous Features
In this lab you will implement two different methods for classification for problems with continuous features. You will implement the Naive Bayes classifier as well as Logistic Regression, verify their operation using a tiny dataset, and then compare the two approaches in a real world dataset. Additionally, you will train the Logistic Regression iteratively using Gradient Descent.

In [1]:
# Import numpy. You can do the first two questions in the lab with only Numpy.
import numpy as np
import pandas as pd
import math

In [2]:
#your code here

In [3]:
# This toy dataset will be used to verify correct operation of your algorithms

In [4]:
# toy dataset of pets. length and height in cm, mass in kg.
               #length, height,  mass
X = np.array([[    107,     91,    52],  # bernese
              [    122,     44,    79],  # great dane
              [    107,     81,    34],  # goldie
              [     64,     56,    24], # beagle
              [     38,     25,     6],  # sphynx
              [     36,     25,     5],  # siamese
              [     44,     25,     5],  # persian
              [     41,     30,     5]]) # manx
Y = np.array([["dog"],
              ["dog"],
              ["dog"],
              ["dog"],
              ["cat"],
              ["cat"],
              ["cat"],
              ["cat"]])


### Part 1: Naive Bayes Classifier (40 marks)
In this section you will implement the functions for Naive Bayes classification.

In [5]:
# Implement a function called get_data which returns a smaller dataset of X including only the rows where Y is equal to y_val

In [419]:
def get_data (X, Y, y_val):
    '''
    Gets all the rows of matrix X where Y is equal to y_val 

    X: A matrix with rows of features
    Y: A column vector of labels
    y_val: The value of Y where you wish to get rows of X
    Returns a new matrix with selected rows of X 
    '''
    # your code here
    Z = np.where(Y[:]==y_val)[0]
    N = X[Z]
    return N

    

In [420]:
''' Test cell
Expected output:
Xd:  [[107  91  52]
 [122  44  79]
 [107  81  34]
 [ 64  56  24]]
'''
Xd = get_data (X, Y, "dog")
print("Xd: ", Xd)

Xd:  [[107  91  52]
 [122  44  79]
 [107  81  34]
 [ 64  56  24]]


In [421]:
# Implement the mean function below
# Takes a matrix of sampels, returns the mean as a row vector of means for each column

In [422]:
def mean (X):
    '''
    X: A matrix with rows of features
    Returns a row vector of means, with one mean for each column
    '''
   
    return np.mean(X,axis=0)
    
    # your code here

In [423]:
''' Test cell
Expected output:
mean d:  [[100.    68.    47.25]]
'''
print ("mean d: ", mean (Xd))


mean d:  [100.    68.    47.25]


In [424]:
# define the variance function below
# Takes a matrix of samples, returns sigma squared as a row vector of standard deviations for each column

In [425]:
def variance (X):
    '''
    X: A matrix with rows of features
    Returns a row vector of standard deviations, with one standard deviation for each column
    '''
    means = mean(X)
    minus = (X - means)**2
    sumofminus = sum(minus)
    return sumofminus/(X.shape[0]-1)
    
    # your code here

In [426]:
''' Test cell
Expected output:
standard deviations d:  [[626.         472.66666667 582.25      ]]
'''
print ("standard deviations d: ", np.array([variance (Xd)]))

standard deviations d:  [[626.         472.66666667 582.25      ]]


In [427]:
# define a function below called class_labels
# This function returns a simple list containing each of the unique values for Y, that is, the class labels for the dataset

In [428]:
def class_labels (Y):
    '''
    Y: A column vector of class labels
    Returns a list containing each of the unique values in Y
    '''
    return np.unique(Y)
    # your code here

In [429]:
''' Test cell
Expected output:
class labels:  ['cat' 'dog']
'''
print("class labels: ", class_labels(Y))


class labels:  ['cat' 'dog']


In [430]:
# create a function below called get_likelihoods, as explained in the function docstring

In [431]:
def get_likelihoods (x, X, Y):
    '''
    x: a row vector representing a sample for which you wish to compute likelihoods 
    X: the complete dataset with both classes
    Y: the column vector of labels for the dataset

    Returns a 2D array of likelihoods, with a row for each feature for the first class, 
    and a row for each feature of the second class
    eg. [[P(X_1|Y=0), ... P(X_N|Y=0)]
        [P(X_1|Y=1), ... P(X_N|Y=1)]]
    This function should use all of the functions previously created
    You may ignore P(Y=1), P(Y=2),... as they do not affect the arg_max for a balanced dataset
    '''
    dogclass = get_data(X, Y,"dog")
    catclass = get_data(X, Y,"cat")
    
    dogmean = mean(dogclass)
    catmean = mean(catclass)
    dogvariance = variance(dogclass)
    catvariance = variance(catclass)
    result =[]
    for i in range(3):
        Z =  (math.sqrt(2*math.pi*catvariance[i]))**-1
        I = (x[0][i]-catmean[i])**2/(2*catvariance[i])
        result.append((Z*np.exp(1)**-I))        
    for i in range(3):
        Z =  (math.sqrt(2*math.pi*dogvariance[i]))**-1
        I = (x[0][i]-dogmean[i])**2/(2*dogvariance[i])
        result.append((Z*np.exp(1)**-I))
    B = np.reshape(result, (-1, 3))
    return B
    
    # your code here

In [432]:
''' Test cell
Expected output:
liklihoods [[4.53829013e-02 5.18070383e-02 7.04130654e-01]
 [5.45823793e-04 3.98332943e-03 3.56964591e-03]]
'''
x = [[35, 30, 5]]
print ("liklihoods", get_likelihoods(x, X, Y))
Z =get_likelihoods(x, X, Y)
I =np.prod(Z,axis=1)

liklihoods [[4.53829013e-02 5.18070383e-02 7.04130654e-01]
 [5.45823793e-04 3.98332943e-03 3.56964591e-03]]


In [433]:
# create a function called prediction, which predicts the class for a test sample as explained in the function docstring.


In [434]:
def prediction (x, X, Y):
    '''
    x is a row vector representing a new sample for classification
    X is the complete dataset with both classes
    Y is the column vector of labels for the dataset
    Return a string value representing the label 
    '''
    likelihoods = get_likelihoods(x, X, Y)
    afterProduct = np.prod(likelihoods, axis=1)
    y = np.argmax(afterProduct)
    return class_labels(Y)[y]
    # your code here

In [435]:
''' Test cell
Expected output:
prediction cat
'''
print ("prediction", prediction(x, X, Y))

prediction cat


In [436]:
'''
Test cell
Expected output:
prediction (cat):  cat
prediction (newfoundland):  dog
prediction (yorkie):  cat
prediction (lion):  dog
'''

x = [[35, 30, 5]] # cat
print("prediction (cat): ", prediction(x, X, Y))

x = [[107, 95, 54]] #newfoundland
print("prediction (newfoundland): ", prediction(x, X, Y))

# Just for fun, some examples we expect to be classified wrong
x = [[39, 33, 7]] # yorkie
print("prediction (yorkie): ", prediction(x, X, Y))

x = [[137, 112, 190]] # lion
print("prediction (lion): ", prediction(x, X, Y))

prediction (cat):  cat
prediction (newfoundland):  dog
prediction (yorkie):  cat
prediction (lion):  dog


### Part 2: Logistic Regression (40 marks)
In this portion of the notebook you will implement logistic regression

In [437]:
#Initialize w as a colum vector with values -1, 1, 1


In [481]:
# your code here
w = np.array ([[-1],
               [1],
               [1]])

In [482]:
# Create a 1-hot encoded Y vector, where "dog"=1 and "cat"=0
#Y =(Y[:]=="dog")*1
Yd = np.array([[1],
              [1],
              [1],
              [1],
              [0],
              [0],
              [0],
              [0]])


In [483]:
# your code here

In [484]:
# create a function called dense which implements the linear part of the logisitic regression node

In [485]:
def dense (X, w):
    '''
    Performs the linear combination part of the logistic regression node
    X: a matrix of samples, with samples in rows and features in columns
    w: a column vector of weights, with 1 weight per feature
    Returns z, a column vector with a value for each sample
    '''
    z = np.array(np.dot(X,w))

    return z
    
    
    # your code here

In [486]:
''' Test cell
Expected output with initial value for w:
z [[ 36]
 [  1]
 [  8]
 [ 16]
 [ -7]
 [ -6]
 [-14]
 [ -6]]
'''

z = dense(X, w)
print ("z", z)

z [[ 36]
 [  1]
 [  8]
 [ 16]
 [ -7]
 [ -6]
 [-14]
 [ -6]]


In [487]:
# create a function called sigmoid which implements the sigmoid part of the logistic regression node

In [488]:
def sigmoid (z):
    '''
    Performs the sigmoid part of the logistic regression node
    z: the output of dense function
    Returns: a column vector the class predictions probabilities for each sample
    '''
    return (1+np.exp(1)**-z)**-1
    
    
    # your code here

In [489]:
''' Test cell
Expected output with initial value for w:
a [[1.00000000e+00]
 [7.31058579e-01]
 [9.99664650e-01]
 [9.99999887e-01]
 [9.11051194e-04]
 [2.47262316e-03]
 [8.31528028e-07]
 [2.47262316e-03]]
'''

Y_hat = sigmoid(z)
print ("Y_hat", Y_hat)

Y_hat [[1.00000000e+00]
 [7.31058579e-01]
 [9.99664650e-01]
 [9.99999887e-01]
 [9.11051194e-04]
 [2.47262316e-03]
 [8.31528028e-07]
 [2.47262316e-03]]


In [490]:
# create a function called log loss which computes the log loss for a given prediction

In [491]:
def log_loss (Y, Y_hat):
    '''
    Y: A 1-hot column vector encoded of labels
    Y_hat: A column vector of predicted probabilities
    Returns the sum of log loss for each prediction vs. the actual value
    '''
    
    lw = -Y*(np.log(Y_hat)) - (1-Y)*(np.log((1-Y_hat)))
    return np.sum(lw)
    # your code here


In [492]:
''' Test cell
Expected output with initial value for w:
ll 0.31946087468389556
'''

log_loss_val = log_loss(Yd, Y_hat)
print ("ll", log_loss_val)

ll 0.31946087468389556


In [493]:
# create a function called gradient_log_loss which computes the gradient of the log loss function with respect to the vector of weights

In [497]:
def gradient_log_loss (X, Y, w):
    '''
    X: A matrix of samples, with samples in rows and features in columns
    Y: A 1-hot column vector encoded of labels
    w: A column vector of weights, with one weight for each feature
    Returns the gradient of the log_loss function at this point
    '''
    
    z = dense(X, w)

    Y_hat = sigmoid(z)
  
    U = Y_hat-Y
    C = X.transpose()
    gll = np.dot(C,U)
    return gll
    # your code here

In [498]:
''' Test cell
Expected output with initial value for w:
gll [[-32.62169456]
 [-11.70180086]
 [-21.2275802 ]]
'''

gradient_log_loss_val = gradient_log_loss (X, Yd, w)
print ("gll", gradient_log_loss_val)

gll [[-32.62169456]
 [-11.70180086]
 [-21.2275802 ]]


In [499]:
# create a function called update which updates the weights based on the gradient log loss as a given set of weights

In [500]:
def update (X, Y, w, alpha):
    '''
    Performs a single iteration of gradient descent. Computes the gradients and updates the weights.
    X: A matrix of samples, with samples in rows and features in columns
    Y: A 1-hot column vector encoded from the labels
    w: A column vector of weights, with one weight for each feature
    alpha: The step size for gradient descent
    '''
    gradient = gradient_log_loss(X,Y,w)
    neww = w - (alpha*gradient)
    return neww
    # your code here
    

In [501]:
''' Test cell
Expected output with initial value for w:
w [[-0.96737831]
 [ 1.0117018 ]
 [ 1.02122758]]
'''

alpha = 0.001
x = update (X, Yd, w, alpha)
print ("w", x)

w [[-0.96737831]
 [ 1.0117018 ]
 [ 1.02122758]]


In [502]:
# Create a function called "train" which is implemented as per the docstring

In [503]:
def train (X, Y, w, alpha, epochs):
    '''
    Trains the weights for logistic regression by performing multiple iterations of gradient descent

    X: A Matrix of training input features
    Y: A 1-hot column vector of training output features
    w: The initialized weight vector
    alpha: The step size in gradient descent
    epochs: The number of iterations for which the model will be trained
    '''
    temp = float('inf')
    for i in range(epochs):
        z = dense(X, w)
        Y_hat = sigmoid(z)
        log_loss_val = log_loss(Yd, Y_hat)
        if log_loss_val > temp:
            return w
        else:
            w = update(X, Y, w, alpha)
            temp = log_loss_val
    return w    
    # your code here
        

In [504]:
# train your model with the following settings:
#  5000 training epochs with alpha=0.001 
''' Test cell
Expected output w after training, 5000 steps + 1 in a previous cell:
w [[-1.00540176]
 [ 0.92809876]
 [ 1.1736416 ]]
'''
alpha = 0.001
epochs = 5000
w = train (X, Yd, w, alpha, epochs)
print ("w", w)

  lw = -Y*(np.log(Y_hat)) - (1-Y)*(np.log((1-Y_hat)))
  lw = -Y*(np.log(Y_hat)) - (1-Y)*(np.log((1-Y_hat)))


w [[-1.00539839]
 [ 0.92809851]
 [ 1.17363023]]


In [505]:
# create a function called logistic_predict which returns the true_label if the prediction probability > 0.5, else return false_label

In [506]:
def logistic_predict (x, w, true_label, false_label):
    '''
    x: a test sample or matrix
    w: the trained weight vector
    true_label: the label to return if the prediction probability is greater than 0.5
    false_label: the label to return otherwise
    returns a pre
    diction or array of predictions using logistic regression
    '''
    z = dense(x, w)
    Y_hat = sigmoid(z)
    if Y_hat>0.5:
        return true_label
    else:
        return false_label
    
    
    # your code here

In [507]:
''' Test cell
Expected output:
prediction (cat):  cat
prediction (newfoundland):  dog
prediction (yorkie):  cat
prediction (lion):  dog
'''
x = [[35, 30, 5]] # cat
print("prediction (cat): ", logistic_predict(x, w, "dog", "cat"))

x = [[107, 95, 54]] #newfoundland
print("prediction (newfoundland): ", logistic_predict(x, w, "dog", "cat"))

# Just for fun, some examples we expect to be wrong

x = [[39, 33, 7]] # yorkie
print("prediction (yorkie): ", logistic_predict(x, w, "dog", "cat"))

x = [[137, 112, 190]] # lion
print("prediction (lion): ", logistic_predict(x, w, "dog", "cat"))

prediction (cat):  cat
prediction (newfoundland):  dog
prediction (yorkie):  cat
prediction (lion):  dog


### Part 3 - Testing our algorithms on a real world dataset (20 Marks)

The final section of this workbook tests out these algorithms on real world datasets. In particular, you will load up the Telco dataset which analyzes customer behaviour to assess customer retention. In business, the term "Churn" is used to refer to customers who ceased using your service.

In [508]:
# import pandas using the canonical import statement


In [509]:
# your code here
import pandas as pd

In [510]:
# load the training and test datasets, which have been balanced and shuffled for you


In [511]:
# your code here
training = pd.read_csv("telco-churn-train.csv")
test = pd.read_csv("telco-churn-test.csv")

In [512]:
# inspect the training dataset to see the names of the columns

In [513]:
# your code here
training.info()
test.info()
training.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2986 entries, 0 to 2985
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        2986 non-null   object 
 1   gender            2986 non-null   object 
 2   SeniorCitizen     2986 non-null   int64  
 3   Partner           2986 non-null   object 
 4   Dependents        2986 non-null   object 
 5   tenure            2986 non-null   int64  
 6   PhoneService      2986 non-null   object 
 7   MultipleLines     2986 non-null   object 
 8   InternetService   2986 non-null   object 
 9   OnlineSecurity    2986 non-null   object 
 10  OnlineBackup      2986 non-null   object 
 11  DeviceProtection  2986 non-null   object 
 12  TechSupport       2986 non-null   object 
 13  StreamingTV       2986 non-null   object 
 14  StreamingMovies   2986 non-null   object 
 15  Contract          2986 non-null   object 
 16  PaperlessBilling  2986 non-null   object 


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,6987-XQSJT,Female,1,No,No,54,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,79.5,4370.25,Yes
1,4686-UXDML,Female,0,No,No,21,Yes,Yes,Fiber optic,No,...,No,No,Yes,Yes,Month-to-month,Yes,Credit card (automatic),99.85,1992.55,No
2,5066-GFJMM,Female,0,Yes,No,3,Yes,No,No,No internet service,...,No internet service,No internet service,No internet service,No internet service,Month-to-month,Yes,Mailed check,19.9,45.75,No
3,1104-TNLZA,Male,1,Yes,No,28,Yes,Yes,Fiber optic,No,...,No,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,105.8,2998.0,No
4,7931-PXHFC,Male,0,No,No,38,Yes,No,DSL,No,...,Yes,Yes,No,Yes,One year,Yes,Mailed check,62.3,2354.8,Yes


In [514]:
# Create a Y vector for each the training and test sets by selecting an appropriate output column from the train and test datasets.
# What is the name of the column? What are the possible class labels?

In [515]:
Y_train = np.array([training["SeniorCitizen"]])# your code here
Y_train = Y_train.transpose()
Y_test  = np.array([training["SeniorCitizen"]])# your code here
Y_test = Y_test.transpose()
print(Y_test)

[[1]
 [0]
 [0]
 ...
 [0]
 [1]
 [0]]


In [516]:
# Create a subset of the training data with columns for tenure, monthly charges and total charges

In [517]:
X_train = np.array([training["tenure"],training["MonthlyCharges"],training["TotalCharges"]])# your code here
X_test  = np.array([test["tenure"], test["MonthlyCharges"],test["TotalCharges"]])# your code here
X_train = X_train.transpose()
X_test = X_test.transpose()

In [518]:
# Now it's time to compare the two models. Using the training set, train a model using both Naive Bayes and Logistic Regression. Compute the training error and test error. To complete this, implement the two functions descibed below and run the function calls.

In [519]:
def test_naive_bayes (X_train, X_test, Y_train, Y_test):
    '''
    Trains the Naive Bayes classifier using the X_train and Y_train, and then tests the
    model on X_test and Y_test. Counts and displays the number of correctly classified samples, 
    the total number of samples, and the accuracy of the model.
    X_train: The training dataset
    X_test: The test dataset
    Y_train: The training labels
    Y_test: The test labels
    '''
    seniorCitzenClass = get_data(X_train, Y_train,1)
    noSeniorCitzenClass = get_data(X_train, Y_train,0)
    
    seniorCitzenmean = mean(seniorCitzenClass)
    noSeniorCitzenmean = mean(noSeniorCitzenClass)
    seniorCitzenvariance = variance(seniorCitzenClass)
    noSeniorCitzenvariance = variance(noSeniorCitzenClass)
    result = []
    def getLikelihoods(index):
        for i in range(3):
            Z =  1/(math.sqrt(2*math.pi*noSeniorCitzenvariance[i]))
            I = (X_test[index][i]-noSeniorCitzenmean[i])**2/(2*noSeniorCitzenvariance[i])
            result.append((Z*np.exp(1)**-I)) 
        for i in range(3):
            Z =  1/(math.sqrt(2*math.pi*seniorCitzenvariance[i]))
            I = (X_test[index][i]-seniorCitzenmean[i])**2/(2*seniorCitzenvariance[i])
            result.append((Z*np.exp(1)**-I))
        return result 
    z = []
    for index in range(len(X_test)): 
        B = np.reshape(getLikelihoods(index), (-1, 3)) 
        result =[]
        afterProduct = np.prod(B, axis=1)
        z.append(np.argmax(afterProduct))
    Z = np.reshape(z,(-1,1)) # tge predict result 
    counter = 0
    for i in range(len(X_test)):
        if Z[i] == Y_test[i]:
            counter +=1
    accuracy = counter/len(X_test)
    print(accuracy)
    # your code here
    

In [520]:
# Assess training accuracy of Naive Bayes 
test_naive_bayes(X_train,X_train, Y_train, Y_train)
#test_naive_bayes (np.array(X_train), np.array(X_train), np.array([Y_train]).T, np.array([Y_train]).T)

0.5452109845947756


In [521]:
# Assess test accuracy of Naive Bayes 
#test_naive_bayes (np.array(X_train), np.array(X_test), np.array([Y_train]).T, np.array([Y_test]).T)
test_naive_bayes(X_train,X_test,Y_train,Y_test )


0.4734042553191489


In [522]:
# train the logistic regression classifier on the telco dataset with the following settings:
# 5000 training epochs with alpha=0.001 
w = [1,1,-1]
alpha = 0.01
epochs = 500
# your code here - Reminder: Create a 1-hot column for Y, and then train using the supplied hyper-parameters
train(X_train, Y_train, w, alpha, epochs)
print ("w", w)

  return (1+np.exp(1)**-z)**-1
  lw = -Y*(np.log(Y_hat)) - (1-Y)*(np.log((1-Y_hat)))
  lw = -Y*(np.log(Y_hat)) - (1-Y)*(np.log((1-Y_hat)))


ValueError: operands could not be broadcast together with shapes (3,) (3,2986) 

In [523]:
# Implement and run the following function to test the logistic regression model on the real world dataet

In [350]:
def test_logistic_regression (X_test, Y_test, w):
    '''
    Tests the trained Logistic Regression classifier using a test dataset.
    Counts and displays the number of correctly classified samples, the total number of 
    samples, and the accuracy of the model.
    X_test: The test dataset
    Y_test: The test labels
    w: The trained weights
    '''
    
    
    
    
    
    # your code here

In [None]:
# Assess the training error of the logistic regression model
test_logistic_regression (np.array(X_train), np.array([Y_train]).T, w)

In [None]:
# Assess the test error of the logistic regression model
test_logistic_regression (np.array(X_test), np.array([Y_test]).T, w)

In [None]:
'''
For verification, expected training accuracies for the two models, not necessarily in order:  
0.6872069658405894
0.7073007367716008
'''