# CP 217, ML4CPS Workshop 
## Logistic Regression (from scratch) 
***

This worksheet walks you through the process of training & classifying with a logistic regression model. This is to provide you the chance to better understand the working of the model.

First, we will understand the Sigmoid function, Hypothesis function, Decision Boundary, the Loss function and code them alongside. After that, we will apply the Gradient Descent Algorithm to find the parameters, weights and bias . Finally, we will measure accuracy and plot the decision boundary for a linearly separable dataset and a non-linearly separable dataset.

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import io
import pandas as pd

Let's generate some simple 2d data to demonstrate logistic regression. Note that usually we'll work with more than 2 dimensions, however for the sake of plotting the results we'll stick to 2d data.

In [None]:
from sklearn.datasets import make_classification
X, y = make_classification(n_features=2, n_redundant=0, 
                           n_informative=2, random_state=1,class_sep=0.5, 
                           n_clusters_per_class=1)

In [None]:
plt.plot(X[y==0,0], X[y==0,1], 'o')
plt.plot(X[y==1,0], X[y==1,1], 's')

We construct $\mathbf{X}$ in the code block below, remembering to include the $x_0 = 1$ column for the bias (intercept).

In [None]:
X = np.hstack((np.ones((X.shape[0],1)), X))
print(X)

### Logistic Regression Model

For Logistic Regression, our hypothesis is 
$$
\hat{y} = h_w(x) = \frac{1}{1+e^{-(w^{T}x)}}
$$
The output rangeo of $\hat{y}$ is between 0 and 1.

#### Sigmoid Function

Let's first implement a Sigmoid function. 

The Sigmoid Function squishes all its inputs (values on the x-axis) between 0 and 1.
$$
\sigma(z) = \frac{1}{1+e^{-z}}
$$

In [None]:
np.exp?

In [None]:
def sigmoid(z):
    return  #....over to you

As we saw in the lecture, following is the cost fucntion for Logistic Regression for binary classification:
$$
J(data, w) = \frac{1}{n}\sum_{i=1}^{n} L(\hat{y}^{(i)},y^{(i)}) = -\frac{1}{n}\sum_{i=1}^{n} [y^{(i)}log(\hat{y}^{(i)}) + (1-y^{(i)})log(1-\hat{y}^{(i)})]
$$

This loss is also called binary cross entropy error

In [None]:
def loss(y, y_hat):
    loss =   #....over to you
    return -loss

#### Gradient of the loss function

Now that we know our hypothesis function and the loss function, all we need to do is use the Gradient Descent Algorithm to find the optimal values of our parameters like this ($\eta$ →learning rate), the update rules for parameters are as follows:
$$
w_{t+1} = w_{t} - \eta*dw
$$
Where $dw$ is the partial derivative of loss w.r.t parameter $w$. It looks like:
$$
dw = \frac{1}{n} * (\hat{y}-y).\textbf{X}
$$


In [None]:
def gradients(X, y, y_hat):
    # X --> Input.
    # y --> true/target value.
    # y_hat --> hypothesis/predictions.
    # n-> number of training examples.
    
    n = X.shape[0]
    
    # Gradient of loss w.r.t weights.
    dw =   #....over to you
    
    return dw


We need to normalize the data before using/computing gradient. It can accelerate the training process.Please make sure we don't normalize the "bias" term (first column).

In [None]:
def normalize(X):
    
    # X --> Input.
    # n-> number of training examples
    # d-> number of features 
    n, d = X.shape
    
    # Normalizing all the d features of X (except the bias (first) column)
    for i in range(d-1):
        X[:,i+1] = (X[:,i+1] - X[:,i+1].mean(axis=0))/X[:,i+1].std(axis=0)
                
    return X

#### Prediction

Now that we have written the functions to learn the parameters, we want to know how our hypothesis($\hat{y}$) is going to make predictions of whether $y=1$ or $y=0$. The way we defined hypothesis is the probability of $y$ being 1 given $\textbf{X}$ and parameterized by $w$.

So, we will say that it will make a prediction of —
$$
\hat{y} = 1 \to w^{T}\textbf{X} \geq 0 \quad \text{OR} \quad \sigma(w^{T} \textbf{X}) \geq 0.5
$$
$$
\hat{y} = 0  \to w^{T}\textbf{X} < 0 \quad \text{OR} \quad  \sigma(w^{T} \textbf{X}) < 0.5
$$


In [None]:
def predict(X,w):
    
    # X --> Input.
    
    # Normalizing the inputs.
    X = normalize(X)
    
    # Calculating prediction/y_hat.
    preds = #....over to you
    
    # Empty List to store predictions.
    pred_class = []
    # if y_hat >= 0.5 --> round up to 1
    # if y_hat < 0.5 --> round up to 0
    
    pred_class =  #....over to you
    
    return np.array(pred_class)

So our decision boundary will be:
$$
\hat{y} = 0.5 \quad or \quad w^{T}\textbf{X} = 0
$$

In [None]:
def plot_decision_boundary(X,w):
    ydisp = -(w[0] + w[1] * X)/w[2]
    fig = plt.figure(figsize=(10,8))
    plt.plot(X[:, 1][y==0], X[:, 2][y==0], "g^")
    plt.plot(X[:, 1][y==1], X[:, 2][y==1], "bs")
    
    plt.xlim([-2, 2.2])
    plt.ylim([-2, 2.2])
    plt.xlabel("feature 1")
    plt.ylabel("feature 2")
    plt.title('Decision Boundary')
    plt.plot(X, ydisp)

Now that we have written all the required blocks for logistic regression model, let's put them together to train our nodel on the training data

In [None]:
def train(X, y, epochs, eta):
    
    # X --> Input.
    # y --> true/target value.
    # bs --> Batch Size.
    # eta --> Learning rate.
        
    # n-> number of training examples
    # d-> number of features 
    
    n, d = X.shape
    
    # Initializing weights and bias to zeros.
    w = np.zeros((d,1))

    
    # Reshaping y.
    y = y.reshape(n,1)
    
    # Normalizing the inputs.
    X = normalize(X)
    
    # Empty list to store losses.
    losses = []
    
    # Training loop.
    for epoch in range(epochs):
        
            # Calculating hypothesis/prediction.
            y_hat = sigmoid(np.dot(X, w))
            
            # Getting the gradients of loss w.r.t parameters.
            dw = gradients(X, y, y_hat)
            
            # Updating the parameters.
            w -=  #....over to you
        
              # Calculating loss and appending it in the list.
            l = loss(y, sigmoid(np.dot(X, w)))
            losses.append(l)
        
    # returning weights, losses(List).
    return w, losses


In [None]:
# Training 
w, l = train(X, y, epochs=100, eta= 0.01)
w

In [None]:
# Plotting Decision Boundary
plot_decision_boundary(X, w)

What do you think about decision bounday or this model? Let's compute the accuracy.

In [None]:
#accuracy
print('The accuracy of model is',(np.sum(1*(y==predict(X,w)))/len(y))*100,'%')

Are you satistifed with the accuracy on this data? How can we check if there is a scope for improvement or if we did something wrong?

In [None]:
plt.plot(l)
plt.ylabel("Loss")
plt.xlabel("epochs")
plt.show()

Can we learn a better model on this data? and How?

In [None]:
w, l =  #....over to you   

plt.plot(l)
plt.ylabel("Loss")
plt.xlabel("epochs")
plt.show()

In [None]:
plot_decision_boundary(X, w)
print('The accuracy of model is',(np.sum(1*(y==predict(X,w)))/len(y))*100,'%')

#### Non-linearly separable data

Now, we will see the performance of logistic regression for non-linearly separable data.

In [None]:
#Classification on non-linearly seperable data
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.3)

In [None]:
plt.plot(X[y==0,0], X[y==0,1], 'o')
plt.plot(X[y==1,0], X[y==1,1], 's')

In [None]:
X = np.hstack((np.ones((X.shape[0],1)), X))
print(X)

In [None]:
# Training 
w, l = train(X, y, epochs=100, eta=1)
# Plotting Decision Boundary
plot_decision_boundary(X, w)

In [None]:
#accuracy of non-linearly seperable data
print('The accuracy of model is',(np.sum(1*(y==predict(X,w)))/len(y))*100,'%')

### CPS Data: Occupancy Detection based on IoT Sensors

Now we will try out logistic regression on a subset of a CPS Domain data. We will be using the Occupancy Detection Data Set from UCI Machine Learning Repository. This is a binary classification problem which requires that an observation of environmental factors such as temperature and humidity be used to classify whether a room is occupied or unoccupied. 

Data is provided with date-time information and six environmental measures taken each minute over multiple days, specifically:

- Data (Timestamp)
- Temperature in Celsius.
- Relative humidity as a percentage.
- Light measured in lux.
- Carbon dioxide measured in parts per million.
- Humidity ratio, derived from temperature and relative humidity measured in kilograms of water vapor per kilogram of air.
- Occupancy as either 1 for occupied or 0 for not occupied.

We won't be using time-stamp as a feature in this problem.

In [None]:
#read csv file as a pandas dataframe
data= pd.read_csv('occupancy_detection.txt')
data

What do you think the best features can be for this classificattion? Let's try predicting using temperature and humidity only. 

In [None]:
# split data into inputs (features) and output (label)
X= data.values[:, 3:5] 
y= data.values[:, -1].astype('int')

In [None]:
plt.plot(X[y==0,0], X[y==0,1], 'o')
plt.plot(X[y==1,0], X[y==1,1], 's')
plt.xlabel("temperature")
plt.ylabel("humidity")

In [None]:
# split the dataset into training and test
from sklearn.model_selection import train_test_split
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.4, shuffle=True, random_state=1)

In [None]:
from sklearn.linear_model import LogisticRegression

# define the model
model = LogisticRegression()

# fit the model on the training set
model.fit(trainX, trainy)

# predict the test set
yhat = model.predict(testX)

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_curve, accuracy_score,auc, roc_auc_score

print('Precision score %s' % precision_score(testy, yhat))
print('Recall score %s' % recall_score(testy, yhat))
print('F1-score score %s' % f1_score(testy, yhat))
print('Accuracy score %s' % accuracy_score(testy, yhat))

Now try with light and $CO_2$ features.

### References
1. Pattern Recognition and Machine Learning, Christopher Bishop, New York, Springer,  2006. (Chapter 3)
2. Machine learning: A Probabilistic Perspective, Kevin Murphy, MIT Press, 2012. (Chapters 17, 18)
3. Statistical Machine Learning Course Workshop, 2015, University of Melbourne, Australia