<center><b><font size=6>Lab-6 A classifier from scratch<b><center>

### Objective: Implement, use and evaluate a classifier (without using specific libraries such as sklearn)
1. **Logistic regression** is a binary classification method that maps a linear combination of parameters and variables into two possible classes. Here, you will implement the logistic regression from scratch to better understand how an ML algorithm works. Useful link: <a href="https://en.wikipedia.org/wiki/Logistic_regression">Wiki</a>.
2. **Performance evaluation metrics** are needed to evaluate the outcome of prediction with respect to true labels. Here, you will implement confusion matrix, accuracy, precision, recall and F-measure. Useful link: <a href="https://en.wikipedia.org/wiki/Confusion_matrix">Wiki</a>.

In [1]:
# import needed python libraries

%matplotlib inline

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import random

### 1. Dataset - TCP logs
The dataset contains traffic information generated by an open-source passive network monitoring tool, namely **tstat**. It automates the collection of packet statistics of traffic aggregates, using real-time monitoring features. Being a passive tool, the typical usage scenario is live monitoring of Internet links, in which all transmitted packets are observed. In case of TCP, Tstat identifies a new flow start when it observes a TCP three-way handshake. Similarly, it identifies a TCP flow end either when it sees the TCP connection teardown, or when it doesn’t observe packets for some time (idle time). A flow is defined by a unique link between the sender and receiver, e.g., a tuple of <em>(IP_Protocol_Type, IP_Source_Address, Source_Port, IP_Destination_Address, Destination_Port)</em>. For a specific flow, tstat calculates a number of statistics of all the packets transmitted over this flow, and then generate a log for such flow with multiple attributes (statistics). A log file is arranged as a simple table where each column is associated to specific information and each row reports the flow during a connection. The log information is a summary of the flow properties. For instance, in the TCP log we can find columns like the starting time of a TCP connection, its duration, the number of sent and received packets, the observed Round Trip Time.
![](tstat.png)

In this lab, since the focus is on the development of logistic regression from scratch, we only consider a portion of the dataset for simplicity. The data can be found in `log_tcp_part.csv`, in which there are multiple columns, the last one is the class label, indicating the flow is from either **google** or **youtube**, and the rest are features. Your job is a binary classification task to classify the domain of each flow (row) **from scratch**, including:
- Build a logistic regression model,
- Evaluate the performance.

1. Load the dataset.
2. Get the list of features (columns 1 to 10).
3. Add a new column and assign numerical class labels of -1 and 1 to google and youtube.
4. Answering the following questions:
    - How many features do we have?
    - How many samples do we have in total?
    - How many samples do we have for each class? Are they similar?

In [9]:
df = pd.read_csv("log_tcp_part.csv")
features = df.columns.values[:-1]
df.loc[df['class']=='google', 'label'] = -1
df.loc[df['class']=='youtube', 'label'] = 1

### 2. Implement your logistic regression learning algorithm
Here you will need to construct a class in which you need to define two functions besides the class initialization:
- `fit`. In this method you will perform ERM. Learn the parameters of the model (i.e., the hypothesis h) from training with gradient descent
- `predict`. In this method given one  sample x (or more) you will perform the inference $sign(h(x))$ to obtain class labels.

Hints:

- The linear function used in the logistic regression is the following: $h(x)=w^T x +b $, where b is a scalar bias.
- Logistic loss: $L((x,y),h)=\log(1+e^{-y h(x)})$
- ERM: $\min_{w,b} f(w,b)=\frac{1}{m}\sum_{i=1}^{m} \log(1+e^{-y^{(i)} h(x^{(i)})})$
- Gradient for weight: $\nabla_w f(w,b) = \frac{1}{m} \sum_i \frac{-y^{(i)}x^{(i)}}{(1+e^{+y^{(i)}h(x^{(i)})})}$
- Gradient for bias: $\nabla_b f(w,b)= \frac{1}{m} \sum_i \frac{-y^{(i)}}{(1+e^{+y^{(i)}h(x^{(i)})})}$
- Update the parameters: $w \leftarrow w - \alpha \nabla w$, $b \leftarrow b - \alpha  \nabla b$

Notice that the sigmoid function $f(z) = \frac{1}{1 + e^{-z}}$ appears multiple times. You can write also a method for the sigmoid function to help you in the computation. By considering f(z), the gradients rewrite as:

- Gradient for weight: $\nabla_w f(w,b) = \frac{1}{m} \sum_i (-y^{(i)}x^{(i)})({f(-y^{(i)} h(x^{(i)})}))  $
- Gradient for bias: $\nabla_b f(w,b) = \frac{1}{m} \sum_i (-y^{(i)})({f(-y^{(i)} h(x^{(i)})}))$

In [10]:
class LogisticRegression:
    def __init__(self, learning_rate, num_iterations):
        # initialize your learning rate and number of iterations
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations

    def sigmoid(self, z):
        # calculate the sigmoid function
        return 1 / (1 + np.exp(-z))

    def fit(self, X, y):
        # Initialize weights and bias
        # weights should should have the same length of features, you can use np.zeros() to initialize a 0 vector
        # bias is a scaler, you can also choose 0
        self.weights = np.zeros(X.shape[1])
        self.bias = 0
        
        m = X.shape[0]

        for i in range(self.num_iterations):
            # Compute linear model output 
            linear_model_output=np.dot(X, self.weights) + self.bias
            # Compute the sigmoid of the output elementwise multiplied by label y
            sigmoid_output = self.sigmoid(-linear_model_output*y) 

            # Compute gradients for weights and bias
            dw= (1/m) *  np.dot(X.T, -y*sigmoid_output)
            db = (1/m) * np.sum(-y*sigmoid_output)

            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    def predict(self, X):
        # get the prediction by obtaining the sign of the linear model
        predictions = np.sign(np.dot(X, self.weights) + self.bias)
        
        return predictions

### 3. Use the model
- Initialize your model with predefined learning rate of `0.1` and iterations of `100`.
- Fit your model with features and targets.
- Get the prediction with features.

In [11]:
log_reg = LogisticRegression(learning_rate=0.1, num_iterations=100)
log_reg.fit(df[features].values, df['label']) 
predictions = log_reg.predict(df[features].values)

  result = getattr(ufunc, method)(*inputs, **kwargs)


### 4. Model evaluation
With predicted class labels and ground truths, we now evaluate the model performance through confusion matrix and numerical metrics. Specifically, you need to derive the following:
- Confusion matrix - Note that, you should indicate the corresponding quantity of each element in the table. Here positive is class 1 and negative is class -1:
\begin{array}{|c|c|c|}
\hline
 & \textbf{Predicted Positive} & \textbf{Predicted Negative} \\
\hline
\textbf{Actual Positive} & \text{True Positive (TP)} & \text{False Negative (FN)} \\
\hline
\textbf{Actual Negative} & \text{False Positive (FP)} & \text{True Negative (TN)} \\
\hline
\end{array}
- Precision of each class and the average value:
$\frac{\text{True Positive (TP)}}{\text{True Positive (TP) + False Positive (FP)}}$
- Recall of each class and the average value:
$\frac{\text{True Positive (TP)}}{\text{True Positive (TP) + False Negative (FN)}}$
- F1-score of each class and the average value:
$F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
- Accuracy:
$\frac{\text{True Positive (TP) + True Negative (TN)}}{\text{True Positive (TP) + True Negative (TN) + False Positive (FP) + False Negative (FN)}}$
- Answering the following questions:
    - Do you have same performance between classes? If not, which one performs better?
    - Change the parameters of learning rate or number of iterations. Do you have same performance? Better or Worse? Why?

In [12]:
def model_evaluation(y_true, y_pred):
    # Confusion Matrix
    TP = sum((y_true[i] == 1) and (y_pred[i] == 1) for i in range(len(y_true)))
    FN = sum((y_true[i] == 1) and (y_pred[i] == -1) for i in range(len(y_true)))
    FP = sum((y_true[i] == -1) and (y_pred[i] == 1) for i in range(len(y_true)))
    TN = sum((y_true[i] == -1) and (y_pred[i] == -1) for i in range(len(y_true)))

    # Metrics for Class 1
    precision_pos = TP / (TP + FP) if TP + FP != 0 else 0
    recall_pos = TP / (TP + FN) if TP + FN != 0 else 0
    f1_pos = 2 * (precision_pos * recall_pos) / (precision_pos + recall_pos) if precision_pos + recall_pos != 0 else 0

    # Metrics for Class -1
    precision_neg = TN / (TN + FN) if TN + FN != 0 else 0
    recall_neg = TN / (TN + FP) if TN + FP != 0 else 0
    f1_neg = 2 * (precision_neg * recall_neg) / (precision_neg + recall_neg) if precision_neg + recall_neg != 0 else 0

    # Accuracy
    accuracy = (TP + TN) / (TP + TN + FP + FN)

    # Average Values
    avg_precision = (precision_pos + precision_neg) / 2
    avg_recall = (recall_pos + recall_neg) / 2
    avg_f1 = (f1_pos + f1_neg) / 2

    # Convert results to DataFrame
    df_metrics = pd.DataFrame({
        "precision": [precision_pos, precision_neg, avg_precision],
        "recall": [recall_pos, recall_neg, avg_recall],
        "f1-score": [f1_pos, f1_neg, avg_f1],
        "accuracy": [accuracy, accuracy, accuracy]
    }, index=["1", "-1", "average"])

    df_confusion = pd.DataFrame({
        "1_predicted": [TP, FP],
        "-1_predicted": [FN, TN]
    }, index=["1_actual", "-1_actual"])

    return df_confusion, df_metrics

In [13]:
df_confusion, df_metrics = model_evaluation(df['label'].values, predictions)

In [14]:
df_confusion

Unnamed: 0,1_predicted,-1_predicted
1_actual,6729,3271
-1_actual,2493,7507


In [15]:
df_metrics

Unnamed: 0,precision,recall,f1-score,accuracy
1,0.729668,0.6729,0.700135,0.7118
-1,0.696511,0.7507,0.722591,0.7118
average,0.71309,0.7118,0.711363,0.7118
