<center><b><font size=6>Lab-6 A classifier from scratch<b><center>

### Objective: Implement, use and evaluate a classifier (without using specific libraries such as sklearn)
1. **Logistic regression** is a binary classification method that maps a linear combination of parameters and variables into two possible classes. Here, you will implement the logistic regression from scratch to better understand how an ML algorithm works. Useful link: <a href="https://en.wikipedia.org/wiki/Logistic_regression">Wiki</a>.
2. **Performance evaluation metrics** are needed to evaluate the outcome of prediction with respect to true labels. Here, you will implement confusion matrix, accuracy, precision, recall and F-measure. Useful link: <a href="https://en.wikipedia.org/wiki/Confusion_matrix">Wiki</a>.

In [125]:
# import needed python libraries

%matplotlib inline

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import random
import math

### 1. Dataset - TCP logs
The dataset contains traffic information generated by an open-source passive network monitoring tool, namely **tstat**. It automates the collection of packet statistics of traffic aggregates, using real-time monitoring features. Being a passive tool, the typical usage scenario is live monitoring of Internet links, in which all transmitted packets are observed. In case of TCP, Tstat identifies a new flow start when it observes a TCP three-way handshake. Similarly, it identifies a TCP flow end either when it sees the TCP connection teardown, or when it doesn’t observe packets for some time (idle time). A flow is defined by a unique link between the sender and receiver, e.g., a tuple of <em>(IP_Protocol_Type, IP_Source_Address, Source_Port, IP_Destination_Address, Destination_Port)</em>. For a specific flow, tstat calculates a number of statistics of all the packets transmitted over this flow, and then generate a log for such flow with multiple attributes (statistics). A log file is arranged as a simple table where each column is associated to specific information and each row reports the flow during a connection. The log information is a summary of the flow properties. For instance, in the TCP log we can find columns like the starting time of a TCP connection, its duration, the number of sent and received packets, the observed Round Trip Time.
![](tstat.png)

In this lab, since the focus is on the development of logistic regression from scratch, we only consider a portion of the dataset for simplicity. The data can be found in `log_tcp_part.csv`, in which there are multiple columns, the last one is the class label, indicating the flow is from either **google** or **youtube**, and the rest are features. Your job is a binary classification task to classify the domain of each flow (row) **from scratch**, including:
- Build a logistic regression model,
- Evaluate the performance.

1. Load the dataset.
2. Get the list of features (columns 1 to 10).
3. Add a new column and assign numerical class labels of -1 and 1 to google and youtube.
4. Answering the following questions:
    - How many features do we have? 10
    - How many samples do we have in total? 20000
    - How many samples do we have for each class? Are they similar? 10000 and 10000, they are identical

In [126]:
dataframe = pd.read_csv("log_tcp_part.csv")
features = dataframe.columns

dataframe["m_classes"] = np.where(dataframe["class"]=='google', -1 ,1)
dataframe

dataframe["class"].value_counts()

google     10000
youtube    10000
Name: class, dtype: int64

### 2. Implement your logistic regression learning algorithm
Here you will need to construct a class in which you need to define two functions besides the class initialization:
- `fit`. In this method you will perform ERM. Learn the parameters of the model (i.e., the hypothesis h) from training with gradient descent
- `predict`. In this method given one  sample x (or more) you will perform the inference $sign(h(x))$ to obtain class labels.

Hints:

- The linear function used in the logistic regression is the following: $h(x)=w^T x +b $, where b is a scalar bias.
- Logistic loss: $L((x,y),h)=\log(1+e^{-y h(x)})$
- ERM: $\min_{w,b} f(w,b)=\frac{1}{m}\sum_{i=1}^{m} \log(1+e^{-y^{(i)} h(x^{(i)})})$
- Gradient for weight: $\nabla_w f(w,b) = \frac{1}{m} \sum_i \frac{-y^{(i)}x^{(i)}}{(1+e^{-y^{(i)}h(x^{(i)})})}$
- Gradient for bias: $\nabla_b f(w,b)= \frac{1}{m} \sum_i \frac{-y^{(i)}}{(1+e^{-y^{(i)}h(x^{(i)})})}$
- Update the parameters: $w \leftarrow w - \alpha \nabla w$, $b \leftarrow b - \alpha  \nabla b$

Notice that the sigmoid function $f(z) = \frac{1}{1 + e^{-z}}$ appears multiple times. You can write also a method for the sigmoid function to help you in the computation. By considering f(z), the gradients rewrite as:

- Gradient for weight: $\nabla_w f(w,b) = \frac{1}{m} \sum_i ({f(h(x^{(i)})) - y^{(i)}})x^{(i)}$
- Gradient for bias: $\nabla_b f(w,b) = \frac{1}{m} \sum_i ({f(h(x^{(i)})) - y^{(i)}})$

In [127]:
X = dataframe.iloc[:,0:10]
y = dataframe.iloc[:,11]

In [131]:
class LogisticRegression:
    
    def __init__(self, learning_rate, num_iterations):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.b = None
        self.W = None
        self.ER = None

    def sigmoid(self, z):
        return 1/(1 + np.exp(-z))
        

    def fit(self, X, y):
        self.b = 0
        self.W = np.random.uniform(-1,1,10)
        self.ER = 0
        
        for i in range(self.num_iterations):
            
            z = X.dot(self.W) + self.b
            
            sigmoide = self.sigmoid(z)
            
            loss = np.log(1 + np.exp(-y * z))
            ER = (1/len(y))*np.sum(loss)
            
            if(ER < self.ER):
                self.ER = ER
                dw = (1/len(y))*np.dot(X.T, (sigmoide-y))
                db = (1/len(y))*np.sum(sigmoide-y)
                self.W -= dw*self.learning_rate
                self.b -= db*self.learning_rate
        
    def predict(self, X):
        
        lista_labels = []
        
        for index,element in X.iterrows():
            z = element.dot(self.W) + self.b
            sigmoide = self.sigmoid(z)
            
            if sigmoide > 0.5:
                lista_labels = lista_labels + [1]
            else:
                lista_labels = lista_labels + [-1]
                
        return lista_labels

### 3. Use the model
- Initialize your model with predefined learning rate of `0.1` and iterations of `100`.
- Fit your model with features and targets.
- Get the prediction with features.

In [134]:
classifier = LogisticRegression(0.1,100)

classifier.fit(X,y)

x_test = X[9900:10100]
y_true = y[9900:10100]


y_pred = classifier.predict(x_test)
# y_pred

  # This is added back by InteractiveShellApp.init_path()


### 4. Model evaluation
With predicted class labels and ground truths, we now evaluate the model performance through confusion matrix and numerical metrics. Specifically, you need to derive the following:
- Confusion matrix - Note that, you should indicate the corresponding quantity of each element in the table. Here positive is class 1 and negative is class -1:
\begin{array}{|c|c|c|}
\hline
 & \textbf{Predicted Positive} & \textbf{Predicted Negative} \\
\hline
\textbf{Actual Positive} & \text{True Positive (TP)} & \text{False Negative (FN)} \\
\hline
\textbf{Actual Negative} & \text{False Positive (FP)} & \text{True Negative (TN)} \\
\hline
\end{array}
- Precision of each class and the average value:
$\frac{\text{True Positive (TP)}}{\text{True Positive (TP) + False Positive (FP)}}$
- Recall of each class and the average value:
$\frac{\text{True Positive (TP)}}{\text{True Positive (TP) + False Negative (FN)}}$
- F1-score of each class and the average value:
$F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
- Accuracy:
$\frac{\text{True Positive (TP) + True Negative (TN)}}{\text{True Positive (TP) + True Negative (TN) + False Positive (FP) + False Negative (FN)}}$
- Answering the following questions:
    - Do you have same performance between classes? If not, which one performs better?
    - Change the parameters of learning rate or number of iterations. Do you have same performance? Better or Worse? Why?

In [135]:
# Initialize confusion matrix elements
tp = fp = fn = tn = 0

# Compute TP, FP, FN, and TN
for actual, predicted in zip(y_true, y_pred):
    if actual == 1 and predicted == 1:
        tp += 1  # True Positive
    elif actual == -1 and predicted == 1:
        fp += 1  # False Positive
    elif actual == 1 and predicted == -1:
        fn += 1  # False Negative
    elif actual == -1 and predicted == -1:
        tn += 1  # True Negative

# Precision, Recall, F1-Score, and Accuracy calculations
precision_pos = tp / (tp + fp) if (tp + fp) > 0 else 0
precision_neg = tn / (tn + fn) if (tn + fn) > 0 else 0
recall_pos = tp / (tp + fn) if (tp + fn) > 0 else 0
recall_neg = tn / (tn + fp) if (tn + fp) > 0 else 0

# Average precision and recall
precision_avg = (precision_pos + precision_neg) / 2
recall_avg = (recall_pos + recall_neg) / 2

# F1-Scores
f1_pos = (2 * precision_pos * recall_pos) / (precision_pos + recall_pos) if (precision_pos + recall_pos) > 0 else 0
f1_neg = (2 * precision_neg * recall_neg) / (precision_neg + recall_neg) if (precision_neg + recall_neg) > 0 else 0
f1_avg = (f1_pos + f1_neg) / 2

# Accuracy
accuracy = (tp + tn) / (tp + tn + fp + fn)

# Print the results
print("Confusion Matrix:")
print(f"TP: {tp}, FP: {fp}, FN: {fn}, TN: {tn}")
print("\nMetrics:")
print(f"Precision (Positive Class): {precision_pos}")
print(f"Precision (Negative Class): {precision_neg}")
print(f"Average Precision: {precision_avg}")
print(f"Recall (Positive Class): {recall_pos}")
print(f"Recall (Negative Class): {recall_neg}")
print(f"Average Recall: {recall_avg}")
print(f"F1-Score (Positive Class): {f1_pos}")
print(f"F1-Score (Negative Class): {f1_neg}")
print(f"Average F1-Score: {f1_avg}")
print(f"Accuracy: {accuracy}")

Confusion Matrix:
TP: 63, FP: 9, FN: 37, TN: 91

Metrics:
Precision (Positive Class): 0.875
Precision (Negative Class): 0.7109375
Average Precision: 0.79296875
Recall (Positive Class): 0.63
Recall (Negative Class): 0.91
Average Recall: 0.77
F1-Score (Positive Class): 0.7325581395348838
F1-Score (Negative Class): 0.7982456140350876
Average F1-Score: 0.7654018767849857
Accuracy: 0.77
