# PH3022 - Machine Learning and Neural Computation - Assignment 05

# Q1

# Q1.a

Binary classification is a supervised learning task where each data sample is assigned to one of two possible categories. These two classes are usually encoded as:
$$y \in \{0, 1\}$$

Examples:
- Spam (1) vs. Not Spam (0)
- Car (1) vs. Van (0)
- Disease (1) vs. No Disease (0)

The goal of the classifier is to learn from training data and correctly assign future samples to class 0 or class 1.

# Q1.b

Logistic regression performs binary classification by predicting the probability that an input belongs to the positive class (class 1).

Given input features $x_1, x_2, \ldots, x_n$, logistic regression first computes a linear combination:
$$z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$$

Here:
- $w_i$ are the model weights,
- $b$ is the bias term,
- $z$ is the linear output.

To convert this linear output into a probability, the sigmoid function is applied:
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

This gives the probability of the sample being in class 1:
$$P(y = 1 \mid x) = \sigma(z)$$

Finally, classification is done using a threshold:
$$\hat{y} =\begin{cases}1, & \sigma(z) \ge 0.5 \\0, & \sigma(z) < 0.5\end{cases}$$

Thus, logistic regression uses the features to compute a weighted sum, converts it to a probability through the sigmoid function, and applies a threshold to determine the predicted class.

# Q2

# Q2.a

For binary classification, the cross-entropy cost function for \(m\) training samples is:
$$J(w, b) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right]$$

Where:
- $y^{(i)}$ is the true label (0 or 1),
- $\hat{y}^{(i)}$ is the predicted probability from the model,
- $w$ are the weights,
- $b$ is the bias,
- $m$ is the total number of samples.

This cost measures how well the predicted probabilities match the true labels.

# Q2.b

For a single training example, the binary cross-entropy loss is:
$$L(y, \hat{y}) = - \left[ y \log(\hat{y}) + (1 - y) \log(1 - \hat{y}) \right]$$
Where:
- $y$ is the true class label (0 or 1),
- $\hat{y} = \sigma(z)$ is the predicted probability from the sigmoid function.
This loss becomes small when the prediction is close to the true label and large when the prediction is wrong.

# Q2.c

To understand how loss changes with the linear output $z$, we recall that:
$$\hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}}$$
where
$$z = w^T x + b$$

Thus, the cross-entropy loss as a function of z becomes:

- For a true label of $y = 1$:
$$L = -\log(\sigma(z))$$
- For a true label of $y = 0$:
$$L = -\log(1 - \sigma(z))$$

A plot of loss vs. z shows the following behaviour:

- When $y = 1$:
  - The loss is a monotonically decreasing function of z.
  - As $z → +∞$, $σ(z) → 1$ and the loss approaches zero.

- When $y = 0$:
  - The loss is a monotonically increasing function of z.
  - As $z → −∞$, $σ(z) → 0$ and the loss approaches zero.

Incorrect predictions (large positive z for $y = 0$ or large negative z for $y = 1$)
result in very high loss values.

# Q2.d

Consider an email dataset where:
- Class 1 = non-spam  
- Class 0 = spam

Suppose logistic regression predicts $spam (0)$,  
but the true label is $\text{non-spam (1)}$.

This means:
- True label: $y = 1$
- Predicted probability of class 1: $\hat{y}$ is very small  
- So the loss is:
$$L = -\log(\hat{y})$$

Because $\hat{y}$ is close to zero,  
$$-\log(\hat{y}) \quad \text{is very large}$$

$\text{Therefore, we expect a high loss.}$

On the loss-vs-$z$ plot:
- The model likely produced a negative $z$ (leaning toward class 0),
- But for a true label of $y=1$, this corresponds to a region where the loss is very high.

Thus, predicting spam when the email is actually non-spam results in a large cross-entropy loss, indicating a poor and confident misclassification.

# Q3

<p>The positive label is: <b>Car</b></p>

<table border="1" cellpadding="6" cellspacing="0">
    <tr>
        <th>Case</th>
        <th>Actual Label</th>
        <th>Predicted Label</th>
        <th>Classification Type</th>
    </tr>
    <tr>
        <td>1</td>
        <td>Car</td>
        <td>Van</td>
        <td><b>False Negative (FN)</b></td>
    </tr>
    <tr>
        <td>2</td>
        <td>Car</td>
        <td>Car</td>
        <td><b>True Positive (TP)</b></td>
    </tr>
    <tr>
        <td>3</td>
        <td>Van</td>
        <td>Car</td>
        <td><b>False Positive (FP)</b></td>
    </tr>
    <tr>
        <td>4</td>
        <td>Van</td>
        <td>Van</td>
        <td><b>True Negative (TN)</b></td>
    </tr>
</table>


# Q4

$\text{Given Confusion Matrix:}$

<table border="1" cellpadding="6" cellspacing="0">
    <tr>
        <th></th>
        <th>Actual Positive</th>
        <th>Actual Negative</th>
    </tr>
    <tr>
        <th>Predicted Positive</th>
        <td>80</td>
        <td>8</td>
    </tr>
    <tr>
        <th>Predicted Negative</th>
        <td>6</td>
        <td>75</td>
    </tr>
</table>

<p>From this confusion matrix:</p>


- $\text{True Positive (TP) = 80}$  
- $\text{False Positive (FP) = 8}$  
- $\text{False Negative (FN) = 6}$
- $\text{True Negative (TN) = 75}$

# Q4.a

$$TP = 80,\quad TN = 75,\quad FP = 8,\quad FN = 6$$

# Q4.b

Accuracy is:
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
Substitute values:
$$\text{Accuracy} = \frac{80 + 75}{80 + 75 + 8 + 6}= \frac{155}{169}\approx 0.917$$
So, $Accuracy ≈ 0.917 (91.7\%)$

# Q4.c

Precision is:
$$\text{Precision} = \frac{TP}{TP + FP}$$
$$\text{Precision} = \frac{80}{80 + 8}= \frac{80}{88}\approx 0.909$$
So, $Precision ≈ 0.909 (90.9\%)$

# Q4.d

Recall (True Positive Rate) is:
$$\text{Recall} = \frac{TP}{TP + FN}$$
$$\text{Recall} = \frac{80}{80 + 6}= \frac{80}{86}\approx 0.930$$
So, $Recall ≈ 0.930 (93.0\%)$

# Q4.e

$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
Substitute values:
$$F1 = 2 \times \frac{0.909 \times 0.930}{0.909 + 0.930}$$
$$F1 \approx 2 \times \frac{0.845}{1.839}\approx 0.920$$
So, $F1 Score ≈ 0.920 (92.0\%)$

# Q5

$\text{Given classes in the dataset:}$

- White dwarf  
- Brown dwarf  
- Red dwarf  
- Blue Giant  
- Red Giant  

Total number of classes:
$$K = 5$$

# Q5.a

In one-vs-rest, each class is treated as the positive class once, while the remaining classes are grouped together as negative.

Number of classifiers:
$$\text{OvR classifiers} = K = 5$$
So, $\text{5 binary classifiers}$ are required.

# Q5.b

In one-vs-one, a classifier is trained for $\text{every pair of classes}$.

Number of classifiers:
$$\text{OvO classifiers} = \frac{K(K - 1)}{2}$$
$$= \frac{5 \times 4}{2} = 10$$
So, 10 binary classifiers are required.

# Q5.c

In the one-vs-rest method, we build five classifiers. Each classifier separates one class from all others.

Example:
- Classifier 1: White dwarf vs. (Brown + Red + Blue + Red Giant)  
- Classifier 2: Brown dwarf vs. (White + Red + Blue + Red Giant)  
- Classifier 3: Red dwarf vs. all others  
- Classifier 4: Blue Giant vs. all others  
- Classifier 5: Red Giant vs. all others  

For a new star:
1. Each classifier outputs a probability score.  
2. The class with the highest probability is chosen as the final prediction.

OvR is simple and works well when the classes are well separated.

# Q5.d

In the one-vs-one method, a classifier is built for each pair of classes. For five classes, this gives 10 classifiers.

Examples:
- White dwarf vs. Brown dwarf  
- White dwarf vs. Red dwarf  
- White dwarf vs. Blue Giant  
- White dwarf vs. Red Giant  
- Brown dwarf vs. Red dwarf  
- ... and so on for all pairs

For a new star:
1. Each classifier predicts one of the two classes.  
2. Every prediction counts as a vote for a class.  
3. The class with the highest number of votes is selected as the final prediction.

OvO often performs well when classes overlap, since each classifier only deals with two classes at a time.

# Q6

# Load Dataset and Split

In [1]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

# Load data
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression with Different Hyperparameters

In [2]:
model = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=200)
model.fit(X_train, y_train)

# Evaluate Model

In [3]:
y_pred = model.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

print("\nClassification Report:\n")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

Confusion Matrix:
 [[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Classification Report:

              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



# Hyperparameter Discussion

- Penalty: l2 worked best due to smooth regularisation
- C: Moderate value (1.0) avoids overfitting
- Solver: lbfgs converges efficiently for multiclass problems
- Accuracy is close to 100% on the test set.

# Final Comment

Logistic regression performs very well on the Iris dataset because the classes are almost linearly separable, especially the Setosa class. Regularization helps prevent overfitting, and the lbfgs solver efficiently handles multiclass classification.