# Binary Classification Models

## Objectives:

1. **Build a Binary Classification Model with Linear Decision Boundary:**
   - Import necessary libraries.
   - Read the dataset from spambase folder, view it, and check for missing values.
   - Define functions for splitting the dataset and standardizing it using Scikit-Learn.
   - Convert dataset into numpy arrays, and split the dataset.
   - Train linear and non-linear/polynomial logistic regression models using Scikit-Learn.
   - Evaluate each model on training, cross-validation, and testing datasets.
   - Determine the optimal model.
   - Build a custom model based on the resulting optimal model.
   - Train and evaluate the custom model and compare results with the Scikit-Learn model.

2. **Build a Binary Classification Model with Non-Linear Decision Boundary:**
   - Read the dataset from customer_churn folder, view it, and check for missing values.
   - Convert dataset into numpy arrays, and split the dataset.
   - Train and evaulate linear and non-linear/polynomial logistic regression models using Scikit-Learn.
   - Determine the optimal model.
   - Based on the resulting optimal model, build a custom model.
   - Train and evaluate the custom model and compare results with the Scikit-Learn model.

## 1. Binary Classification Model with Linear Decision Boundary

In [1]:
# Import necessary libraries
import numpy as np
import plotly.express as px

##### This project uses data from the [`UCI Machine Learning Repository`](https://archive.ics.uci.edu/dataset/94/spambase). The dataset is licensed under a [`Creative Commons Attribution 4.0 International (CC BY 4.0) license`](https://creativecommons.org/licenses/by/4.0/legalcode). Variable names were added to spambase.data file and the dataset was converted into numpy arrays, for building a machine learning model.

In [2]:
import pandas as pd

data = pd.read_csv("spambase/spambase.data")
data

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Class
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278,1
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028,1
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259,1
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191,1
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88,0
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14,0
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118,0
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78,0


In [3]:
# Check for any missing values in the dataset
print(f"\nMissing values in dataset: \n{data.isna().any().values}")


Missing values in dataset: 
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False]


We have a cleaned dataset with **4,601** training examples, **57** feature variables, and **1** binary target variable. Our goal is to build a binary classification model with a linear decision boundary. To ensure the dataset is separable by a linear boundary, we will first use scikit-learn to build and evaluate both linear and non-linear/polynomial binary classification models. After identifying the optimal scikit-learn model, we will develop a custom model based on these findings and compare its results with the optimal scikit-learn model. This approach saves time by efficiently determining the best model type before building our custom models.

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def split_dataset(X, y):
    # Splitting the data into training (60%), cross-validation (20%), and testing (20%) sets
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

    return X_train, y_train, X_val, y_val, X_test, y_test


def standardize_dataset(X_train, X_val, X_test):
    # Standardizing the datasets
    scaler = StandardScaler()

    # Fitting the scaler on the training data and transforming training, validation, and testing sets
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    X_test_scaled = scaler.transform(X_test)

    return scaler, X_train_scaled, X_val_scaled, X_test_scaled

In [5]:
# Convert the data into numpy arrays of features (X) and target (y)
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

# Splitting the dataset into training, validation, and test sets
X_train, y_train, X_val, y_val, X_test, y_test = split_dataset(X, y)

# Printing the shapes of the resulting datasets
print(f"Training set: {X_train.shape}, {y_train.shape}")
print(f"Validation set: {X_val.shape}, {y_val.shape}")
print(f"Test set: {X_test.shape}, {y_test.shape}")

Training set: (2760, 57), (2760,)
Validation set: (920, 57), (920,)
Test set: (921, 57), (921,)


In [6]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


def train_eval_logistic_regression(X_train, y_train, X_val, y_val, X_test, y_test, degree=1):
    # Initialize logistic regression model (linear/polynomial)
    model = make_pipeline(
        PolynomialFeatures(degree=degree),
        StandardScaler(),
        LogisticRegression(random_state=42, max_iter=1000)
    )
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Evaluate on training set
    train_acc, train_prec, train_rec, train_f1 = evaluate_model(model, X_train, y_train)
    
    # Evaluate on validation set
    val_acc, val_prec, val_rec, val_f1 = evaluate_model(model, X_val, y_val)

    # Evaluate on testing set
    test_acc, test_prec, test_rec, test_f1 = evaluate_model(model, X_test, y_test)

    print("Linear Logistic Regression:") if degree == 1 \
        else print(f"\nPolynomial Degree {degree} Logistic Regression:")
    
    print(f"Training set - Accuracy: {train_acc:.4f}, Precision: {train_prec:.4f}, Recall: {train_rec:.4f}, "
            f"F1-score: {train_f1:.4f}")

    print(f"Validation set - Accuracy: {val_acc:.4f}, Precision: {val_prec:.4f}, Recall: {val_rec:.4f}, "
            f"F1-score: {val_f1:.4f}")
    
    print(f"Test set - Accuracy: {test_acc:.4f}, Precision: {test_prec:.4f}, Recall: {test_rec:.4f}, "
            f"F1-score: {test_f1:.4f}")
    
    return model


def evaluate_model(model, X, y):
    # Predict the target values using the provided model and features
    y_pred = model.predict(X)
    
    # Calculate the accuracy of the model
    accuracy = accuracy_score(y, y_pred)
    # Calculate the precision of the model
    precision = precision_score(y, y_pred)
    # Calculate the recall of the model
    recall = recall_score(y, y_pred)
    # Calculate the F1 score of the model
    f1 = f1_score(y, y_pred)
    
    return accuracy, precision, recall, f1


# Train and evaluate linear logistic regression model
linear_lr_model = train_eval_logistic_regression(X_train, y_train, X_val, y_val, X_test, y_test)

# Train and evaluate polynomial (degree 2) logistic regression model
poly2_lr_model = train_eval_logistic_regression(X_train, y_train, X_val, y_val, X_test, y_test, degree=2)

# Train and evaluate polynomial (degree 3) logistic regression model
poly3_lr_model = train_eval_logistic_regression(X_train, y_train, X_val, y_val, X_test, y_test, degree=3)

Linear Logistic Regression:
Training set - Accuracy: 0.9272, Precision: 0.9236, Recall: 0.8888, F1-score: 0.9059
Validation set - Accuracy: 0.9250, Precision: 0.9322, Recall: 0.8729, F1-score: 0.9016
Test set - Accuracy: 0.9175, Precision: 0.9135, Recall: 0.8733, F1-score: 0.8930

Polynomial Degree 2 Logistic Regression:
Training set - Accuracy: 0.9888, Precision: 0.9944, Recall: 0.9770, F1-score: 0.9856
Validation set - Accuracy: 0.9174, Precision: 0.9062, Recall: 0.8812, F1-score: 0.8936
Test set - Accuracy: 0.9294, Precision: 0.9162, Recall: 0.9036, F1-score: 0.9098

Polynomial Degree 3 Logistic Regression:
Training set - Accuracy: 0.9938, Precision: 0.9991, Recall: 0.9853, F1-score: 0.9921
Validation set - Accuracy: 0.9174, Precision: 0.9086, Recall: 0.8785, F1-score: 0.8933
Test set - Accuracy: 0.9349, Precision: 0.9292, Recall: 0.9036, F1-score: 0.9162


Based on the comprehensive evaluation across training, validation, and test sets, the performance of the linear logistic regression model consistently demonstrates robust generalization ability. It achieves competitive accuracy (`92.50%` on validation, `91.75%` on test) and balanced precision (around `91%`) and recall (around `87%`) metrics, resulting in an F1-score of approximately `89%` on the test set. In contrast, while polynomial models with degrees 2 and 3 achieve remarkably high training accuracies (`98.88%` and `99.38%` respectively), they show a notable drop in performance on the validation and test sets, suggesting potential overfitting to the training data.

Therefore, despite the polynomial model's impressive training performance, the linear logistic regression model emerges as the optimal choice due to its consistent and reliable performance across all evaluation metrics and datasets. Its ability to maintain competitive accuracy and balanced precision-recall trade-off underscores its suitability and capacity to generalize well to unseen data. Thus, for this classification task, the linear logistic regression model stands out as the most effective choice among the evaluated models. Now, we will build our custom logistic regression model having a linear decision boundary and compare the results with the optimal model.

The linear/polynomial logistic regression model can be represented as:

$$
\begin{equation}
\hat{\mathbf{y}} = \sigma(\mathbf{X} \cdot \mathbf{w} + b)
\end{equation}
$$

where:
- $\hat{\mathbf{y}}$ represents the predicted probabilities for each instance,
- $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function that squashes the output into the range $[0, 1]$,
- $\mathbf{X}$ represents the matrix of input features,
- $\mathbf{w}$ represents the column vector of weights (coefficients),
- $b$ represents the bias term.

To train the logistic regression model, we use the log loss (cross-entropy) as the loss function:

$$
\begin{equation}
\text{Log Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right] \tag{2}
\end{equation}
$$

where:
- $m$ is the number of data points,
- $y_i$ is the actual binary label (0 or 1) for the $i$-th data point,
- $\hat{y}_i$ is the predicted probability that the $i$-th instance belongs to class 1.

To update the parameters $\mathbf{w}$ and $b$, we use gradient descent:

$$
\begin{equation}
w_{j, \text{new}} = w_{j, \text{old}} - \alpha \times \frac{\partial \text{Log Loss}}{\partial w_j} \tag{3}
\end{equation}
$$

$$
\begin{equation}
b_{\text{new}} = b_{\text{old}} - \alpha \times \frac{\partial \text{Log Loss}}{\partial b} \tag{4}
\end{equation}
$$

where:
- $j = 1, 2, 3, \ldots, n$
- $w_{j, \text{new}}$ and $w_{j, \text{old}}$ are the updated and current weights for the $j$-th feature, respectively,
- $b_{\text{new}}$ and $b_{\text{old}}$ are the updated and current $y$-intercepts, respectively,
- $\alpha$ is the learning rate,
- $\frac{\partial \text{Log Loss}}{\partial w_j}$ is the gradient of the Log Loss function with respect to the $j$-th weight,
- $\frac{\partial \text{Log Loss}}{\partial b}$ is the gradient of the Log Loss function with respect to the bias term.

The gradients with respect to the weights and bias are computed as follows:

$$
\begin{equation}
\frac{\partial \text{Log Loss}}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i) x^j_i \tag{5}
\end{equation}
$$

$$
\begin{equation}
\frac{\partial \text{Log Loss}}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i) \tag{6}
\end{equation}
$$

In [60]:
class CustomLogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []

    
    def sigmoid(self, z):
        # Calculate the sigmoid of z
        return 1 / (1 + np.exp(-z))


    def initialize_parameters(self, n_features):
        # Initialize weights as a zero vector of shape (n_features, 1)
        self.weights = np.zeros((n_features, 1))
        
        # Initialize bias as zero
        self.bias = 0.0


    def compute_cost(self, y_hat, y):
        # Get the number of samples
        m = y.shape[0]

        # Small epsilon value to prevent log(0)
        epsilon = 1e-10
        
        # Compute the cost using logistic regression loss function
        cost = - (1 / m) * np.sum(y * np.log(y_hat + epsilon) + (1 - y) * np.log(1 - y_hat + epsilon))
        
        return cost 
   

    def fit(self, X, y):
        m, n = X.shape
        self.initialize_parameters(n)

        for i in range(self.num_iterations):
            # Forward propagation
            z = np.dot(X, self.weights) + self.bias
            y_hat = self.sigmoid(z)

            # Compute cost
            cost = self.compute_cost(y_hat, y)
            print(f"Iteration {i+1}/{self.num_iterations}: Cost {cost}")

            # Save cost at every 100 iterations
            if (i + 1) % 100 == 0:
                self.cost_history.append((i+1, cost))

            # Compute gradients
            dw = (1 / m) * np.dot(X.T, (y_hat - y))
            db = (1 / m) * np.sum(y_hat - y)

            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    
    def predict(self, X):
        # Compute the linear combination of input features and weights plus bias
        # Note: X can be polynomial features
        z = np.dot(X, self.weights) + self.bias
        
        # Apply sigmoid function to the linear combination
        y_hat = self.sigmoid(z)
        
        # Convert probabilities to binary predictions (0 or 1)
        return (y_hat >= 0.5).astype(int)
    

    def evaluate_model(self, X, y):
        # Predict the target values using the provided features
        y_pred = self.predict(X)
        
        # Calculate the accuracy of the model
        accuracy = accuracy_score(y, y_pred)
        # Calculate the precision of the model
        precision = precision_score(y, y_pred)
        # Calculate the recall of the model
        recall = recall_score(y, y_pred)
        # Calculate the F1 score of the model
        f1 = f1_score(y, y_pred)
    
        return accuracy, precision, recall, f1

In [61]:
# Standardize the dataset
scaler, X_train_scaled, X_val_scaled, X_test_scaled = standardize_dataset(X_train, X_val, X_test)

In [65]:
# Train custom linear logistic regression model
custom_linear_lr_model = CustomLogisticRegression(learning_rate=11)
custom_linear_lr_model.fit(X_train_scaled, y_train.reshape(-1, 1))

Iteration 1/1000: Cost 0.6931471803599453
Iteration 2/1000: Cost 0.6189889650850795
Iteration 3/1000: Cost 0.44503665027582595
Iteration 4/1000: Cost 0.33920806550052
Iteration 5/1000: Cost 0.29310118904472166
Iteration 6/1000: Cost 0.2643758804588905
Iteration 7/1000: Cost 0.24608777742475238
Iteration 8/1000: Cost 0.23510829175623377
Iteration 9/1000: Cost 0.22909314559945454
Iteration 10/1000: Cost 0.22555597704828304
Iteration 11/1000: Cost 0.22297903867853738
Iteration 12/1000: Cost 0.22096152046012837
Iteration 13/1000: Cost 0.2193368540922247
Iteration 14/1000: Cost 0.21800618490945398
Iteration 15/1000: Cost 0.21690311213521402
Iteration 16/1000: Cost 0.21597880066026243
Iteration 17/1000: Cost 0.21519550795148176
Iteration 18/1000: Cost 0.2145231876809364
Iteration 19/1000: Cost 0.21393809006566336
Iteration 20/1000: Cost 0.21342171210625288
Iteration 21/1000: Cost 0.21296007259658178
Iteration 22/1000: Cost 0.2125427707363651
Iteration 23/1000: Cost 0.2121621712960062
Iterati

In [66]:
# Convert cost history into numpy arrays
cost_hist = np.array(custom_linear_lr_model.cost_history)

# Plotly Express line chart
fig = px.line(
    x=cost_hist[:, 0],
    y=cost_hist[:, 1],
    title="Iteration vs Cost",
    labels={"x": "Iteration", "y": "Cost"}
)

# Show the plot
fig.show()

# Saved as plot_1.png in the current directory/folder

The above plot shows the decrease in cost as the number of iterations increases, indicating the proper functioning of gradient descent in minimizing cost and leading to optimal model parameters.

In [67]:
# Evaluate the custom linear logistic regression model
train_acc, train_prec, train_rec, train_f1 = custom_linear_lr_model.evaluate_model(X_train_scaled, 
                                                                                   y_train.reshape(-1, 1))
val_acc, val_prec, val_rec, val_f1 = custom_linear_lr_model.evaluate_model(X_val_scaled, y_val.reshape(-1, 1))
test_acc, test_prec, test_rec, test_f1 = custom_linear_lr_model.evaluate_model(X_test_scaled, y_test.reshape(-1, 1))

print("Custom Linear Logistic Regression:\n")

print(f"Training set - Accuracy: {train_acc:.4f}, Precision: {train_prec:.4f}, Recall: {train_rec:.4f}, "
        f"F1-score: {train_f1:.4f}")

print(f"Validation set - Accuracy: {val_acc:.4f}, Precision: {val_prec:.4f}, Recall: {val_rec:.4f}, "
        f"F1-score: {val_f1:.4f}")

print(f"Test set - Accuracy: {test_acc:.4f}, Precision: {test_prec:.4f}, Recall: {test_rec:.4f}, "
        f"F1-score: {test_f1:.4f}")

Custom Linear Logistic Regression:

Training set - Accuracy: 0.9308, Precision: 0.9243, Recall: 0.8980, F1-score: 0.9110
Validation set - Accuracy: 0.9239, Precision: 0.9269, Recall: 0.8757, F1-score: 0.9006
Test set - Accuracy: 0.9218, Precision: 0.9145, Recall: 0.8843, F1-score: 0.8992


Scikit-Learn Linear Logistic Regression:<br>

Training set - Accuracy: 0.9272, Precision: 0.9236, Recall: 0.8888, F1-score: 0.9059<br>
Validation set - Accuracy: 0.9250, Precision: 0.9322, Recall: 0.8729, F1-score: 0.9016<br>
Test set - Accuracy: 0.9175, Precision: 0.9135, Recall: 0.8733, F1-score: 0.8930

The Custom Linear Logistic Regression model generally outperforms the Scikit-Learn model across training and test sets, with slight exceptions. For the training set, the custom model achieves an accuracy of **0.9308**, precision of **0.9243**, recall of **0.8980**, and F1-score of **0.9110**, with a slightly lower precision compared to the Scikit-Learn model's **0.9236**. On the test set, the custom model maintains its lead with an accuracy of **0.9218**, precision of **0.9145**, recall of **0.8843**, and F1-score of **0.8992**, compared to **0.9175**, **0.9135**, **0.8733**, and **0.8930** for the Scikit-Learn model. However, on the validation set, the Scikit-Learn model performs slightly better with an accuracy of **0.9250**, precision of **0.9322**, recall of **0.8729**, and F1-score of **0.9016**, while the custom model achieves **0.9239**, **0.9269**, **0.8757**, and **0.9006** for these metrics, having slightly better recall performance. Overall, the custom model demonstrates strong performance across different sets, particularly on unseen data (test set). With the goal of building a binary classification model with a linear decision boundary achieved, our next step is to develop a binary classification model with a non-linear/polynomial decision boundary.

## 2. Binary Classification Model with Non-Linear Decision Boundary

##### This project also uses data from the [`UCI Machine Learning Repository`](https://archive.ics.uci.edu/dataset/563/iranian+churn+dataset). The dataset is licensed under a [`Creative Commons Attribution 4.0 International (CC BY 4.0) license`](https://creativecommons.org/licenses/by/4.0/legalcode). The dataset was converted into numpy arrays, for building a machine learning model.

In [68]:
data_ = pd.read_csv("customer_churn/Customer Churn.csv")
data_

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value,Churn
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.640,0
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035,0
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.520,0
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.020,0
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3145,21,0,19,2,6697,147,92,44,2,2,1,25,721.980,0
3146,17,0,17,1,9237,177,80,42,5,1,1,55,261.210,0
3147,13,0,18,4,3157,51,38,21,3,1,1,30,280.320,0
3148,7,0,11,2,4695,46,222,12,3,1,1,30,1077.640,0


In [69]:
# Check for any missing values in the dataset
print(f"\nMissing values in dataset: \n{data_.isna().any().values}")


Missing values in dataset: 
[False False False False False False False False False False False False
 False False]


We have a cleaned dataset with **3,150** training examples, **13** feature variables, and **1** binary target variable. Our goal is to build a binary classification model with a non-linear decision boundary. To ensure the dataset is separable by a non-linear boundary, we will use the same approach as we did with our previous goal of building a binary classification model with a linear decision boundary.

In [70]:
# Convert the data_ into numpy arrays of features (X_) and target (y_)
X_ = data_.iloc[:, :-1].values
y_ = data_.iloc[:, -1].values

# Splitting the dataset into training, validation, and test sets
X_train_, y_train_, X_val_, y_val_, X_test_, y_test_ = split_dataset(X_, y_)

# Printing the shapes of the resulting datasets
print(f"Training set: {X_train_.shape}, {y_train_.shape}")
print(f"Validation set: {X_val_.shape}, {y_val_.shape}")
print(f"Test set: {X_test_.shape}, {y_test_.shape}")

Training set: (1890, 13), (1890,)
Validation set: (630, 13), (630,)
Test set: (630, 13), (630,)


In [71]:
# Train and evaluate linear logistic regression model
linear_lr_model_ = train_eval_logistic_regression(X_train_, y_train_, X_val_, y_val_, X_test_, y_test_)

# Train and evaluate polynomial (degree 2) logistic regression model
poly2_lr_model_ = train_eval_logistic_regression(X_train_, y_train_, X_val_, y_val_, X_test_, y_test_, degree=2)

# Train and evaluate polynomial (degree 3) logistic regression model
poly3_lr_model_ = train_eval_logistic_regression(X_train_, y_train_, X_val_, y_val_, X_test_, y_test_, degree=3)

# Train and evaluate polynomial (degree 4) logistic regression model
poly4_lr_model_ = train_eval_logistic_regression(X_train_, y_train_, X_val_, y_val_, X_test_, y_test_, degree=4)

# Train and evaluate polynomial (degree 5) logistic regression model
poly5_lr_model_ = train_eval_logistic_regression(X_train_, y_train_, X_val_, y_val_, X_test_, y_test_, degree=5)

Linear Logistic Regression:
Training set - Accuracy: 0.8947, Precision: 0.7882, Recall: 0.4512, F1-score: 0.5739
Validation set - Accuracy: 0.8984, Precision: 0.7869, Recall: 0.4848, F1-score: 0.6000
Test set - Accuracy: 0.8889, Precision: 0.7959, Recall: 0.3939, F1-score: 0.5270

Polynomial Degree 2 Logistic Regression:
Training set - Accuracy: 0.9328, Precision: 0.9087, Recall: 0.6364, F1-score: 0.7485
Validation set - Accuracy: 0.9317, Precision: 0.8415, Recall: 0.6970, F1-score: 0.7624
Test set - Accuracy: 0.9317, Precision: 0.9000, Recall: 0.6364, F1-score: 0.7456

Polynomial Degree 3 Logistic Regression:
Training set - Accuracy: 0.9519, Precision: 0.9402, Recall: 0.7407, F1-score: 0.8286
Validation set - Accuracy: 0.9429, Precision: 0.8621, Recall: 0.7576, F1-score: 0.8065
Test set - Accuracy: 0.9444, Precision: 0.9103, Recall: 0.7172, F1-score: 0.8023

Polynomial Degree 4 Logistic Regression:
Training set - Accuracy: 0.9598, Precision: 0.9108, Recall: 0.8249, F1-score: 0.8657
Va

The **Linear Logistic Regression** model shows decent performance but struggles with **recall** and **F1-score**, indicating difficulty in identifying positive instances. **Polynomial Degree 2 Logistic Regression** improves significantly, particularly in **recall** and **F1-score**, but still lags behind the higher-degree models. **Polynomial Degree 3 Logistic Regression** performs better but doesn't surpass the balance seen with **Degree 4**.

**Polynomial Degree 4 Logistic Regression** achieves the best overall balance, with high **accuracy**, and **F1-score** across all datasets, making it the most reliable. **Polynomial Degree 5**, although strong, shows signs of overfitting with a drop in **recall** in test metric.

Thus, **Polynomial Degree 4** is selected as the optimal model, balancing performance and generalization. We will now build the custom **Polynomial Degree 4** logistic regression model and compare it with the scikit-learn model.

In [72]:
# Apply the polynomial features to the dataset
poly = PolynomialFeatures(degree=4, include_bias=False)
X_train_poly = poly.fit_transform(X_train_)
X_val_poly = poly.transform(X_val_)
X_test_poly = poly.transform(X_test_)

In [73]:
# Standardize the dataset
scaler_, X_train_poly_scaled, X_val_poly_scaled, X_test_poly_scaled = standardize_dataset(X_train_poly, 
                                                                                          X_val_poly, X_test_poly)

In [94]:
# Train custom polynomial (degree 4) logistic regression model
custom_poly4_lr_model = CustomLogisticRegression(learning_rate=0.11, num_iterations=3000)
custom_poly4_lr_model.fit(X_train_poly_scaled, y_train_.reshape(-1, 1))

Iteration 1/3000: Cost 0.6931471803599453
Iteration 2/3000: Cost 1.6974590191289811
Iteration 3/3000: Cost 4.162578284708797
Iteration 4/3000: Cost 5.5565596652504565
Iteration 5/3000: Cost 0.6062594822897792
Iteration 6/3000: Cost 0.5992946752427507


Iteration 7/3000: Cost 2.6226310617286286
Iteration 8/3000: Cost 3.367974864314566
Iteration 9/3000: Cost 5.546113400142651
Iteration 10/3000: Cost 0.6220435200204215
Iteration 11/3000: Cost 0.5186499604885425
Iteration 12/3000: Cost 1.7691601101816339
Iteration 13/3000: Cost 3.635933434133673
Iteration 14/3000: Cost 4.994090277309957
Iteration 15/3000: Cost 0.5133470113996937
Iteration 16/3000: Cost 0.3924568998070142
Iteration 17/3000: Cost 0.3891886319961966
Iteration 18/3000: Cost 0.6348770576262728
Iteration 19/3000: Cost 2.4899280869082756
Iteration 20/3000: Cost 5.747895826779498
Iteration 21/3000: Cost 0.4948808791802765
Iteration 22/3000: Cost 0.6242458761116836
Iteration 23/3000: Cost 2.696723896894768
Iteration 24/3000: Cost 2.123306381755294
Iteration 25/3000: Cost 5.077490342073456
Iteration 26/3000: Cost 0.4704796486263446
Iteration 27/3000: Cost 0.39822346934843755
Iteration 28/3000: Cost 0.3899230823675609
Iteration 29/3000: Cost 0.4424126852445383
Iteration 30/3000: Co

In [95]:
# Convert cost history into numpy arrays
cost_hist_ = np.array(custom_poly4_lr_model.cost_history)

# Plotly Express line chart
fig = px.line(
    x=cost_hist_[:, 0],
    y=cost_hist_[:, 1],
    title="Iteration vs Cost",
    labels={"x": "Iteration", "y": "Cost"}
)

# Show the plot
fig.show()

# Saved as plot_2.png in the current directory/folder

The above plot shows the decrease in cost as the number of iterations increases, indicating the proper functioning of gradient descent in minimizing cost and leading to optimal model parameters.

In [96]:
# Evaluate the custom polynomial (degree 4) logistic regression model
train_acc_, train_prec_, train_rec_, train_f1_ = custom_poly4_lr_model.evaluate_model(X_train_poly_scaled, 
                                                                                      y_train_.reshape(-1, 1))

val_acc_, val_prec_, val_rec_, val_f1_ = custom_poly4_lr_model.evaluate_model(X_val_poly_scaled, 
                                                                              y_val_.reshape(-1, 1))

test_acc_, test_prec_, test_rec_, test_f1_ = custom_poly4_lr_model.evaluate_model(X_test_poly_scaled, 
                                                                                  y_test_.reshape(-1, 1))

print("Custom Polynomial (degree 4) Logistic Regression:\n")

print(f"Training set - Accuracy: {train_acc_:.4f}, Precision: {train_prec_:.4f}, Recall: {train_rec_:.4f}, "
        f"F1-score: {train_f1_:.4f}")

print(f"Validation set - Accuracy: {val_acc_:.4f}, Precision: {val_prec_:.4f}, Recall: {val_rec_:.4f}, "
        f"F1-score: {val_f1_:.4f}")

print(f"Test set - Accuracy: {test_acc_:.4f}, Precision: {test_prec_:.4f}, Recall: {test_rec_:.4f}, "
        f"F1-score: {test_f1_:.4f}")

Custom Polynomial (degree 4) Logistic Regression:

Training set - Accuracy: 0.9571, Precision: 0.9219, Recall: 0.7946, F1-score: 0.8535
Validation set - Accuracy: 0.9444, Precision: 0.8404, Recall: 0.7980, F1-score: 0.8187
Test set - Accuracy: 0.9492, Precision: 0.8941, Recall: 0.7677, F1-score: 0.8261


Scikit-Learn Polynomial (degree 4) Logistic Regression:<br>

Training set - Accuracy: 0.9598, Precision: 0.9108, Recall: 0.8249, F1-score: 0.8657<br>
Validation set - Accuracy: 0.9524, Precision: 0.8416, Recall: 0.8586, F1-score: 0.8500<br>
Test set - Accuracy: 0.9476, Precision: 0.8587, Recall: 0.7980, F1-score: 0.8272


Both the custom polynomial logistic regression model and Scikit-Learn's implementation perform strongly with a degree of **4**. The custom model achieves high accuracy across the training (**95.71%**), validation (**94.44%**), and test (**94.92%**) sets, with notable precision (**92.19%**, **84.04%**, **89.41%**), recall (**79.46%**, **79.80%**, **76.77%**), and F1-scores (**85.35%**, **81.87%**, **82.61%**) respectively.

In comparison, Scikit-Learn's model demonstrates similar performance. It achieves training set metrics with an accuracy of **95.98%**, precision of **91.08%**, recall of **82.49%**, and F1-score of **86.57%**. On the validation set, it reports an accuracy of **95.24%**, precision of **84.16%**, recall of **85.86%**, and F1-score of **85.00%**. The test set yields **94.76%** accuracy, **85.87%** precision, **79.80%** recall, and an F1-score of **82.72%**.

Both models demonstrate effective polynomial logistic regression capabilities, with the custom model showing competitive performance. The goal of building a binary classification model with a non-linear/polynomial decision boundary has been successfully achieved.