# Binary Classification Models

## Objectives

1. **Build a Binary Classification Model with Linear Decision Boundary:**
   - Import necessary libraries.
   - Import the dataset from UCIML repository, view it, and check its shape and for missing values.
   - Define functions for splitting the dataset and standardizing it using Scikit-Learn.
   - Split and standardize the dataset.
   - Train linear and non-linear/polynomial logistic regression models using Scikit-Learn.
   - Evaluate each model on training, cross-validation, and testing datasets.
   - Determine the optimal model, which was linear logistic regression.
   - Build a custom model based on the resulting optimal model.
   - Train and evaluate the custom model and compare results with the Scikit-Learn model.

2. **Build a Binary Classification Model with Non-Linear Decision Boundary:**
   - Get the dataset from the same source as before, view it, and check its shape and missing values.
   - Split and standardize the dataset.
   - Train and evaulate linear and non-linear/polynomial logistic regression models using Scikit-Learn.
   - Determine the optimal model, which was a polynomial degree **3** logistic regression model.
   - Based on the resulting optimal model, build a custom model.
   - Train and evaluate the custom model and compare results with the Scikit-Learn model.


## 1. Binary Classification Model with Linear Decision Boundary

In [1]:
# Import necessary libraries
import numpy as np
import plotly.express as px

##### This project uses data from the [`UCI Machine Learning Repository`](https://archive.ics.uci.edu/dataset/94/spambase). The dataset is licensed under a [`Creative Commons Attribution 4.0 International (CC BY 4.0) license`](https://creativecommons.org/licenses/by/4.0/legalcode). The features (X) and target (y) variables were converted into numpy arrays, for building a machine learning model.

In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
spambase = fetch_ucirepo(id=94)

# data (as pandas dataframes)
X = spambase.data.features
y = spambase.data.targets

In [4]:
# View first 10 rows of features data
X.head(10)

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191
5,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54
6,0.0,0.0,0.0,0.0,1.92,0.0,0.0,0.0,0.0,0.64,...,0.0,0.0,0.054,0.0,0.164,0.054,0.0,1.671,4,112
7,0.0,0.0,0.0,0.0,1.88,0.0,0.0,1.88,0.0,0.0,...,0.0,0.0,0.206,0.0,0.0,0.0,0.0,2.45,11,49
8,0.15,0.0,0.46,0.0,0.61,0.0,0.3,0.0,0.92,0.76,...,0.0,0.0,0.271,0.0,0.181,0.203,0.022,9.744,445,1257
9,0.06,0.12,0.77,0.0,0.19,0.32,0.38,0.0,0.06,0.0,...,0.0,0.04,0.03,0.0,0.244,0.081,0.0,1.729,43,749


In [5]:
# View target data
y

Unnamed: 0,Class
0,1
1,1
2,1
3,1
4,1
...,...
4596,0
4597,0
4598,0
4599,0


In [6]:
# Check the shape of X and y
print(f"Shape of X = {X.shape}")
print(f"Shape of y = {y.shape}")

# Check for any missing values for X and y
print(f"\nMissing values in X: \n{X.isna().any().values}")
print(f"\nMissing values in y: \n{y.isna().any().values}")

Shape of X = (4601, 57)
Shape of y = (4601, 1)

Missing values in X: 
[False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False]

Missing values in y: 
[False]


We have a cleaned dataset with **4,601** training examples, **57** feature variables, and **1** binary target variable. Our goal is to build a binary classification model with a linear decision boundary. To ensure the dataset is separable by a linear boundary, we will first use scikit-learn to build and evaluate both linear and non-linear/polynomial binary classification models. After identifying the optimal scikit-learn model, we will develop a custom model based on these findings and compare its results with the optimal scikit-learn model. This approach saves time by efficiently determining the best model type before building our custom models.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler


def split_dataset(X, y):
    # Splitting the data into training (60%), cross-validation (20%), and testing (20%) sets
    X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
    X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

    return X_train, y_train, X_val, y_val, X_test, y_test


def standardize_dataset(X_train, X_val, X_test):
    # Standardizing the datasets
    scaler = StandardScaler()

    # Fitting the scaler on the training data and transforming training, validation, and testing sets
    X_train_scaled = scaler.fit_transform(X_train)
    X_val_scaled = scaler.transform(X_val)
    X_test_scaled = scaler.transform(X_test)

    return scaler, X_train_scaled, X_val_scaled, X_test_scaled

In [8]:
# Convert the features (X) and target (y) variables into numpy arrays
X = X.to_numpy()
y = y["Class"].to_numpy()

# Splitting the dataset into training, validation, and test sets
X_train, y_train, X_val, y_val, X_test, y_test = split_dataset(X, y)

# Standardizing the feature datasets (training, validation, and test sets) to have zero mean and unit variance
scaler, X_train_scaled, X_val_scaled, X_test_scaled = standardize_dataset(X_train, X_val, X_test)

# Printing the shapes of the resulting datasets
print(f"Training set: {X_train_scaled.shape}, {y_train.shape}")
print(f"Validation set: {X_val_scaled.shape}, {y_val.shape}")
print(f"Test set: {X_test_scaled.shape}, {y_test.shape}")

Training set: (2760, 57), (2760,)
Validation set: (920, 57), (920,)
Test set: (921, 57), (921,)


In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


def train_eval_logistic_regression(X_train, y_train, X_val, y_val, X_test, y_test, degree=1):
    # Initialize logistic regression model (linear/polynomial)
    model = make_pipeline(
        PolynomialFeatures(degree=degree),
        LogisticRegression(random_state=42, max_iter=1000)
    )
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Evaluate on training set
    train_acc, train_prec, train_rec, train_f1 = evaluate_model(model, X_train, y_train)
    
    # Evaluate on validation set
    val_acc, val_prec, val_rec, val_f1 = evaluate_model(model, X_val, y_val)

    # Evaluate on testing set
    test_acc, test_prec, test_rec, test_f1 = evaluate_model(model, X_test, y_test)

    print("Linear Logistic Regression:") if degree == 1 else print(f"\nPolynomial Degree {degree} Logistic Regression:")
    
    print(f"Training set - Accuracy: {train_acc:.4f}, Precision: {train_prec:.4f}, Recall: {train_rec:.4f}, "
            f"F1-score: {train_f1:.4f}")

    print(f"Validation set - Accuracy: {val_acc:.4f}, Precision: {val_prec:.4f}, Recall: {val_rec:.4f}, "
            f"F1-score: {val_f1:.4f}")
    
    print(f"Test set - Accuracy: {test_acc:.4f}, Precision: {test_prec:.4f}, Recall: {test_rec:.4f}, "
            f"F1-score: {test_f1:.4f}")
    
    return model


def evaluate_model(model, X, y):
    # Predict the target values using the provided model and features
    y_pred = model.predict(X)
    
    # Calculate the accuracy of the model
    accuracy = accuracy_score(y, y_pred)
    # Calculate the precision of the model
    precision = precision_score(y, y_pred)
    # Calculate the recall of the model
    recall = recall_score(y, y_pred)
    # Calculate the F1 score of the model
    f1 = f1_score(y, y_pred)
    
    return accuracy, precision, recall, f1


# Train and evaluate linear logistic regression model
linear_lr_model = train_eval_logistic_regression(X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled, y_test)

# Train and evaluate polynomial (degree 2) logistic regression model
poly2_lr_model = train_eval_logistic_regression(X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled, y_test,
                                                degree=2)

# Train and evaluate polynomial (degree 3) logistic regression model
poly3_lr_model = train_eval_logistic_regression(X_train_scaled, y_train, X_val_scaled, y_val, X_test_scaled, y_test,
                                                degree=3)

Linear Logistic Regression:
Training set - Accuracy: 0.9283, Precision: 0.9246, Recall: 0.8906, F1-score: 0.9073
Validation set - Accuracy: 0.9250, Precision: 0.9322, Recall: 0.8729, F1-score: 0.9016
Test set - Accuracy: 0.9175, Precision: 0.9135, Recall: 0.8733, F1-score: 0.8930

Polynomial Degree 2 Logistic Regression:
Training set - Accuracy: 0.9891, Precision: 0.9926, Recall: 0.9798, F1-score: 0.9861
Validation set - Accuracy: 0.9185, Precision: 0.8997, Recall: 0.8923, F1-score: 0.8960
Test set - Accuracy: 0.9153, Precision: 0.9060, Recall: 0.8760, F1-score: 0.8908

Polynomial Degree 3 Logistic Regression:
Training set - Accuracy: 0.9949, Precision: 0.9991, Recall: 0.9881, F1-score: 0.9935
Validation set - Accuracy: 0.9163, Precision: 0.8969, Recall: 0.8895, F1-score: 0.8932
Test set - Accuracy: 0.9175, Precision: 0.8932, Recall: 0.8981, F1-score: 0.8956


Based on the comprehensive evaluation across training, validation, and test sets, the performance of the linear logistic regression model consistently demonstrates robust generalization ability. It achieves competitive accuracy (`92.50%` on validation, `91.75%` on test) and balanced precision (around `91%`) and recall (around `87%`) metrics, resulting in an F1-score of approximately `89%` on the test set. In contrast, while polynomial models with degrees 2 and 3 achieve remarkably high training accuracies (`98.91%` and `99.49%` respectively), they show a notable drop in performance on the validation and test sets, suggesting potential overfitting to the training data.

Therefore, despite the polynomial model's impressive training performance, the linear logistic regression model emerges as the optimal choice due to its consistent and reliable performance across all evaluation metrics and datasets. Its ability to maintain competitive accuracy and balanced precision-recall trade-off underscores its suitability and capacity to generalize well to unseen data. Thus, for this classification task, the linear logistic regression model stands out as the most effective choice among the evaluated models. Now, we will build our custom logistic regression model having a linear decision boundary and compare the results with the optimal model.

The linear/polynomial logistic regression model can be represented as:

$$
\begin{equation}
\hat{\mathbf{y}} = \sigma(\mathbf{X} \cdot \mathbf{w} + b)
\end{equation}
$$

where:
- $\hat{\mathbf{y}}$ represents the predicted probabilities for each instance,
- $\sigma(z) = \frac{1}{1 + e^{-z}}$ is the sigmoid function that squashes the output into the range $[0, 1]$,
- $\mathbf{X}$ represents the matrix of input features,
- $\mathbf{w}$ represents the column vector of weights (coefficients),
- $b$ represents the bias term.

To train the logistic regression model, we use the log loss (cross-entropy) as the loss function:

$$
\begin{equation}
\text{Log Loss} = -\frac{1}{m} \sum_{i=1}^{m} \left[y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)\right] \tag{2}
\end{equation}
$$

where:
- $m$ is the number of data points,
- $y_i$ is the actual binary label (0 or 1) for the $i$-th data point,
- $\hat{y}_i$ is the predicted probability that the $i$-th instance belongs to class 1.

To update the parameters $\mathbf{w}$ and $b$, we use gradient descent:

$$
\begin{equation}
w_{j, \text{new}} = w_{j, \text{old}} - \alpha \times \frac{\partial \text{Log Loss}}{\partial w_j} \tag{3}
\end{equation}
$$

$$
\begin{equation}
b_{\text{new}} = b_{\text{old}} - \alpha \times \frac{\partial \text{Log Loss}}{\partial b} \tag{4}
\end{equation}
$$

where:
- $j = 1, 2, 3, \ldots, n$
- $w_{j, \text{new}}$ and $w_{j, \text{old}}$ are the updated and current weights for the $j$-th feature, respectively,
- $b_{\text{new}}$ and $b_{\text{old}}$ are the updated and current $y$-intercepts, respectively,
- $\alpha$ is the learning rate,
- $\frac{\partial \text{Log Loss}}{\partial w_j}$ is the gradient of the Log Loss function with respect to the $j$-th weight,
- $\frac{\partial \text{Log Loss}}{\partial b}$ is the gradient of the Log Loss function with respect to the bias term.

The gradients with respect to the weights and bias are computed as follows:

$$
\begin{equation}
\frac{\partial \text{Log Loss}}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i) x^j_i \tag{5}
\end{equation}
$$

$$
\begin{equation}
\frac{\partial \text{Log Loss}}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} (\hat{y}_i - y_i) \tag{6}
\end{equation}
$$

In [10]:
class CustomLogisticRegression:
    def __init__(self, learning_rate=0.01, num_iterations=1000):
        self.learning_rate = learning_rate
        self.num_iterations = num_iterations
        self.weights = None
        self.bias = None
        self.cost_history = []

    
    def sigmoid(self, z):
        # Calculate the sigmoid of z
        return 1 / (1 + np.exp(-z))


    def initialize_parameters(self, n_features):
        # Initialize weights as a zero vector of shape (n_features, 1)
        self.weights = np.zeros((n_features, 1))
        
        # Initialize bias as zero
        self.bias = 0    


    def compute_cost(self, y_hat, y):
        # Get the number of samples
        m = y.shape[0]

        # Small epsilon value to prevent log(0)
        epsilon = 1e-10
        
        # Compute the cost using logistic regression loss function
        cost = - (1 / m) * np.sum(y * np.log(y_hat + epsilon) + (1 - y) * np.log(1 - y_hat + epsilon))
        
        return cost 
   

    def fit(self, X, y):
        m, n = X.shape
        self.initialize_parameters(n)

        for i in range(self.num_iterations):
            # Forward propagation
            z = np.dot(X, self.weights) + self.bias
            y_hat = self.sigmoid(z)

            # Compute cost
            cost = self.compute_cost(y_hat, y)

            # Print cost every 100 iterations and save cost
            if (i + 1) % 100 == 0:
                print(f"Iteration {i + 1}/{self.num_iterations}: Cost {cost}")
                self.cost_history.append((i+1, cost))

            # Compute gradients
            dw = (1 / m) * np.dot(X.T, (y_hat - y))
            db = (1 / m) * np.sum(y_hat - y)

            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db

    
    def predict(self, X):
        # Compute the linear combination of input features and weights plus bias
        # Note: X can be polynomial features
        z = np.dot(X, self.weights) + self.bias
        
        # Apply sigmoid function to the linear combination
        y_hat = self.sigmoid(z)
        
        # Convert probabilities to binary predictions (0 or 1)
        return (y_hat >= 0.5).astype(int)
    

    def evaluate_model(self, X, y):
        # Predict the target values using the provided features
        y_pred = self.predict(X)
        
        # Calculate the accuracy of the model
        accuracy = accuracy_score(y, y_pred)
        # Calculate the precision of the model
        precision = precision_score(y, y_pred)
        # Calculate the recall of the model
        recall = recall_score(y, y_pred)
        # Calculate the F1 score of the model
        f1 = f1_score(y, y_pred)
    
        return accuracy, precision, recall, f1

In [11]:
# Train custom linear logistic regression model
custom_linear_lr_model = CustomLogisticRegression(learning_rate=11.3)
custom_linear_lr_model.fit(X_train_scaled, y_train.reshape(-1, 1))

Iteration 100/1000: Cost 0.2045616742480486
Iteration 200/1000: Cost 0.20230641267129337
Iteration 300/1000: Cost 0.20115133389425785
Iteration 400/1000: Cost 0.20041742052833028
Iteration 500/1000: Cost 0.19989627399501622
Iteration 600/1000: Cost 0.19949915791877024
Iteration 700/1000: Cost 0.1991816049423096
Iteration 800/1000: Cost 0.19891876204302394
Iteration 900/1000: Cost 0.19869555844276626
Iteration 1000/1000: Cost 0.19850225483520703


In [12]:
# Convert cost history into numpy arrays
cost_hist = np.array(custom_linear_lr_model.cost_history)

# Plotly Express line chart
fig = px.line(
    x=cost_hist[:, 0],
    y=cost_hist[:, 1],
    title="Iteration vs Cost",
    labels={"x": "Iteration", "y": "Cost"}
)

# Show the plot
fig.show()

# Saved as plot_1.png in the current directory/folder

The above plot shows the decrease in cost as the number of iterations increases, indicating the proper functioning of gradient descent in minimizing cost and leading to optimal model parameters.

In [13]:
# Evaluate the custom linear logistic regression model
train_acc, train_prec, train_rec, train_f1 = custom_linear_lr_model.evaluate_model(X_train_scaled, y_train.reshape(-1, 1))
val_acc, val_prec, val_rec, val_f1 = custom_linear_lr_model.evaluate_model(X_val_scaled, y_val.reshape(-1, 1))
test_acc, test_prec, test_rec, test_f1 = custom_linear_lr_model.evaluate_model(X_test_scaled, y_test.reshape(-1, 1))

print("Custom Linear Logistic Regression:")

print(f"Training set - Accuracy: {train_acc:.4f}, Precision: {train_prec:.4f}, Recall: {train_rec:.4f}, "
        f"F1-score: {train_f1:.4f}")

print(f"Validation set - Accuracy: {val_acc:.4f}, Precision: {val_prec:.4f}, Recall: {val_rec:.4f}, "
        f"F1-score: {val_f1:.4f}")

print(f"Test set - Accuracy: {test_acc:.4f}, Precision: {test_prec:.4f}, Recall: {test_rec:.4f}, "
        f"F1-score: {test_f1:.4f}")

Custom Linear Logistic Regression:
Training set - Accuracy: 0.9308, Precision: 0.9243, Recall: 0.8980, F1-score: 0.9110
Validation set - Accuracy: 0.9239, Precision: 0.9269, Recall: 0.8757, F1-score: 0.9006
Test set - Accuracy: 0.9218, Precision: 0.9145, Recall: 0.8843, F1-score: 0.8992


Scikit-Learn Linear Logistic Regression:<br>
Training set - Accuracy: 0.9283, Precision: 0.9246, Recall: 0.8906, F1-score: 0.9073<br>
Validation set - Accuracy: 0.9250, Precision: 0.9322, Recall: 0.8729, F1-score: 0.9016<br>
Test set - Accuracy: 0.9175, Precision: 0.9135, Recall: 0.8733, F1-score: 0.8930

The Custom Linear Logistic Regression model generally outperforms the Scikit-Learn model across training and test sets, with slight exceptions. For the training set, the custom model achieves an accuracy of **0.9308**, precision of **0.9243**, recall of **0.8980**, and F1-score of **0.9110**, with a slightly lower precision compared to the Scikit-Learn model's **0.9246**. On the test set, the custom model maintains its lead with an accuracy of **0.9218**, precision of **0.9145**, recall of **0.8843**, and F1-score of **0.8992**, compared to **0.9175**, **0.9135**, **0.8733**, and **0.8930** for the Scikit-Learn model. However, on the validation set, the Scikit-Learn model performs slightly better with an accuracy of **0.9250**, precision of **0.9322**, recall of **0.8729**, and F1-score of **0.9016**, while the custom model achieves **0.9239**, **0.9269**, **0.8757**, and **0.9006** for these metrics, having slightly better recall performance. Overall, the custom model demonstrates strong performance across different sets, particularly on unseen data (test set). With the goal of building a binary classification model with a linear decision boundary achieved, our next step is to develop a binary classification model with a non-linear/polynomial decision boundary.

## 2. Binary Classification Model with Non-Linear Decision Boundary

##### This project also uses data from the [`UCI Machine Learning Repository`](https://archive.ics.uci.edu/dataset/563/iranian+churn+dataset). The dataset is licensed under a [`Creative Commons Attribution 4.0 International (CC BY 4.0) license`](https://creativecommons.org/licenses/by/4.0/legalcode). The features (X_) and target (y_) variables were converted into numpy arrays, for building a machine learning model.

In [14]:
# fetch dataset
iranian_churn = fetch_ucirepo(id=563)

# data (as pandas dataframes)
X_ = iranian_churn.data.features
y_ = iranian_churn.data.targets

In [15]:
# View first 10 rows of features data
X_.head(10)

Unnamed: 0,Call Failure,Complains,Subscription Length,Charge Amount,Seconds of Use,Frequency of use,Frequency of SMS,Distinct Called Numbers,Age Group,Tariff Plan,Status,Age,Customer Value
0,8,0,38,0,4370,71,5,17,3,1,1,30,197.64
1,0,0,39,0,318,5,7,4,2,1,2,25,46.035
2,10,0,37,0,2453,60,359,24,3,1,1,30,1536.52
3,10,0,38,0,4198,66,1,35,1,1,1,15,240.02
4,3,0,38,0,2393,58,2,33,1,1,1,15,145.805
5,11,0,38,1,3775,82,32,28,3,1,1,30,282.28
6,4,0,38,0,2360,39,285,18,3,1,1,30,1235.96
7,13,0,37,2,9115,121,144,43,3,1,1,30,945.44
8,7,0,38,0,13773,169,0,44,3,1,1,30,557.68
9,7,0,38,1,4515,83,2,25,3,1,1,30,191.92


In [16]:
# View target data
y_

Unnamed: 0,Churn
0,0
1,0
2,0
3,0
4,0
...,...
3145,0
3146,0
3147,0
3148,0


In [17]:
# Check the shape of X_ and y_
print(f"Shape of X_ = {X_.shape}")
print(f"Shape of y_ = {y_.shape}")

# Check for any missing values for X_ and y_
print(f"\nMissing values in X_: \n{X_.isna().any().values}")
print(f"\nMissing values in y_: \n{y_.isna().any().values}")

Shape of X_ = (3150, 13)
Shape of y_ = (3150, 1)

Missing values in X_: 
[False False False False False False False False False False False False
 False]

Missing values in y_: 
[False]


We have a cleaned dataset with **3,150** training examples, **13** feature variables, and **1** binary target variable. Our goal is to build a binary classification model with a non-linear decision boundary. To ensure the dataset is separable by a non-linear boundary, we will use the same approach as we did with our previous goal of building a binary classification model with a linear decision boundary.

In [18]:
# Convert the features (X_) and target (y_) variables into numpy arrays
X_ = X_.to_numpy()
y_ = y_["Churn"].to_numpy()

In [19]:
# Splitting the dataset into training, validation, and test sets
X_train_, y_train_, X_val_, y_val_, X_test_, y_test_ = split_dataset(X_, y_)

# Standardizing the feature datasets (training, validation, and test sets) to have zero mean and unit variance
scaler_, X_train_scaled_, X_val_scaled_, X_test_scaled_ = standardize_dataset(X_train_, X_val_, X_test_)

# Printing the shapes of the resulting datasets
print(f"Training set: {X_train_scaled_.shape}, {y_train_.shape}")
print(f"Validation set: {X_val_scaled_.shape}, {y_val_.shape}")
print(f"Test set: {X_test_scaled_.shape}, {y_test_.shape}")

Training set: (1890, 13), (1890,)
Validation set: (630, 13), (630,)
Test set: (630, 13), (630,)


In [20]:
# Train and evaluate linear logistic regression model
linear_lr_model_ = train_eval_logistic_regression(X_train_scaled_, y_train_, X_val_scaled_, y_val_, X_test_scaled_, y_test_)

# Train and evaluate polynomial (degree 2) logistic regression model
poly2_lr_model_ = train_eval_logistic_regression(X_train_scaled_, y_train_, X_val_scaled_, y_val_, X_test_scaled_, y_test_,
                                                    degree=2)

# Train and evaluate polynomial (degree 3) logistic regression model
poly3_lr_model_ = train_eval_logistic_regression(X_train_scaled_, y_train_, X_val_scaled_, y_val_, X_test_scaled_, y_test_,
                                                    degree=3)

# Train and evaluate polynomial (degree 4) logistic regression model
poly4_lr_model_ = train_eval_logistic_regression(X_train_scaled_, y_train_, X_val_scaled_, y_val_, X_test_scaled_, y_test_,
                                                    degree=4)

# Train and evaluate polynomial (degree 5) logistic regression model
poly5_lr_model_ = train_eval_logistic_regression(X_train_scaled_, y_train_, X_val_scaled_, y_val_, X_test_scaled_, y_test_,
                                                    degree=5)

Linear Logistic Regression:
Training set - Accuracy: 0.8942, Precision: 0.7870, Recall: 0.4478, F1-score: 0.5708
Validation set - Accuracy: 0.8984, Precision: 0.7869, Recall: 0.4848, F1-score: 0.6000
Test set - Accuracy: 0.8889, Precision: 0.7959, Recall: 0.3939, F1-score: 0.5270

Polynomial Degree 2 Logistic Regression:
Training set - Accuracy: 0.9471, Precision: 0.9301, Recall: 0.7172, F1-score: 0.8099
Validation set - Accuracy: 0.9333, Precision: 0.8434, Recall: 0.7071, F1-score: 0.7692
Test set - Accuracy: 0.9397, Precision: 0.8588, Recall: 0.7374, F1-score: 0.7935

Polynomial Degree 3 Logistic Regression:
Training set - Accuracy: 0.9698, Precision: 0.9167, Recall: 0.8889, F1-score: 0.9026
Validation set - Accuracy: 0.9635, Precision: 0.8585, Recall: 0.9192, F1-score: 0.8878
Test set - Accuracy: 0.9524, Precision: 0.8485, Recall: 0.8485, F1-score: 0.8485

Polynomial Degree 4 Logistic Regression:
Training set - Accuracy: 0.9783, Precision: 0.9238, Recall: 0.9394, F1-score: 0.9316
Va

The evaluation results indicate that while the **Linear Logistic Regression** model shows decent performance, its lower **recall** and **F1-score** suggest it struggles with correctly identifying all positive instances. **Polynomial Degree 2 Logistic Regression** improves significantly, capturing more complex patterns and achieving better metrics across all sets. **Polynomial Degree 3** further enhances these metrics, demonstrating robust and consistent performance. **Polynomial Degree 4 Logistic Regression** shows high **accuracy** and **recall**, suggesting it captures more positive instances accurately, but **Polynomial Degree 5**, while maintaining high **accuracy**, indicates potential overfitting with a slight decrease in validation **precision** and test **recall** and **F1-score**. Therefore, the **Polynomial Degree 3** model is the optimal choice, balancing performance and generalization. Now, we will build our custom **Polynomial Degree 3** logistic regression model and compare the results with the optimal scikit-learn model.

In [21]:
from itertools import combinations_with_replacement


def polynomial_features_multiple(X: np.ndarray, degree: int):
    """
    Generate polynomial features for a given standardized dataset X up to a specific degree.
    
    Parameters:
    X (np.ndarray): The standardized input dataset of shape (n_samples, n_features).
    degree (int): The degree of the polynomial features.
    
    Returns:
    np.ndarray: A new dataset with polynomial features of shape (n_samples, n_output_features).
    
    """
    
    n_features = X.shape[1]
    
    # List to hold the polynomial features
    features = []
    
    # Add polynomial features of all degrees from 1 to the specified degree
    for d in range(1, degree + 1):
        for indices in combinations_with_replacement(range(n_features), d):
            new_feature = np.prod(X[:, indices], axis=1)
            features.append(new_feature)
    
    return np.vstack(features).T

In [22]:
# Train custom polynomial (degree 3) logistic regression model
custom_poly3_lr_model = CustomLogisticRegression(learning_rate=0.39)
custom_poly3_lr_model.fit(polynomial_features_multiple(X_train_scaled_, 3), y_train_.reshape(-1, 1))

Iteration 100/1000: Cost 0.17299103043391806
Iteration 200/1000: Cost 0.13000487619638096
Iteration 300/1000: Cost 0.1118748593831729
Iteration 400/1000: Cost 0.10316749304994151
Iteration 500/1000: Cost 0.097527276515648
Iteration 600/1000: Cost 0.09335426309812214
Iteration 700/1000: Cost 0.09008193632326347
Iteration 800/1000: Cost 0.08748242079359296
Iteration 900/1000: Cost 0.08553166047931152
Iteration 1000/1000: Cost 0.08415279098185142


In [23]:
# Convert cost history into numpy arrays
cost_hist_ = np.array(custom_poly3_lr_model.cost_history)

# Plotly Express line chart
fig = px.line(
    x=cost_hist_[:, 0],
    y=cost_hist_[:, 1],
    title="Iteration vs Cost",
    labels={"x": "Iteration", "y": "Cost"}
)

# Show the plot
fig.show()

# Saved as plot_2.png in the current directory/folder

The above plot shows the decrease in cost as the number of iterations increases, indicating the proper functioning of gradient descent in minimizing cost and leading to optimal model parameters.

In [24]:
# Evaluate the custom polynomial (degree 3) logistic regression model
train_acc_, train_prec_, train_rec_, train_f1_ = \
        custom_poly3_lr_model.evaluate_model(polynomial_features_multiple(X_train_scaled_, 3), y_train_.reshape(-1, 1))

val_acc_, val_prec_, val_rec_, val_f1_ = \
        custom_poly3_lr_model.evaluate_model(polynomial_features_multiple(X_val_scaled_, 3), y_val_.reshape(-1, 1))

test_acc_, test_prec_, test_rec_, test_f1_ = \
        custom_poly3_lr_model.evaluate_model(polynomial_features_multiple(X_test_scaled_, 3), y_test_.reshape(-1, 1))

print("Custom Polynomial (degree 3) Logistic Regression:")

print(f"Training set - Accuracy: {train_acc_:.4f}, Precision: {train_prec_:.4f}, Recall: {train_rec_:.4f}, "
        f"F1-score: {train_f1_:.4f}")

print(f"Validation set - Accuracy: {val_acc_:.4f}, Precision: {val_prec_:.4f}, Recall: {val_rec_:.4f}, "
        f"F1-score: {val_f1_:.4f}")

print(f"Test set - Accuracy: {test_acc_:.4f}, Precision: {test_prec_:.4f}, Recall: {test_rec_:.4f}, "
        f"F1-score: {test_f1_:.4f}")

Custom Polynomial (degree 3) Logistic Regression:
Training set - Accuracy: 0.9651, Precision: 0.9053, Recall: 0.8687, F1-score: 0.8866
Validation set - Accuracy: 0.9571, Precision: 0.8600, Recall: 0.8687, F1-score: 0.8643
Test set - Accuracy: 0.9476, Precision: 0.8511, Recall: 0.8081, F1-score: 0.8290


Scikit-Learn Polynomial (degree 3) Logistic Regression: <br>
Training set - Accuracy: 0.9698, Precision: 0.9167, Recall: 0.8889, F1-score: 0.9026 <br>
Validation set - Accuracy: 0.9635, Precision: 0.8585, Recall: 0.9192, F1-score: 0.8878 <br>
Test set - Accuracy: 0.9524, Precision: 0.8485, Recall: 0.8485, F1-score: 0.8485

Both the custom polynomial logistic regression model and Scikit-Learn's implementation perform strongly with a degree of **3**. The custom model achieves high accuracy across training (**96.51%**), validation (**95.71%**), and test (**94.76%**) sets, with notable precision (**90.53%**, **86.00%**, **85.11%**), recall (**86.87%**, **86.87%**, **80.81%**), and F1-scores (**88.66%**, **86.43%**, **82.90%**) respectively. In comparison, Scikit-Learn's model shows slightly higher metrics on the training set (**96.98%**, **91.67%**, **88.89%**, **90.26%**) but similar or marginally improved results on the validation (**96.35%**, **85.85%**, **91.92%**, **88.78%**) and test sets (**95.24%**, **84.85%**, **84.85%**, **84.85%**). Both models demonstrate effective polynomial logistic regression capabilities, with the custom model showing competitive performance despite slight variations in recall and F1-scores compared to the Scikit-Learn implementation. So, with this comparison, our goal of building a binary classification model with a non-linear/polynomial decision boundary has been achieved successfully.