**Author:** Ahmadreza Attarpour  
**Email:** [a.attarpour@mail.utoronto.ca](mailto:a.attarpour@mail.utoronto.ca)  

This notebook demonstrates logistic regression algorithm for a simple classification task.

---

## **1. Logistic Regression Overview**
Logistic Regression is a classification algorithm used to predict **binary outcomes** (e.g., 0 or 1, Yes or No, Spam or Not Spam). It is based on the **sigmoid function**, which maps real-valued inputs into a range between **0 and 1**.

## **2. Mathematical Formulation**
Given an input feature vector \( X \), we compute a linear combination of the features:

$$
z = w_0 + w_1x_1 + w_2x_2 + \dots + w_nx_n
$$

where:
- \( z \) is the linear combination of inputs
- \( w_0 \) (bias term) and \( w_i \) (weights) are parameters of the model
- \( x_i \) are the feature values

## **3. Sigmoid (Logistic) Function**
To convert the linear output \( z \) into a probability between 0 and 1, we apply the **sigmoid function**:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

This function ensures that the output is always between **0 and 1**.

## **4. Hypothesis Function**
The probability of class **1** (positive class) is given by:

$$
P(y=1 | X) = \sigma(z) = \frac{1}{1 + e^{-z}}
$$

Similarly, the probability of class **0** (negative class) is:

$$
P(y=0 | X) = 1 - \sigma(z)
$$

## **5. Cost Function (Log Loss)**
The cost function for logistic regression is called the **log loss (logarithmic loss)**:

$$
J(w) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]
$$

where:
- \( m \) is the number of training samples
- \( y^{(i)} \) is the actual class label (0 or 1)
- \( \hat{y}^{(i)} = \sigma(z) \) is the predicted probability

## **6. Gradient Descent for Optimization**
To find the optimal weights \( w \), we use **gradient descent**:

$$
w_j := w_j - \alpha \frac{\partial J(w)}{\partial w_j}
$$

where:
- \( \alpha \) is the learning rate
- The gradient of the cost function w.r.t \( w_j \) is:

$$
\frac{\partial J(w)}{\partial w_j} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) x_j^{(i)}
$$

This updates the weights to minimize the cost function.

## **7. Decision Boundary**
To classify a new sample, we use a **threshold**:

$$
\hat{y} =
\begin{cases}
1, & \text{if } \sigma(z) \geq 0.5 \\
0, & \text{otherwise}
\end{cases}
$$

This means if the predicted probability \( \hat{y} \) is **greater than or equal to 0.5**, we classify it as **1**; otherwise, we classify it as **0**.


---
Python Practice
---
We will use the scikit-learn’s decision tree classifier to classify patients into liver patient (liver disease) or not (no disease). 

We will use a dataset of 416 liver patient records and 167 non liver patient records collected from North East of Andhra Pradesh, India. 10 variables for each patient are recorded, and the true label is in column Dataset. The data is obtained through https://www.kaggle.com/datasets/uciml/indian-liver-patient-records. We will use the data stored in HW1_data.csv on the course website for this assignment.
You will build a KNN classifier to classify patients into liver patient (liver disease) or not (no disease).


In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler,  LabelEncoder
import numpy as np
import matplotlib.pyplot as plt
from graphviz import Source
# setting random seed for reproducibility
np.random.seed(1210)

In [10]:
# path to data
data_path = "/Users/ahmadreza/Documents/Files/PhD Files/code/code_repository/algorithms/logistic_regression/hw1_data.csv"

In [11]:
# Function for loading the data
def load_data(data_path: str) -> tuple[list, list, list]:
    # Load the dataset from the given CSV file
    df = pd.read_csv(data_path)
    print(df)
    
    # Initialize the encoder for categorical 'Gender' feature
    label_encoder = LabelEncoder()
    # Encode the 'Gender' column (Male: 0, Female: 1)
    df['Gender'] = label_encoder.fit_transform(df['Gender'])
    
    # Extract target labels from 'Dataset' column
    y = df["Dataset"].to_numpy()
    
    # Drop the target column from the feature set
    df = df.drop(columns=["Dataset"])
    
    # Convert features to numpy array
    X = df.to_numpy()
    
    # Handle NaN values by replacing them with 0
    X = np.nan_to_num(X)
    
    # Split the dataset into training + validation and testing sets (80% train + validation, 20% test)
    train_val_X, test_X, train_val_y, test_y = train_test_split(X, y, test_size=0.20)
    
    # Split the training + validation set into train and validation sets (90% train, 10% validation)
    train_X, val_X, train_y, val_y = train_test_split(train_val_X, train_val_y, test_size=(0.10/0.80))
    
    # Return the train, validation, and test sets as tuples
    return [train_X, train_y], [val_X, val_y], [test_X, test_y]


In [12]:
# Load data
train, val, test = load_data(data_path)


     Age  Gender  Total_Bilirubin  Direct_Bilirubin  Alkaline_Phosphotase  \
0     65  Female              0.7               0.1                   187   
1     62    Male             10.9               5.5                   699   
2     62    Male              7.3               4.1                   490   
3     58    Male              1.0               0.4                   182   
4     72    Male              3.9               2.0                   195   
..   ...     ...              ...               ...                   ...   
578   60    Male              0.5               0.1                   500   
579   40    Male              0.6               0.1                    98   
580   52    Male              0.8               0.2                   245   
581   31    Male              1.3               0.5                   184   
582   38    Male              1.0               0.3                   216   

     Alamine_Aminotransferase  Aspartate_Aminotransferase  Total_Protiens  

In [44]:
def train_logistic_rec(
        train: tuple[list, list],
        val: tuple[list, list],
        standardization: bool = True, 
        solver: str = 'lbfgs',
        penalty: str = 'l2',
        fit_intercept: bool = True,
        max_iter: int = 100
) -> LogisticRegression:

    # Extract data and labels for training, validation, and testing
    train_X, train_y = train[0], train[1]
    val_X, val_y = val[0], val[1]

    # If standardization is enabled, scale the features to have zero mean and unit variance
    if standardization:
        scaler = StandardScaler().fit(train_X)  # Fit scaler on training data
        train_X = scaler.transform(train_X)  # Apply transformation to training data
        val_X = scaler.transform(val_X)  # Apply transformation to validation data

    # Initialize the logistic regression model
    model = LogisticRegression(penalty=penalty, solver=solver, fit_intercept=fit_intercept, max_iter=max_iter)
    model.fit(train_X, train_y)  # Fit the model on the training data
    print(f"Training accuracy: {model.score(train_X, train_y):.2f}")  # Compute the training accuracy
    print(f"Training accuracy: {model.score(val_X, val_y):.2f}")  # Compute the training accuracy
    print(f"model coefficients and intercept: {model.coef_}, {model.intercept_}")

    return model


    

In [27]:
def model_evaluation(
        model: LogisticRegression,
        test: tuple[list, list],
        standardization=True,
) -> None:
        # Extract data and labels for testing
        test_X, test_y = test[0], test[1]
        if standardization:
            scaler = StandardScaler().fit(test_X)  # Fit scaler on training data
            test_X = scaler.transform(test_X)
        
        print(f"Test accuracy: {model.score(test_X, test_y):.2f}")  # Compute the test accuracy
    

In [51]:
model1 = train_logistic_rec(train,
                            val,
                            standardization=True,
                            solver='saga',
                            penalty='l1',
                            fit_intercept=True,
                            max_iter=100)

Training accuracy: 0.73
Training accuracy: 0.76
model coefficients and intercept: [[-0.29245042 -0.04213527 -0.27462571 -0.90181007 -0.27261782 -1.08441888
  -0.63433761 -0.37197053  0.39236702 -0.13727922]], [-1.47372187]




In [52]:
model2 = train_logistic_rec(train,
                            val,
                            standardization=True,
                            penalty='l2',
                            fit_intercept=True,
                            max_iter=100)

Training accuracy: 0.72
Training accuracy: 0.75
model coefficients and intercept: [[-0.30187614 -0.04353233 -0.33050351 -0.8726239  -0.28869057 -1.17073254
  -0.66645098 -0.63273211  0.76998586 -0.37284885]], [-1.53028852]


In [53]:
model_evaluation(model1, test, standardization=True)

Test accuracy: 0.74


In [50]:
model_evaluation(model2, test, standardization=True)

Test accuracy: 0.76
