# **Logistic Regression**

## **1 Introduction**

This notebook is my learning material to keep track of the notions approached in the [Supervised Machine Learning: Regression and Classification](https://www.coursera.org/learn/machine-learning?specialization=machine-learning-introduction) course from the [Machine Learning Specialization](https://www.coursera.org/specializations/machine-learning-introduction) created by Andrew Ng.

The first part explains how to perform Logistic Regression using gradient descent.<br>
The second part shows how to solve this problem using [scikit-learn](https://scikit-learn.org/stable/index.html).

Through this notebook, I use the [Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data) created by the UCI Machine Learning.

### **1.0.1 Imports**

In [None]:
# Data manipulation
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler 

# Options for pandas
pd.options.display.max_columns = 50
pd.options.display.max_rows = 30

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Options for seaborn
sns.set_style('darkgrid')
%matplotlib inline

from IPython import get_ipython
ipython = get_ipython()

# Autoreload extesnions
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

### **1.1 Data**

#### **1.1.0.1 Import**

In [None]:
cancer = pd.read_csv('Breast Cancer WIsconsin (Diagnostic).csv')
cancer

#### **1.1.1 Exploratory Data Analysis**

In [None]:
cancer.info()
cancer.describe()

## **2 Two-variable Logistic Regression**

### **2.1 Data preparation**

In [None]:
# Retrieve features
data = cancer[['diagnosis', 'radius_mean', 'texture_mean']].copy()

# Cast diagnosis into integers
data['diagnosis'] = [1 if d == 'M' else 0 for d in data['diagnosis']]

data

In [None]:
sns.JointGrid(data=data, x='radius_mean', y='texture_mean',
              hue='diagnosis') \
   .plot_joint(sns.scatterplot) \
   .plot_marginals(sns.kdeplot,
                   fill=True)

In [None]:
# Z-Score normalization
r = data['radius_mean']
t = data['texture_mean']

data['radius_mean'] = (r - r.mean()) / r.var()
data['texture_mean'] = (t - t.mean()) / t.var()

In [None]:
sns.scatterplot(data=data, x='radius_mean', y='texture_mean',
                hue='diagnosis', style='diagnosis')

In [None]:
# Split the data
training_data = data.sample(frac=0.8, random_state=25)
testing_data = data.drop(training_data.index)

X_train = training_data.drop('diagnosis', axis=1).to_numpy()
y_train = training_data['diagnosis'].values

X_test = testing_data.drop('diagnosis', axis=1).to_numpy()
y_test = testing_data['diagnosis'].values

### **2.2 Analysis**

#### **2.2.1 Model**

$$
g(z) = \frac{1}{1 + e^{-z}} \tag{1}
$$

In [None]:
def g(z):
    return 1 / (1 + np.exp(-z))

$$
f_{\vec{w}, b}(\vec{x}) = g(\vec{w} \cdot \vec{x} + b) \tag{2}
$$

In [None]:
def f(X, w, b):
    return g(np.dot(w, X.T) + b)

#### **2.2.2 Cost function**

$$
loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log(f_{\vec{w},b}(\vec{x}^{(i)})) - ( 1 - y^{(i)}) \log( 1 - f_{\vec{w},b}(\vec{x}^{(i)})) \tag{3}
$$

In [None]:
def compute_loss(X, y, w, b):
    return (-y * np.log(f(X, w, b))) - (1 - y) * np.log(1 - f(X, w, b))

$$
J(\vec{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} loss(f_{\vec{w},b}(\vec{x}^{(i)}), y^{(i)}) \tag{4}
$$

In [None]:
def compute_cost(X, y, w, b):
    m = X.shape[0]
    c = 0
    for i in range(m):
        c += compute_loss(X[i], y[i], w, b)
        
    return c / (2 * m)

## Gradient

$$
\begin{align}
\frac{\partial J(\vec{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\vec{x}^{(i)}) -y^{(i)})x_j^{(i)} \tag{5}
\\
\frac{\partial J(\vec{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)}) \tag{6}
\end{align}
$$

In [None]:
def compute_gradient(X, y, w, b):
    m = X.shape[0]
    
    dw = np.sum((f(X, w, b) - y) * X.T, axis=1) / m
    db = np.sum(f(X, w, b) - y) / m
    
    return dw, db

## Gradient descent

$$
\text{repeat until convergence} \left\{
    \begin{array}{ll}
        w_j \leftarrow w_j + \alpha \frac{1}{m} \sum_{i=0}^{m-1} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)}) x^{(i)} \\
        b \leftarrow b + \alpha \frac{1}{m} \sum_{i=0}^{m-1} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})
    \end{array} \tag{7}
\right.
$$

In [None]:
def gradient_descent(X, y, cost_function, gradient_function, alpha, epochs):
    m = X.shape[0]
    cost_history = np.zeros((epochs))
    
    # Initial parameter
    w, b = 0, 0
    
    for i in range(epochs):
        dw, db = compute_gradient(X, y, w, b)
        
        # Update parameter
        w -= alpha * dw
        b -= alpha * db
        
        # Save cost
        cost_history[i] = compute_cost(X, y, w, b)
        
    return w, b, cost_history

### **2.3 Results**

#### **2.3.1 Decision boundary**

In [None]:
w, b, cost_history = gradient_descent(X_train, y_train, 
                                      cost_function=compute_cost, gradient_function=compute_gradient,
                                      alpha=0.3, epochs=1500)

print(f'w, b found by gradient descent:\n {w}, {b}')

In [None]:
x0 = np.arange(data['radius_mean'].min(), data['radius_mean'].max())
x1 = (-b - w[0] * x0) / w[1]

sns.scatterplot(data=data, x='radius_mean', y='texture_mean',
                hue='diagnosis', style='diagnosis')

sns.lineplot(x=x0, y=x1,
             label='decision boundary',
             color='purple') \
   .fill_between(x0, x1, x1[1],
                 color='purple', alpha=0.2)

#### **2.3.2 Convergence**

In [None]:
sns.lineplot(x=range(cost_history.shape[0]), y=cost_history)

plt.xlabel('iteration')
plt.ylabel('cost')

#### **2.3.2 Predictions**

In [None]:
def predict(X, w, b, threshold=0.5):
    p = f(X, w, b)
    
    p[p >= threshold] = 1
    p[p < threshold] = 0
    
    return p

In [None]:
predictions = predict(X_test, w, b)

print(f'train accuracy: {(np.mean(predictions == y_test) * 100)}')

In [None]:
guess = ['correct' if b else 'wrong' for b in predictions == y_test]

sns.scatterplot(data=testing_data, x='radius_mean', y='texture_mean',
                hue=guess, style=guess)

## **3 Logistic Regression with scikit-learn**

### **3.1 Data preparation**

In [None]:
data = cancer.drop(['id', 'Unnamed: 32'], axis=1).copy()
data['diagnosis'] = [1 if d == 'M' else 0 for d in data['diagnosis']]

data

In [None]:
training_data = data.sample(frac=0.8, random_state=35)
testing_data = data.drop(training_data.index)

X_train = training_data.drop('diagnosis', axis=1).to_numpy()
y_train = training_data['diagnosis'].values

X_test = testing_data.drop('diagnosis', axis=1).to_numpy()
y_test = testing_data['diagnosis'].values

In [None]:
scaler = StandardScaler()
fit_data = scaler.fit_transform(training_data)

scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)

X_train_norm

### **3.2 Analysis**

In [None]:
lr_model = LogisticRegression()
lr_model.fit(X_train_norm, y_train)

b = lr_model.intercept_
w = lr_model.coef_
print(f'w, b found:\n {b}, {w}')

### **3.3 Results**

In [None]:
fit_data = scaler.fit_transform(testing_data)

scaler = StandardScaler()
X_test_norm = scaler.fit_transform(X_test)

print("Accuracy on training set:", lr_model.score(X_test_norm, y_test))