# Scikit Learn - Logistic Regression

**Why Use scikit-learn for Logistic Regression Instead of Coding from Scratch?**

Implementing logistic regression from scratch is a great learning experience — it helps you understand how gradient descent, the sigmoid function, and cost functions work. However, when you're working on real-world datasets or production-level tasks, using a library like scikit-learn is far more efficient.

**Benefits of Using scikit-learn:**
- Numerical Stability: Handles large values in exponentials/logarithms internally without throwing overflow or division errors.

- Speed: Uses optimized C libraries under the hood for faster computation.

- Convenience: One-liner functions for training, prediction, evaluation, and model tuning.

- Scalability: Easily integrates with pipelines, feature selection, cross-validation, and grid search.

- Robustness: Comes with built-in regularization, different solvers, and penalty options.

So, while it's crucial to learn the math by coding it manually, in practice we use libraries to focus on solving problems rather than reinventing the wheel.

## Importing required libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler #feature scaling
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score #accuracy measures

## Preprocess data

In [2]:
df = pd.read_csv("titanic")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,alone,embarked
0,0,0,3,1,22.0,1,0,7.25,Third,False,0
1,1,1,1,0,38.0,1,0,71.2833,First,False,1
2,2,1,3,0,26.0,0,0,7.925,Third,True,0
3,3,1,1,0,35.0,1,0,53.1,First,False,0
4,4,0,3,1,35.0,0,0,8.05,Third,True,0


In [4]:
df = df.drop(['Unnamed: 0'], axis = 1)

In [5]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,class,alone,embarked
0,0,3,1,22.0,1,0,7.25,Third,False,0
1,1,1,0,38.0,1,0,71.2833,First,False,1
2,1,3,0,26.0,0,0,7.925,Third,True,0
3,1,1,0,35.0,1,0,53.1,First,False,0
4,0,3,1,35.0,0,0,8.05,Third,True,0


One hot encoding values:

Question: Why are we one hot encoding certain values in this dataset?

In [6]:
df = pd.get_dummies(df, columns=['class', 'alone'], drop_first=True)
df = df.astype(int)

In [7]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class_Second,class_Third,alone_True
0,0,3,1,22,1,0,7,0,0,1,0
1,1,1,0,38,1,0,71,1,0,0,0
2,1,3,0,26,0,0,7,0,0,1,1
3,1,1,0,35,1,0,53,0,0,0,0
4,0,3,1,35,0,0,8,0,0,1,1


## Train Test Split



In [8]:
X = df.drop('survived', axis=1)
y = df['survived']

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [10]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Fitting model to train set

In [14]:
model = LogisticRegression(C=0.1)
model.fit(X_train_scaled, y_train)

## Predicting values

In [15]:
y_pred = model.predict(X_test_scaled)

## Accuracy Metrics

In [16]:
accuracy = accuracy_score(y_test, y_pred) * 100
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)

print(f"Accuracy: {accuracy:.2f}%")
print(f"Precision: {precision:.2f}")
print(f"Recall:    {recall:.2f}")
print(f"F1 Score:  {f1:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, zero_division=0))

Accuracy: 80.06%
Precision: 0.75
Recall:    0.73
F1 Score:  0.74

Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       219
           1       0.75      0.73      0.74       137

    accuracy                           0.80       356
   macro avg       0.79      0.79      0.79       356
weighted avg       0.80      0.80      0.80       356



**See? It’s much simpler and shorter with libraries like Scikit-Learn!**

Instead of manually coding every step — gradient descent, cost functions, and parameter updates — Scikit-Learn lets you focus more on experimentation and results, not boilerplate code.

With just a few lines, you get:

- Training

- Prediction

- Evaluation (accuracy, precision, recall, F1)

- Hyperparameter tuning

- Preprocessing pipelines

# Hyperparameter finetuning

This is an integral part to improving the accuracy of models. ML developers spend hours improving hyperparameters manually through trial and error.

Scikit Learn provides us with models such as RandomizedSearchCV and GridSearchCV to aid us with hyperparameter finetuning of models.

### Hyperparameter Tuning with RandomizedSearchCV for Logistic Regression
Using RandomizedSearchCV allows us to efficiently search through a range of hyperparameters to find the best model settings for improved performance.


In [18]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

In [19]:
log_reg = LogisticRegression(solver='liblinear')

param_dist = {
    'C': uniform(loc=0, scale=4),
    'penalty': ['l1', 'l2']
}

random_search = RandomizedSearchCV(
    estimator=log_reg,
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_train, y_train)

print(" Best Parameters:", random_search.best_params_)

best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

Fitting 5 folds for each of 50 candidates, totalling 250 fits
 Best Parameters: {'C': np.float64(0.12525316982223433), 'penalty': 'l2'}


In [20]:
print("\n Evaluation Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred) * 100:.2f}%")
print(f"Precision: {precision_score(y_test, y_pred):.2f}")
print(f"Recall:    {recall_score(y_test, y_pred):.2f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.2f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


 Evaluation Metrics:
Accuracy:  80.62%
Precision: 0.77
Recall:    0.70
F1 Score:  0.74

Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.87      0.85       219
           1       0.77      0.70      0.74       137

    accuracy                           0.81       356
   macro avg       0.80      0.79      0.79       356
weighted avg       0.80      0.81      0.80       356



### Hyperparameter Tuning with GridSearchCV
To further improve our model’s performance, we can tune its hyperparameters using GridSearchCV. This helps us automatically search through combinations of parameters and evaluate them using cross-validation.

In [21]:
from sklearn.model_selection import GridSearchCV

In [22]:
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

log_reg = LogisticRegression()

grid_search = GridSearchCV(log_reg, param_grid, cv=5, scoring='f1', verbose=1)
grid_search.fit(X_train, y_train)

print("🔍 Best Parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_

y_pred = best_model.predict(X_test)

Fitting 5 folds for each of 8 candidates, totalling 40 fits
🔍 Best Parameters: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}


In [24]:
print("Accuracy:", accuracy_score(y_test, y_pred) * 100, "%")
print("Classification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 79.7752808988764 %
Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.84      0.84       219
           1       0.74      0.73      0.74       137

    accuracy                           0.80       356
   macro avg       0.79      0.79      0.79       356
weighted avg       0.80      0.80      0.80       356



As you can see, RandomizedSearchCV gives a higher accuracy rate.

You can also manually experiment with hyperparameters.