<a href="https://colab.research.google.com/github/Shakesdydaa/Shakesdydaa/blob/main/model_optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install scikit-optimize



**HYPERPARAMETER TUNING**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from skopt import BayesSearchCV

In [None]:
# LOAD DATASET - titanic dataset
titanic_data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

In [None]:
titanic_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
# Drop unneeded columns from the dataset
titanic_data = titanic_data.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])


In [None]:
#Convert categorical columns to numerical values
titanic_data = pd.get_dummies(titanic_data, drop_first=True)

In [None]:
#check if the dataset has missing values
titanic_data.isnull().sum()

#fill the missing values with median (imputation), mean
titanic_data.fillna(titanic_data.median(), inplace=True)


In [None]:
#New dataset
titanic_data.isnull().sum()

Unnamed: 0,0
Survived,0
Pclass,0
Age,0
SibSp,0
Parch,0
Fare,0
Sex_male,0
Embarked_Q,0
Embarked_S,0


In [None]:
# Split the dataset into training and test sets
X = titanic_data.drop(columns=['Survived']) #features set
y = titanic_data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Initialize the Logistic Regression
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
# Create the param_grid
param_grid = {
    "C": [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
    "penalty": ["l1", "l2"],  # Regularization type
    "solver": ["liblinear", "saga"]  # Optimizers that support L1 and L2
}

Super Simple Hyperparameter Tuning for Logistic Regression

C (Regularization Strength) → Controls model complexity.

Solver (Optimization Algorithm) → Decides how to find the best model.


What is the penalty Parameter in Logistic Regression?
The penalty parameter in Logistic Regression controls regularization, which helps prevent overfitting by limiting how complex the model can be.

Types of Penalty in Logistic Regression
L1 (Lasso) Regularization

Shrinks some coefficients to zero → Can remove unnecessary features.

Helps with feature selection.

Works with liblinear and saga solvers.

📝 Example: "penalty": "l1"

L2 (Ridge) Regularization

Shrinks coefficients towards zero but never makes them exactly zero.

Keeps all features but reduces their impact.

Works with liblinear, saga, lbfgs, and newton-cg solvers.

📝 Example: "penalty": "l2"

Elastic Net (L1 + L2 Combined)

Mix of L1 and L2 → Shrinks some coefficients to zero but keeps others small.

More flexible than just L1 or L2.

Works with saga solver only.

📝 Example: "penalty": "elasticnet"

No Regularization (None)

No penalty, model is free to fit as much as possible.

Rarely used (leads to overfitting).

📝 Example: "penalty": None

Which One Should You Use?
If you want feature selection → Use "penalty": "l1" (Lasso).

If you want stable performance → Use "penalty": "l2" (Ridge).

If you want the best of both worlds → Use "penalty": "elasticnet".

If you just want a quick model → "penalty": "l2" is the safest choice.

In [None]:
# Grid Search Method
import time
grid_search = GridSearchCV(logreg, param_grid, cv=5)
start_time = time.time() #current time
grid_search.fit(X_train, y_train)
end_time = time.time() #current time
time_taken = end_time - start_time
print("Time Taken: ", time_taken)


Time Taken:  0.46903419494628906


**cv=5 → Cross-Validation Splits the Data**
What it does: Splits the training data into 5 equal parts (folds).

How it works:

Train the model on 4 folds, test on 1 fold.

Repeat this 5 times, each time using a different fold for testing.

Take the average accuracy to get a more reliable estimate.



In [None]:
#Pick the best parameters for the Grid Search
print("Best Parameters: ", grid_search.best_params_)

Best Parameters:  {'C': 0.01, 'penalty': 'l2', 'solver': 'saga'}


In [None]:
# Finding the best model (best prediction)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", round(accuracy*100,2), "%")

Accuracy:  79.89 %


In [None]:
### Randomized Search CV
randomized_search = RandomizedSearchCV(logreg, param_grid, cv=5)
randomized_search.fit(X_train, y_train)
print("Best parameters:", randomized_search.best_params_)
print("Best score:", randomized_search.best_score_)

'''
best_model = randomized_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", round(accuracy*100,2), "%")
'''

Best parameters: {'solver': 'saga', 'penalty': 'l2', 'C': 0.01}
Best score: 0.8033389146065202


In [None]:
### Bayesian Search CV
bayesian_search = BayesSearchCV(logreg, param_grid, cv=5)
bayesian_search.fit(X_train, y_train)
print("Best parameters:", bayesian_search.best_params_)
print("Best score:", bayesian_search.best_score_)



Best parameters: OrderedDict([('C', 0.01), ('penalty', 'l2'), ('solver', 'saga')])
Best score: 0.8033389146065202
