#**Week - 4**

Brief information about all the concepts I have learnt in Week-4

#**CROSS VALIDATION**

Cross-validation is a statistical technique used in machine learning and data science to evaluate how well a model will generalize to unseen data.

Instead of splitting the dataset once into a training set and a test set, cross-validation splits the dataset into multiple parts, trains the model on some of them, and tests it on the remaining part—multiple times.



- Commonly it uses K-fold Cross-Validation

Below is the code implementation of Cross - Validation

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

# Load Titanic dataset
data = pd.read_csv("C:\\Users\\mdmar\\Downloads\\Titanic-Dataset.csv")

# Drop columns that are not useful
data = data.drop(['Name', 'Ticket', 'Cabin'], axis=1)

# Handle missing values
data['Age'].fillna(data['Age'].median(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Encode categorical features
le = LabelEncoder()
data['Sex'] = le.fit_transform(data['Sex'])
data['Embarked'] = le.fit_transform(data['Embarked'])

# Drop remaining nulls if any
data = data.dropna()

# Features and target
X = data.drop('Survived', axis=1)
y = data['Survived']

# Function to return model score
def get_score(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    return model.score(X_test, y_test)

# Stratified K-Fold
folds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

# Score lists
scores_l = []
scores_svm = []
scores_rf = []

# Loop through folds
for train_index, test_index in folds.split(X, y):
    X_train_fold, X_test_fold = X.iloc[train_index], X.iloc[test_index]
    y_train_fold, y_test_fold = y.iloc[train_index], y.iloc[test_index]

    scores_l.append(get_score(LogisticRegression(max_iter=1000), X_train_fold, X_test_fold, y_train_fold, y_test_fold))
    scores_svm.append(get_score(SVC(), X_train_fold, X_test_fold, y_train_fold, y_test_fold))
    scores_rf.append(get_score(RandomForestClassifier(), X_train_fold, X_test_fold, y_train_fold, y_test_fold))

# Print average scores
print("Logistic Regression average score:", sum(scores_l)/len(scores_l))
print("SVM average score:", sum(scores_svm)/len(scores_svm))
print("Random Forest average score:", sum(scores_rf)/len(scores_rf))


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Age'].fillna(data['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)


Logistic Regression average score: 0.7968574635241302
SVM average score: 0.6430976430976431
Random Forest average score: 0.8136924803591471


In [2]:

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_fold, y_train_fold)
print("Logistic Regression Accuracy:", lr.score(X_test_fold, y_test_fold))

# Support Vector Machine
svm = SVC()
svm.fit(X_train_fold, y_train_fold)
print("SVM Accuracy:", svm.score(X_test_fold, y_test_fold))

# Random Forest
rf = RandomForestClassifier(n_estimators=40)
rf.fit(X_train_fold, y_train_fold)
print("Random Forest Accuracy:", rf.score(X_test_fold, y_test_fold))

Logistic Regression Accuracy: 0.8047138047138047
SVM Accuracy: 0.6397306397306397
Random Forest Accuracy: 0.8047138047138047


#**HyperParameter Tuning**

1. GridSearchCV
2. RandomizedSearchCV

In [3]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV


# Load the dataset
df = pd.read_csv("C:\\Users\\mdmar\\Downloads\\winequality-white.csv", sep=';')  # Note: Use sep=';' if it's from UCI
df.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [None]:
from sklearn.model_selection import train_test_split

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model and parameters
model = SVC()
param_grid = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# Perform Grid Search
grid = GridSearchCV(model, param_grid, cv=5)
grid.fit(X_train, y_train)

# Best model and score
print("Best parameters:", grid.best_params_)
print("Best score:", grid.best_score_)
print("Test accuracy:", grid.score(X_test, y_test))

In [None]:
# Randomized Search
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Define model and parameter distributions
svc_model = SVC()
param_dist_svc = {
    'C': uniform(0.1, 10),
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}

# Randomized Search
rand_search_svc = RandomizedSearchCV(svc_model, param_dist_svc, n_iter=10, cv=5, random_state=42)
rand_search_svc.fit(X_train, y_train)

print("SVC Best Params:", rand_search_svc.best_params_)
print("SVC Best CV Score:", rand_search_svc.best_score_)
print("SVC Test Accuracy:", rand_search_svc.score(X_test, y_test))

#**Decision Trees and Random Forest**

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score


data = pd.read_csv("C:\\Users\\mdmar\\Downloads\\Titanic-Dataset.csv")
df



Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


In [6]:
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})
df['Embarked'] = df['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

X = df.drop('Survived', axis=1)
y = df['Survived']

KeyError: 'Age'

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


In [None]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Random Forest Accuracy on Titanic dataset:", accuracy)


Random Forest Accuracy on Titanic dataset: 0.8268156424581006


#**Gradient Boosting**

Gradient Boosting is a powerful machine learning technique used for both regression and classification problems. It builds an ensemble of weak learners (usually decision trees) in a sequential manner, with each new model correcting the errors of the previous one.

Working of Gradient Boosting
- Start with a weak model, like a small decision tree.
- Calculate errors (residuals) between the predicted and actual values.
- Train a new model to predict those residuals.
- Add the new model to the ensemble, weighted to minimize the loss.
- Repeat the process for many iterations.

It has high accuracy and performance

In [11]:
# this is an example code for Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

NameError: name 'X_train' is not defined

#**XGBoost**

XGBoost stands for eXtreme Gradient Boosting, and it's an optimized, scalable, and regularized version of gradient boosting. It is one of the most powerful and popular algorithms in machine learning competitions and real-world applications.

**How it works**
- Start with an initial prediction.
- Compute gradients (errors) from the current model.
- Build a new decision tree to predict these gradients.
- Add the new tree to the model (boosting).
- Repeat steps 2–4 for several rounds (boosting iterations).
- Apply regularization to avoid overfitting.

It is faster than Gradient Boosting

It has regularizarion and parallelism

In [None]:
# this is a small example of XGBoost
import xgboost as xgb
from sklearn.metrics import accuracy_score

model = xgb.XGBClassifier(n_estimators=100, max_depth=3, learning_rate=0.1, use_label_encoder=False, eval_metric='logloss')

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))


Parameters: { "use_label_encoder" } are not used.



Accuracy: 0.8212290502793296


#**CatBoost**

CatBoost is a high-performance, open-source gradient boosting library developed by Yandex. It’s specifically designed to handle categorical features automatically and efficiently, which is a major advantage over other boosting algorithms like XGBoost or LightGBM.

**How CatBoost Works**

CatBoost uses Gradient Boosting Decision Trees (GBDT), just like XGBoost and LightGBM, but with:

- A unique way to handle categorical variables natively (based on target statistics)

- A technique called Ordered Boosting that prevents target leakage

- A built-in approach for missing value imputation

In [None]:
!pip install catboost

from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score

# Define CatBoost model
model = CatBoostClassifier(
    iterations=100,         # same as n_estimators in XGBoost
    depth=3,                # same as max_depth
    learning_rate=0.1,
    loss_function='Logloss',
    eval_metric='Accuracy',
    verbose=0               # suppress training output
)

# Train the model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Collecting catboost
  Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.8-cp311-cp311-manylinux2014_x86_64.whl (99.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.8
Accuracy: 0.8156424581005587


#**AdaBoost**

AdaBoost (short for Adaptive Boosting) is one of the earliest and most popular ensemble learning techniques. It combines multiple weak learners (typically decision stumps—i.e., trees with depth 1) to form a strong classifier.

** How It Works**

- Train a weak learner (e.g., a small decision tree) on the original data.
- Assign weights to each sample. Initially, all samples are weighted equally.
- After each round:
- - Increase weights for misclassified samples
- - Decrease weights for correctly classified samples
- Train the next weak learner on the reweighted data.
- Repeat for multiple rounds.
- Final prediction is made by a weighted vote from all weak learners.

It is very Simple and effective method to use

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Base weak learner: decision stump
base_model = DecisionTreeClassifier(max_depth=1)

# AdaBoost model
model = AdaBoostClassifier( n_estimators=100, learning_rate=0.1)

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.7932960893854749


#**KNN**

K-Nearest Neighbors (KNN) is a simple, intuitive, and non-parametric machine learning algorithm used for classification and regression.


🔍 How KNN Works
- When making a prediction for a new data point:
- - Measure the distance (usually Euclidean) between the new point and all points in the training data.
- Select the K closest points (the "neighbors").
- For classification:
- - The new point is assigned the most common label among its K neighbors.
- For regression:
- - The predicted value is the average of its K neighbors' values

Works well with small datasets

Simple to implement and understan

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Create model
model = KNeighborsClassifier(n_neighbors=5)

# Train
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.6983240223463687
