# Table of Contents

## I. Data Preparation for Modelling

1. [Types of Encoding](#types-of-encoding)
2. [Data Transformation Techniques](#data-transformation-techniques)
3. [Principal Component Analysis](#principal-component-analysis)
4. [Imbalance Treatment Methods](#imbalance-treatment-methods)
5. [Train Test Split](#train-test-split)

## II. Model Building

1. [Regression Algorithm (OLS)](#regression-algorithm-ols)
2. [Classification Algorithms](#classification-algorithms)
    - [Logistic Regression](#logistic-regression)
    - [SVM](#svm)
    - [KNN](#knn)
    - [Naive Bayes](#naive-bayes)
    - [Decision Tree](#decision-tree)
3. [Clustering Algorithms](#clustering-algorithms)
    - [K-Means](#k-means)
    - [Hierarchical](#hierarchical)
    - [DBScan](#dbscan)

## III. Ensembling Techniques

1. [Bagging (Random Forest)](#bagging-random-forest)
2. [Boosting](#boosting)
    - [AdaBoost](#adaboost)
    - [GradBoost](#gradboost)
    - [XGBoost](#xgboost)

## IV. Model Evaluation

1. [Confusion Matrix](#confusion-matrix)
2. [Accuracy Metrics](#accuracy-metrics)
3. [Model Validation Techniques](#model-validation-techniques)
    - [k-Fold Cross Validation](#k-fold-cross-validation)
    - [LOOCV](#loocv)

## V. Regularization

1. [Ridge](#Ridge)
2. [Lasso](#Lasso)
3. [ElasticNet](#ElasticNet)


# I. Data Preparation for modelling

## <a id="types-of-encoding"></a>Types of Encoding

#### Label Encoding

Converts categorical values into integer values.

Assigns a unique integer to each category. Example for "Color": Red → 0, Green → 1, Blue → 2.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category'] = le.fit_transform(df['category'])

#### One-Hot Encoding

Creates binary columns for each category in a feature.

Example: For a categorical feature "Color" with values "Red," "Green," and "Blue," one-hot encoding creates three new columns: "Color_Red," "Color_Green," and "Color_Blue." If a sample is "Green," the encoded vector would be [0, 1, 0].

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
encoded_features = ohe.fit_transform(df[['category']])


#### Binary Encoding

Converts categories into binary code and then splits the digits into separate columns.

Example: For the same "Color" feature with values "Red," "Green," and "Blue," binary encoding might map "Red" to 01, "Green" to 10, and "Blue" to 11. This would result in 2 columns for binary representation.

In [None]:
import category_encoders as ce
be = ce.BinaryEncoder()
df = be.fit_transform(df['category'])

#### Target Encoding

Replaces each category with the mean of the target variable for that category. 

Useful in situations where there is a strong relationship between categorical features and the target variable.

In [None]:
import category_encoders as ce
te = ce.TargetEncoder()
df['category'] = te.fit_transform(df['category'], df['target'])


#### Frequency Encoding

Replaces each category with its frequency in the dataset.

Useful when the frequency of categories is significant.

In [None]:
df['category_freq'] = df['category'].map(df['category'].value_counts())

#### Ordinal Encoding

Converts categories into integers but respects the order of the categories.

Assigns integers based on an inherent order. Example: Small → 1, Medium → 2, Large → 3.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
df['category'] = oe.fit_transform(df[['category']])


## <a id="data-transformation-techniques"></a>Data Transformation Techniques

#### Normalization (Min-Max Scaling)

Rescales the data to fit within a specific range, typically [0, 1].

Useful when the data needs to be brought to a common scale without distorting differences in the ranges of values.

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)


#### Standardization (Z-score Normalization)

Rescales data to have a mean of 0 and a standard deviation of 1.

Useful when the data has different units or scales, especially for algorithms that assume normally distributed data.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_standardized = scaler.fit_transform(df)


#### Log Transformation

Applies the logarithm function to the data.

Useful for reducing skewness and handling exponential growth.

In [None]:
import numpy as np
df['log_transformed'] = np.log(df['column'] + 1)

#### Square Root Transformation

Applies the square root function to the data.

Useful for stabilizing variance and making the data more normally distributed.

In [None]:
df['sqrt_transformed'] = np.sqrt(df['column'])

#### Power Transformation (Box-Cox Transformation)

Applies a power transformation to stabilize variance and make the data more normally distributed.

Useful for dealing with skewed data.

In [None]:
from scipy.stats import boxcox
df['boxcox_transformed'], _ = boxcox(df['column'] + 1)

#### Exponential Transformation

Applies the exponential function to the data.

Can be used to reverse log transformations or to handle data that grows exponentially.

In [None]:
df['exp_transformed'] = np.exp(df['column'])

#### Quantile Transformation

Transforms features to follow a uniform or normal distribution.

Useful for making non-normal data more normally distributed.

In [None]:
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution='normal')
df_quantile_transformed = qt.fit_transform(df)


#### Rank Transformation

Replaces values with their rank in the data.

Useful for non-parametric methods and handling ordinal data.

In [None]:
df['rank_transformed'] = df['column'].rank()


#### Binning (Discretization)

Converts continuous data into discrete bins.

Useful for reducing the effect of minor observation errors and managing outliers.

In [None]:
df['binned'] = pd.cut(df['column'], bins=3, labels=["low", "medium", "high"])


#### Polynomial Transformation

Generates polynomial and interaction features.

Useful for modeling non-linear relationships.

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
df_poly = poly.fit_transform(df)


#### Yeojohnson Transformation

Definition: A power transformation technique that works on both positive and negative values.

Usage: Useful for achieving normality in skewed data.

In [None]:
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
df_yeojohnson_transformed = pt.fit_transform(df)


#### Feature Scaling

Adjusts the scale of the features to bring them into alignment.

Ensures features contribute equally to the model. It uses the median and the interquartile range (IQR) to scale the data, making it more resistant to outliers.

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
df_scaled = scaler.fit_transform(df)


#### Logistic Transformation

Transforms data using the logistic function.

Useful for transforming binary outcomes or probabilities.

In [None]:
df['logistic_transformed'] = 1 / (1 + np.exp(-df['column']))

## <a id="principal-component-analysis"></a>Principal Component Analysis


A statistical procedure that uses an orthogonal transformation to convert a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components.

In [None]:
#Standardize the Data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_standardized = scaler.fit_transform(df)

#Compute the Covariance Matrix
import numpy as np
covariance_matrix = np.cov(df_standardized.T)

#Calculate Eigenvalues and Eigenvectors
eig_vals, eig_vecs = np.linalg.eig(covariance_matrix)

#Sort Eigenvalues and Eigenvectors
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort(key=lambda x: x[0], reverse=True)

#Form the Principal Components
num_components = 2  # or any other number of desired components
selected_vectors = np.hstack([eig_pairs[i][1].reshape(df_standardized.shape[1], 1) for i in range(num_components)])

#Transform the Original Dataset
df_pca = df_standardized.dot(selected_vectors)


In [None]:
# By using a library

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
df_pca = pca.fit_transform(df_standardized)


#### Explained Variance

The proportion of the datasets variance that each principal component explains.

Helps determine the number of principal components to retain.

In [None]:
explained_variance = pca.explained_variance_ratio_
import matplotlib.pyplot as plt
plt.plot(range(1, len(explained_variance) + 1), explained_variance, marker='o')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance')
plt.show()

## <a id="imbalance-treatment-methods"></a>Imbalance Treatment Methods



#### Oversampling

Using SMOTE (Synthetic Minority Over-sampling Technique)

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

# Generate a sample dataset with class imbalance
X, y = make_classification(n_classes=2, class_sep=2,
                           weights=[0.1, 0.9], n_informative=3, n_redundant=1,
                           flip_y=0, n_features=20, n_clusters_per_class=1,
                           n_samples=1000, random_state=42)

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Print the class distribution before applying SMOTE
print("Class distribution before SMOTE:", Counter(y_train))

# Initialize SMOTE
smote = SMOTE(random_state=42)

# Fit and transform the training data using SMOTE
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

# Print the class distribution after applying SMOTE
print("Class distribution after SMOTE:", Counter(y_resampled))


#### Undersampling

Reducing the number of majority class samples.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

undersample = RandomUnderSampler()
X_res, y_res = undersample.fit_resample(X, y)

# X: features from your dataset, an array
# y: labels in dataset, an array

# II. Model Building

## <a id="regression-algorithm-ols"></a>Regression Algorithm (OLS)

Linear regression is used to model the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

## <a id="classification-algorithms"></a>Classification Algorithms

### <a id="logistic-regression"></a>Logistic Regression

Logistic regression is used for binary classification problems. It models the probability that a given input belongs to a particular class.

In [None]:
from sklearn.linear_model import LogisticRegression

# Assuming binary target variable
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)
y_pred = log_reg.predict(X_test)

# max_iter: maximumm number of iteration you want your algorithm to perform
# Default Value: max_iter defaults to 100.
# If the algorithm does not converge within the default number of iterations, you may receive a ConvergenceWarning. 
# In such cases, you should increase max_iter.

### <a id="svm"></a>SVM

SVM is used for classification and regression tasks. 

It finds the hyperplane that best separates the classes in the feature space.

In [None]:
from sklearn.svm import SVC

classifier = SVC()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

### <a id="knn"></a>KNN

KNN is a simple, non-parametric algorithm used for both classification and regression tasks. 

It classifies a data point based on the majority class among its k-nearest neighbors in the feature space.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

#n_neighbours: numeber of neighbours you want to consider for KNN Classifier

### <a id="naive-bayes"></a>Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes theorem. 

It assumes independence between predictors.

In [None]:
from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

### <a id="decision-tree"></a>Decision Tree

Decision trees are used for both classification and regression tasks. 

They split the data into subsets based on the value of input features, creating a tree-like model of decisions.

In [None]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

## <a id="clustering-algorithms"></a>Clustering Algorithms

### <a id="k-means"></a>K-Means

Partitions data into k clusters, where each data point belongs to the cluster with the nearest mean.

Suitable for large datasets where cluster boundaries are spherical

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_


### <a id="hierarchical"></a>Hierarchical

Builds a hierarchy of clusters using either agglomerative (bottom-up) or divisive (top-down) approaches.

Useful for creating dendrograms to visualize the data’s nested structure.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(X, method='ward')
dendrogram(linked)


### <a id="dbscan"></a>DBScan

Groups points that are closely packed together and marks points that lie alone in low-density regions as outliers.

Effective for datasets with noise and varying density.

In [None]:
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X)


# III. Ensembling Techniques

## <a id="bagging-random-forest"></a>Bagging (Random Forest)

An extension of bagging that constructs multiple decision trees and merges them to get a more accurate and stable prediction.

Handles both classification and regression tasks well.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)


## <a id="boosting"></a>Boosting

Combines multiple models (usually of the same type) to reduce variance and improve stability.

Useful for algorithms that exhibit high variance.

### <a id="adaboost"></a>AdaBoost

 A type of boosting that adjusts the weights of incorrectly classified instances, focusing more on difficult cases.

Useful for improving the performance of weak classifiers.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
ada = AdaBoostClassifier(n_estimators=50)
ada.fit(X_train, y_train)


### <a id="gradboost"></a>GradBoost

Sequentially builds models that minimize a loss function, typically using decision trees.

Suitable for both classification and regression tasks.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=100)
gbc.fit(X_train, y_train)


### <a id="xgboost"></a>XGBoost

An optimized implementation of gradient boosting designed for speed and performance.

Highly efficient and scalable for large datasets.

In [None]:
import xgboost as xgb
xgb_model = xgb.XGBClassifier(n_estimators=100)
xgb_model.fit(X_train, y_train)

# IV. Model Evaluation 

## <a id="confusion-matrix"></a>Confusion Matrix

A table that summarizes the performance of a classification model by showing the true vs. predicted classifications.

Useful for evaluating the accuracy of a classification.

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)


## <a id="accuracy-metrics"></a>Accuracy Metrics

Metrics used to evaluate the performance of a model, including accuracy, precision, recall, F1 score, etc.

Provides different perspectives on model performance.

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)


## <a id="model-validation-techniques"></a>Model Validation 

### <a id="k-fold-cross-validation"></a>k-Fold Cross Validation

Splits the data into k subsets and trains the model k times, each time using a different subset as the validation set and the remaining as the training set.

Provides a more reliable estimate of model performance.

In [None]:
from sklearn.model_selection import KFold, cross_val_score
kf = KFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=kf)


### <a id="loocv"></a>LOOCV

A special case of k-fold cross-validation where k equals the number of observations, i.e., each instance is used once as a validation set.

Useful for small datasets to maximize training data.

In [None]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)

# V. Regularization

## <a id="Ridge"></a>Ridge

Adds a penalty equal to the sum of the squared coefficients to the loss function, discouraging large coefficients.

Reduces overfitting by constraining the model.

In [None]:
from sklearn.linear_model import Ridge
ridge = Ridge(alpha=1.0)
ridge.fit(X_train, y_train)


## <a id="Lasso"></a>Lasso

Adds a penalty equal to the absolute value of the coefficients to the loss function, leading to sparse models.

Useful for feature selection by shrinking some coefficients to zero.

In [None]:
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)


## <a id="ElasticNet"></a>ElasticNet

Combines L1 and L2 regularization, adding both penalties to the loss function.

Useful when there are multiple features with high collinearity.

In [None]:
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=1.0, l1_ratio=0.5)
elastic_net.fit(X_train, y_train)
