# 🤖 Machine Learning Algorithms Summary

This notebook summarizes the main *supervised machine learning* algorithms, focusing on **classification** (though **regression** is also mentioned).

We’ll use the Titanic dataset as a practical example to compare:

✅ How they work

✅ When to use them

✅ Their advantages and disadvantages

✅ Which hyperparameters are important

✅ How well they predict survival


## Dataset preparation

In [11]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.preprocessing import StandardScaler

# Load dataset
df = sns.load_dataset('titanic')

# Preprocessing similar to previous notebooks: dropping irrelevant columns, imputing missing values and encoding categorical variables
df = df.drop(columns=['deck', 'embark_town', 'alive'])
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['sex'] = df['sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['embarked'], drop_first=True)

# Create additional features
df['family_size'] = df['sibsp'] + df['parch'] + 1
df['is_alone'] = (df['family_size'] == 1).astype(int)

# Scale numerical variables (mean 0, variance 1)
scaler = StandardScaler()
df[['age', 'fare']] = scaler.fit_transform(df[['age', 'fare']])

# Select the most relevant variables
features = ['pclass', 'sex', 'age', 'fare', 'family_size', 'is_alone', 'embarked_Q', 'embarked_S']
X = df[features]
y = df['survived'] # The goal of the model is to predict whether a given person survives or not

# Split the data into training and testing sets (80%-20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Classification models

Since the target variable is 1 or 0, we'll start by applying classification models. We will test the following models:

- Logistic Regression
- K-Nearest Neighbors (KNN)
- Decision Trees
- Random Forest
- Support Vector Machines (SVM)
- Gradient Boosting

**A. Logistic Regression**

In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Train the model and make predictions
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

### Generate a report with various classification metrics
'''
Precision: Of the ones I predicted as positive, how many actually were?
Recall: Of the actual positives, how many did I correctly predict?
F1-score: Arithmetic mean between precision and recall
Support: Number of actual samples in each class (from y_test)

Accuracy: % of total correct predictions
Macro/Weighted avg: simple/weighted averages across classes
'''
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.86      0.83       105
           1       0.78      0.72      0.75        74

    accuracy                           0.80       179
   macro avg       0.80      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



Logistic regression computes a linear combination of features to estimate the probability of a class (e.g., survival = 1). It tells you how much the probability of an event increases or decreases as a feature changes.

**When to use it?**
- When features have a linear relationship with the class probability.
- Fast, interpretable, and useful as a baseline.

⚠️ **Limitations:** Cannot model non-linear relationships.

Important coefficients: features with coefficients (`model.coef_`) farthest from 0 (e.g., sex, pclass, fare)

**B. K-Nearest Neighbors (KNN)**

In [13]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.89      0.86       105
           1       0.82      0.76      0.79        74

    accuracy                           0.83       179
   macro avg       0.83      0.82      0.82       179
weighted avg       0.83      0.83      0.83       179



It measures the "distance" between points to classify them (for example, if your 5 nearest neighbors survived, you probably did too). It is not directly interpretable but works well when groups are clearly separated in feature space.

**When to use it?**
- When you have little data and the feature space has geometric meaning.
- No training required (lazy learner), just comparison with neighbors.

⚠️ **Limitations:** Sensitive to feature scaling and noise. Slow with large datasets.

Hyperparameters: `n_neighbors`, `wheights`,`metric`. No direct weights, but the variables that most affect the distance will have the most influence (that's why scaling the data is important).

**C. Decision trees**

In [14]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.91      0.84       105
           1       0.84      0.64      0.72        74

    accuracy                           0.80       179
   macro avg       0.81      0.77      0.78       179
weighted avg       0.80      0.80      0.79       179



Decision trees split the data based on feature conditions. They are highly interpretable, allowing you to see the rules used for each prediction.

**When to use them?**

- When you need interpretability and clear decision rules.
- Can easily capture non-linear relationships.

⚠️ **Limitations:** Prone to overfitting if not pruned (`max_depth`, `min_samples_leaf`).

Important features:
Those most used for splitting and reducing uncertainty (`model.feature_importances`).


In [15]:
# To see the importance of the variables we can use 'feature_importances_'
feature_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': model.feature_importances_
}).sort_values(by='importance', ascending=False)

print(feature_importance)

       feature  importance
1          sex    0.579608
0       pclass    0.200498
3         fare    0.081064
2          age    0.076410
4  family_size    0.048520
7   embarked_S    0.013900
5     is_alone    0.000000
6   embarked_Q    0.000000


**D. Random Forest**

In [16]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.87      0.86       105
           1       0.81      0.80      0.80        74

    accuracy                           0.84       179
   macro avg       0.83      0.83      0.83       179
weighted avg       0.84      0.84      0.84       179



Random Forest combines the results of multiple decision trees to improve accuracy and reduce overfitting.

**When to use it?**
- When you want accuracy and robustness against overfitting.
- Very effective with mixed variables (numerical + categorical).

⚠️ **Limitations:** Less interpretable than single trees. More resource-intensive or slower.

Hyperparameters: `n_estimators`, `max_depth`


**E. SVM (Support Vector Machine)**

In [17]:
from sklearn.svm import SVC

model = SVC(kernel='rbf', C=1, gamma='scale')
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.88      0.84       105
           1       0.80      0.72      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179



SVM finds a hyperplane (a decision boundary) that best separates the classes. It's more complex to interpret but useful when the classes aren't easily separable.

**When to use it?**
- Good for high-dimensional datasets with clear margins.
- Powerful with few observations and scaled data.

⚠️ **Limitations:** Computationally expensive on large datasets. Less interpretable

Hyperparameters: `C`, `kernel`, `gamma`. It's necessary to define the variables that determine the position of the hyperplane's margin.

**F. Gradient Boosting**

In [18]:
from sklearn.ensemble import GradientBoostingClassifier

model = GradientBoostingClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.89      0.84       105
           1       0.81      0.69      0.74        74

    accuracy                           0.80       179
   macro avg       0.81      0.79      0.79       179
weighted avg       0.80      0.80      0.80       179



It builds sequential trees, where each new tree corrects the errors of the previous one. It is powerful and flexible but harder to interpret without special tools.

**When to use it?**
- When you want the best possible performance
- Captures complex relationships and is robust to noise

⚠️ **Limitations:** Requires tuning of many hyperparameters, slower to train

Important variables: Again, with `.feature_importances_`. You can also use techniques like SHAP for deeper explainability.

### Model comparison

In [19]:
from sklearn.metrics import accuracy_score

models = {
    'Logistic Regression': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'Gradient Boosting': GradientBoostingClassifier()
}

# Compare the accuracy of each model used
for name, model in models.items():
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    print(f"{name}: {acc:.3f}")

Logistic Regression: 0.799
KNN: 0.832
Decision Tree: 0.788
Random Forest: 0.838
SVM: 0.810
Gradient Boosting: 0.804


**Cross-validation** splits the data into several folds: trains the model on some folds, and tests it on the remaining ones. This process is repeated so every observation is used for both training and testing, reducing variance in performance estimates.

Why is it important?
- To have a more robust measure of the model's performance (better than a single train-test split).
- To avoid overfitting or underfitting by evaluating the model on multiple subsets.

In [20]:
print("Cross-validated accuracy scores (5-fold):\n")

for name, model in models.items():
    # Compute cross-validated accuracy (5 folds by default)
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    print(f"{name}: Mean={scores.mean():.3f}, Std={scores.std():.3f}")

Cross-validated accuracy scores (5-fold):

Logistic Regression: Mean=0.794, Std=0.017
KNN: Mean=0.800, Std=0.019
Decision Tree: Mean=0.787, Std=0.023
Random Forest: Mean=0.802, Std=0.026
SVM: Mean=0.825, Std=0.019
Gradient Boosting: Mean=0.833, Std=0.025


### Conclusions

| Model               | Pros                           | Cons                          |
|----------------------|--------------------------------|----------------------------------|
| Logistic Regression  | Simple, interpretable          | Cannot capture non-linearity          |
| KNN                  | Easy, no training required              | Slow, sensitive to scaling         |
| Decision trees    | Interpretable, captures non-linearity  | Overfitting without pruning             |
| Random Forest        | Robust, accurate                | Less interpretable              |
| SVM                  | Accurate in high-dimensional spaces       | Slow, less interpretable       |
| Gradient Boosting    | Very accurate                     | More expensive, complex tuning     |

The best model depends on the problem, but understanding how each works is the first step toward a good solution. 🎯


## Regression model

We can work with the same dataset by changing the target variable. To do so, the goal is not a class but a continuous variable, such as the `fare` field.

In [21]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define the new target variable
y_reg = df['fare']  # Change to a continuous target
X_reg = df[['pclass', 'sex', 'age', 'family_size', 'is_alone', 'embarked_Q', 'embarked_S']]

# Split the data in the same way
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

# Linear regression model (one of the simplest)
reg_model = LinearRegression()
reg_model.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_reg = reg_model.predict(X_test_reg)

# Evaluate using regression metrics (MSE, MAE,R2..)
mse = mean_squared_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R² Score: {r2:.2f}")

Mean Squared Error (MSE): 0.37
R² Score: 0.41


As we can see, the regression model is not good enough: the mean squared error is too high, and the explainability of the target variable (R²) is very low.

Therefore, we will explore two more models to test two hypotheses:
- A: The model is too simple and fails to capture the underlying relationships
- B: We don’t have enough features or data to model the target variable effectively

In [22]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Select a different set of variables.
y_reg = df['fare']  # Target variable
X_reg = df[['pclass', 'sex', 'age', 'family_size', 'parch', 'embarked_Q', 'embarked_S']]
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train_reg, y_train_reg)

# Predict
y_pred_reg = rf_regressor.predict(X_test_reg)

# Evaluate
mse = mean_squared_error(y_test_reg, y_pred_reg)
r2 = r2_score(y_test_reg, y_pred_reg)

print(f"Random Forest Regressor MSE: {mse:.2f}")
print(f"Random Forest Regressor R²: {r2:.2f}")

Random Forest Regressor MSE: 0.68
Random Forest Regressor R²: -0.08


In [23]:
from sklearn.ensemble import GradientBoostingRegressor

gb_regressor = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
gb_regressor.fit(X_train_reg, y_train_reg)

# Predecimos
y_pred_gb = gb_regressor.predict(X_test_reg)

# Y evaluamos
mse_gb = mean_squared_error(y_test_reg, y_pred_gb)
r2_gb = r2_score(y_test_reg, y_pred_gb)

print(f"Gradient Boosting Regressor MSE: {mse_gb:.2f}")
print(f"Gradient Boosting Regressor R²: {r2_gb:.2f}")

Gradient Boosting Regressor MSE: 0.53
Gradient Boosting Regressor R²: 0.15


In both cases, we get a model that performs worse than the first one... (even with RandomForest, we get an R² score below 0, which indicates a lack of meaningful explanation).

Thus, we can conclude that with the current variables, we **cannot build a useful regression model**. In the future, we could try:

- Reviewing which features are more strongly related to the target.
- Doing feature engineering (creating new variables, normalizing, or scaling existing ones).
- Trying more robust models.
- Expanding the dataset.

## (APPENDIX A) GLM: Generalized Linear Models
GLMs (Generalized Linear Models) are a flexible extension of linear/logistic regression that allow modeling different types of response variables using appropriate distributions and link functions.

In [24]:
import statsmodels.api as sm

# First remove the categorical variables
X_train = X_train.drop(columns=['embarked_Q', 'embarked_S'])

## Add a constant term
# In linear and GLM models, the intercept (or constant term) represents the baseline prediction when all variables are zero
X_glm = sm.add_constant(X_train)

# Use the Binomial family, as this is equivalent to performing logistic regression (the goal is to predict a binary variable)
glm_model = sm.GLM(y_train, X_glm, family=sm.families.Binomial())
glm_results = glm_model.fit()

print(glm_results.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:               survived   No. Observations:                  712
Model:                            GLM   Df Residuals:                      705
Model Family:                Binomial   Df Model:                            6
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -317.93
Date:                Wed, 11 Jun 2025   Deviance:                       635.86
Time:                        06:37:34   Pearson chi2:                     711.
No. Iterations:                     5   Pseudo R-squ. (CS):             0.3505
Covariance Type:            nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const           1.7466      0.456      3.833      

The summary includes:
* Coefficients → how much each variable influences the outcome
* Standard error → precision of the coefficients
* z-value → coefficient divided by its standard error (for hypothesis testing)
* P-value → whether the coefficient is significantly different from 0
* Confidence intervals (95% CI) → range where the true value is likely to fall
* Deviance, Pearson Chi2 → model fit metrics
* Pseudo R² → a measure of how well the model fits (similar to regression R²)

**When to use it?**
- When strong statistical interpretability is needed
- When the project is scientific or academic
- When testing hypotheses about specific variables

Other types of families include: Poisson (for counting events, e.g., number of purchases) or Gamma (for positive continuous values like costs or durations)