<a href="https://github.com/YOUR-USERNAME/YOUR-REPOSITORY" target="_blank" 
   style="display: inline-flex; align-items: center; background-color: #24292e; color: white; 
          padding: 10px 15px; border-radius: 6px; text-decoration: none; font-family: Arial, sans-serif; 
          font-size: 16px; font-weight: bold;">
    <img src="https://github.githubassets.com/images/modules/logos_page/GitHub-Mark.png" 
         alt="GitHub Logo" width="30" style="margin-right: 10px;">
    Go to the Repository
</a>


<table style="border: none; border-collapse: collapse; width: 100%; padding: 15px;">
    <tr>
        <td style="vertical-align: middle; padding: 15px;">
            <p style="font-size: 24px; font-weight: bold; color: #0030A1; margin: 5px 0;">
                Hyperparameter Optimization
            </p>
            <p style="font-size: 18px; color: #0030A1; margin: 5px 0;">
                Data Science & AI
            </p>
            <p style="font-size: 16px; font-style: italic; color: #555; margin: 10px 0;">
                Sebastián Reyes • 2024
            </p>
        </td>
    </tr>
</table>


---
## <font color='264CC7'> Introduction </font>

Throughout this notebook, we will apply hyperparameter optimization to a model of your choice.

In [1]:
# Libraries to be imported
import kagglehub
import pandas as pd
import matplotlib.pyplot as plt
import joblib  # To save the model

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.neighbors import KNeighborsClassifier

--- 
## <font color='264CC7'> Classification </font>

### <font color='264CC7'> Data Preprocessing </font>


In [2]:
# Download the dataset
path = kagglehub.dataset_download("imakash3011/customer-personality-analysis")

# Load the dataset
df = pd.read_csv(f"{path}/marketing_campaign.csv", sep="\t")

# Data Exploration
print("🔍 Initial Data Exploration:\n")

# First records with display
print("📊 First 5 Records:")
display(df.head())


Downloading from https://www.kaggle.com/api/v1/datasets/download/imakash3011/customer-personality-analysis?dataset_version_number=1...


100%|██████████| 62.0k/62.0k [00:00<00:00, 815kB/s]

Extracting files...
🔍 Initial Data Exploration:

📊 First 5 Records:





Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


In [3]:
# Selection of relevant variables
variables = [
    "Income", "MntWines", "MntFruits", "MntMeatProducts",
    "MntFishProducts", "MntSweetProducts", "MntGoldProds",
    "NumDealsPurchases", "NumWebPurchases", "NumCatalogPurchases",
    "NumStorePurchases", "NumWebVisitsMonth"
]

target = "Response"

# Initial cleaning: Removal of rows with missing values
df = df.dropna(subset=variables + [target])

# Split into features and labels
X = df[variables]
y = df[target]

# Scaling the variables
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")


Training set size: (1772, 12)
Test set size: (444, 12)



### <font color='264CC7'> Model </font>


<div style="background-color: #edf1f8; border-color: #264CC7; border-left: 5px solid #264CC7; padding: 0.5em;">

<ul>
  <li>Show the model's hyperparameters.</li>
  <li>Explain the meaning of at least 4 hyperparameters.</li>
  <li>Select the hyperparameters you want to optimize.</li>
</ul>
</div>


In [4]:
# Initial model with k=9
knn_optimal = KNeighborsClassifier(n_neighbors=9)
knn_optimal.fit(X_train, y_train)
y_test_pred = knn_optimal.predict(X_test)

# Show model metrics
print("📊 Initial model metrics:")
print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.2f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))

📊 Initial model metrics:
Accuracy: 0.87

Confusion Matrix:
[[376   1]
 [ 57  10]]

Classification Report:
              precision    recall  f1-score   support

           0       0.87      1.00      0.93       377
           1       0.91      0.15      0.26        67

    accuracy                           0.87       444
   macro avg       0.89      0.57      0.59       444
weighted avg       0.87      0.87      0.83       444



### Hyperparameter Optimization with GridSearch

#### Meaning of 4 k-NN Model Hyperparameters

1. **`n_neighbors`**:
   - Specifies the number of nearest neighbors the model will consider for classification or regression decisions.
   - A low value makes the model more sensitive to noise, while a high value results in a more generalized model.

2. **`weights`**:
   - Determines how neighbors are weighted for the final decision.

3. **`metric`**:
   - Specifies the mathematical formula used to calculate the distance between points. Some common values include:

4. **`p`**:
   - A hyperparameter associated with the `'minkowski'` metric.

---

#### Selected Hyperparameters for Optimization

To optimize the k-NN model, the following hyperparameters will be tested:

1. **`n_neighbors`**:
   - Values to try: `[5, 7, 9, 11, 13]`.

2. **`weights`**:
   - Weighting methods: `['uniform', 'distance']`.

3. **`metric`**:
   - Distance metrics: `['euclidean', 'manhattan', 'minkowski']`.

4. **`p`**:
   - Only applies when using `'minkowski'`.
   - Values to try: `[1, 2]`.

---


### <font color='264CC7'> Optimization by GridSearch </font>

In [5]:
param_grid = {
    "n_neighbors": [5, 7, 9, 11, 13],  # Values for the number of neighbors
    "weights": ["uniform", "distance"],  # Weighting methods
    "metric": ["euclidean", "manhattan", "minkowski"],  # Distances
    "p": [1, 2]  # Parameter p for 'minkowski' (1: Manhattan, 2: Euclidean)
}

# Set up and run Grid Search with 5 cross-validations
grid_search = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,  # Number of cross-validations
    verbose=1,
    n_jobs=-1  # Use all available cores to speed up
)

grid_search.fit(X_train, y_train)

# Show optimization results
print("\nBest hyperparameters found:")
print(grid_search.best_params_)

print(f"\nBest cross-validation accuracy: {grid_search.best_score_:.2f}")

# Evaluate the optimized model on the test set
best_model = grid_search.best_estimator_
y_test_pred_optimized = best_model.predict(X_test)

# Show metrics of the optimized model
print("\n📊 Metrics of the optimized model on the test set:")
print(f"Accuracy: {accuracy_score(y_test, y_test_pred_optimized):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred_optimized))


Fitting 5 folds for each of 60 candidates, totalling 300 fits

Best hyperparameters found:
{'metric': 'manhattan', 'n_neighbors': 13, 'p': 1, 'weights': 'distance'}

Best cross-validation accuracy: 0.86

📊 Metrics of the optimized model on the test set:
Accuracy: 0.87

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       377
           1       0.79      0.16      0.27        67

    accuracy                           0.87       444
   macro avg       0.83      0.58      0.60       444
weighted avg       0.86      0.87      0.83       444



### <font color='264CC7'> Optimization by RandomSearch </font>

In [6]:
# Configuración de hiperparámetros para RandomizedSearchCV
param_distributions = {
    "n_neighbors": [3, 5, 7, 9, 11, 13, 15],  # Values for the number of neighbors
    "weights": ["uniform", "distance"],  # Weighting methods
    "metric": ["euclidean", "manhattan", "minkowski"],  # Distances
    "p": [1, 2, 3, 4, 5]  # Parameter p for 'minkowski'
}

# Set up and run RandomizedSearchCV with 25 iterations and 5 cross-validations
random_search = RandomizedSearchCV(
    estimator=KNeighborsClassifier(),
    param_distributions=param_distributions,
    n_iter=25,  # Number of combinations to try
    scoring="accuracy",
    cv=5,  # Number of cross-validations
    verbose=1,
    random_state=42,
    n_jobs=-1  # Use all available cores to speed up
)

random_search.fit(X_train, y_train)

# Show optimization results
print("\nBest hyperparameters found:")
print(random_search.best_params_)

print(f"\nBest cross-validation accuracy: {random_search.best_score_:.2f}")

# Evaluate the optimized model on the test set
best_model = random_search.best_estimator_
y_test_pred_optimized = best_model.predict(X_test)

# Show metrics of the optimized model
print("\n📊 Metrics of the optimized model on the test set:")
print(f"Accuracy: {accuracy_score(y_test, y_test_pred_optimized):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred_optimized))

Fitting 5 folds for each of 25 candidates, totalling 125 fits

Best hyperparameters found:
{'weights': 'distance', 'p': 4, 'n_neighbors': 15, 'metric': 'minkowski'}

Best cross-validation accuracy: 0.86

📊 Metrics of the optimized model on the test set:
Accuracy: 0.87

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       377
           1       0.83      0.15      0.25        67

    accuracy                           0.87       444
   macro avg       0.85      0.57      0.59       444
weighted avg       0.86      0.87      0.83       444



### <font color='264CC7'> Model Saving </font>

In [7]:
# Configuración del modelo más simple
simple_model_params = {'metric': 'manhattan', 'n_neighbors': 13, 'p': 1, 'weights': 'distance'}

# Create and train the model with the simplest parameters
simple_knn_model = KNeighborsClassifier(
    n_neighbors=simple_model_params['n_neighbors'],
    weights=simple_model_params['weights'],
    metric=simple_model_params['metric'],
    p=simple_model_params['p']
)

# Training with the entire training set
simple_knn_model.fit(X_train, y_train)

# Evaluation on the test set
y_test_pred_simple = simple_knn_model.predict(X_test)

# Show model results
print("\n📊 Metrics of the simplest model on the test set:")
print(f"Accuracy: {accuracy_score(y_test, y_test_pred_simple):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred_simple))


📊 Metrics of the simplest model on the test set:
Accuracy: 0.87

Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.99      0.93       377
           1       0.79      0.16      0.27        67

    accuracy                           0.87       444
   macro avg       0.83      0.58      0.60       444
weighted avg       0.86      0.87      0.83       444



In [8]:
simple_model_filename = "Simple knn model.pkl"
joblib.dump(simple_knn_model, simple_model_filename)
print(f"\nSimplest model saved as '{simple_model_filename}'.")


Simplest model saved as 'Simple knn model.pkl'.
