# Exercise for pipelines and hyperparameter tuning

- classification task based on the wine dataset
- task is to compare the available models and pick out the best possible one
- use k-fold cross validation for evaluation
- optimize the models using hyperparameter tuning

## Typical hyperparameter ranges:
---
### Logistic Regression:
- C (Inverse of regularization strength): Common values are logarithmic scale like 0.001, 0.01, 0.1, 1, 10, 100.

### Random Forest:
- n_estimators (Number of trees in the forest): Usually ranges from 10 to 200, incremented by 10 or 50.
- max_features (The number of features to consider when looking for the best split): Can be sqrt, log2, or a fraction (0.1, 0.2, ..., 0.9).
- max_depth (Maximum depth of the tree): Can range from 10 to 100 or None (unlimited depth).

### Support Vector Machine (SVM):
- C (Regularization parameter): Similar to logistic regression, often ranges on a logarithmic scale like 0.1, 1, 10, 100.
- gamma (Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’): Common values are 0.001, 0.01, 0.1, 1, 10, 100.

In [4]:
from sklearn.datasets import load_wine
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

# Load the wine dataset
X, y = load_wine(return_X_y=True)

# Define the classifiers and their respective hyperparameters for tuning
classifiers = {
    'LogisticRegression': LogisticRegression(),
    'RandomForest': RandomForestClassifier(),
    'SVC': SVC()
}

params = {
    'LogisticRegression': {'classifier__C': [0.01, 0.1, 1, 10, 100]},
    'RandomForest': {'classifier__n_estimators': [10, 50, 100, 200]},
    'SVC': {'classifier__C': [0.1, 1, 10, 100], 'classifier__gamma': [0.01, 0.1, 1, 10, 100]}
}

# Create pipelines for each classifier
pipelines = {}
for clf_name, clf in classifiers.items():
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', clf)
    ])
    pipelines[clf_name] = pipeline

# Perform Grid Search with Cross-Validation
best_estimators = {}
for clf_name, pipeline in pipelines.items():
    grid_search = GridSearchCV(pipeline, params[clf_name], cv=5, scoring='accuracy')
    grid_search.fit(X, y)
    best_estimators[clf_name] = grid_search.best_estimator_

# Display the best estimator for each classifier
best_estimators

{'LogisticRegression': Pipeline(steps=[('scaler', StandardScaler()),
                 ('classifier', LogisticRegression(C=0.1))]),
 'RandomForest': Pipeline(steps=[('scaler', StandardScaler()),
                 ('classifier', RandomForestClassifier(n_estimators=200))]),
 'SVC': Pipeline(steps=[('scaler', StandardScaler()),
                 ('classifier', SVC(C=10, gamma=0.1))])}

In [3]:
# Perform Grid Search with Cross-Validation and find the best estimator and score for each classifier
best_estimators_scores = {}
for clf_name, pipeline in pipelines.items():
    grid_search = GridSearchCV(pipeline, params[clf_name], cv=5, scoring='accuracy')
    grid_search.fit(X, y)
    best_estimators[clf_name] = grid_search.best_estimator_
    best_estimators_scores[clf_name] = grid_search.best_score_

# Find the classifier with the highest cross-validation score
best_classifier_name = max(best_estimators_scores, key=best_estimators_scores.get)
best_classifier = best_estimators[best_classifier_name]
best_score = best_estimators_scores[best_classifier_name]

(best_classifier_name, best_classifier, best_score)


('SVC',
 Pipeline(steps=[('scaler', StandardScaler()),
                 ('classifier', SVC(C=10, gamma=0.1))]),
 0.9888888888888889)