# Model selection: Pre-deep learning approach




1. Introduction to the Script
2. Preparing Your Data
3. Model Warmup for Classification and Regression
4. Visualizing Model Performance
5. Conclusion

---

### 1. Introduction to the Script

This script is designed to assist users in the preliminary steps of machine learning model selection and evaluation. It provides functions for model warmup, which involves training and tuning several machine learning models using grid search cross-validation. The script supports both classification and regression tasks. 
Many times, before stepping into deep learning, we process our data and still don't know what model to use. Whether our problem is a regression one or a classification one, we have a variety of models to choose from such as SVC, SVR, Decision Trees, and more.

### What models does this script support

- DecisionTreeClassifier (see sklearn documentation of DTC) [link](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
- DecisionTreeRegressor (see sklearn documentation of DTR) [link](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
- GradientBoostingClassifier (see sklearn documentation of GBC) [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)
- GradientBoostRegressor (see sklearn documentation of GBR) [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)
- RandomForestClassifier (see sklearn documentation of RFC) [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- RandomForestRegressor (see sklearn documentation of RFR) [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
- SVC (see sklearn documentation of SVC) [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
- SVR (see sklearn documentation of SVR) [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)
- LogisticRegression (see sklearn documentation of LogisticRegression) [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
- LinearRegression (see documentation of LinearRegression) [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
- Ridge (see sklearn documentation of Ridge) [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)


### 2. Preparing Your Data

Before using the script, ensure that your data is properly preprocessed and split into training and validation sets. The `X_train`, `y_train`, `X_val`, and `y_val` variables should contain the features and target variables for training and validation.

### 3. Model Warmup for Classification and Regression

#### Classification:
To warm up classification models, call the `warmup_classification()` function with your training data. This function performs grid search cross-validation to find the best hyperparameters for logistic regression, decision tree, gradient boosting, random forest, and support vector machine (SVM) classifiers.

#### Regression:
To warm up regression models, call the `warmup_regression()` function with your training data. Similar to classification, this function performs grid search cross-validation to find the best hyperparameters for linear regression, decision tree, gradient boosting, random forest, and support vector regression (SVR) models.

### 4. Visualizing Model Performance

After warming up the models, you can visualize their performance on both the training and validation sets. The `visualization_class()` function is used for classification models, while the `visualization_regress()` function is used for regression models. These functions display the accuracy or R-squared scores of each model on the training and validation sets.

### 5. Conclusion

Using this script, you can efficiently explore various machine learning models, tune their hyperparameters, and evaluate their performance on your dataset. Experiment with different models and hyperparameters to find the best model for your specific task.


In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, r2_score
from sklearn.model_selection import learning_curve, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, GradientBoostingRegressor, RandomForestRegressor
from sklearn.svm import SVC, SVR
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression, Ridge

In [None]:
def warmup_classification(X_train, y_train, multiclass=False):
    # Logistic regression
    param_grid_logistic = {
        'C': [0.001, 0.01, 0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear', 'saga']
    }
    log_r = LogisticRegression(random_state=42)
    if multiclass == True :
        log_r = LogisticRegression(multi_class='multinomial', random_state=42)
    grid_search_logistic = GridSearchCV(
        log_r, 
        param_grid= param_grid_logistic, 
        cv=5, 
        scoring='accuracy'
    )
    grid_search_logistic.fit(X_train, y_train)
    best_model_logistic = grid_search_logistic.best_estimator_

    # Decision tree
    param_grid_dt = {
        'criterion': ['gini', 'entropy'],
        'max_depth': range(1,20),
        'min_samples_split': range(2,21),
        'min_samples_leaf': range(1,21)
    }
    grid_search_dt = GridSearchCV(
        DecisionTreeClassifier(random_state=42),
        param_grid= param_grid_dt,
        cv = 5,
        scoring = 'accuracy',
        n_jobs= -1
    )
    grid_search_dt.fit(X_train, y_train) 
    best_model_dt = grid_search_dt.best_estimator_
    
    # Gradient Boosting
    param_grid_gb = {
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'min_samples_split': [2, 4],
        'min_samples_leaf': [1, 2]
    }
    grid_search_gb = GridSearchCV(
        GradientBoostingClassifier(random_state=42),
        param_grid= param_grid_gb,
        cv= 5,
        n_jobs=-1,
        scoring='accuracy'
    )
    grid_search_gb.fit(X_train, y_train)
    best_model_gb = grid_search_gb.best_estimator_

    #Random forest
    param_grid_rf = {
        'n_estimators': [100, 200, 300],  
        'max_depth': [None, 10, 20, 30],  
        'min_samples_split': [2, 5, 10],  
        'min_samples_leaf': [1, 2, 4],    
        'bootstrap': [True, False]
    }
    grid_search_rf = GridSearchCV(
        RandomForestClassifier(random_state=42),
        param_grid= param_grid_rf,
        cv=5,
        verbose=2,
        n_jobs=-1
    )
    grid_search_rf.fit(X_train, y_train)
    best_model_rf = grid_search_rf.best_estimator_

    # SVM
    param_grid_svm = {
        'C': [0.1, 1, 10, 100],
        'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
        'kernel': ['rbf', 'poly', 'sigmoid', 'linear']
    }
    svm = SVC(random_state=42)
    if multiclass == True:
        svm = SVC(random_state=42,decision_function_shape='ovr')
    grid_search_svm = GridSearchCV(
        svm, 
        param_grid_svm, 
        refit=True, 
        verbose=3, 
        cv=5, 
        n_jobs=-1
    )
    grid_search_svm.fit(X_train, y_train)
    best_model_svm = grid_search_svm.best_estimator_

    return best_model_logistic, best_model_dt, best_model_gb, best_model_rf, best_model_svm

In [None]:
def warmup_regression(X_train, y_train):
    # Linear Regression 
    param_grid_linear = {
        'alpha': [0.001, 0.01, 0.1, 1, 10, 100]
    }
    grid_search_linear = GridSearchCV(
        Ridge(random_state=42), 
        param_grid=param_grid_linear, 
        cv=5, 
        scoring='neg_mean_squared_error'
    )
    grid_search_linear.fit(X_train, y_train)
    best_model_linear = grid_search_linear.best_estimator_

    # Decision Tree Regressor
    param_grid_dt = {
        'criterion': ['mse', 'friedman_mse', 'mae'],
        'max_depth': range(1,20),
        'min_samples_split': range(2,21),
        'min_samples_leaf': range(1,21)
    }
    grid_search_dt = GridSearchCV(
        DecisionTreeRegressor(random_state=42),
        param_grid=param_grid_dt,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    grid_search_dt.fit(X_train, y_train)
    best_model_dt = grid_search_dt.best_estimator_
    
    # Gradient Boosting Regressor
    param_grid_gb = {
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'min_samples_split': [2, 4],
        'min_samples_leaf': [1, 2]
    }
    grid_search_gb = GridSearchCV(
        GradientBoostingRegressor(random_state=42),
        param_grid=param_grid_gb,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    grid_search_gb.fit(X_train, y_train)
    best_model_gb = grid_search_gb.best_estimator_

    # Random Forest Regressor
    param_grid_rf = {
        'n_estimators': [100, 200, 300],
        'max_depth': [3, 4, 5],
        'min_samples_split': [2, 4],
        'min_samples_leaf': [1, 2]
    }
    grid_search_rf = GridSearchCV(
        RandomForestRegressor(random_state=42),
        param_grid=param_grid_rf,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    grid_search_rf.fit(X_train, y_train)
    best_model_rf = grid_search_rf.best_estimator_

    # SVR
    param_grid_svr = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 'auto'],
        'kernel': ['rbf', 'poly', 'sigmoid', 'linear']
    }
    grid_search_svr = GridSearchCV(
        SVR(),
        param_grid=param_grid_svr,
        cv=5,
        scoring='neg_mean_squared_error',
        n_jobs=-1
    )
    grid_search_svr.fit(X_train, y_train)
    best_model_svr = grid_search_svr.best_estimator_

    return best_model_linear, best_model_dt, best_model_gb, best_model_rf, best_model_svr

In [None]:
def warmup(X_train, y_train, problem = 'binary_classifcation'):
    if problem == 'binary_classification':
        warmup_classification(X_train, y_train)
    elif problem == 'multiclass_classification':
        warmup_classification(X_train, y_train, False)
    else: 
        warmup_regression(X_train, y_train)

In [None]:
  warmup_regression(X_train, y_train)
def visualization_class(logistic, decision_tree, gradient_boost, random_forest, svm, X_train, y_train, X_val, y_val):
    predictions_train = logistic.predict(X_train)
    predictions_val = logistic.predict(X_val)
    logistic_train = accuracy_score(predictions_train,y_train)
    logistic_val = accuracy_score(predictions_val, y_val)
    print("\033[1m" + "Logistic Regression" + "\033[0m"+":")
    print('train: ',logistic_train*100,'%')
    print('val: ',logistic_val*100,'%')

    predictions_train = decision_tree.predict(X_train)
    predictions_val = decision_tree.predict(X_val)
    dt_train = accuracy_score(predictions_train,y_train)
    dt_val = accuracy_score(predictions_val, y_val)
    print("\033[1m" + "Decision Tree" + "\033[0m"+":")
    print('train: ',dt_train*100,'%')
    print('val: ',dt_val*100,'%')

    predictions_train = gradient_boost.predict(X_train)
    predictions_val = gradient_boost.predict(X_val)
    gb_train = accuracy_score(predictions_train,y_train)
    gb_val = accuracy_score(predictions_val, y_val)
    print("\033[1m" + "Gradient Boost" + "\033[0m"+":")
    print('train: ',gb_train*100,'%')
    print('val: ',gb_val*100,'%')

    predictions_train = random_forest.predict(X_train)
    predictions_val = random_forest.predict(X_val)
    rf_train = accuracy_score(predictions_train,y_train)
    rf_val = accuracy_score(predictions_val, y_val)
    print("\033[1m" + "Random Forest" + "\033[0m"+":")
    print('train: ',rf_train*100,'%')
    print('val: ',rf_val*100,'%')

    predictions_train = svm.predict(X_train)
    predictions_val = svm.predict(X_val)
    svm_train = accuracy_score(predictions_train,y_train)
    svm_val = accuracy_score(predictions_val, y_val)
    print("\033[1m" + "SVM" + "\033[0m"+":")
    print('train: ',svm_train*100,'%')
    print('val: ',svm_val*100,'%')


In [None]:
def visualization_regress(linear, decision_tree, gradient_boost, random_forest, svr, X_train, y_train, X_val, y_val):
    predictions_train = linear.predict(X_train)
    predictions_val = linear.predict(X_val)
    linear_train = r2_score(y_train, predictions_train)
    linear_val = r2_score(y_val, predictions_val)
    print("\033[1m" + "Linear Regression" + "\033[0m" + ":")
    print('train: ', linear_train*100, '%')
    print('val: ', linear_val*100, '%')
    
    predictions_train = decision_tree.predict(X_train)
    predictions_val = decision_tree.predict(X_val)
    dt_train = r2_score(y_train, predictions_train)
    dt_val = r2_score(y_val, predictions_val)
    print("\033[1m" + "Decision Tree" + "\033[0m" + ":")
    print('train: ', dt_train*100, '%')
    print('val: ', dt_val*100, '%')
    
    predictions_train = gradient_boost.predict(X_train)
    predictions_val = gradient_boost.predict(X_val)
    gb_train = r2_score(y_train, predictions_train)
    gb_val = r2_score(y_val, predictions_val)
    print("\033[1m" + "Gradient Boost" + "\033[0m" + ":")
    print('train: ', gb_train*100, '%')
    print('val: ', gb_val*100, '%')
    
    predictions_train = random_forest.predict(X_train)
    predictions_val = random_forest.predict(X_val)
    rf_train = r2_score(y_train, predictions_train)
    rf_val = r2_score(y_val, predictions_val)
    print("\033[1m" + "Random Forest" + "\033[0m" + ":")
    print('train: ', rf_train*100, '%')
    print('val: ', rf_val*100, '%')
    
    predictions_train = svr.predict(X_train)
    predictions_val = svr.predict(X_val)
    svr_train = r2_score(y_train, predictions_train)
    svr_val = r2_score(y_val, predictions_val)
    print("\033[1m" + "SVR" + "\033[0m" + ":")
    print('train: ', svr_train*100, '%')
    print('val: ', svr_val*100, '%')