# ML Models Comparisson

When comparing multiple models like __Decision Trees__, __Random Forest__, and __Logistic Regression__, the best practice is to follow a systematic workflow to ensure fair comparisons. This involves the following steps:

- Data Preparation: Preprocess the data (e.g., handling missing values, encoding categorical variables).
- Train-Test Split: Split the dataset into training and testing sets to evaluate model performance on unseen data.
- Cross-Validation: Use cross-validation (e.g., k-fold) to get a robust estimate of model performance.
- Hyperparameter Tuning: Tune hyperparameters for each model using techniques like GridSearchCV or RandomizedSearchCV.
- Evaluation Metrics: Use the same evaluation metrics (e.g., accuracy, precision, recall, F1-score) for comparison.
- Model Comparison: Compare the models based on their cross-validated performance and final evaluation on the test set.

### 1. Data Preparation

Assuming you already have a dataset loaded and preprocessed:

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load your dataset (for example, Iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
# Standardize the features (important for Logistic Regression)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### 2. Define Models and Perform Hyperparameter Tuning

We will use GridSearchCV to perform hyperparameter tuning for each model and select the best one based on cross-validation.

In [None]:
# Define models
decision_tree = DecisionTreeClassifier(random_state=42)
random_forest = RandomForestClassifier(random_state=42)
logistic_regression = LogisticRegression(random_state=42, max_iter=1000)

# Define hyperparameter grids
param_grid_dt = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

param_grid_lr = {
    'C': [0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs']
}

# Initialize GridSearchCV for each model
grid_search_dt = GridSearchCV(estimator=decision_tree, param_grid=param_grid_dt, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_rf = GridSearchCV(estimator=random_forest, param_grid=param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)
grid_search_lr = GridSearchCV(estimator=logistic_regression, param_grid=param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the models on the training data
grid_search_dt.fit(X_train, y_train)
grid_search_rf.fit(X_train, y_train)
grid_search_lr.fit(X_train, y_train)

### 3. Compare Models Based on Cross-Validation Scores

After fitting the models using GridSearchCV, we can compare the best cross-validated scores for each model.

In [4]:
print(f"Best Decision Tree Accuracy: {grid_search_dt.best_score_:.4f}")
print(f"Best Random Forest Accuracy: {grid_search_rf.best_score_:.4f}")
print(f"Best Logistic Regression Accuracy: {grid_search_lr.best_score_:.4f}")

print(f"Best Decision Tree Params: {grid_search_dt.best_params_}")
print(f"Best Random Forest Params: {grid_search_rf.best_params_}")
print(f"Best Logistic Regression Params: {grid_search_lr.best_params_}")

Best Decision Tree Accuracy: 0.9429
Best Random Forest Accuracy: 0.9429
Best Logistic Regression Accuracy: 0.9524
Best Decision Tree Params: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10}
Best Random Forest Params: {'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}
Best Logistic Regression Params: {'C': 100, 'solver': 'lbfgs'}


### 4. Evaluate the Best Model from Each Search on the Test Set

Now, evaluate the final tuned models on the test set for a more robust comparison.

In [5]:
# Best estimators
best_dt = grid_search_dt.best_estimator_
best_rf = grid_search_rf.best_estimator_
best_lr = grid_search_lr.best_estimator_

# Predictions on test data
y_pred_dt = best_dt.predict(X_test)
y_pred_rf = best_rf.predict(X_test)
y_pred_lr = best_lr.predict(X_test)

# Define a function to evaluate models
def evaluate_model(y_test, y_pred, model_name):
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    print(f"\n{model_name} Performance on Test Set")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")

# Evaluate each model
evaluate_model(y_test, y_pred_dt, "Decision Tree")
evaluate_model(y_test, y_pred_rf, "Random Forest")
evaluate_model(y_test, y_pred_lr, "Logistic Regression")


Decision Tree Performance on Test Set
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000

Random Forest Performance on Test Set
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000

Logistic Regression Performance on Test Set
Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
F1 Score: 1.0000


### 5. Final Model Comparison

Now that you’ve evaluated all models on the test set, you can compare their performance based on accuracy, precision, recall, and F1-score. Choose the best model based on your problem requirements (e.g., whether precision or recall is more important than accuracy).

### Best Practices for Model Comparison

- Cross-Validation: Always use cross-validation to avoid overfitting and get reliable estimates of model performance.
- Hyperparameter Tuning: Tune hyperparameters for each model using techniques like GridSearchCV or RandomizedSearchCV to get the best possible configuration for each model.
- Use Multiple Metrics: Compare models based on multiple evaluation metrics (accuracy, precision, recall, F1-score) to capture different aspects of performance.
- Final Test Set Evaluation: After cross-validation and tuning, always evaluate your models on a separate test set to get an unbiased estimate of how they perform on unseen data.