In [13]:
# Import libraries

from pandas import DataFrame
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score,KFold,StratifiedKFold,LeaveOneOut

In [14]:
# Dataset Preparation

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

### Model Selection

You will use a logistic regression model as a baseline to apply the cross-validation techniques. Initialize the logistic regression model with the parameters provided below.

In [15]:
# Model Selection

# Initialize Logistic Regression
logit = LogisticRegression(max_iter=200, random_state=42)

### K-Fold Cross-Validation

Implement K-Fold cross-validation with 5 splits and ensure that the dataset is shuffled before splitting. Calculate and output the cross-validation scores and the average score. Use the logistic regression model.

In [16]:
# K-Fold Cross-Validation

kf = KFold(n_splits=5, shuffle=True, random_state=42)
logit_kfold_scores = cross_val_score(logit, X, y, cv=kf)

print(f"K-Fold CV scores: {[round(score,2) for score in logit_kfold_scores]}")
print(f"Average K-Fold CV score: {logit_kfold_scores.mean():.2f}")

K-Fold CV scores: [1.0, 1.0, 0.93, 0.97, 0.97]
Average K-Fold CV score: 0.97


## Stratified K-Fold Cross-Validation

The Iris dataset is balanced but in practice, datasets can often be imbalanced. Stratified K-Fold cross-validation is a variation of K-Fold that returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set. Implement Stratified K-Fold CV and explain why it might be preferred over regular K-Fold for imbalanced datasets.

In [17]:
# Stratified K-Fold Cross-Validation

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
logit_strat_scores = cross_val_score(logit, X, y, cv=skf)

print(f"Stratified K-Fold CV scores: {[round(score,2) for score in logit_strat_scores]}")
print(f"Average Stratified K-Fold CV score: {logit_strat_scores.mean():.2f}")

Stratified K-Fold CV scores: [1.0, 0.97, 0.93, 1.0, 0.93]
Average Stratified K-Fold CV score: 0.97


## Leave-One-Out Cross-Validation

Leave-One-Out (LOO) is a cross-validation technique that can provide a robust estimate of model performance but can be computationally expensive, especially for large datasets. Implement LOO CV and discuss the trade-offs associated with using this method.

In [18]:
# Leave-One-Out Cross-Validation

loo = LeaveOneOut()
logit_loo_scores = cross_val_score(logit, X, y, cv=loo)

print(f"Leave-One-Out CV scores: {[round(score,2) for score in logit_loo_scores]}")
print(f"Average Leave-One-Out CV score: {logit_loo_scores.mean():.2f}")

Leave-One-Out CV scores: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Average Leave-One-Out CV score: 0.97


## Cross-Validation with the Random Forest Classifier

Now, let's try a different classifier. Use a RandomForestClassifier and perform the same three cross-validation techniques as before. Discuss the differences in the cross-validation scores between the logistic regression and the random forest classifiers.

In [19]:
# Cross-Validation with RF classifier


# Initialize Random Forest Classifier
rf = RandomForestClassifier(random_state=42)

# K-Fold Cross-Validation
rf_kfold_scores = cross_val_score(rf, X, y, cv=kf)

# Stratified K-Fold Cross-Validation
rf_strat_scores = cross_val_score(rf,X,y,cv=skf)

# Leave-One-Out Cross-Validation
rf_loo_scores = cross_val_score(rf, X, y, cv=loo)

print(f"Random Forest K-Fold CV scores: {[round(score,2) for score in rf_kfold_scores]}")
print(f"Average Random Forest K-Fold CV score: {rf_kfold_scores.mean():.2f}")
print(f"Stratified K-Fold CV scores: {[round(score,2) for score in rf_strat_scores]}")
print(f"Average Stratified K-Fold CV score: {rf_strat_scores.mean():.2f}")
print(f"Leave-One-Out CV scores: {[round(score,2) for score in rf_loo_scores]}")
print(f"Average Leave-One-Out CV score: {rf_loo_scores.mean():.2f}")

Random Forest K-Fold CV scores: [1.0, 0.97, 0.93, 0.93, 0.97]
Average Random Forest K-Fold CV score: 0.96
Stratified K-Fold CV scores: [0.97, 0.97, 0.93, 0.97, 0.9]
Average Stratified K-Fold CV score: 0.95
Leave-One-Out CV scores: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
Average Leave-One-

Comparing the performances of the logistic regression with the Random Forest classifier, we find that the logistic regression is slightly preferable to the RF classifier although there is not much difference in performance. The reason is that our dataset is relatively small (150 observations and 4 features) and in such cases simpler models are preferred for their lower computational cost while having the same performances as more complex estimators such as the Random Forest Classifier. Even if we would have performed automatic parameter tuning *(we will see how to do this in the next batch)*, we shouldn't expect the Random Forest Classifier to have significantly higher performances.

## Analysis

Summarize the results from the previous tasks. Reflect on the average scores and the variability of the scores for each cross-validation technique. Discuss the implications of using each technique and the circumstances in which one might be preferred over the others.

In [22]:
perf = DataFrame(
    data = {
        'K-Fold': [logit_kfold_scores.mean(),rf_kfold_scores.mean()],
        'Stratified K-Fold': [logit_strat_scores.mean(),rf_strat_scores.mean()],
        'Leave-One-Out': [logit_loo_scores.mean(),rf_loo_scores.mean()]
    },
    index = ['Logistic Regression', 'Random Forest Classifier']
).round(2)
perf

Unnamed: 0,K-Fold,Stratified K-Fold,Leave-One-Out
Logistic Regression,0.97,0.97,0.97
Random Forest Classifier,0.96,0.95,0.95


The Iris dataset contains the same number of observations for each species *(50 observations each)* which means that the dataset is very well balanced. From that, there shouldn't be significant difference between the K-Fold and Stratified K-Fold cross validation scores *(for both Logistic Regression and RF Classifier)* and the application of the Stratified K-Fold cross validation is not relevant in this case.

The choice between applying **K-Fold** or **Leave-One-Out** depends on your business case. **If** your manager or client wants to have a performance estimate that is on average very close to the true unknown accuracy score but she doesn't care if your current estimate is likely to be very far from an estimate you would obtain with a different dataset **then** you should use the **Leave-One-Out** cross validation technique. Otherwise, **if** your manager or client doesn't care to have a performance estimate that is on average far to the true unknown accuracy score but she cares about the difference between your current estimate and an estimate you would obtain with a different dataset **then** you should use the **K-Fold** cross validation technique with K << N, N being the number of observations.