#### What is Cross validation?

Cross-validation is a technique used in machine learning to assess how well a model will generalize to an independent data set. It is primarily used to estimate the effectiveness of a predictive model on unseen data. Here's a simple explanation:

Imagine you have a dataset that you want to use to build a machine learning model. To ensure that your model doesn't just memorize the data (a problem known as overfitting), you need to test it on data it hasn't seen during training. However, you might not always have a separate test dataset. This is where cross-validation comes in.

1. Split the Data: You divide your dataset into smaller parts. A common method is k-fold cross-validation, where 'k' is a number like 5 or 10. In k-fold cross-validation, you split your data into 'k' equal parts.

2. Train and Test in Rounds: For each round, you use one of the 'k' parts as a test set and the rest as a training set. You train your model on the training set and then test it on the test set.

3. Rotate and Repeat: You repeat this process 'k' times, each time using a different part as the test set. This way, each part of your data gets to be used as a test set exactly once.

4. Average the Results: After you've trained and tested your model 'k' times, you average the performance (like accuracy) from each round. This gives you a good idea of how well your model is likely to perform on unseen data.

In [9]:
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import numpy as np

X, y = load_iris(return_X_y=True)

# Choose the number of folds
k = 5

# Initialize k-fold cross-validation
kf = KFold(n_splits=k, shuffle=True, random_state=42)

# Initialize a model (using Logistic Regression here as an example)
model = LogisticRegression(max_iter=200)

# Store the accuracy for each fold
accuracies = []

# Perform k-fold cross-validation
for train_index, test_index in kf.split(X):
    # print(train_index, test_index)
    # Split the data into training and test sets for this fold
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Train the model
    model.fit(X_train, y_train)

    # Make predictions and evaluate the model
    predictions = model.predict(X_test)
    accuracy = accuracy_score(y_test, predictions)

    # Store the accuracy
    accuracies.append(accuracy)

# Calculate the average accuracy across all folds
average_accuracy = np.mean(accuracies)
print(f"Average Accuracy: {average_accuracy}")


Average Accuracy: 0.9733333333333334


In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load a sample dataset
X, y = load_iris(return_X_y=True)

# Initialize a model (using Logistic Regression here as an example)
model = LogisticRegression(max_iter=200)

# Choose the number of folds
k = 5

# Perform k-fold cross-validation
# This function returns the scores for each fold automatically
scores = cross_val_score(model, X, y, cv=k)

# Calculate the average score across all folds
average_score = scores.mean()

print(f"Average Score: {average_score}")


Average Score: 0.9733333333333334


In [7]:
### Measuring recall from cross validation

# Load a multiclass classification dataset
X, y = load_iris(return_X_y=True)

# Initialize the model
model = LogisticRegression(max_iter=200)

# Perform 5-fold cross-validation measuring macro or weighted recall
scores_macro = cross_val_score(model, X, y, cv=5, scoring='recall_macro')
scores_weighted = cross_val_score(model, X, y, cv=5, scoring='recall_weighted')

# Calculate the average recall
average_recall_macro = scores_macro.mean()
average_recall_weighted = scores_weighted.mean()

print(f"Average Macro Recall: {average_recall_macro}")
print(f"Average Weighted Recall: {average_recall_weighted}")


'''

Macro Recall:

Definition: Macro recall is calculated by computing the recall for each class independently and then taking the average of these recalls. This method treats all classes equally, regardless of their frequency in the dataset.
Calculation: If you have N classes, calculate the recall for each class, and then average these values.
Use Case: Macro recall is useful when you want to treat all classes equally and when each class is equally important.


Weighted Recall:

Definition: Weighted recall, on the other hand, calculates recall for each class like macro recall but then takes a weighted average of these scores. The weight for each class's recall is proportional to the number of true instances of that class in the dataset.
Calculation: Compute the recall for each class, then average these values, weighting each one by the proportion of true instances of that class in the dataset.
Use Case: Weighted recall is useful when class imbalance is present in the dataset. It gives more weight to the majority class and is more representative of the model's performance across the most common classes.

''' 

Average Macro Recall: 0.9733333333333334
Average Weighted Recall: 0.9733333333333334


"\n\nMacro Recall:\n\nDefinition: Macro recall is calculated by computing the recall for each class independently and then taking the average of these recalls. This method treats all classes equally, regardless of their frequency in the dataset.\nCalculation: If you have N classes, calculate the recall for each class, and then average these values.\nUse Case: Macro recall is useful when you want to treat all classes equally and when each class is equally important.\n\n\nWeighted Recall:\n\nDefinition: Weighted recall, on the other hand, calculates recall for each class like macro recall but then takes a weighted average of these scores. The weight for each class's recall is proportional to the number of true instances of that class in the dataset.\nCalculation: Compute the recall for each class, then average these values, weighting each one by the proportion of true instances of that class in the dataset.\nUse Case: Weighted recall is useful when class imbalance is present in the datas