<a href="https://colab.research.google.com/github/NDsasuke/Autocorrelation-function-Diagnostics-and-prediction/blob/main/Diagnostics%20and%20prediction/Cross-Validation/Cross_Validation_for_Imbalanced_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


1. Importing necessary libraries:
   - `sklearn.datasets.load_breast_cancer`: This library is used to load the Breast Cancer Wisconsin (Diagnostic) dataset.
   - `sklearn.model_selection.StratifiedKFold`: This library is used to perform stratified k-fold cross-validation.
   - `sklearn.metrics.accuracy_score`, `sklearn.metrics.precision_score`, `sklearn.metrics.recall_score`, `sklearn.metrics.f1_score`: These libraries are used to calculate evaluation metrics for the classification model.
   - `sklearn.linear_model.LogisticRegression`: This library is used to create a logistic regression classifier.


In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression


2. Load the Breast Cancer Wisconsin (Diagnostic) dataset:
   - The dataset is loaded using `load_breast_cancer` function from `sklearn.datasets`.
   - The features of the dataset are stored in `X`.
   - The target labels of the dataset are stored in `y`.


In [7]:
# Load the Breast Cancer Wisconsin (Diagnostic) Dataset
data = load_breast_cancer()
X = data.data
y = data.target


3. Set the number of folds for cross-validation:
   - The variable `num_folds` is set to the desired number of folds for cross-validation.


In [8]:
# Set the number of folds for cross-validation
num_folds = 5



4. Perform Stratified K-Fold Cross-Validation:
   - The `StratifiedKFold` function is used to create an instance of stratified k-fold cross-validator.
   - The dataset is split into training and testing sets using `split` method of `StratifiedKFold`.
   - The classifier is trained and evaluated for each fold of cross-validation.


In [9]:
# Perform Stratified K-Fold Cross-Validation
skf = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=42)
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []



5. Classification on the imbalanced dataset:
   - Logistic regression classifier is created using `LogisticRegression` with `max_iter` set to 10000 and `solver` set to 'saga'.
   - The classifier is fitted on the training data using `fit` method.
   - The classifier predicts the labels for the test data using `predict` method.



6. Calculate evaluation metrics:
   - The evaluation metrics such as accuracy, precision, recall, and F1-score are calculated using the predicted labels and true labels of the test data.



7. Store the evaluation metrics for each fold:
   - The evaluation metrics for each fold are stored in separate lists.


In [10]:

for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # Perform classification on the imbalanced dataset
    classifier = LogisticRegression(max_iter=10000, solver='saga')  # Increase the max_iter value and use 'saga' solver
    classifier.fit(X_train, y_train)
    y_pred = classifier.predict(X_test)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Store the evaluation metrics for each fold
    accuracy_scores.append(accuracy)
    precision_scores.append(precision)
    recall_scores.append(recall)
    f1_scores.append(f1)


8. Print the average evaluation metrics:
   - The average values of accuracy, precision, recall, and F1-score are calculated by summing the values and dividing by the number of folds.
   - The average evaluation metrics are printed to the console.


In [11]:
# Print the average evaluation metrics
print("Average Accuracy:", sum(accuracy_scores)/len(accuracy_scores))
print("Average Precision:", sum(precision_scores)/len(precision_scores))
print("Average Recall:", sum(recall_scores)/len(recall_scores))
print("Average F1-score:", sum(f1_scores)/len(f1_scores))


Average Accuracy: 0.9227138643067846
Average Precision: 0.9176437976437976
Average Recall: 0.9636150234741784
Average F1-score: 0.9399254992343857
