‚úÖ END-TO-END MACHINE LEARNING

Think of Machine Learning (ML) like cooking:

Dataset = raw ingredients

Preprocessing = cleaning and cutting vegetables

Model = recipe

Training = cooking

Testing = tasting the food

Hyperparameters = choices in the recipe (salt, flame level)

Parameters = what the model learns from data

‚úÖ CODE: ML END-TO-END

‚úÖ STEP 1: IMPORT THE REQUIRED LIBRARIES

In [1]:
# sklearn provides ready-made datasets and ML models
from sklearn.datasets import load_iris, load_breast_cancer, load_digits
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# ML models to test
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier


‚úÖ STEP 2: CREATE A LIST OF DATASETS

In [2]:
# We create a Python dictionary called "datasets".
# A dictionary stores data in KEY : VALUE pairs.
# Example:
#     "Name" : "Arokia"
#
# Here, the KEY is the dataset name (a string),
# and the VALUE is the actual dataset loaded from sklearn.

datasets = {
    "Iris Dataset"       : load_iris(),          # Flower dataset (predict flower type)
    "Breast Cancer"      : load_breast_cancer(), # Medical dataset (predict cancer type)
    "Digits Recognition" : load_digits()         # Image dataset (predict handwritten digit)
}

# All sklearn datasets follow the SAME structure:
#   dataset.data   ‚Üí features (X)
#   dataset.target ‚Üí labels   (y)
#
# So we can loop through this dictionary and treat
# every dataset the same way in our ML pipeline.


‚úÖ STEP 3: DEFINE MODELS TO TEST

In [3]:
# We create another dictionary called "models".
# KEY   = model name (string)
# VALUE = the actual ML model wrapped inside a Pipeline.
# ‚úÖ What is a Pipeline?
# A Pipeline is like a "processing machine" where data enters,
# gets cleaned/preprocessed, and then goes to the ML model.
#
# Why use a Pipeline here?
# 1) It scales (normalizes) the data automatically.
# 2) It sends the scaled data into the model.
# 3) It keeps your code clean and reduces mistakes.
# 4) You train everything using ONE line: model.fit()

models = {

    # -------------------------------------------------------------
    # 1Ô∏è‚É£ LOGISTIC REGRESSION PIPELINE
    # -------------------------------------------------------------
    "Logistic Regression": Pipeline([
        ('scaler', StandardScaler()),      # Step 1: Standardize features
                                           # (Mean = 0, Std = 1)
                                           # Helps many models perform better

        ('clf', LogisticRegression())      # Step 2: The actual ML model
                                           # clf = "classifier"
    ]),

    # -------------------------------------------------------------
    # 2Ô∏è‚É£ DECISION TREE PIPELINE
    # -------------------------------------------------------------
    "Decision Tree": Pipeline([
        ('scaler', StandardScaler()),      # Scaling still included for consistency,
                                           # even though trees do NOT need scaling

        ('clf', DecisionTreeClassifier())  # Step 2: Decision Tree model
                                           # Learns rules like:
                                           # "If THIS > 5 ‚Üí class A"
    ]),

    # -------------------------------------------------------------
    # 3Ô∏è‚É£ RANDOM FOREST PIPELINE
    # -------------------------------------------------------------
    "Random Forest": Pipeline([
        ('scaler', StandardScaler()),      # Again, scaling is safe and consistent

        ('clf', RandomForestClassifier())  # Step 2: Random Forest model
                                           # A group of many decision trees
                                           # (majority vote)
    ])
}

# Summary:
# - All models follow the same structure.
# - All models use the SAME preprocessing (scaling).
# - This makes it fair to compare them.
# - Later, we can simply loop through the "models" dictionary
#   to train & test every model easily.


‚úÖ STEP 4: LOOP THROUGH EACH DATASET & TEST EVERY MODEL

In [4]:
# We will loop through:
#   1) Each dataset (Iris, Breast Cancer, Digits)
#   2) Each model (Logistic Regression, Tree, Forest)
#
# For every dataset:
#   - Split it into train & test
#   - Train all models
#   - Test all models
#   - Print accuracy
#   - Pick the best model
# -------------------------------------------------

for dataset_name, dataset in datasets.items():
    # datasets.items() returns:
    # ("Iris Dataset", <iris_object>)
    # ("Breast Cancer", <cancer_object>)
    # ...
    # dataset_name = KEY (string)
    # dataset = VALUE (actual sklearn dataset)

    print("\n======================================")
    print(f"üìå DATASET:", dataset_name)   # Shows which dataset we are testing
    print("======================================")

    # -------------------------------------------------
    # EXTRACT FEATURES (X) AND LABELS (y)
    # -------------------------------------------------
    # X = input data (what we use to predict)
    # y = output labels (the correct answers)
    X = dataset.data      # All feature columns
    y = dataset.target    # The category/class we want to predict

    # -------------------------------------------------
    # SPLIT INTO TRAINING AND TESTING SETS
    # -------------------------------------------------
    # 80% ‚Üí used to TRAIN the model
    # 20% ‚Üí used to TEST the model (unseen data)
    # random_state=42 ensures results never change
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # -------------------------------------------------
    # INITIALIZE TRACKING VARIABLES
    # -------------------------------------------------
    best_model_name = None    # We will store the name of the best model
    best_accuracy = 0         # Store its highest accuracy score

    # -------------------------------------------------
    # LOOP THROUGH EACH MODEL
    # -------------------------------------------------
    for model_name, model in models.items():

        # -------------------------------
        # TRAIN THE MODEL
        # -------------------------------
        # This teaches the model using training data.
        # Pipeline automatically:
        #   1) Scales the data
        #   2) Trains the classifier
        model.fit(X_train, y_train)

        # -------------------------------
        # TEST THE MODEL
        # -------------------------------
        # .score() returns accuracy:
        # How many predictions were correct?
        accuracy = model.score(X_test, y_test)

        # Print model name + accuracy
        print(f"{model_name}: {accuracy:.3f}")

        # -------------------------------------------------
        # CHECK IF THIS MODEL IS THE BEST ONE (so far)
        # -------------------------------------------------
        if accuracy > best_accuracy:
            best_accuracy = accuracy       # Update best accuracy
            best_model_name = model_name   # Update best model

    # -------------------------------------------------
    # PRINT BEST MODEL FOR THIS DATASET
    # -------------------------------------------------
    print(f"‚úÖ BEST MODEL: {best_model_name} with accuracy {best_accuracy:.3f}")



üìå DATASET: Iris Dataset
Logistic Regression: 1.000
Decision Tree: 1.000
Random Forest: 1.000
‚úÖ BEST MODEL: Logistic Regression with accuracy 1.000

üìå DATASET: Breast Cancer
Logistic Regression: 0.974
Decision Tree: 0.947
Random Forest: 0.965
‚úÖ BEST MODEL: Logistic Regression with accuracy 0.974

üìå DATASET: Digits Recognition
Logistic Regression: 0.972
Decision Tree: 0.847
Random Forest: 0.975
‚úÖ BEST MODEL: Random Forest with accuracy 0.975


SUMMARY

‚úÖ You loop through datasets like this:

Iris ‚Üí test all models ‚Üí choose best

Breast Cancer ‚Üí test all models ‚Üí choose best

Digits ‚Üí test all models ‚Üí choose best

‚úÖ Pipelines automatically clean and scale data.

‚úÖ .fit() = training

‚úÖ .score() = test accuracy

‚úÖ You compare accuracy and choose the highest.

‚úÖ Why multiple datasets?

Because each dataset has different patterns.
We want to check:

‚úÖ Which model works on flowers? (IRIS)

‚úÖ Which model works on images? (Digits)

‚úÖ Which model works on medical data? (Cancer)

‚úÖ Why Pipelines?

Pipeline = A box that automatically does:

Scale data

Train model

Instead of writing:

scaler.fit()

scaler.transform()

model.fit()

You write only:
pipeline.fit()

‚úÖ Why train-test split?

We want to check if the model works on new data it has never seen.

‚úÖ Why use .fit()?

Training = model learning patterns.

‚úÖ Why use .score()?

Testing = model accuracy.

‚úÖ Why use .predict()?

Prediction = model giving answer.

Example:

model.predict([[5.1, 3.5, 1.4, 0.2]])

‚úÖ FINAL OUTCOME

You built a system that:

‚úÖ Loads multiple datasets

‚úÖ Trains 3 different ML models

‚úÖ Tests each model

‚úÖ Automatically finds the best model

‚úÖ Works for absolute beginners

‚úÖ Fully end-to-end

IMBALANCED DATA

‚úÖ 1. What is imbalanced data?

Imbalanced data means:

One class has many samples

Another class has very few

Example:

990 normal emails

10 spam emails

üëâ Accuracy becomes meaningless.

A model can say ‚ÄúEverything is normal‚Äù and still get 990/1000 correct ‚Üí 99% accuracy. But it failed completely at detecting spam.

So accuracy is not useful for imbalanced datasets.

BEST METRICS FOR IMBALANCED DATA

‚úÖ ‚úÖ (1) Precision

Of all predicted positives, how many are correct?

Useful when false positives are dangerous. Example: Spam filter (don't mark real emails as spam).

‚úÖ ‚úÖ (2) Recall

Of all actual positives, how many did the model catch?

Useful when false negatives are dangerous. Example: Cancer detection (don‚Äôt miss sick patients).

‚úÖ ‚úÖ (3) F1-Score

Balance between Precision and Recall

Use when: Data is imbalanced  You want a single, fair metric

‚úÖ ‚úÖ (4) Confusion Matrix

A table showing correct/incorrect predictions. Very useful to see mistakes clearly.

‚úÖ SUMMARY TABLE
| Situation                               | Best Metric           |
| --------------------------------------- | --------------------- |
| Data imbalanced                         | **F1-score** |
| Missing positives is dangerous (cancer) | **Recall**            |
| False alarms are dangerous (spam)       | **Precision**         |
| Want a visual mistake summary           | **Confusion Matrix**  |
| Balanced data                           | **Accuracy** is fine  |


‚úÖ THE SIMPLEST RULE

üëâ For imbalanced datasets, DO NOT USE Accuracy

üëâ Use F1-score

Testing Models on Different Imbalanced Datasets

We will use 3 datasets:

1Ô∏è‚É£ Breast Cancer Dataset ‚Äì slightly imbalanced

2Ô∏è‚É£ Make_Classification (Custom Synthetic Data) ‚Äì heavily imbalanced

3Ô∏è‚É£ Credit Card Fraud Dataset (simulated) ‚Äì extremely imbalanced

In [5]:
# STEP 1: IMPORT LIBRARIES
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# ML models to test
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Metrics for imbalanced data
from sklearn.metrics import precision_score, recall_score, f1_score


In [6]:
# STEP 2: CREATE MULTIPLE IMBALANCED DATASETS
# We make a dictionary called "datasets" that will store
# all our imbalanced datasets.
#
# Why?
# Because later we want to LOOP through them and test
# every ML model on every dataset easily.

datasets = {}

# ---------------------------------------------------------
# 1Ô∏è‚É£ Breast Cancer dataset (slightly imbalanced)
# ---------------------------------------------------------
# This dataset comes built-in from sklearn.
# It has two classes:
#   - 0 : malignant (cancer)
#   - 1 : benign (non-cancer)
#
# The dataset is not perfectly balanced.
# One class has more samples than the other.
datasets["Breast Cancer"] = load_breast_cancer()

# ---------------------------------------------------------
# 2Ô∏è‚É£ Synthetic dataset (artificially created)
# ---------------------------------------------------------
# We create our own dataset using make_classification().
#
# n_samples   = total rows of data
# n_features  = number of columns/features
# weights     = class distribution
#               0.9 means 90%
#               0.1 means 10%
#
# So we are creating:
#   - Class 0 ‚Üí 90% of data
#   - Class 1 ‚Üí 10% of data
#
# This simulates real-world imbalance like:
#   - disease (rare)
#   - machine failure (rare)
#   - fraud detection (rare)
X_syn, y_syn = make_classification(
    n_samples=3000,
    n_features=10,
    weights=[0.9, 0.1],        # 90% vs 10% imbalance
    random_state=42            # ensures the same output every time
)

# We store this dataset in our dictionary.
# We use a small dictionary for each dataset:
#     {"data": X, "target": y}
datasets["Synthetic 90:10"] = {"data": X_syn, "target": y_syn}

# ---------------------------------------------------------
# 3Ô∏è‚É£ Simulated Credit Card Fraud dataset
# ---------------------------------------------------------
# We create another synthetic dataset, but now with
# EXTREME imbalance:
#
#   - Class 0 ‚Üí 99.2%
#   - Class 1 ‚Üí 0.8% (almost no fraud)
#
# This represents real financial fraud data.
# Fraud cases are VERY rare.
X_fraud, y_fraud = make_classification(
    n_samples=5000,
    n_features=15,
    weights=[0.992, 0.008],   # EXTREME imbalance (99:1)
    random_state=42
)

# Store this dataset in our main dictionary
datasets["Credit Fraud 99:1"] = {"data": X_fraud, "target": y_fraud}

# ---------------------------------------------------------
# ‚úÖ Summary for Absolute Beginners:
# ---------------------------------------------------------
# ‚úÖ We now have 3 datasets:
#     1) Breast Cancer (slightly imbalanced)
#     2) Synthetic 90:10 (moderately imbalanced)
#     3) Credit Fraud 99:1 (extremely imbalanced)
#
# ‚úÖ All datasets are stored in ONE dictionary (datasets)
#    so we can loop through them later.
#
# ‚úÖ Using imbalanced datasets helps us learn which models
#    and metrics work best when one class is very rare.


In [7]:
# STEP 3: DEFINE MODELS TO TEST

# We create a dictionary called 'models'
# Each key = name of the model (text)
# Each value = the actual model inside a Pipeline
models = {

    # -----------------------------------------------------
    # 1Ô∏è‚É£ Logistic Regression
    # -----------------------------------------------------
    "Logistic Regression": Pipeline([

        # Step A: Scale the data
        # StandardScaler makes all features (columns)
        # have similar ranges so the model learns better.
        ('scaler', StandardScaler()),

        # Step B: Our actual ML model
        # Logistic Regression is a simple classification model.
        ('clf', LogisticRegression())
    ]),


    # -----------------------------------------------------
    # 2Ô∏è‚É£ Decision Tree Classifier
    # -----------------------------------------------------
    "Decision Tree": Pipeline([

        # Step A: Scale features ‚Äî not required for trees,
        # but keeping everything consistent for beginners.
        ('scaler', StandardScaler()),

        # Step B: Decision Tree model
        # It learns by splitting data like a flowchart.
        ('clf', DecisionTreeClassifier())
    ]),


    # -----------------------------------------------------
    # 3Ô∏è‚É£ Random Forest Classifier
    # -----------------------------------------------------
    "Random Forest": Pipeline([

        # Step A: Scaling ‚Äî again for consistency
        ('scaler', StandardScaler()),

        # Step B: Random Forest model
        # A forest = many decision trees working together.
        # Usually performs better than a single tree.
        ('clf', RandomForestClassifier())
    ])
}


Test ALL Models on ALL Imbalanced Datasets

In [9]:
# STEP 4: LOOP THROUGH EACH DATASET

# Loop through every dataset we created in Step 2
# dataset_name  = text (ex: "Breast Cancer")
# dataset       = actual data (X, y)
for dataset_name, dataset in datasets.items():

    # Print the dataset name so we know which one is running
    print("\n======================================")
    print("üìå DATASET:", dataset_name)
    print("======================================")

    # -----------------------------------------------------
    # Some datasets (like Breast Cancer) come as sklearn objects.
    # Others (our synthetic ones) are Python dictionaries.
    # So we check the type and extract X, y correctly.
    # -----------------------------------------------------

    # If dataset is a normal dictionary
    if isinstance(dataset, dict):
        X = dataset["data"]     # features (inputs)
        y = dataset["target"]   # labels (outputs)
    else:
        # If dataset is a sklearn dataset object
        X = dataset.data
        y = dataset.target

    # -----------------------------------------------------
    # Split data into Train (80%) and Test (20%)
    # Train ‚Üí used to teach the model
    # Test ‚Üí used to check how well model performs
    # -----------------------------------------------------
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # -----------------------------------------------------
    # Variables to store the best model for this dataset
    # We use F1-Score because data is imbalanced.
    # -----------------------------------------------------
    best_model_name = None
    best_f1 = 0

    # -----------------------------------------------------
    # Test EACH model (logistic, tree, random forest)
    # -----------------------------------------------------
    for model_name, model in models.items():

        # ‚úÖ Train the model using training data
        model.fit(X_train, y_train)

        # ‚úÖ Make predictions on the test data
        y_pred = model.predict(X_test)

        # ‚úÖ Needed for ROC-AUC (probability of class 1)
        y_proba = model.predict_proba(X_test)[:, 1]

        # -------------------------------------------------
        # ‚úÖ Calculate evaluation metrics for IMBALANCED DATA
        # Precision ‚Üí Of all predicted positive, how many correct?
        # Recall    ‚Üí Of all actual positive, how many found?
        # F1 Score  ‚Üí Balance between Precision & Recall
        # -------------------------------------------------
        precision = precision_score(y_test, y_pred, zero_division=0)
        recall = recall_score(y_test, y_pred, zero_division=0)
        f1 = f1_score(y_test, y_pred, zero_division=0)

        # Print metrics
        print(f"\nüîπ {model_name}")
        print("Precision:", round(precision, 3))
        print("Recall:   ", round(recall, 3))
        print("F1 Score: ", round(f1, 3))

        # -------------------------------------------------
        # Keep the model with the highest F1-Score
        # -------------------------------------------------
        if f1 > best_f1:
            best_f1 = f1
            best_model_name = model_name

    # ‚úÖ After testing all models, print the best one
    print(f"\n‚úÖ BEST MODEL (based on F1): {best_model_name} ‚Äî F1 = {best_f1:.3f}")


üìå DATASET: Breast Cancer

üîπ Logistic Regression
Precision: 0.972
Recall:    0.986
F1 Score:  0.979

üîπ Decision Tree
Precision: 0.944
Recall:    0.958
F1 Score:  0.951

üîπ Random Forest
Precision: 0.959
Recall:    0.986
F1 Score:  0.972

‚úÖ BEST MODEL (based on F1): Logistic Regression ‚Äî F1 = 0.979

üìå DATASET: Synthetic 90:10

üîπ Logistic Regression
Precision: 0.871
Recall:    0.831
F1 Score:  0.85

üîπ Decision Tree
Precision: 0.864
Recall:    0.877
F1 Score:  0.87

üîπ Random Forest
Precision: 0.965
Recall:    0.846
F1 Score:  0.902

‚úÖ BEST MODEL (based on F1): Random Forest ‚Äî F1 = 0.902

üìå DATASET: Credit Fraud 99:1

üîπ Logistic Regression
Precision: 0.0
Recall:    0.0
F1 Score:  0.0

üîπ Decision Tree
Precision: 0.143
Recall:    0.231
F1 Score:  0.176

üîπ Random Forest
Precision: 0.75
Recall:    0.231
F1 Score:  0.353

‚úÖ BEST MODEL (based on F1): Random Forest ‚Äî F1 = 0.353
