# Machine Learning Pipeline: Classification Task

This notebook follows a structured pipeline for preprocessing, model evaluation, hyperparameter tuning, and prediction.
The goal is to classify data using machine learning while optimizing performance.

## Step 1: Import Necessary Libraries
We load all required libraries for data manipulation, preprocessing, modeling, and evaluation.

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import (
    RepeatedStratifiedKFold,
    cross_val_score,
    RandomizedSearchCV,
    train_test_split,
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from os import linesep

## Step 2: Load and Preprocess the Data
We define a function `fetch_data()` to load the dataset and remove unnecessary columns.

In [23]:
def fetch_data(file_name):
    """Loads and preprocesses data by removing unnecessary columns."""
    df = pd.read_csv(file_name)
    # Drop unnecessary columns
    df = df.drop(columns=["Unnamed: 0", "x12"]) 
    if file_name == "TrainOnMe.csv":
        labels = df["y"]
        features = df.drop(columns=["y"])
        return labels, features
    return df

# Load training data
y, X = fetch_data("TrainOnMe.csv")

# Display data summary
print(X.describe(include="all"))
print(X.head())

                 x1           x2           x3           x4           x5  \
count   5000.000000  5000.000000  5000.000000  5000.000000  5000.000000   
unique          NaN          NaN          NaN          NaN          NaN   
top             NaN          NaN          NaN          NaN          NaN   
freq            NaN          NaN          NaN          NaN          NaN   
mean     199.999115     0.014573   -99.948342    -1.053022   229.938275   
std        1.014776     0.707570     3.169226     0.006610     1.014779   
min      196.519800    -2.546980  -112.554330    -1.070990   226.472260   
25%      199.309460    -0.471633  -102.090328    -1.057280   229.246200   
50%      199.990365     0.018090  -100.005840    -1.053280   229.930135   
75%      200.703677     0.503020   -97.884070    -1.048160   230.642983   
max      203.424740     2.225160   -87.614030    -1.039130   233.358560   

                 x6       x7           x8           x9          x10  \
count   5000.000000     5000

## Step 3: Data Preprocessing
- **One-Hot Encoding** for categorical variable `x7`
- **Standard Scaling** for numerical features
- **Principal Component Analysis (PCA)** to reduce dimensionality

In [21]:
def get_preprocessor(use_pca=False, n_components=None):
    """Dynamically applies PCA based on the classifier."""
    transformers = [("encoder", OneHotEncoder(), ["x7"]),
                    ("scaler", StandardScaler(), ["x1", "x2", "x3", "x4", "x5", "x6", "x8", "x9", "x10", "x11", "x13"])]

    if use_pca:
        pca_transformer = PCA(n_components=n_components) if n_components else PCA()
        transformers.append(("pca", pca_transformer, ["x1", "x2", "x3", "x4", "x5", "x6", "x8", "x9", "x10", "x11", "x13"]))

    return ColumnTransformer(transformers)

## Step 4: Model Selection
We evaluate multiple classifiers using cross-validation to find the best model.

In [33]:
# Define classifiers and whether they should use PCA
classifiers = {
    "Random Forest": (RandomForestClassifier(random_state=42), True, None),  # Full PCA
    "Decision Tree": (DecisionTreeClassifier(), True, None),  # Full PCA
    "Logistic Regression": (LogisticRegression(), True, 10),  # Max 10 components
    "SVM": (SVC(), True, 10),  # Max 10 components
    "Naïve Bayes": (GaussianNB(), False, None),  # No PCA
    "kNN": (KNeighborsClassifier(), True, 7)  # Slightly lower for stability
}

rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
best_score = 0
best_model_name = None
best_model = None

for name, (model, use_pca, n_components) in classifiers.items():
    pipeline = Pipeline(steps=[("preprocessor", get_preprocessor(use_pca, n_components)), ("classifier", model)])
    scores = cross_val_score(pipeline, X, y, cv=rskf, scoring="accuracy")
    mean_score = scores.mean()
    print(f"{name}: {mean_score:.4f} (PCA: {use_pca}, n_components: {n_components})")
    if mean_score > best_score:
        best_score = mean_score
        best_model_name = name
        best_model = model
print(f"\nBest Model: {best_model_name} with Accuracy: {best_score:.4f}")

Random Forest: 0.8364 (PCA: True, n_components: None)
Decision Tree: 0.7752 (PCA: True, n_components: None)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Logistic Regression: 0.5884 (PCA: True, n_components: 10)
SVM: 0.6561 (PCA: True, n_components: 10)
Naïve Bayes: 0.6621 (PCA: False, n_components: None)
kNN: 0.5666 (PCA: True, n_components: 7)

Best Model: Random Forest with Accuracy: 0.8364


In [37]:
# Reset the pipeline based on results
pipeline = Pipeline(steps=[
    ("preprocessor", get_preprocessor(True, None)),  
    ("classifier", best_model) 
])

## Step 5: Hyperparameter Tuning
We perform **RandomizedSearchCV** to find the best hyperparameters, **only if RandomForest is the best model**.

In [38]:
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    "classifier__n_estimators": [100, 200, 300, 500, 700, 900, 1200],
    "classifier__max_features": ["auto","sqrt", "log2"],
    "classifier__max_depth": [None, 10, 20, 30, 50, 70, 100, 150],
    "classifier__min_samples_split": [2, 5, 10, 20],
    "classifier__min_samples_leaf": [1, 2, 4, 5, 10],
    "classifier__bootstrap": [True, False]
}
rand_search = RandomizedSearchCV(
    estimator=pipeline,
    param_distributions=param_dist,
    n_iter=150,  
    scoring="accuracy", 
    cv=10,  
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit the search
rand_search.fit(X, y)

print("Best Parameters:", rand_search.best_params_)
print("Best Cross-Validation Score:", rand_search.best_score_)


Fitting 10 folds for each of 150 candidates, totalling 1500 fits


550 fits failed out of a total of 1500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
329 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/beriyesilaydin/Library/Python/3.9/lib/python/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/beriyesilaydin/Library/Python/3.9/lib/python/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/beriyesilaydin/Library/Python/3.9/lib/python/site-packages/sklearn/pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/Users/beriyesilaydin/Library/Python/3.9

Best Parameters: {'classifier__n_estimators': 700, 'classifier__min_samples_split': 5, 'classifier__min_samples_leaf': 2, 'classifier__max_features': 'sqrt', 'classifier__max_depth': 50, 'classifier__bootstrap': True}
Best Cross-Validation Score: 0.8442000000000001


## Step 6: Make Predictions
We use the **best trained model** to predict labels for the evaluation dataset.

In [41]:
# Making the predictions

best_model = rand_search.best_estimator_

X_eval = fetch_data('EvaluateOnMe.csv')

y_eval = best_model.predict(X_eval)

# Save Predictions to File
with open("labels.txt", "w") as f:
    for prediction in y_eval:
        f.write(prediction + linesep)

## Compute Prediction Accuracy with Ground Truth Labels
We compare our predictions to the ground truth labels.

In [42]:
def read_labels(file_path):
    """Reads label files and returns them as lists."""
    with open(file_path, mode="r", encoding="utf-8") as file:
        return [line.strip() for line in file]

def calculate_label_correctness(file1_labels, file2_labels):
    """Compares two lists of labels and calculates correctness percentage."""
    min_length = min(len(file1_labels), len(file2_labels))
    file1_labels, file2_labels = file1_labels[:min_length], file2_labels[:min_length]

    matches = sum(1 for label1, label2 in zip(file1_labels, file2_labels) if label1 == label2)
    return (matches / min_length) * 100

csv_path = "EvaluationGT-5505b867-40a4-47b3-970b-d72adde17321.csv"
txt_path = "labels.txt"

csv_labels = read_labels(csv_path)
txt_labels = read_labels(txt_path)

percentage_correct = calculate_label_correctness(csv_labels, txt_labels)
print(f"Percentage of Matching Labels: {percentage_correct:.2f}%")

Percentage of Matching Labels: 84.82%


## Step 8: Summary of Results
- The best model was **{Random Forest}** with an average accuracy of **84.42%**.
- Predictions were saved in `labels.txt`.
- The correctness of our predictions against the ground truth is **84.82%**.