##Comparative Analysis of Binary Classification Models on Synthetic and Real Data

In this project we explore and compare the performance of various ML models on binary classification tasks using both synthetic and real-world datasets. The primary objective is to understand how different models generalize from controlled, synthetic environments to complex, real-world scenarios.

Models Explored:
* Logistic Regression
* K-Nearest Neighbors (KNN)
* Neural Network (MLP)
* Decision Tree

Datasets:
* Synthetic Dataset: Generated using scikit-learn's `make_classification` function, designed to simulate a binary classification problem with clear patterns.
* Real Dataset: Utilizes the Pima Indians Diabetes Database, a well-known dataset in the medical domain aimed at predicting the onset of diabetes based on diagnostic measures.

In [1]:
# Import necessary libraries
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

### First, let's try with a synthetic dataset (for simplicity)

In [2]:
# Load the dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [3]:
# List of models to train
models = {
    'Logistic Regression': LogisticRegression(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Neural Network': MLPClassifier(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier()
}

In [4]:
# Train each model, predict and print performance
for model_name, model in models.items():
    # Define the model
    # Model is already defined in the dictionary, so we just fit it
    model.fit(X_train, y_train)

    # Make predictions
    predictions = model.predict(X_test)

    # Print the performance
    accuracy = accuracy_score(y_test, predictions)
    print(f"{model_name} Accuracy: {accuracy:.4f}")

Logistic Regression Accuracy: 0.8550
K-Nearest Neighbors Accuracy: 0.8100
Neural Network Accuracy: 0.8250
Decision Tree Accuracy: 0.8800


### Now let's try with a real dataset

One of the classic datasets used for binary classification tasks is the Pima Indians Diabetes Database, which is available on UCI and has the objective of diagnostically predicting whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. All patients here are females at least 21 years old of Pima Indian heritage.

In [5]:
# Import some more libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler

In [6]:
# Load the Pima Indians Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
columns = ["NumTimesPrg", "PlasmaGlucose", "BloodP", "SkinThick", "TwoHourSerIns", "BMI", "DiabetesPedigree", "Age", "Class"]
data = pd.read_csv(url, names=columns)

In [7]:
# Split dataset into features and target variable
X_real = data.drop('Class', axis=1)
y_real = data['Class']

# Standardize the features
scaler = StandardScaler()
X_real_scaled = scaler.fit_transform(X_real)

# Split dataset into train and test sets
X_real_train, X_real_test, y_real_train, y_real_test = train_test_split(X_real_scaled, y_real, test_size=0.2, random_state=42)


In [8]:
# List of (the same) models to train
models = {
    'Logistic Regression': LogisticRegression(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Neural Network': MLPClassifier(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier()
}

In [9]:
# Train each model, predict and print performance
for model_name, model in models.items():
    # Define the model
    # Model is already defined in the dictionary, so we just fit it
    model.fit(X_real_train, y_real_train)

    # Make predictions
    predictions = model.predict(X_real_test)

    # Print the performance
    accuracy = accuracy_score(y_real_test, predictions)
    print(f"{model_name} Accuracy: {accuracy:.4f}")

Logistic Regression Accuracy: 0.7532
K-Nearest Neighbors Accuracy: 0.6883
Neural Network Accuracy: 0.7468
Decision Tree Accuracy: 0.7662


Some general observations reagarding the models' performance:

* **General Trend**: Across all models, there is a noticeable drop in accuracy when moving from synthetic to real data. This is expected as real data tends to be noisier and contain more complex patterns than synthetic data, which is often cleaner and more controlled.

* **Best Performing Model**: On synthetic data, the Decision Tree performed best, whereas on real data, its performance dropped, but it still remained one of the better-performing models. This could indicate that the Decision Tree was able to capture the underlying patterns in both datasets effectively, though less so in the real data due to its complexity.

* **Worst Performing Model**: K-Nearest Neighbors had the largest drop in performance on the real dataset compared to the synthetic one. This might be due to the real data having more complex and less linearly separable features, which can adversely affect distance-based models like KNN.

* **Consistency of Logistic Regression and Neural Network**: Logistic Regression and Neural Networks showed a relatively smaller decrease in performance on the real dataset. This might be because these models, especially the Neural Network, are capable of modeling complex relationships and can be more robust to noise and variability in real-world data.