# Machine Learning â€“ Classification  
Dataset: Palmer Penguins  

## Task  
Predict penguin species based on physical measurements using supervised machine learning classifiers.


In [None]:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay
)

import seaborn as sns
import matplotlib.pyplot as plt

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)


In [None]:

url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
df = pd.read_csv(url).dropna().reset_index(drop=True)
df.head()


In [None]:

features = [
    "bill_length_mm",
    "bill_depth_mm",
    "flipper_length_mm",
    "body_mass_g"
]

X = df[features]
y = df["species"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.25,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape


In [None]:

models = {
    "Logistic Regression": Pipeline([
        ("scaler", StandardScaler()),
        ("model", LogisticRegression(max_iter=1000))
    ]),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "KNN (k=5)": Pipeline([
        ("scaler", StandardScaler()),
        ("model", KNeighborsClassifier(n_neighbors=5))
    ])
}

results = {}


In [None]:

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    results[name] = {
        "accuracy": accuracy_score(y_test, y_pred),
        "report": classification_report(y_test, y_pred, output_dict=True),
        "confusion": confusion_matrix(y_test, y_pred)
    }

results


In [None]:

for name, res in results.items():
    print(f"=== {name} ===")
    print(f"Accuracy: {res['accuracy']:.3f}")
    print(classification_report(y_test, models[name].predict(X_test)))


In [None]:

for name, res in results.items():
    disp = ConfusionMatrixDisplay(
        confusion_matrix=res["confusion"],
        display_labels=models[name].classes_
    )
    disp.plot()
    plt.title(name)
    plt.show()



## Interpretation  

- **Logistic Regression**  
  - High accuracy due to near-linear separability of species  
  - Interpretable coefficients  

- **Decision Tree**  
  - Slightly lower generalization performance  
  - Easy to interpret decision logic  

- **KNN**  
  - Competitive accuracy  
  - Sensitive to feature scaling and choice of *k*  

### Overall Conclusion  
Physical measurements are strong predictors of penguin species.  
The dataset is well-suited for classification, and even simple models achieve high performance.

## Limitations & Next Steps  
- Limited dataset size  
- No hyperparameter tuning  
- Extend with cross-validation, ROC curves, or ensemble methods
