# Penguin Identification in NYC

Every day, a new penguin is mysteriously discovered in the streets of **New York City**. While no one knows exactly how they got there, one thing is certain — we need to **identify their species quickly**!

To solve this, we use three key physical features of each penguin:

- **Bill Length (mm)**
- **Bill Depth (mm)**
- **Flipper Length (mm)**

Using these features, we aim to **train a machine learning model** that can accurately predict the **species** of each penguin. This will allow us to categorize them efficiently.


In [32]:
import seaborn as sns
penguins = sns.load_dataset("penguins").dropna()
print(penguins.info())

<class 'pandas.core.frame.DataFrame'>
Index: 333 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            333 non-null    object 
 1   island             333 non-null    object 
 2   bill_length_mm     333 non-null    float64
 3   bill_depth_mm      333 non-null    float64
 4   flipper_length_mm  333 non-null    float64
 5   body_mass_g        333 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 20.8+ KB
None


# Feature Selection Test

In the following section, we take a closer look at which of the four feature selection methods — **Filter**, **Wrapper**, **Embedded**, and **Permutation** — performs best.  
We compare their effectiveness based on model accuracy and cross-validation to determine the most suitable approach for predicting penguin species.

In [43]:
import pandas as pd
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

# Define features and target
features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
target = 'species'

X = penguins[features]
y = penguins[target]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### -------------------------------
### Filter Method (Mutual Info)
selector = SelectKBest(score_func=mutual_info_classif, k=3)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)
filter_acc = accuracy_score(y_test, y_pred)
print("Accuracy (Filter - Mutual Info):", filter_acc)

### -------------------------------
### Wrapper Method (RFE)
model = LogisticRegression(max_iter=1000)
rfe = RFE(estimator=model, n_features_to_select=3)
X_train_selected = rfe.fit_transform(X_train_scaled, y_train)
X_test_selected = rfe.transform(X_test_scaled)

model.fit(X_train_selected, y_train)
y_pred = model.predict(X_test_selected)
wrapper_acc = accuracy_score(y_test, y_pred)
print("Accuracy (Wrapper - RFE):", wrapper_acc)

### -------------------------------
### Embedded Method (Random Forest)
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
embedded_acc = accuracy_score(y_test, y_pred)
print("Accuracy (Embedded - RandomForest):", embedded_acc)

### -------------------------------
### Permutation Importance
model = RandomForestClassifier(random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
perm_acc = accuracy_score(y_test, y_pred)
print("Accuracy (Permutation Importance):", perm_acc)

# Summary Table
results = {
    "Method": ["Filter (MI)", "Wrapper (RFE)", "Embedded (RF)", "Permutation Importance"],
    "Accuracy": [filter_acc, wrapper_acc, embedded_acc, perm_acc]
}

results_df = pd.DataFrame(results)
print("\n Summary of Classification Results")
print(results_df)

Accuracy (Filter - Mutual Info): 1.0
Accuracy (Wrapper - RFE): 1.0
Accuracy (Embedded - RandomForest): 1.0
Accuracy (Permutation Importance): 1.0

✅ Summary of Classification Results
                   Method  Accuracy
0             Filter (MI)       1.0
1           Wrapper (RFE)       1.0
2           Embedded (RF)       1.0
3  Permutation Importance       1.0


In [44]:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.base import clone

# Base models
logreg = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(random_state=42)

# KFold strategy
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# 1. Filter Method (SelectKBest + LogisticRegression)
filter_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('select', SelectKBest(score_func=mutual_info_classif, k=3)),
    ('model', clone(logreg))
])
filter_scores = cross_val_score(filter_pipeline, X, y, cv=cv)

# 2. Wrapper Method (RFE + LogisticRegression)
rfe_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rfe', RFE(estimator=clone(logreg), n_features_to_select=3)),
    ('model', clone(logreg))
])
wrapper_scores = cross_val_score(rfe_pipeline, X, y, cv=cv)

# 3. Embedded Method (RandomForestClassifier)
embedded_scores = cross_val_score(clone(rf), X, y, cv=cv)

# 4. Permutation Importance shares the same model as Embedded
permutation_scores = embedded_scores  # same scores

# Print results
print("Cross-Validation Accuracy Results:")
print(f"Filter (Mutual Info + LR):     {filter_scores.mean():.4f}")
print(f"Wrapper (RFE + LR):            {wrapper_scores.mean():.4f}")
print(f"Embedded (Random Forest):      {embedded_scores.mean():.4f}")
print(f"Permutation Importance (RF):   {permutation_scores.mean():.4f}")

Cross-Validation Accuracy Results:
Filter (Mutual Info + LR):     0.9820
Wrapper (RFE + LR):            0.9910
Embedded (Random Forest):      0.9760
Permutation Importance (RF):   0.9760


# Conclusion

Although all models initially achieved a perfect accuracy of **100%**, a more reliable evaluation using **cross-validation** reveals that the **Wrapper method (RFE + Logistic Regression)** performs the best, with an average accuracy of **0.9910**.

This indicates that the model generalizes better and is not simply overfitting to a specific train-test split.  
Therefore, the **Wrapper method** is considered the most **robust and accurate** approach for predicting penguin species based on the selected features.


In [45]:
import joblib
import os

# Downloading the wrapper method (best method)
os.makedirs("models", exist_ok=True)

# Save the trained model
joblib.dump(model, "models/model.pkl")

# Save the scaler
joblib.dump(scaler, "models/scaler.pkl")

# Optional: save label encoder
# joblib.dump(label_encoder, "models/label_encoder.pkl")


['models/wrapper_model.pkl']