A Machine Learning Model (Classification) to classify penguins into their species based on some unique physical characteristics of each species.

Basic Workflow ---
1. Input Data from Dataset
2. Clean the data (if needed)
3. Split the data into Training and Testing data for the model (80:20)
4. Training the model with Training dataset (Random Forest model)
5. Test the model with non-training data

Installing the required libraries

In [1]:
pip install pandas seaborn scikit-learn



Importing Libraries

In [2]:
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

1. Input/Load the data (Palmer Penguins dataset)

In [7]:
df = sns.load_dataset("penguins")
df.head(5)

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


2. Cleaning our data (since ML models can't work with NaN values)

In [8]:
#Checking for missing values, that is, NaN values
print("Missing values per column:")
print(df.isnull().sum())

#Dropping rows with missing values (since there is enough data)
df_clean = df.dropna()

#Comparing the size of the dataset before and after this step, to verify the successful cleaning
print(f"\nOriginal shape: {df.shape}")
print(f"\nCleaned shape:  {df_clean.shape}")

Missing values per column:
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
dtype: int64

Original shape: (344, 7)

Cleaned shape:  (333, 7)


3. Preparing the

a) Features(X) [that is, the physical properties/measurements of each species], and

b)Target(Y) [The species into which we wate to classify]

In [9]:
#We select only numerical features for simplicity
feature_columns = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

X = df_clean[feature_columns]
y = df_clean['species']

#Preview the input data
X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
4,36.7,19.3,193.0,3450.0
5,39.3,20.6,190.0,3650.0


4. Splitting the Test(20%) and Training(80%) datasets

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Testing samples:  {len(X_test)}")

Training samples: 266
Testing samples:  67


5. Training our model, which is a Random Forest Classifier.
It uses multiple decision trees to vote on the best answer.

In [11]:
model = RandomForestClassifier(n_estimators=100, random_state=42) #100 Decision Trees, 42 seed for randomness
model.fit(X_train, y_train)

print("Model trained successfully!")

Model trained successfully!


6. Testing/Evaluating our model

In [12]:
#Make predictions
y_pred = model.predict(X_test)

#Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

print("\nDetailed Report:")
print(classification_report(y_test, y_pred))

Model Accuracy: 100.00%

Detailed Report:
              precision    recall  f1-score   support

      Adelie       1.00      1.00      1.00        31
   Chinstrap       1.00      1.00      1.00        13
      Gentoo       1.00      1.00      1.00        23

    accuracy                           1.00        67
   macro avg       1.00      1.00      1.00        67
weighted avg       1.00      1.00      1.00        67



Bonus: Testing the model with a custom penguin, that is not included in the training and test data

In [13]:
import pandas as pd

samples = [
    [50.0, 15.0, 220.0, 5000.0], #gentoo
    [38.0, 18.0, 180.0, 3500.0], #Adelie
    [45.0, 16.0, 200.0, 4200.0], #Gentoo
    [42.0, 20.0, 190.0, 3800.0] #adelile
]

# bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
new_penguin = pd.DataFrame(
    [[38.0, 18.0, 180.0, 3500.0]],
    columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
)

prediction = model.predict(new_penguin)
print(f"This is a {prediction[0]} penguin")

This is a Adelie penguin
