# üêß Penguins Species Prediction

**Goal:** Build a simple machine learning model to predict the species of penguins based on physical measurements.


## üìä Dataset

- Source: [Palmer Archipelago (Antarctica) Penguins Dataset](https://github.com/allisonhorst/penguins)
- Contains measurements of penguins‚Äô physical attributes.
- Features used:
  - Culmen Length (mm)
  - Culmen Depth (mm)
  - Flipper Length (mm)
  - Body Mass (g)
- Target:
  - Species (Adelie, Chinstrap, Gentoo)


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score




In [4]:
# Load dataset
data = pd.read_csv('penguins.csv')


# Quick look
data.head()


Unnamed: 0,studyName,Sample Number,Species,Region,Island,Stage,Individual ID,Clutch Completion,Date Egg,Culmen Length (mm),Culmen Depth (mm),Flipper Length (mm),Body Mass (g),Sex,Delta 15 N (o/oo),Delta 13 C (o/oo),Comments
0,PAL0708,1,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A1,Yes,11/11/07,39.1,18.7,181.0,3750.0,MALE,,,Not enough blood for isotopes.
1,PAL0708,2,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N1A2,Yes,11/11/07,39.5,17.4,186.0,3800.0,FEMALE,8.94956,-24.69454,
2,PAL0708,3,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A1,Yes,11/16/07,40.3,18.0,195.0,3250.0,FEMALE,8.36821,-25.33302,
3,PAL0708,4,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N2A2,Yes,11/16/07,,,,,,,,Adult not sampled.
4,PAL0708,5,Adelie Penguin (Pygoscelis adeliae),Anvers,Torgersen,"Adult, 1 Egg Stage",N3A1,Yes,11/16/07,36.7,19.3,193.0,3450.0,FEMALE,8.76651,-25.32426,


## üîß Data Preprocessing

- Removed rows with missing values.
- Selected features and target for the model.


In [6]:
print(data.columns)


Index(['studyName', 'Sample Number', 'Species', 'Region', 'Island', 'Stage',
       'Individual ID', 'Clutch Completion', 'Date Egg', 'Culmen Length (mm)',
       'Culmen Depth (mm)', 'Flipper Length (mm)', 'Body Mass (g)', 'Sex',
       'Delta 15 N (o/oo)', 'Delta 13 C (o/oo)', 'Comments'],
      dtype='object')


In [7]:
# Clean column names
data.columns = data.columns.str.strip() \
                           .str.lower() \
                           .str.replace(' ', '_') \
                           .str.replace('(', '') \
                           .str.replace(')', '') \
                           .str.replace('/', '_')
print(data.columns)


Index(['studyname', 'sample_number', 'species', 'region', 'island', 'stage',
       'individual_id', 'clutch_completion', 'date_egg', 'culmen_length_mm',
       'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex',
       'delta_15_n_o_oo', 'delta_13_c_o_oo', 'comments'],
      dtype='object')


In [8]:
# Drop rows with missing values
data = data.dropna()

# Select features and target
X = data[['culmen_length_mm', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g']]
y = data['species']


## üìà Train-Test Split

- Split the dataset into **80% training** and **20% testing**.


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## ü§ñ Train Random Forest Classifier


In [12]:
model = RandomForestClassifier()
model.fit(X_train, y_train)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## üìä Model Evaluation

- Evaluate accuracy on the test dataset.


In [11]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy*100:.2f}%")


Model Accuracy: 100.00%


## üèÉ How to Run

1. Install dependencies:


2. Make sure `penguins.csv` is in the `data/` folder.
3. Run the notebook in Jupyter Notebook or Jupyter Lab.
