# Palmer Penguins Classification with Machine Learning Techniques

This notebook aims to explore the Palmer Penguins dataset and develop classification models capable of accurately predicting penguin species based on various morphological measurements and environmental factors. The project encompasses a step-by-step guide, covering data preprocessing, hyperparameter tuning, and model building processes.

The project focuses on the following key aspects:

 1) <b>Data Management and Preprocessing</b>: Processing the dataset to transform it into a format suitable for model training. This involves handling missing values, encoding categorical variables, and scaling numerical features.

 2) <b>Hyperparameter Tuning</b>: Utilizing <b>*Grid Search Cross Validation*</b> to find the optimal hyperparameters for both the <b>K-Nearest-Neighbor (KNN)</b> and <b>Decision Tree</b> models. This ensures that the models are trained with the most effective parameters for achieving high classification accuracy.

 3) <b>Model Building</b>: Constructing classification models using the best hyperparameters identified during the tuning phase. The notebook outlines the steps for training and testing these models, evaluating their performance, and comparing their effectiveness in predicting penguin species.

By addressing these objectives, the notebook provides a comprehensive approach to solving the Palmer Penguins classification problem, guiding users through the entire process from data preprocessing to model implementation.

Before diving into the project, we'll import the necessary libraries. We'll utilize Pandas to create a DataFrame for the Palmer Penguins dataset, scikit-learn for building the classification models, and Joblib for saving the trained models.

Let's proceed by importing these libraries.

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
import joblib

We load the Palmer Penguins dataset from a CSV file using the Pandas library.

In [2]:
dataframe = pd.read_csv('penguins.csv')
dataframe

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


Upon inspection, we can see that the dataset contains rows with missing values. To ensure the integrity of our analysis and model training process, it's imperative to address these missing values by removing them from the dataset. 

In [3]:
dataframe_cleared = dataframe.dropna()
dataframe_cleared

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,male,2007
...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,male,2009
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,female,2009
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,male,2009
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,male,2009


The next step is to transform categorical variables into a numerical format suitable for model training. This process involves assigning numerical codes to represent different categories within each categorical feature. By encoding categorical features, we ensure compatibility with machine learning algorithms that require numerical input, thus enabling effective utilization of these features in our classification models.

In [4]:
categorical_columns = ['island', 'sex']
categorical_features = dataframe_cleared[categorical_columns]

dataframe_encoded = dataframe_cleared.copy()
encoder = preprocessing.LabelEncoder()

for col in categorical_columns:
    dataframe_encoded[col] = encoder.fit_transform(dataframe_cleared[col])

species_codes = {"Adelie": 1, "Chinstrap": 2, "Gentoo": 3}
dataframe_encoded['species'] = dataframe_encoded['species'].map(species_codes)

dataframe_encoded

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,2,39.1,18.7,181.0,3750.0,1,2007
1,1,2,39.5,17.4,186.0,3800.0,0,2007
2,1,2,40.3,18.0,195.0,3250.0,0,2007
4,1,2,36.7,19.3,193.0,3450.0,0,2007
5,1,2,39.3,20.6,190.0,3650.0,1,2007
...,...,...,...,...,...,...,...,...
339,2,1,55.8,19.8,207.0,4000.0,1,2009
340,2,1,43.5,18.1,202.0,3400.0,0,2009
341,2,1,49.6,18.2,193.0,3775.0,1,2009
342,2,1,50.8,19.0,210.0,4100.0,1,2009


Now we have to scale the numerical features within the dataset to a standard range. Scaling is essential to ensure that all numerical features contribute equally to the model training process, preventing features with larger magnitudes from dominating the learning algorithm. This normalization process enhances the performance and convergence of machine learning algorithms, thereby improving the accuracy and reliability of our classification models.

In [5]:
numerical_features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

scaler = preprocessing.MinMaxScaler()
dataframe_encoded[numerical_features] = scaler.fit_transform(dataframe_encoded[numerical_features])
df = dataframe_encoded
df['year'] = df['year'] - 2007
df

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,1,2,0.254545,0.666667,0.152542,0.291667,1,0
1,1,2,0.269091,0.511905,0.237288,0.305556,0,0
2,1,2,0.298182,0.583333,0.389831,0.152778,0,0
4,1,2,0.167273,0.738095,0.355932,0.208333,0,0
5,1,2,0.261818,0.892857,0.305085,0.263889,1,0
...,...,...,...,...,...,...,...,...
339,2,1,0.861818,0.797619,0.593220,0.361111,1,2
340,2,1,0.414545,0.595238,0.508475,0.194444,0,2
341,2,1,0.636364,0.607143,0.355932,0.298611,1,2
342,2,1,0.680000,0.702381,0.644068,0.388889,1,2


With the dataset now prepared for model processing, we proceed by separating the features from the target label, which in this case is the penguin species. We partition the data into a 70% training dataset and a 30% testing dataset. This division ensures that a sufficient portion of the data is allocated for model training, while retaining a separate portion for evaluating the model's performance on unseen data.

In [6]:
X = df.drop(columns=['species'])
y = df['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=42)

We are now seeking to optimize our model's performance by finding the best hyperparameters. Specifically, we're focusing on determining the optimal number of neighbors for the K Neighbors Classifier. We utilize Grid Search Cross Validation for this task.The objective of hyperparameter tuning is to enhance the model's predictive accuracy and generalization ability.

Grid Search Cross Validation systematically explores a predefined grid of hyperparameters, evaluating each combination through cross-validation. This technique splits the dataset into subsets and iteratively trains and validates the model on different portions, providing a reliable estimate of its performance.

In [7]:
knn = KNeighborsClassifier()

param_grid = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]} 

grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_n_neighbors = grid_search.best_params_['n_neighbors']
print("Best n_neighbors:", best_n_neighbors)

Best n_neighbors: 1


After determining the optimal number of neighbors through hyperparameter tuning, we proceed to create the K Neighbors Classifier model using this best parameter. Subsequently, we train the model on the training data and evaluate its performance on the testing data.

In [8]:
knn_model = KNeighborsClassifier(n_neighbors=best_n_neighbors) 

knn_model.fit(X_train, y_train)

y_pred = knn_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.98


In the last step, we save the trained model to make it accessible for use in other applications.

In [9]:
joblib.dump(knn_model, 'models/knn_model.pkl')

['models/knn_model.pkl']

We iterate through the steps of hyperparameter optimization, model construction, training, testing, and model saving, this time focusing on the Decision Tree Algorithm. In this iteration, our aim is to identify the optimal maximum depth of the tree through hyperparameter tuning. Once the best hyperparameter is determined, we construct, train, evaluate, and save the Decision Tree model, ensuring its readiness for integration into other applications.

In [10]:
dt_classifier = DecisionTreeClassifier(criterion='entropy', random_state=42)

param_grid = {'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}

grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_max_depth = grid_search.best_params_['max_depth']
print("Best max_depth:", best_max_depth)

Best max_depth: 4


In [11]:
dt_model = DecisionTreeClassifier(criterion='entropy', random_state=42, max_depth=best_max_depth)

dt_model.fit(X_train, y_train)

y_pred = dt_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.97


In [12]:
joblib.dump(dt_model, 'models/dt_model.pkl')

['models/dt_model.pkl']