## Penguin Feature Prediction

![penguins-1.jpg](attachment:penguins-1.jpg)

#### TABLE OF CONTENT

#### INTRODUCTION
    1.0 OVERVIEW

#### 2. REPOSITORY
    2.0 IMPORT LIBRARIES
    2.1 LOAD DATASET
 

#### 3. MODEL BUILDING
    3.0 REGRESSION MODEL
    3.1 CLASSIFICATION MODEL

#### 4.0 SUMMARY

#### INTRODUCTION
    1.0 OVERVIEW

The primary goal of the task is to create two machine learning models that can predict a feature in the dataset. One of the models would be a Regression Model and the other is a classification model.


#### REPOSITORY
    2.0 IMPORT LIBRARIES

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report

    2.1 LOAD DATASET

In [2]:
# Load dataset
newpenguindata = pd.read_csv("newpenguindata.csv")

In [3]:
# Set a seed for reproducibility
seed = 42


#### MODEL BUILDING
    4.0 REGRESSION MODEL
    

In [4]:
# Regression Model (Predicting year based on numerical features)
# Features: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
X_regression = newpenguindata[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
y_regression = newpenguindata["year"]

In [5]:
# Split the data into training and testing sets
X_train_regression, X_test_regression, y_train_regression, y_test_regression = train_test_split(
    X_regression, y_regression, test_size=0.2, random_state=seed)

In [6]:
# Create and train a Linear Regression model
regression_model = LinearRegression()
regression_model.fit(X_train_regression, y_train_regression)

In [7]:
# Make predictions using the regression model
y_pred_regression = regression_model.predict(X_test_regression)

In [8]:
# Evaluate the regression model (using Mean Squared Error)
mse = mean_squared_error(y_test_regression, y_pred_regression)
print(f"Regression Model MSE: {mse:.2f}")

Regression Model MSE: 0.60


    4.1 CLASSIFICATION MODEL

In [9]:
# Classification Model (Predicting species based on numerical features)
# Features: bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g
X_classification = newpenguindata[["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"]]
y_classification = newpenguindata["species"]

In [10]:
# Split the data into training and testing sets
X_train_classification, X_test_classification, y_train_classification, y_test_classification = train_test_split(
    X_classification, y_classification, test_size=0.2, random_state=seed)

In [11]:
# Create and train a Random Forest Classifier
classification_model = RandomForestClassifier(random_state=seed)
classification_model.fit(X_train_classification, y_train_classification)

In [12]:
# Make predictions using the classification model
y_pred_classification = classification_model.predict(X_test_classification)

In [13]:
# Evaluate the classification model (using Accuracy)
accuracy = accuracy_score(y_test_classification, y_pred_classification)
print(f"Classification Model Accuracy: {accuracy:.2f}")

Classification Model Accuracy: 0.97


In [14]:
# Print classification report for detailed evaluation
classification_report_output = classification_report(y_test_classification, y_pred_classification)
print("Classification Report:")
print(classification_report_output)

Classification Report:
              precision    recall  f1-score   support

      Adelie       1.00      0.94      0.97        32
   Chinstrap       0.93      1.00      0.96        13
      Gentoo       0.96      1.00      0.98        24

    accuracy                           0.97        69
   macro avg       0.96      0.98      0.97        69
weighted avg       0.97      0.97      0.97        69



#### 5.0 SUMMARY

The Mean Squared Error (MSE) value of 0.60 for the regression model indicates the average squared difference between the predicted and actual values in the regression model. An MSE of 0.60 is relatively low, which implies that, on average, the model's predictions are reasonably close to the actual values. This suggests that the model has good predictive performance.

Based on the provided classification report, it appears that the classification model (Random Forest Classifier) performed very well in predicting the penguin species based on the given numerical features (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g).

The overall accuracy of the classification model is 97%, indicating that it correctly classified 97% of the penguin species in the test dataset. This is a high level of accuracy, suggesting that the model is effective in predicting penguin species.

The precision, recall, and F1-score for each species category (Adelie, Chinstrap, Gentoo) are quite high, with values close to or above 0.95. This means that the model performs well in terms of both precision (low false positive rate) and recall (low false negative rate) for each species.

The macro and weighted averages for precision, recall, and F1-score are also high, indicating that the model's performance is consistent across different species categories. The weighted average is influenced by the class distribution, and in this case, it remains high due to the balanced dataset.

Support: The "support" column shows the number of samples for each species category in the test dataset. It is clear that the model has been evaluated on a relatively small dataset (69 samples in total).