<a href="https://colab.research.google.com/github/SayuruA/Pattern_Recognition/blob/main/Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load the penguins dataset
df = sns.load_dataset("penguins")
df.dropna(inplace=True)
# Filter rows for 'Adelie' and 'Chinstrap' classes
selected_classes = ['Adelie', 'Chinstrap']
df_filtered = df[df['species'].isin(selected_classes)].copy() # Make a copy to avoid the warning
# Initialize the LabelEncoder
le = LabelEncoder()
# Encode the species column
y_encoded = le.fit_transform(df_filtered['species'])
df_filtered['class_encoded'] = y_encoded
# Display the filtered and encoded DataFrame
print(df_filtered[['species', 'class_encoded']])
# Split the data into features (X) and target variable (y)
y = df_filtered['class_encoded'] # Target variable
X = df_filtered.drop(['species', 'island', 'sex','class_encoded'], axis=1)

       species  class_encoded
0       Adelie              0
1       Adelie              0
2       Adelie              0
4       Adelie              0
5       Adelie              0
..         ...            ...
215  Chinstrap              1
216  Chinstrap              1
217  Chinstrap              1
218  Chinstrap              1
219  Chinstrap              1

[214 rows x 2 columns]


Explanation:

1.   After dropping null values of the *Penguins* data set (*df* ) we have only taken the *rows* belonging to species *Adelie* and *Chinstrap* (*df_filtered* ).
2.   We have encoded species' names with a label using *Label Encoding* and saved that as a new column in *df_filtered* (*class_encoded*).
3.   *df_filtered*  is broken into *X* (after removing categorical features) and *y* (*Features and Targets*).



In [47]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,Male


In [31]:
#Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#Train the logistic regression model. Here we are using sagasolver to learn weights.
logreg = LogisticRegression(solver='saga',max_iter=1000000,tol = 1e-6)
logreg.fit(X_train, y_train)
# Predict on the testing data
y_pred = logreg.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(logreg.coef_, logreg.intercept_)

Accuracy: 1.0
[[ 0.94267355 -0.23814816 -0.1373987  -0.00292342]] [-0.01238663]


Expalnation:

1.   First, we divide the data set into *train* and *test* splits.
2.   Use the *SAGA* solver as the logistic regression technique.
3. Without setting *max_iter* and *tol* to different values *SAGA* performs very poorly (default values are: 100, 1e-4).
4. *fit*  method finds the model using the *test* set.
5. *predict* is used to get model predictions for any data point.
6. *accuracy* is measured using the *test* set and formula is,




> Accuracy = (Number of Correct Predictions) / (Total Number of Samples)










Note 1: **SAGA**
* *SAGA* performance may depend on *preprocessing,
regularization, class imbalance, solver hyperparameters*.
* *SAGA* can be very sensitive to unscaled data.

In [79]:
X_test.head()
#extract the 3rd row of X_test
x_test_first_row = X_test.iloc[[20]]
x_test_first_row

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
104,37.9,18.6,193.0,2925.0


In [80]:
logreg.predict(x_test_first_row)

array([0])

In [81]:
# get the 197 th index of df_filtered
df_filtered.loc[[104]]

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,class_encoded
104,Adelie,Biscoe,37.9,18.6,193.0,2925.0,Female,0


In [58]:
import numpy as np
a = np.array([40.3,18.0,195.0,3250.0])
a

array([  40.3,   18. ,  195. , 3250. ])

In [64]:
# covert a to dataframe as a row
a_df = pd.DataFrame(a.reshape(1, -1), columns=X_test.columns)
a_df

Unnamed: 0,0,1,2,3
0,40.3,18.0,195.0,3250.0


In [72]:
logreg.predict(x_test_first_row)

array([1])

In [24]:
X.head()

Unnamed: 0,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g
0,39.1,18.7,181.0,3750.0
1,39.5,17.4,186.0,3800.0
2,40.3,18.0,195.0,3250.0
4,36.7,19.3,193.0,3450.0
5,39.3,20.6,190.0,3650.0
