Load the penguin's dataset using the following code.

In [4]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load the penguins dataset
df = sns.load_dataset("penguins")

df.dropna(inplace=True)

# Filter rows for 'Adelie' and 'Chinstrap' classes
selected_classes = ['Adelie', 'Chinstrap']
df_filtered = df[df['species'].isin(selected_classes)].copy()  # Make a copy to avoid the warning

# Initialize the LabelEncoder
le = LabelEncoder()

# Encode the species column
y_encoded = le.fit_transform(df_filtered['species'])

df_filtered['class_encoded'] = y_encoded

# Display the filtered and encoded DataFrame
print(df_filtered[['species', 'class_encoded']])

# Split the data into features (X) and target variable (y)

y = df_filtered['class_encoded']  # Target variable
X = df_filtered.drop(['species', 'island', 'sex','class_encoded'], axis=1)

       species  class_encoded
0       Adelie              0
1       Adelie              0
2       Adelie              0
4       Adelie              0
5       Adelie              0
..         ...            ...
215  Chinstrap              1
216  Chinstrap              1
217  Chinstrap              1
218  Chinstrap              1
219  Chinstrap              1

[214 rows x 2 columns]



1.What is the purpose of "y_encoded = le.fit_transform(df_filtered['species'])" ?

The purpose of this line of code is to encode the target variable 'species' into numerical labels. In machine learning, algorithms often require the target variable to be in numerical form for training. The LabelEncoder from scikit-learn is used to convert the species names (e.g., 'Adelie' and 'Chinstrap') into corresponding numerical labels (e.g., 0 and 1), which can be used for classification


2.What is the purpose of "X = df.drop(['species', 'island', 'sex'], axis=1)" ?

X = df_filtered.drop(['species', 'island', 'sex'], axis=1) creates the feature matrix X by dropping the columns 'species', 'island', and 'sex' from the DataFrame. These columns are removed because they are categorical and not directly usable as features for logistic regression. You typically need to convert categorical variables into numerical representations, or in some cases, perform one-hot encoding, which you later do in the code.

This line of code is used to create the feature matrix 'X' by dropping the columns 'species', 'island', and 'sex' from the original DataFrame 'df'. These columns are typically not used as features for classification in this specific context. The resulting 'X' contains only the numeric features that will be used to train the logistic regression model.


3.Why we cannot use "island" and "sex" features?

"Island" and "sex" are categorical features. While some machine learning algorithms can handle categorical data directly, logistic regression typically requires numeric input features. To use categorical features in logistic regression, they need to be one-hot encoded or otherwise transformed into numeric representations. In this code, these columns are dropped instead of being one-hot encoded, which is a common preprocessing step when dealing with categorical data.

Split the data into training and testing sets

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train the logistic regression model. Here we are using saga solver to learn weights.

In [6]:
logreg = LogisticRegression(solver='saga')

logreg.fit(X_train, y_train)

# Predict on the testing data
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(logreg.coef_, logreg.intercept_)


Accuracy: 0.5813953488372093
[[ 2.75633615e-03 -8.08986406e-05  4.77783153e-04 -2.87299611e-04]] [-8.39446233e-06]




4.Why is accuracy low? why does the saga solver perform poorly?


The accuracy of the logistic regression model with the 'saga' solver might be low because the 'saga' solver is sensitive to feature scaling. If features are not properly scaled, it can affect the convergence of the algorithm and lead to suboptimal results. This is why you observe an increase in accuracy when switching to the 'liblinear' solver, which is less sensitive to feature scaling.

The initial accuracy might be low for several reasons:

The features might not be well-suited for classification.

There could be class imbalance in the dataset.

The choice of solver ('saga') may not be optimal for this specific dataset.

The 'saga' solver might perform poorly in this case because it's sensitive to the scale of the features, and logistic regression generally benefits from feature scaling. If the features are not properly scaled, it can affect the convergence and performance of the solver.






Change the solver to "liblinear"


In [8]:
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)
# Predict on the testing data
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(logreg.coef_, logreg.intercept_)

Accuracy: 1.0
[[ 1.61343591 -1.4665703  -0.15152349 -0.00398479]] [-0.08866849]


5.Why is accuracy now? why does the "liblinear" solver perform better than "saga" solver ?

Changing the solver to "liblinear" often improves accuracy, especially if the data is not well-scaled. "liblinear" is more robust to unscaled features, and it's a good choice when you have a small dataset or when other solvers do not perform well. Accuracy may increase because "liblinear" can handle the data better in its original scale.

Repeat the above tasks after feature normalization and observe the accuracy levels.

In [9]:
from sklearn.preprocessing import StandardScaler
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logreg = LogisticRegression(solver='saga')
logreg.fit(X_train_scaled, y_train)

# Predict on the testing data
y_pred = logreg.predict(X_test_scaled)

6.Now observe the accuracies for both  "liblinear" solver and "saga" solver. Why accuracy of the "saga" solver is increased?

Normalizing the features using StandardScaler scales them to have a mean of 0 and a standard deviation of 1. This can help the 'saga' solver converge faster and perform better because it reduces the impact of feature scales on the optimization process. Normalization often makes the solver more stable and effective, especially when features have different scales.


Extra

The accuracy of the "saga" solver may have increased after feature normalization (using the "MaxAbsScaler") because feature scaling can have a significant impact on the performance of logistic regression, especially when using the "saga" solver. Here's why the accuracy of the "saga" solver might have improved:

Feature Scaling: The "saga" solver is sensitive to the scale of the features. When features have different scales, it can lead to slow convergence or convergence to suboptimal solutions. In the original, unscaled data, features like "bill_length_mm" and "bill_depth_mm" could have different scales. Scaling these features to have similar magnitudes can help the solver converge more quickly and reach a better solution.

Normalization Effect: Normalization, in this case using "MaxAbsScaler," can improve the condition of the optimization problem. It ensures that each feature has values within a similar range ([-1, 1]), making the optimization landscape more favorable. This can help the "saga" solver find a better decision boundary, leading to improved classification accuracy.

Reducing Numerical Instabilities: Feature scaling can reduce numerical instabilities that might occur during the optimization process. When features are on different scales, it can cause problems like floating-point overflows or underflows, which can affect the solver's performance. Scaling mitigates these issues.

Convergence: The "saga" solver is designed for large datasets and can handle both L1 and L2 regularization. In some cases, with scaled features, it may converge more efficiently and find a solution that separates the classes better.

Run the following code to load the dataset again, and use logistic regression for the classification.

In [10]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# Load the penguins dataset
df = sns.load_dataset("penguins")

df.dropna(inplace=True)

# Filter rows for 'Adelie' and 'Chinstrap' classes
selected_classes = ['Adelie', 'Chinstrap']
df_filtered = df[df['species'].isin(selected_classes)].copy()  # Make a copy to avoid the warning

# Initialize the LabelEncoder
le = LabelEncoder()

# Encode the species column
y_encoded = le.fit_transform(df_filtered['species'])
df_filtered['class_encoded'] = y_encoded


df_filtered.head()

X = df_filtered.drop(['species', 'class_encoded'], axis=1)  # Choose features
y = df_filtered['class_encoded']  # Target variable

X.head()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg = LogisticRegression(solver='saga')

#logreg = LogisticRegression(max_iter=166, solver='newton-cg')
# logreg = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=100, multi_class='ovr', random_state=42)
logreg.fit(X_train, y_train)

# Predict on the testing data
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(logreg.coef_, logreg.intercept_)

ValueError: could not convert string to float: 'Dream'

Reason for the  above error

The error we encountered is due to the presence of non-numeric (string) values in your DataFrame columns. In particular, the issue is related to the 'island','sex' columns, which contais categorical string values. Logistic Regression, like many other machine learning models, requires numeric input features.

Changes made to address the error:

Import OneHotEncoder from sklearn.preprocessing.

Added 'sex' column to the list of columns to be one-hot encoded: pd.get_dummies(df_filtered, columns=['island', 'sex'], drop_first=True).

Removed any references to the original 'sex' column after encoding since it's no longer needed for modeling.

7.What is the problem? Why algorithm cannot perform classification?

The problem is that the categorical features 'island' and 'sex' have not been transformed into numerical representations (e.g., one-hot encoding or label encoding). Logistic regression, as implemented here, requires numerical features. Without encoding these categorical features, the algorithm cannot process them, leading to errors or poor performance.

8.How to solve this issue??

To solve the issue in the second code block, you should transform the categorical features 'island' and 'sex' into numerical representations. One common approach is to use one-hot encoding, which creates binary columns for each category. You can use the following code to perform one-hot encoding:

df_filtered = pd.get_dummies(df_filtered, columns=['island', 'sex'], drop_first=True)


In [14]:
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


# Load the penguins dataset
df = sns.load_dataset("penguins")

df.dropna(inplace=True)

# Filter rows for 'Adelie' and 'Chinstrap' classes
selected_classes = ['Adelie', 'Chinstrap']
df_filtered = df[df['species'].isin(selected_classes)].copy()  # Make a copy to avoid the warning

# Initialize the LabelEncoder
le = LabelEncoder()

# Encode the species column
y_encoded = le.fit_transform(df_filtered['species'])
df_filtered['class_encoded'] = y_encoded

# One-hot encode the 'island' and 'sex' columns
df_filtered = pd.get_dummies(df_filtered, columns=['island', 'sex'], drop_first=True)

df_filtered.head()

X = df_filtered.drop(['species', 'class_encoded'], axis=1)  # Choose features
y = df_filtered['class_encoded']  # Target variable

X.head()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

logreg = LogisticRegression(solver='saga')

#logreg = LogisticRegression(max_iter=166, solver='newton-cg')
# logreg = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=100, multi_class='ovr', random_state=42)
logreg.fit(X_train, y_train)

# Predict on the testing data
y_pred = logreg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(logreg.coef_, logreg.intercept_)

Accuracy: 0.5813953488372093
[[ 2.75545030e-03 -8.65124398e-05  4.49807168e-04 -2.85834541e-04
   1.85223496e-04 -1.04988406e-04  1.07637223e-05]] [-8.6399497e-06]




Use the following code to visualize the encoding

In [15]:
samples = df_filtered.groupby('sex_Male').head(1)
print(samples)
print()
samples = df_filtered.groupby('island_Torgersen').head(1)
print(samples)
print()
samples = df_filtered.groupby('island_Dream').head(1)
print(samples)

  species  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \
0  Adelie            39.1           18.7              181.0       3750.0   
1  Adelie            39.5           17.4              186.0       3800.0   

   class_encoded  island_Dream  island_Torgersen  sex_Male  
0              0             0                 1         1  
1              0             0                 1         0  

   species  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \
0   Adelie            39.1           18.7              181.0       3750.0   
20  Adelie            37.8           18.3              174.0       3400.0   

    class_encoded  island_Dream  island_Torgersen  sex_Male  
0               0             0                 1         1  
20              0             0                 0         0  

   species  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  \
0   Adelie            39.1           18.7              181.0       3750.0   
30  Adelie    

Use the following code to apply logistic regression

In [16]:
X = df_filtered.drop(['species','class_encoded'], axis=1)

y = df_filtered['class_encoded']  # Target variable
print(X.shape, y.shape)
X.head()

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
from sklearn.preprocessing import MaxAbsScaler
scaler=MaxAbsScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logreg = LogisticRegression(solver='saga',max_iter=150,)

#logreg = LogisticRegression(max_iter=166, solver='newton-cg')
# logreg = LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=100, multi_class='ovr', random_state=42)
logreg.fit(X_train_scaled, y_train)

# Predict on the testing data
y_pred = logreg.predict(X_test_scaled)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

print(logreg.coef_, logreg.intercept_)

(214, 7) (214,)
Accuracy: 1.0
[[ 3.63420412  0.16314238  0.62632368  0.10221005  2.59927011 -0.87718394
  -0.35907275]] [-5.99637185]


Why we are using the "MaxAbsScaler" scaler rather than the "StandardScaler"?

The choice of using "MaxAbsScaler" versus "StandardScaler" depends on the nature of the data and the goals of scaling. "MaxAbsScaler" scales the features by dividing each feature by its maximum absolute value. This scaler is useful when you want to preserve the sparsity of the data, as it doesn't shift the distribution of the data or affect the mean and variance.

In contrast, "StandardScaler" standardizes the features to have a mean of 0 and a standard deviation of 1, which can be beneficial when features have different scales and you want to give them equal importance.

Use the following code to visualize feature scaling before and after normalization.

In [17]:
from sklearn.preprocessing import MaxAbsScaler
scaler=MaxAbsScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled)

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(X_test_scaled)

[[0.59655172 0.98139535 0.93396226 0.91666667 0.         1.
  1.        ]
 [0.8862069  0.88372093 0.94811321 0.82291667 1.         0.
  1.        ]
 [0.68275862 0.8        0.9245283  0.73958333 0.         1.
  0.        ]
 [0.87586207 0.86046512 0.94811321 0.92708333 1.         0.
  1.        ]
 [0.7137931  0.86046512 0.95283019 0.80729167 0.         1.
  1.        ]
 [0.64310345 0.95348837 0.93867925 0.78645833 0.         1.
  1.        ]
 [0.65172414 0.85116279 0.82075472 0.70833333 0.         0.
  0.        ]
 [0.5862069  0.79534884 0.87264151 0.70833333 1.         0.
  0.        ]
 [0.73965517 0.81860465 0.9245283  0.97916667 0.         1.
  1.        ]
 [0.62068966 0.79534884 0.88207547 0.77083333 1.         0.
  0.        ]
 [0.82068966 0.85116279 0.91981132 0.80208333 1.         0.
  0.        ]
 [0.80517241 0.83255814 0.91981132 0.6875     1.         0.
  0.        ]
 [0.63103448 0.85581395 0.86792453 0.72395833 1.         0.
  0.        ]
 [0.72586207 0.88837209 0.91981132 0.8

What can you observe in the values related to "island_Dream",    "island_Torgersen"  and   "sex_Male" features before and after scaling?

Before scaling, the values related to these categorical features are either 0 or 1 because they are binary (indicating the presence or absence of a category). After scaling, using either "MaxAbsScaler" or "StandardScaler" doesn't change the values for these binary features since their maximum absolute value is 1. So, the values for "island_Dream," "island_Torgersen," and "sex_Male" remain the same (0 or 1) before and after scaling. Scaling primarily affects continuous numeric features.