# Iris Logistic Regression

The dataset consists of three classes of irises. The objective is to create a classifier that will predict whether an iris belongs to the ‘Iris-setosa' class or not.

This means that we have two classes: ‘Iris-setosa' and not-‘Iris-setosa’ (which includes 'Iris-versicolour' and 'Iris-virginica').

Identify your independent variable x.

Encode your dependent variable y such that ‘Iris-setosa' is encoded as 0, and 'Iris-versicolour' and 'Iris-virginica' are both encoded as 1. (0 corresponds to the 'Iris-setosa' class, and 1 corresponds to the not-‘Iris-setosa' class.)


Split the data into a training and test set.


Use sklearn’s logistic regression function to fit a model and make predictions on the test set.


Use sklearn to generate a confusion matrix, which compares thecpredicted labels to the actual labels (gold labels).


Analyse the confusion matrix and provide a prediction, in a comment, on whether the model is likely to have higher precision, higher recall, or similar precision and recall.


Write your own code to calculate the accuracy, precision, and recall, and check whether your prediction was right.

In [1]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

In [2]:
# importing the dataset
iris = load_iris()
data = pd.DataFrame(iris.data, columns=iris.feature_names)
data['Species'] = iris.target
data.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
print("\nColumns of the dataset:")
print(data.columns)


Columns of the dataset:
Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)', 'Species'],
      dtype='object')


In [4]:
print("\nValue counts of the target variable:")
print(data["Species"].value_counts())


Value counts of the target variable:
Species
0    50
1    50
2    50
Name: count, dtype: int64


Identify the independent variable x and encode the dependent variable y:

In [5]:
# Independent variables (features)
X = data.iloc[:,[0,1,2,3]].values
# Dependent variable (target)
y = data.iloc[:,4].values
# Encode the target variable such that 'Iris-setosa' is 0 and others are 1
y_binary = np.where(data['Species'] == 0, 0, 1)
# Display the first few rows to check the 'Iris-setosa' encoding
print("\nEncoded Target Variable Preview (Binary):")
print(y_binary[:5])


Encoded Target Variable Preview (Binary):
[0 0 0 0 0]


Split the data into a training and test set

In [6]:
# Scale the features
X = preprocessing.scale(X)
X = preprocessing.scale(X) # scale the data so that it is easier to fit

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.25, random_state=0)

# Display the shape of the training and test sets
print("\nTraining and Test Set Shapes (X_train.shape, X_test.shape, y_train.shape, y_test.shape):")
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)


Training and Test Set Shapes (X_train.shape, X_test.shape, y_train.shape, y_test.shape):
(112, 4) (38, 4) (112,) (38,)


Fit a logistic regression model and make predictions on the test set

In [7]:
# Fit a model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions on test data
y_pred = log_reg.predict(X_test)

# Display the predictions
print("\nPredictions on the Test Set (Binary):")
print(y_pred)


Predictions on the Test Set (Binary):
[1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 0
 1]


### Measuring Model Performance


Generates and displays the confusion matrix.

In [8]:
# Generate the confusion matrix
conf_mat = confusion_matrix(y_test, y_pred)

# Display confusion matrix
print("\nConfusion Matrix (Binary):")
print(conf_mat)


Confusion Matrix (Binary):
[[13  0]
 [ 0 25]]


Analyse the confusion matrix and calculate accuracy, precision, and recall

In [9]:
# Calculate and display accuracy, precision, and recall using sklearn
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

print(f"\nAccuracy (Binary): {accuracy}")
print(f"Precision (Binary): {precision}")
print(f"Recall (Binary): {recall}")


Accuracy (Binary): 1.0
Precision (Binary): 1.0
Recall (Binary): 1.0


Write your own code to calculate accuracy, precision, and recall

In [10]:
# Calculate accuracy, precision, and recall manually
tp = conf_mat[1, 1]  # True positives
tn = conf_mat[0, 0]  # True negatives
fp = conf_mat[0, 1]  # False positives
fn = conf_mat[1, 0]  # False negatives

# Accuracy
accuracy_manual = (tp + tn) / (tp + tn + fp + fn)
# Precision
precision_manual = tp / (tp + fp)
# Recall
recall_manual = tp / (tp + fn)

print(f"Manual Accuracy: {accuracy_manual}")
print(f"Manual Precision: {precision_manual}")
print(f"Manual Recall: {recall_manual}")

Manual Accuracy: 1.0
Manual Precision: 1.0
Manual Recall: 1.0


Compare manual results vs sklearn results

In [11]:
# Check if the manual calculations match the sklearn results
print("\nDo manual calculations match sklearn results?")


Do manual calculations match sklearn results?


In [12]:
# Check if the manual calculations match the sklearn results
print(f"Accuracy match: {accuracy == accuracy_manual}")

Accuracy match: True


In [13]:
# Check if the manual calculations match the sklearn results
print(f"Precision match: {precision == precision_manual}")

Precision match: True


In [14]:
# Check if the manual calculations match the sklearn results
print(f"Precision match: {precision == precision_manual}")

Precision match: True


(Optional) Repeat this task but change it so that we only have all three categories ‘Iris-setosa', 'Iris-versicolour', and 'Iris-virginica' corresponding to the numeric values 0, 1, and 2 respectively; this will now be a three-class problem. Observe how this changes the confusion matrix.

Encode y such that 'Iris-setosa' is 0, 'Iris-versicolor' is 1, and 'Iris-virginica' is 2.

In [15]:
# Encode the target variable such that 'Iris-setosa' is 0, 'Iris-versicolor' is 1, and 'Iris-virginica' is 2
y_optional = data['Species']
# Display the first few rows to check the encoding
print("\nEncoded Target Variable Preview (All classes):")
print(y_optional.head())


Encoded Target Variable Preview (All classes):
0    0
1    0
2    0
3    0
4    0
Name: Species, dtype: int32


In [16]:
# Split the data into training and test sets
X_train_optional, X_test_optional, y_train_optional, y_test_optional = train_test_split(X, 
                y_optional, test_size=0.25, random_state=0)

# Display the shape of the training and test sets
print("\nTraining and Test Set Shapes (X_train_optional.shape, X_test_optional.shape, y_train_optional.shape, y_test_optional.shape):")
print(X_train_optional.shape, X_test_optional.shape, y_train_optional.shape, y_test_optional.shape)


Training and Test Set Shapes (X_train_optional.shape, X_test_optional.shape, y_train_optional.shape, y_test_optional.shape):
(112, 4) (38, 4) (112,) (38,)


In [17]:
# Initialize the logistic regression model for multi-class classification using OvR
l_g_optional = LogisticRegression(multi_class='ovr')#

# Fit the model for multi-class classification
l_g_optional.fit(X_train_optional, y_train_optional)

# Make predictions on the test set
y_pred_optional = l_g_optional.predict(X_test_optional)

# Display the predictions
print("\nPredictions on the Test Set (All classes):")
print(y_pred_optional)


Predictions on the Test Set (All classes):
[2 1 0 2 0 2 0 2 1 1 1 2 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [18]:
# Generate the confusion matrix
cm_optional = confusion_matrix(y_test_optional, y_pred_optional)

# Display the confusion matrix
print("\nConfusion Matrix for all classes problem:")
print(cm_optional)


Confusion Matrix for all classes problem:
[[13  0  0]
 [ 0 13  3]
 [ 0  1  8]]


In [19]:
# Calculate and display accuracy, precision, and recall for multi-class classification using sklearn
accuracy_optional = accuracy_score(y_test_optional, y_pred_optional)
precision_optional = precision_score(y_test_optional, y_pred_optional, average='macro')
recall_optional = recall_score(y_test_optional, y_pred_optional, average='macro')

print(f"\nAccuracy (all-class): {accuracy_optional}")
print(f"Precision (all-class): {precision_optional}")
print(f"Recall (all-class): {recall_optional}")


Accuracy (all-class): 0.8947368421052632
Precision (all-class): 0.8852813852813853
Recall (all-class): 0.9004629629629629


In [20]:
# Calculate the F1 score for multi-class classification
f1_optional = f1_score(y_test_optional, y_pred_optional, average='macro')
print(f"F1 Score (Multi-class): {f1_optional}")

F1 Score (Multi-class): 0.888888888888889


In [21]:
# Calculate the F1 score for each class and determine the hardest class
f1_scores_per_class = f1_score(y_test_optional, y_pred_optional, average=None)
lowest_f1_score = min(f1_scores_per_class)
hardest_class_index = list(f1_scores_per_class).index(lowest_f1_score)
hardest_class = iris.target_names[hardest_class_index]
print(f"Hardest class: {hardest_class}")

Hardest class: virginica


In [22]:
# Calculate precision and recall for 'virginica' class manually
class_index = list(iris.target_names).index('virginica')
recall_virginica = recall_score(y_test_optional == class_index, y_pred_optional == class_index)
precision_virginica = precision_score(y_test_optional == class_index, y_pred_optional == class_index)
print(f"Recall for 'virginica': {recall_virginica}")
print(f"Precision for 'virginica': {precision_virginica}")

Recall for 'virginica': 0.8888888888888888
Precision for 'virginica': 0.7272727272727273
