## Lab 6 ##

Kar Lok Ng
8971216

In [95]:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import pandas as pd

iris = load_iris(as_frame=True)
X = iris.data

# this way, all instances of 'virginica' is labelled as True
# and all instances of 'non-virginica' is labelled as False
y = iris.target_names[iris.target] == 'virginica'

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# training the model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train, y_train)

# attaining train/test label predictions
y_train_pred = log_reg.predict(X_train)
y_test_pred = log_reg.predict(X_test)

# training a DummyClassifier for context
dummy_clf = DummyClassifier()
dummy_clf.fit(X_train, y_train)

y_dummy_train_pred = dummy_clf.predict(X_train)
y_dummy_test_pred = dummy_clf.predict(X_test)

In [96]:
cm1 = confusion_matrix(y_train, y_train_pred)
cm2 = confusion_matrix(y_test, y_test_pred)

cm3 = confusion_matrix(y_train, y_dummy_train_pred)
cm4 = confusion_matrix(y_test, y_dummy_test_pred)

print("Train Confusion Matrix: ")
print(cm1)
print("")
print("Test Confusion Matrix: ")
print(cm2)
print("")
print("Train Dummy Confusion Matrix: ")
print(cm3)
print("")
print("Test Dummy Confusion Matrix: ")
print(cm4)
# Just a note for myself to keep myself sane.
# True Negative class False    | False Negative of class True
# False Positive of class True | True Positive of class False

Train Confusion Matrix: 
[[72  2]
 [ 1 37]]

Test Confusion Matrix: 
[[26  0]
 [ 0 12]]

Train Dummy Confusion Matrix: 
[[74  0]
 [38  0]]

Test Dummy Confusion Matrix: 
[[26  0]
 [12  0]]


We observe that the model performs quite well as compared to the dummy classifier.

The dummy classifier completely ignores the input features (X_train or X_test), and simply predicts the most common feature in the given y_train/y_test data. This provides a good baseline performance to compare our logistic regression classifier to. 

We see that, in the test cases for the logistic regression classifier, it actually has no false negatives and false positives. In other words, it actually managed to predict the label for the flower (virginica or not virginica) accurately 100% of the time in this set. In the training data, it also performed admirably, accurately predicting the labels for all but 3 of the observations. 

We can take a look at these 3 observations specifically, and see if there's anything in common regarding them. 

In [97]:
# false + false = 0
# true + true = 2
# false true/true false = 1
# thus if there's a difference in the two lists, it would equal 1
# unfortunately this method strips the information of what type
# of error we have (i.e. I don't know if it's a false positive or
# false negative)
diff_lst = list(y_train_pred.astype('int') + y_train.astype('int'))

# get their indices
diff_idx = [i for i in range(len(diff_lst)) if diff_lst[i] == 1]

# getting per-class aggregate values
iris_data = iris.data
iris_target = iris.target_names[iris.target]
iris_data['target'] = iris_target

# get the mean/std per class

print("Setosa ------------------")
print(iris_data[iris_data['target'] == 'setosa'].iloc[:, 0:4].mean())
print(iris_data[iris_data['target'] == 'setosa'].iloc[:, 0:4].std())
print("")
print("Versicolor ------------------")
print(iris_data[iris_data['target'] == 'versicolor'].iloc[:, 0:4].mean())
print(iris_data[iris_data['target'] == 'versicolor'].iloc[:, 0:4].std())
print("")
print("Virginica ------------------")
print(iris_data[iris_data['target'] == 'virginica'].iloc[:, 0:4].mean())
print(iris_data[iris_data['target'] == 'virginica'].iloc[:, 0:4].std())
print("")
print('The features of the 3 mislabeled data points:')
display((iris.data.iloc[diff_idx]))

Setosa ------------------
sepal length (cm)    5.006
sepal width (cm)     3.428
petal length (cm)    1.462
petal width (cm)     0.246
dtype: float64
sepal length (cm)    0.352490
sepal width (cm)     0.379064
petal length (cm)    0.173664
petal width (cm)     0.105386
dtype: float64

Versicolor ------------------
sepal length (cm)    5.936
sepal width (cm)     2.770
petal length (cm)    4.260
petal width (cm)     1.326
dtype: float64
sepal length (cm)    0.516171
sepal width (cm)     0.313798
petal length (cm)    0.469911
petal width (cm)     0.197753
dtype: float64

Virginica ------------------
sepal length (cm)    6.588
sepal width (cm)     2.974
petal length (cm)    5.552
petal width (cm)     2.026
dtype: float64
sepal length (cm)    0.635880
sepal width (cm)     0.322497
petal length (cm)    0.551895
petal width (cm)     0.274650
dtype: float64

The features of the 3 mislabeled data points:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
46,5.1,3.8,1.6,0.2,setosa
54,6.5,2.8,4.6,1.5,versicolor
108,6.7,2.5,5.8,1.8,virginica


My idea going into this may be that each of the erronious labelling may be due to the fact that the features may be outliers in their respective categories. However, it does not seem like that is the case here - all features of our 3 wrongly labelled observations lie at most around 1 standard deviation away from the mean. 

Perhaps these 3 observations may be close to the decision boundary between viriginica and not-virginica, and thus may be labelled wrongly by our model. 