In [40]:
# imports
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier

# Part 1: Load the dataset

In [2]:
# read data from sklearn library
# separate the data and target
# convert the returning to dataframes
iris_data, iris_target = load_iris(return_X_y=True, as_frame=True)

COL_S_L = "sepal length (cm)"
COL_S_W = "sepal width (cm)"
COL_P_L = "petal length (cm)"
COL_P_W = "petal width (cm)"

In [3]:
# output the first 15 rows of data
iris_data.head(15)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


In [4]:
# summary of the table information
iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
sepal length (cm)    150 non-null float64
sepal width (cm)     150 non-null float64
petal length (cm)    150 non-null float64
petal width (cm)     150 non-null float64
dtypes: float64(4)
memory usage: 4.8 KB


## About the dataset  
In this dataset, we have 4 features: the length of sepal, the width of speal, the length of petal and the width of petal. Our labels are the codes of different iris, which are stored in the `iris_target` series. In this dataset, we have 3 kind of iris, and we can use these 4 features to identify each iris.

# Part 2: Split the dataset into train and test

In [5]:
# 90% train and 10% test
features_train, features_test, labels_train, labels_test = train_test_split(
    iris_data, iris_target, test_size=0.1, random_state=53
)
print(f'features training size: {len(features_train)}')
print(f'features testing size: {len(features_test)}')
print(f'labels training size: {len(labels_train)}')
print(f'labels testing size: {len(labels_test)}')

features training size: 135
features testing size: 15
labels training size: 135
labels testing size: 15


# Part 3: Logistic Regression

In [6]:
# use sklearn to train a LogistiRegression model on the training set
clf = LogisticRegression().fit(features_train, labels_train)

In [7]:
sample_index = 11
sample_X = features_train.iloc[sample_index: sample_index+1, :]
sample_y = labels_train.iloc[sample_index]
# predict sample data
probas = clf.predict_proba(sample_X)
print(f'Probabilites of each classes: {probas}')
print(f'Actual label: {sample_y}')

Probabilites of each classes: [[9.84881804e-01 1.51181819e-02 1.41387858e-08]]
Actual label: 0


The result suggest that, for the 12th element in the training features, 98.5% it belongs to class 0, 1.5% it belongs to class 1 and 0.0000014% it belongs to class 0. This result fits to our label.

In [8]:
clf.score(features_test, labels_test)

0.9333333333333333

`score` function returns the mean accuracy on the `features_test` and `labels_test`. The accuracy of 93.3% tells this model has a very good fit to the testing data.

In [9]:
print(clf.coef_)
print(clf.intercept_)

[[-0.38874638  0.98095147 -2.45856245 -1.05939201]
 [ 0.58105679 -0.37328934 -0.27315416 -0.8526411 ]
 [-0.19231041 -0.60766213  2.73171661  1.91203311]]
[  9.33326993   2.19963106 -11.532901  ]


# Part 4: Support Vector Machine

In [10]:
# use sklearn to train a SVC on training set
svm_clf = svm.SVC(probability=True)
svm_clf.fit(features_train, labels_train)

SVC(probability=True)

In [11]:
svm_probas = svm_clf.predict_proba(sample_X)
print(f'Probabilites of each classes: {svm_probas}')
print(f'Actual label: {sample_y}')

Probabilites of each classes: [[0.97801188 0.01343567 0.00855245]]
Actual label: 0


The result suggest that, for the 12th element in the training features, 97.8% it belongs to class 0, 1.3% it belongs to class 1 and 0.88% it belongs to class 0. This result fits to our label.

In [12]:
svm_clf.score(features_test, labels_test)

0.9333333333333333

The `score` function measures the mean accuracy on the given testing dataset.  
The accuracy of 93.33% tells the model can perform a good classification.

# Part 5: Nural Network

In [18]:
# train a MLP classifier on the training set
MLP_clf = MLPClassifier(random_state=1).fit(features_train, labels_train)



In [20]:
MLP_probas = MLP_clf.predict_proba(sample_X)
print(f'Probabilites of each classes: {MLP_probas}')
print(f'Actual label: {sample_y}')

Probabilites of each classes: [[0.95232573 0.04546975 0.00220452]]
Actual label: 0


The result suggest that, for the 12th element in the training features, 95.2% it belongs to class 0, 4.5% it belongs to class 1 and 0.22% it belongs to class 0. This result fits to our label.

In [21]:
MLP_clf.score(features_test, labels_test)

0.8666666666666667

The `score` function tells us the mean accuracy on the given testing dataset.  
Under the default 200 iterations and rectified linear unit function, the MLP classifer has the accracy of 86.67%

In [38]:
MLP_clf_600 = MLPClassifier(random_state=1, max_iter=600).fit(features_train, labels_train)
MLP_probas_600 = MLP_clf_600.predict_proba(sample_X)
print(f'Probabilites of each classes: {MLP_probas_600}')
print(MLP_clf_600.score(features_test, labels_test))

Probabilites of each classes: [[9.98998468e-01 1.00153237e-03 2.61583050e-15]]
0.9333333333333333




In [37]:
MLP_clf_700 = MLPClassifier(random_state=1, max_iter=700).fit(features_train, labels_train)
MLP_probas_700 = MLP_clf_700.predict_proba(sample_X)
print(f'Probabilites of each classes: {MLP_probas_700}')
print(MLP_clf_700.score(features_test, labels_test))

Probabilites of each classes: [[9.99275346e-01 7.24653546e-04 5.88429271e-17]]
0.9333333333333333


At around 700 iterations, the MLP classifer reaches to the optimal situation and the optimal accuracy is 93.33%.

In [30]:
MLP_clf_logistic = MLPClassifier(random_state=1, activation="logistic").fit(features_train, labels_train)
MLP_clf_logistic.score(features_test, labels_test)



0.9333333333333333

In [39]:
MLP_clf_logistic_1000 = MLPClassifier(random_state=1, max_iter=1000,activation="logistic").fit(features_train, labels_train)
MLP_probas_logistic_1000 = MLP_clf_logistic_1000.predict_proba(sample_X)
print(f'Probabilites of each classes: {MLP_probas_logistic_1000}')
print(MLP_clf_logistic_1000.score(features_test, labels_test))

Probabilites of each classes: [[9.95398651e-01 4.60134902e-03 2.62218707e-12]]
0.9333333333333333


Using logistic as activation function and run 1000 iteration, this MLP classifier reaches to the optimal situation and the accuracy is 93.33%.  
This accuracy is the same as the classifier using relu as activation function, but this flassifier needs more iterations.

# Part 6: K-Nearest Neighbors

In [43]:
# train a k-neighobrs classifier
neigh = KNeighborsClassifier()
neigh.fit(features_train.values, labels_train.values)

KNeighborsClassifier()

In [44]:
neigh_probas = neigh.predict_proba(sample_X.values)
print(f'Probabilites of each classes: {neigh_probas}')
print(f'Actual label: {sample_y}')

Probabilites of each classes: [[1. 0. 0.]]
Actual label: 0


The KNN classifier classifies the sample X as the class 0, which matches our label.

In [46]:
neigh.score(features_test.values, labels_test.values)

0.9333333333333333

We are using the default 5 neighbors and let the `KNeighborsCalssifier` class decide the algorithm.  
And the KNN model gives us the accuracy of 93.33%.

# Part 7: Conclusions and takeaways  
By looking at the scores of all these models, we can learn that for this setup, the best accuracy we can have is about 93.33%.    
  
The Neural Network does the best job in my opinion. When we configurate the neural network to reach the optimal situation, the neural network using logistic funciton as the activation function gives us the fittest prediction on the sample datapoint. It tells us the 12th datapoint in the training feature dataset is 99.54% should be class 0, while logistic regression says it is 98.49%, svm says it is 97.8%.  
KNN works differently and it tells us the 12th datapoint is class 0.  
But the neural network needs us to setup the settings to get the best result. If we don't, the default setup gives us an accuracy of 86.67%, which is way worse than other models.  
Considering that the logistic regression just runs 100 iterations and has a similar result with the optimal neural network, which required 1000 iterations, logistic regression has the best overall performance on regonizing iris.  
  
The thing surprise me is that the neural network requires much more configurations and resources. It has different activation functions and it needs a large numebrs of iterations to acheve the optimal model. These configurations can really improve the performance of a neural network, but on regonizing iris, the improvment is limited when comparing with other models.