# Iris Dataset Project 

### Example

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from pandas.plotting import parallel_coordinates
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

In [3]:
# load through url
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
dataset = pd.read_csv(url, names = attributes)
dataset.columns = attributes

# or load through local csv using the line below
#data = pd.read_csv('data.csv')

URLError: <urlopen error [Errno 8] nodename nor servname provided, or not known>

In [None]:
dataset

In [None]:
dataset.head(5)

In [None]:
# types for the columns
dataset.dtypes

In [None]:
# numerical summary, only applies to numerical columns
dataset.describe()

In [None]:
# number of instances in each class
dataset.groupby('class').size()

# Train - Test - Split

In [None]:
# Take out a test set
train, test = train_test_split(dataset, test_size = 0.4, stratify = dataset['class'], random_state = 42)



In [None]:
# number of instances in each class in training data
train.groupby('class').size()

*Note: I set 40 percent of data to be the test set to ensure there are enough data points to test the models.*

## Exploratory Data Analysis 

In [None]:
# histograms
n_bins = 10
fig, axs = plt.subplots(2, 2)
axs[0,0].hist(train['sepal_length'], bins = n_bins);
axs[0,0].set_title('Sepal Length');
axs[0,1].hist(train['sepal_width'], bins = n_bins);
axs[0,1].set_title('Sepal Width');
axs[1,0].hist(train['petal_length'], bins = n_bins);
axs[1,0].set_title('Petal Length');
axs[1,1].hist(train['petal_width'], bins = n_bins);
axs[1,1].set_title('Petal Width');

# add some spacing between subplots
fig.tight_layout(pad=1.0);

In [None]:
sns.pairplot(train, hue="class", height = 2, palette = 'colorblind');

In [None]:
# correlation matrix
corrmat = train.corr()
sns.heatmap(corrmat, annot = True, square = True);

The main takeaway is that the petal measurements have **highly positive correlation**, while the sepal one are uncorrelated. Note that the petal features also have relatively high correlation with sepal_length, but not with sepal_width.
Another cool visualization tool is parallel coordinate plot, which represents each row as a line.

In [None]:
# parallel coordinates
parallel_coordinates(train, "class", color = ['blue', 'red', 'green']);

As we have seen before, petal measurements can separate species better than the sepal ones.

# Time to Build Classifiers 

Now we are ready to build some classifiers (woo-hoo!)

To make our lives easier, let’s separate out the class label and features:

In [None]:
# You need to fill this out using this format 
# Model development
X_train = train[['sepal_length','sepal_width','petal_length','petal_width']]#attributes[:-1]
y_train = train['class']
X_test = test[['sepal_length','sepal_width','petal_length','petal_width']]
y_test = test['class']

## Classification Tree 

The first classifier that comes up to my mind is a discriminative classification model called classification trees. The reason is that we get to see the classification rules and it is easy to interpret.

Let’s build one using sklearn (documentation), with a maximum depth of 3, and we can check its accuracy on the test data.

In [None]:
# your first  decision tree
mod_dt = DecisionTreeClassifier(max_depth = 3, random_state = 1)
mod_dt.fit(X_train,y_train)
prediction=mod_dt.predict(X_test)
print('The accuracy of the Decision Tree is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))

How much of the test data does this decision tree test correctly? 

One nice thing about this model is that you can see the importance of each predictor through its feature_importances_ attribute. try it out! 

From the output and based on the indices of the four features, we know that the first two features (sepal measurements) are of no importance, and only the petal ones are used to build this tree.
Another nice thing about the decision tree is that we can visualize the classification rules through plot_tree:

In [None]:
#Build plot tree
print(mod_dt.feature_importances_) #42% and 58% contribution used for petal length and width, respectively
plt.figure(figsize = (10,8))
plot_tree(mod_dt, feature_names = ['sepal_length','sepal_width','petal_length','petal_width'], class_names = test['class'].unique(), filled = True);

Apart from each rule (e.g. the first criterion is petal_width ≤ 0.7), we can also see the Gini index (impurity measure) at each split, assigned class, etc. Note that all terminal nodes are pure besides the two “light purple” boxes at the bottom. We can less confident regarding instances in those two categories.
To demonstrate how easy it is to classify new data points, say a new instance has a petal length of 4.5cm and a petal width of 1.5cm, then we can predict it to be versicolor following the rules.
Since only the petal features are being used, we can visualize the decision boundary and plot the test data in 2D:

In [None]:
# Example code: will need some tweaking
# plot decision boundary for pedal width vs pedal length
# plot decision boundary for pedal width vs pedal length
plot_step = 0.01
plot_colors = "ryb"
xx, yy = np.meshgrid(np.arange(0, 7, plot_step), np.arange(0, 3, plot_step))
plt.tight_layout(h_pad=1, w_pad=1, pad=2.5)

selected_predictors = ["petal_length", "petal_width"]
mod_dt_1 = DecisionTreeClassifier(max_depth = 3, random_state = 1)
y_train_en = y_train.replace({'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}).copy()
mod_dt_1.fit(X_train[selected_predictors],y_train_en)

pred_all = mod_dt_1.predict(np.c_[xx.ravel(), yy.ravel()])
pred_all = pred_all.reshape(xx.shape)

graph = plt.contourf(xx, yy, pred_all, cmap=plt.cm.RdYlBu)

plt.xlabel(selected_predictors[0])
plt.ylabel(selected_predictors[1])

# plot test data points
n_class = 3
cn = ['Iris-setosa','Iris-versicolor','Iris-virginica']
for i, color in zip(cn, plot_colors):
    temp = np.where(y_test == i)
    idx = [elem for elems in temp for elem in elems]
    plt.scatter(X_test.iloc[idx, 2], X_test.iloc[idx, 3], c=color, 
                label=y_test, cmap=plt.cm.RdYlBu, edgecolor='black', s=20)

plt.suptitle("Decision Boundary Shown in 2D with Test Data")
plt.axis("tight");

Out of the 60 data points, 59 are correctly classified. Another way to show the prediction results is through a confusion matrix:

In [None]:

# example code for confusion matrix will need tweaking
# one versicolor misclassified
disp = metrics.plot_confusion_matrix(mod_dt, X_test, y_test,
                                 display_labels=cn,
                                 cmap=plt.cm.Blues,
                                 normalize=None)
disp.ax_.set_title('Decision Tree Confusion matrix, without normalization');

Through this matrix, we see that there is one versicolor which we predict to be virginica.
One downside is building a single tree is its instability, which can be improved through ensemble techniques such as random forests, boosting, etc. For now, let’s move on to the next model.

## Gaussian Naive Bayes Classifier

One of the most popular classification models is Naive Bayes. It contains the word “Naive” because it has a key assumption of class-conditional independence, which means that given the class, each feature’s value is assumed to be independent of that of any other feature ( read more [https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c]
We know that it is clearly not the case here, evidenced by the high correlation between the petal features. Let’s examine the test accuracy using this model to see whether this assumption is robust:

In [None]:
# Example code for Guassian Naive Bayes Classifier to help you make your own
mod_gnb_all = GaussianNB()
y_pred = mod_gnb_all.fit(X_train, y_train).predict(X_test)
print('The accuracy of the Gaussian Naive Bayes Classifier on test data is',"{:.3f}".format(metrics.accuracy_score(y_pred,y_test)))

What about the results if we only used petal features?

In [None]:
# Modfy this example to make your own
# Guassian Naive Bayes Classifier with two predictors
mod_gnb = GaussianNB()
y_pred = mod_gnb.fit(X_train[selected_predictors], y_train).predict(X_test[selected_predictors])
print('The accuracy of the Guassian Naive Bayes Classifier with 2 predictors on test data is',"{:.3f}".format(metrics.accuracy_score(y_pred,y_test)))

Interestingly, using only two features results in more correctly classified points, suggesting possibility of over-fitting when using all features. Seems that our Naive Bayes classifier did a decent job.

## Linear Discriminant Analysis (LDA)

If we use multivariate Gaussian distribution to calculate the class conditional density instead of taking a product of univariate Gaussian distribution (used in Naive Bayes), we would then get a LDA model (read more https://machinelearningmastery.com/linear-discriminant-analysis-for-machine-learning/). The key assumption of LDA is that the covariances are equal among classes. We can examine the test accuracy using all features and only petal features:

In [None]:
# Example LDA Classifier
mod_lda_all = LinearDiscriminantAnalysis()
y_pred = mod_lda_all.fit(X_train, y_train).predict(X_test)
print('The accuracy of the LDA Classifier on test data is',"{:.3f}".format(metrics.accuracy_score(y_pred,y_test)))


In [None]:

# Example LDA Classifier with two predictors
mod_lda = LinearDiscriminantAnalysis()
y_pred = mod_lda.fit(X_train[selected_predictors], y_train).predict(X_test[selected_predictors])
print('The accuracy of the LDA Classifier with two predictors on test data is',"{:.3f}".format(metrics.accuracy_score(y_pred,y_test)))

Using all features boosts the test accuracy of our LDA model.
To visualize the decision boundary in 2D, we can use our LDA model with only petals and also plot the test data:

In [None]:
# An example: please tweak it for your own code
# LDA with 2 predictors
mod_lda_1 = LinearDiscriminantAnalysis()
y_pred = mod_lda_1.fit(X_train[selected_predictors], y_train_en).predict(X_test[selected_predictors])

N = 300
X = np.linspace(0, 7, N)
Y = np.linspace(0, 3, N)
X, Y = np.meshgrid(X, Y)

g = sns.FacetGrid(test, hue="class", height=5, palette = 'colorblind').map(plt.scatter,"petal_length", "petal_width", ).add_legend()
my_ax = g.ax

zz = np.array([mod_lda_1.predict(np.array([[xx,yy]])) for xx, yy in zip(np.ravel(X), np.ravel(Y)) ] )
Z = zz.reshape(X.shape)

#Plot the filled and boundary contours
my_ax.contourf( X, Y, Z, 2, alpha = .1, colors = ('blue','green','red'))
my_ax.contour( X, Y, Z, 2, alpha = 1, colors = ('blue','green','red'))

# Add axis and title
my_ax.set_xlabel('Petal Length')
my_ax.set_ylabel('Petal Width')
my_ax.set_title('LDA Decision Boundaries with Test Data'); 

You should notice Four test points are misclassified — three virginica and one versicolor.
Now suppose we want to classify new data points with this model, we can just plot the point on this graph, and predicts according to the colored region it belonged to.

## Quadratic Discriminant Analysis (QDA)

The difference between LDA and QDA is that QDA does NOT assume the covariances to be equal across classes, and it is called “quadratic” because the decision boundary is a quadratic function.

In [None]:
# Example QDA Classifier
mod_qda_all = QuadraticDiscriminantAnalysis()
y_pred = mod_qda_all.fit(X_train, y_train).predict(X_test)
print('The accuracy of the QDA Classifier is',"{:.3f}".format(metrics.accuracy_score(y_pred,y_test)))

In [None]:
# QDA Classifier with two predictors
mod_qda = QuadraticDiscriminantAnalysis()
y_pred = mod_qda.fit(X_train[selected_predictors], y_train).predict(X_test[selected_predictors])
print('The accuracy of the QDA Classifier with two predictors is',"{:.3f}".format(metrics.accuracy_score(y_pred,y_test)))

It has the same accuracy with LDA in the case of all features, and it performs slightly better when only using petals.
Similarly, let’s plot the decision boundary for QDA (model with only petals):

In [None]:
# Eaxmple QDA with 2 predictors
mod_qda_1 = QuadraticDiscriminantAnalysis()
y_pred = mod_qda_1.fit(X_train.iloc[:,2:4], y_train_en).predict(X_test.iloc[:,2:4])

N = 300
X = np.linspace(0, 7, N)
Y = np.linspace(0, 3, N)
X, Y = np.meshgrid(X, Y)

g = sns.FacetGrid(test, hue="class", height=5, palette = 'colorblind').map(plt.scatter,"petal_length", "petal_width", ).add_legend()
my_ax = g.ax

zz = np.array([mod_qda_1.predict(np.array([[xx,yy]])) for xx, yy in zip(np.ravel(X), np.ravel(Y)) ] )
Z = zz.reshape(X.shape)

#Plot the filled and boundary contours
my_ax.contourf( X, Y, Z, 2, alpha = .1, colors = ('blue','green','red'))
my_ax.contour( X, Y, Z, 2, alpha = 1, colors = ('blue','green','red'))

# Addd axis and title
my_ax.set_xlabel('Petal Length')
my_ax.set_ylabel('Petal Width')
my_ax.set_title('QDA Decision Boundaries with Test Data');

## K Nearest Neighbors (K-NN)

Now, let’s switch gears a little and take a look at a non-parametric generative model called KNN (read more https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761). It is a popular model since it is relatively simple and easy to implement. However, we need to be aware of the curse of dimensionality when number of features gets large.
Let’s plot the test accuracy with different choices of K:

In [None]:
# an example to tweak
# KNN, first try 5
mod_5nn=KNeighborsClassifier(n_neighbors=5) 
mod_5nn.fit(X_train,y_train)
prediction=mod_5nn.predict(X_test)
print('The accuracy of the 5NN Classifier is',"{:.3f}".format(metrics.accuracy_score(prediction,y_test)))

In [None]:
# an example
# try different k
acc_s = pd.Series(dtype = 'float')
for i in list(range(1,11)):
    mod_knn=KNeighborsClassifier(n_neighbors=i) 
    mod_knn.fit(X_train,y_train)
    prediction=mod_knn.predict(X_test)
    acc_s = acc_s.append(pd.Series(metrics.accuracy_score(prediction,y_test)))
    
plt.plot(list(range(1,11)), acc_s)
plt.suptitle("Test Accuracy vs K")
plt.xticks(list(range(1,11)))
plt.ylim(0.9,0.98);

We can see that the accuracy is highest (about 0.965) when K is 3, or between 7 and 10. Compare to the previous models, it is less straightforward to classify new data points since we would need to look at its K closest neighbors in four-dimensional space.

## Additional Work 

 - Make sure this project is in your github repo ( make it look nice ) 
 - go to titus under module 12 and click on the extra questions link. There are 4 different category of questions. Choose 1 from each category and add it to your project. 
 - add a conclusion and a summary of the project and what you learned 

### We explored the Iris dataset, and then built a few popular classifiers using sklearn. We saw that the petal measurements are more helpful at classifying instances than the sepal ones. Furthermore, most models achieved a test accuracy of over 95%.

## Conclusion/Summary
In this Supervised Machine Learning project, we tested the ability of an ML model to accurately predict the class (species) of the Iris flower, given the attributes sepal length, sepal width, petal length, and petal width. Fortunately this dataset was robust, with a balanced class distribution including 50 instances of the same four attributes for each of the three Iris species. With this robust data set, missing data or imbalanced class distributions were not factors needed to address prior to our analysis.

Initially, we defined the test data set as having 40% of the data, with the train data set having the other 60%. Altering these proportions to 30% and 70% respectively reduced the accuracy of prediction in some analyses, notably in the gini values of the Classification Tree. 

In our initial scatterplots mapping the correlations between each of the four attributes, there was a clear variance for the Iris setosa species, a result which was further demonstrated in the visualizations that followed. This correlation was found to be mostly attributable to the petal length and in particular, the petal width attribute which had a 58% contribution in predicting the test data. Then, looking at the Gaussian Naive Bayes classification model, there was 95% accuracy when considering only the petal attributes (2% more than with all four attributes).

Assuming a 95% confidence interval, our K-NN analysis proved the model to be successful with K values of 1, 3, 7, 8, 9, and 10.
