# Iris Dataset

![image.png](https://storage.googleapis.com/kaggle-datasets-images/19/19/default-backgrounds/dataset-cover.jpg)

The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems.

It includes three iris species with 50 samples each as well as the following properties about each flower:
- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)



# Dataset Loading
A collection of data is called dataset. We need datasets to train machine learning models. Datasets have the two following components:

## Features
The variables of data are called its features. These are the inputs to a machine learning model.

**Feature names**: List of all the names of the features.

**Feature matrix**: Collection of features, typically in a matrix or vector format if there is more than one feature.

## Label
The output variables that depends upon the feature variables. These are what the machine learning model seeks to predict, sometimes called the target.

**Target names**: List of possible values that the model can output

**Target vector**: True target value for each item in the dataset.

In [None]:
# Dataset loading code snippet:

from sklearn.datasets import load_iris

iris = load_iris() # load the iris dataset
X = iris.data # feature matrix
y = iris.target # target vector
feature_names = iris.feature_names
target_names = iris.target_names

print("Feature names:", feature_names)
print("Target names:", target_names)

# Data investigation

The iris dataset contains 150 data point, each with a feature vector containing four measurements. Each data point corresponds to a particular type of flower. 

In the following exercises, we will familiarize ourselves with the format of the dataset.

## Data layout

The feature matrix X contains 150 data points with 4 features each.

In the space below, determine the layout of the data in X. How are the data points organized? Do the feature values make sense?

In [None]:
raise NotImplementedError("Add your code here!")


# Your answer here

## Data visualization

Before buiding a machine learning model, it is useful to visualize the feature and target data to understand the task at hand.

We can look for patterns within individual features or dependencies between features. These patterns will be useful for the model in learning to predict the target labels.

In [None]:
import matplotlib.pyplot as plt

sepal_length = X[:, 0]
plt.hist(sepal_length)
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Count")

In the space below, create histograms for the other features. Do any patterns emerge that may be useful for distinguishing between different types of flowers?

In [None]:
raise NotImplementedError("Add your code here!")


# Your answer here

Machine learning models can also learn from combinations of features. In the space below, try plotting scatter plots of pairs of features. Do any patterns emerge that may be useful for distinguishing between different types of flowers?

In [None]:
import matplotlib.pyplot as plt

sepal_length = X[:, 0]
sepal_width = X[:, 1]
plt.scatter(sepal_length, sepal_width)
plt.xlabel("Sepal Length (cm)")
plt.ylabel("Sepal Width (cm)")

So far we've just looked at the features, but it can also be useful to compare with the targets the model will learn to predict. In the space below, try replacing some of the variables plotted above with the target vector (y) to see if the patterns you found can distinguish between flower types.

_Note_: The target vector contains integers, but you can refer back to the target names to see what types of flowers they correspond to.

In [None]:
raise NotImplementedError("Add your code here!")

# Creating a Dataset

To check the accuracy of our model, we can split the dataset into two pieces: a training set and a testing set. Use the training set to train the model and testing set to test the model. After that, we can evaluate how well our model did.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
   X, 
   y, 
   test_size = 0.3, # fraction of the data for testing
   random_state = 1, # ensures we get the same train and test sets every time
)

print("Train features:", X_train.shape)
print("Test features:", X_test.shape)

print("Train targets:", y_train.shape)
print("Test targets:", y_test.shape)

# Building a KNN Classifier

The first model we'll train is a [k-nearest neighbors](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) (KNN) classifier. This model classifies new data by finding the closest k data points in the training data and returning the most common target label among the set.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

# create the classifier, change k to control how many neighbors are used
# k is n_neighbors
classifier = KNeighborsClassifier(n_neighbors = 3)

# fit the model to the training data
classifier.fit(X_train, y_train)

# predict the targets for the test data
y_pred = classifier.predict(X_test)

# measure the accuracy of the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

The parameters we choose for our model have a big impact on performance. In the space below, try some alternative values for k. 

How many neighbors are best for classifying the flower types?

What happens as you increase the number of neighbors to a very high value?

In [None]:
raise NotImplementedError("Add your code here!")


# Your answer here

# Training a Decision Tree

Decision trees are a type of model that learns a hierarchy of simple decision rules for classifying data points. The deeper the decision tree, the more complex the rules become.

We can train a decision tree using the same technique as above for the KNN.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

# create the classifier, change max_depth to control how deep the tree is
classifier = DecisionTreeClassifier(max_depth = 3)

# fit the model to the training data
classifier.fit(X_train, y_train)

# predict the targets for the test data
y_pred = classifier.predict(X_test)

# measure the accuracy of the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Just like before, the parameters we choose for our model have a big impact on performance. In the space below, try some alternative values for max_depth. 

What is the model doing when the depth is 1?

How deep of a tree is best for classifying the flower types?

In [None]:
raise NotImplementedError("Add your code here!")


# Your answer here

# Training a Neural Network

Multi-layer perceptrons (MLPs) are a type of neural network that uses multiple hidden layers to learn complex non-linear relationship between features to predict target labels. MLPs are trained using the [backpropagation](https://en.wikipedia.org/wiki/Backpropagation) algorithm.

We can build an MLPClassifier with very minimal changes to the KNN Classifier or Decision Tree code above. Follow the documentation from Sci-kit Learn to adapt the code above and train an MLP for the iris dataset.

https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [None]:
raise NotImplementedError("Add your code here!")

How does the MLP performance compare to previous methods you tried?

Consider the patterns you saw in the data at the beginning of this notebook. Why might some methods be a better fit for the iris dataset?

In [None]:
# Your answer here

# Try on Other Datasets

The iris dataset is a relatively small and simple dataset without many features. Sci-kit Learn provides several other datasets with unique characteristics. Try loading a few more datasets from the website below and test a few classifiers to see which perform well.

https://scikit-learn.org/stable/datasets/toy_dataset.html

In [None]:
raise NotImplementedError("Add your code here!")

Were any datasets particularly challenging for the models you've tested?

In [None]:
# Your answer here

# Try Other Classifiers

Sci-kit Learn provides a broad set of interchangeable classifiers that can be applied to the datasets you've tested. In the space below, try a few more classifiers on any datasets that seemed challenging for the classifiers you've tried so far.

https://scikit-learn.org/stable/supervised_learning.html

Support vector machines (SVMs) are a powerful class of models for clasifying data. The code below implements an SVM using Sci-kit learn. 

In [None]:
from sklearn.svm import SVC
from sklearn import metrics

# create the classifier
classifier = SVC()

# fit the model to the training data
classifier.fit(X_train, y_train)

# predict the targets for the test data
y_pred = classifier.predict(X_test)

# measure the accuracy of the model
accuracy = metrics.accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)