# Project Iris
Use a [Multilayer Perceptron](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) in machine learning to do Multiclass Classification to predict the type of iris from the flower's attributes.

## Importing Libraries and Data
The [Iris Data Set](http://archive.ics.uci.edu/ml/datasets/Iris) contains attribute information for 150 irises.  The attributes are: sepal length, sepal width, petal length, petal width, and class (Iris Setosa, Iris vericolour, or Iris Virginica).  The measurements are all in cemtimeters.

### Importing Libraries

In [1]:
# We import pandas for dataframes
import pandas as pd

# Import NumPy for math functions and calculations
import numpy as np

# Import from sklearn library for machine learning
from sklearn.model_selection import train_test_split   # for splitting the dataset
from sklearn.neural_network import MLPClassifier       # the main model library
from sklearn.metrics import accuracy_score             # to calculate error
from sklearn.preprocessing import StandardScaler       # to normalize/feature-scale data

# Import Operating System functions for finding the local directory
import os

# Import plotting library to display data
from matplotlib import pyplot as plt

### Importing Data

In [2]:
# Find the notebook directory path
# This is the directory where the iris.data file should be
notebook_path = os.path.abspath("Iris Classification.ipyb")
data_file = os.path.join(os.path.dirname(notebook_path), "iris.data")

# Read in dataset
iris_data = pd.read_csv(data_file, header=None)

FileNotFoundError: [Errno 2] No such file or directory: '/home/jenny/Documents/GitHub/JennySteichen.github.io/_portfolio/iris.data'

## Exploring the Data

In [None]:
# Print a selection from the data frame
print(iris_data.head())

### Change Column Names

In [None]:
# Add meaningful column names
iris_data.rename(columns = {0:"sepal len", 1:'sepal width', 2:'petal len', 
                            3:'petal width', 4:'iristype'},inplace=True)
print(iris_data.head())

### Check Data Statistics
Rough way to compare the distributions in each column.

In [None]:
feature_columns = ["sepal len", 'sepal width', 'petal len', 'petal width']
print("Maximums: ")
iris_data[feature_columns].min()

In [None]:
print("Minimums: ")
iris_data[feature_columns].min()

In [None]:
print("Means: ")
iris_data[feature_columns].mean()

### Display the Data

In [None]:
# Split the data into multiple data frames based on the type of iris
setosas = iris_data[iris_data['iristype'] == 'Iris-setosa']
versicolors = iris_data[iris_data['iristype'] == 'Iris-versicolor']
virginicas = iris_data[iris_data['iristype'] == 'Iris-virginica']

# Function to plot iris_data using only two features (columns) and three colors
def plot_2d(x_str, y_str):
    ax = setosas.plot(x=x_str, y=y_str, kind='scatter',c='r',label='Iris setosa')
    versicolors.plot(x=x_str, y=y_str, kind='scatter', ax=ax, c='g', label='Iris versicolor')
    virginicas.plot(x=x_str, y=y_str,kind='scatter', ax=ax, c='b', label='Iris virginica')
    plt.title("Iris Dataset")
    plt.show()
    # To save to file: plt.savefig("iris_data_scatterplot.png")

# Plot two features against each other in scatter plots
plot_2d('sepal len', 'sepal width')
plot_2d('sepal len', 'petal len')


## Wrangle the Data
Edit the data to prepare for developing the maching learning model

### Correct Data Errors
The Iris dataset has [errors in two lines](http://archive.ics.uci.edu/ml/datasets/Iris).  "The 35th sample should be: 4.9,3.1,1.5,0.2,"Iris-setosa" where the error is in the fourth feature. The 38th sample: 4.9,3.6,1.4,0.1,"Iris-setosa" where the errors are in the second and third features."
We fix those errors below.  Since the numbering in Python starts at 0, we change the 34th and 37th lines of the data.

In [None]:
# Correcting the two errors in the data
iris_data.loc[34,'petal width'] = 0.2
iris_data.loc[37, 'sepal width'] = 3.6
iris_data.loc[37, 'petal len'] = 1.4
print(iris_data.loc[34:38,:])


### Replace Categorical Label
We replace the categorical variable "iristype" with a numeric variable "iris num". 

In [None]:
# Before converting column iristype to 'iris num', we check the set of column values.
print(iris_data.iristype.unique())

# Add a new column to iris_data called "iris num" that represents "iris type" as a number
iris_type_dict = {'Iris-setosa':1, 'Iris-versicolor':2, 'Iris-virginica':3}
iris_data['iris num'] = iris_data['iristype'].map(iris_type_dict)
print(iris_data)

### Divide into variables and labels
We want the data in the form of a matrix X (the first four numberic columns of the data) 
and y (the iris num column).

In [None]:
y = pd.DataFrame(iris_data, columns=['iris num'])
X = pd.DataFrame(iris_data, columns=feature_columns)

### Divide data into train and test sets
We divide the data into training (80%) and test (20%) sets.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

## Define the Model
We use a multi-class classifier [MLPClasifier](https://scikit-learn.org/stable/modules/neural_networks_supervised.html) with 2 hidden layers, each with 7 nodes each.

In [None]:
# Initialize classifier
clf = MLPClassifier(solver='lbfgs', hidden_layer_sizes=(7, 2), random_state=100)

# Train the classifier
clf.fit(X_train, y_train.values.ravel())

## Evaluate Model

In [None]:
# Make predictions
y_predicted = clf.predict(X_test)

In [None]:
# Evaluate accuracy
print("{0:.0%}".format(accuracy_score(y_test, y_predicted)))

# Conclusion
The choice of architechture (2 hidden layers with 7 nodes in each layer) was arbitrary.  While the accuracy is high on this small set, the model could be improved if you had more data to test on.  One way to improve the model is to create a series of models with different learning rates and hidden layer sizes.  Train each model on the test set, then decide between the models using a validation set.