## Introduction

The dataset we will be working on is the famous <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set" target="_blank">Irish Dataset</a>. In 1936, an American botanist of the name of <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set" target="_blank">Edgar Anderson</a> collected 50 samples of 3 different species of Iris (type of flower) - total sample of 150. 

For each Iris, he measured it's Sepal Length (cm), Sepal Width (cm), Petal Length (cm), Petal Width (cm) and it's species (Setosa, Versicolor or Virginica). 

<img src="image/Iris.png" alt="Iris" style="width:800px;"/>
<a href="https://en.wikipedia.org/wiki/Iris_flower_data_set" target="_blank">Source 1</a>
<a href="https://www.chegg.com/homework-help/questions-and-answers/following-visualization-show-four-data-attributes-sepal-length-sepal-width-petal-length-pe-q48732841" target="_blank">Source 2</a>

The dataset was introduced to the world of statistics by British statisitian and biologist Ronald Fisher. Machine Learning techniques can be used on the dataset to predict the <a href="https://www.youtube.com/watch?v=FLuqwQgSBDw&ab_channel=AppliedAICourse" target="_blank">classification</a> of a given flower in to one of three categories (in this case the species) based on it's features (in this case Sepal Length, Sepal Width, Petal Length, Petal Width).

## Python Libraries 

For the analysis we will be using Python and the following packages within it. 

<a href="https://pandas.pydata.org/docs/getting_started/overview.html#:~:text=pandas%20is%20a%20Python%20package,world%20data%20analysis%20in%20Python." target="_blank">Pandas</a> - data manipulation and analysis for working with datastructures.\
<a href="https://numpy.org/" target="_blank">Numpy</a> - computational power of working with large multi-dimensional arrays.\
<a href="https://matplotlib.org/" target="_blank">Matplotlib</a> - data visualisations.\
<a href="https://seaborn.pydata.org/introduction.html" target="_blank">Seaborn</a> - data visualisations.\
<a href="https://scikit-learn.org/stable/" target="_blank">scikit-learn
</a> - machine learning for predictive data analysis.

In [None]:
# Import packages we will be using throughout the analysis. 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

## Importing Dataset 

The was downloaded from the <a href="https://archive.ics.uci.edu/ml/datasets/Iris" target="_blank">UCI Machine Learning Repository</a>. The raw data was saved in a text file ***iris.txt***. 

5.1,3.5,1.4,0.2,Iris-setosa\
4.9,3.0,1.4,0.2,Iris-setosa\
4.7,3.2,1.3,0.2,Iris-setosa\
4.6,3.1,1.5,0.2,Iris-setosa\
5.0,3.6,1.4,0.2,Iris-setosa\
5.4,3.9,1.7,0.4,Iris-setosa\
4.6,3.4,1.4,0.3,Iris-setosa\

As there was no collumn names given in the raw data our first job was to add them in. 

In [146]:
# List of the collumn names to be added to dataframe
collumnNames = ["SepalLengthCm", "SepalWidthCm", "PetalLengthCm", "PetalWidthCm", "Species"] 

The data was then read in as a pandas dataframe named ***df*** and collumn names added. 

In [147]:
# Reads in dataset as pandas dataframe and adds the collumn names to top
df = pd.read_csv('iris.txt', names = collumnNames)

<a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html" target="_blank">Reading in CSV</a>

## What our data looks like..

Now that we have our dataset imported, lets have a look and see how our dataframe looks. 

In [148]:
# Gives us the first 6 rows - in this case our collumn names we added and 5 rows of data. 
print(df.head())

   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0            5.1           3.5            1.4           0.2  Iris-setosa
1            4.9           3.0            1.4           0.2  Iris-setosa
2            4.7           3.2            1.3           0.2  Iris-setosa
3            4.6           3.1            1.5           0.2  Iris-setosa
4            5.0           3.6            1.4           0.2  Iris-setosa


In [149]:
# Tells us how many rows (150) and collumns (5) we have. 
print(df.shape)

(150, 5)


In [150]:
# Data types for each variable
print(df.dtypes)

SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object


In [151]:
# Null values count
print(df.isnull().sum())

SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64


In [152]:
# Our 3 species have 50 counts each 
numberOfSpecies = df["Species"].value_counts()
print(numberOfSpecies)

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64


From the output we can see our dataframe contains 150 rows of data and 5 collumns - Sepal Length, Sepal Width, Petal Length, Petal Width and Species. 

Sepal Length, Sepal Width, Petal Length and Petal Width are all type floats while Species is type object with 3 value types Setosa, Versicolor and Virginica all containing 50 rows. There are no null values in our dataset. 

## Overview of Statistics

Next, lets get an overview of our variable statistics for the total sample size of 150:  

In [153]:
# Count, mean, standard deviation, minimum value, 25% quartile, median, 75th quartile 
# and maximum by each numerical variable rounded to one decimal place. 

In [154]:
print(df.describe().round(1))

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count          150.0         150.0          150.0         150.0
mean             5.8           3.1            3.8           1.2
std              0.8           0.4            1.8           0.8
min              4.3           2.0            1.0           0.1
25%              5.1           2.8            1.6           0.3
50%              5.8           3.0            4.4           1.3
75%              6.4           3.3            5.1           1.8
max              7.9           4.4            6.9           2.5


We can then view the same statistics for each of the species type: 

In [155]:
# Calculates the mean, standard deviation, min, median and max by each species of flower for each variable. 
SepalLengthBySpecies = df.groupby("Species")["SepalLengthCm"].agg([np.mean, np.std, np.min, np.median, np.max])
print("Sepal Length\n", SepalLengthBySpecies, "\n")

SepalWidthBySpecies = df.groupby("Species")["SepalWidthCm"].agg([np.mean, np.std, np.min, np.median, np.max])
print("Sepal Width\n", SepalWidthBySpecies, "\n")

PetalLengthBySpecies = df.groupby("Species")["PetalLengthCm"].agg([np.mean, np.std, np.min, np.median, np.max])
print("Petal Length\n", PetalLengthBySpecies, "\n")

PetalWidthBySpecies = df.groupby("Species")["PetalWidthCm"].agg([np.mean, np.std, np.min, np.median, np.max])
print("Petal Width\n", PetalWidthBySpecies, "\n")

Sepal Length
                   mean       std  amin  median  amax
Species                                             
Iris-setosa      5.006  0.352490   4.3     5.0   5.8
Iris-versicolor  5.936  0.516171   4.9     5.9   7.0
Iris-virginica   6.588  0.635880   4.9     6.5   7.9 

Sepal Width
                   mean       std  amin  median  amax
Species                                             
Iris-setosa      3.418  0.381024   2.3     3.4   4.4
Iris-versicolor  2.770  0.313798   2.0     2.8   3.4
Iris-virginica   2.974  0.322497   2.2     3.0   3.8 

Petal Length
                   mean       std  amin  median  amax
Species                                             
Iris-setosa      1.464  0.173511   1.0    1.50   1.9
Iris-versicolor  4.260  0.469911   3.0    4.35   5.1
Iris-virginica   5.552  0.551895   4.5    5.55   6.9 

Petal Width
                   mean       std  amin  median  amax
Species                                             
Iris-setosa      0.244  0.107210   0.1 

<a href="https://campus.datacamp.com/courses/data-manipulation-with-pandas/aggregating-dataframes?ex=11" target="_blank">Aggregating DataFrames</a>\
<a href="https://www.youtube.com/watch?v=pTjsr_0YWas&ab_channel=HackersRealm" target="_blank">Analysis</a>


In [None]:
corrMatrix = df.corr()
print(corrMatrix)

<a href="https://www.youtube.com/watch?v=pTjsr_0YWas&t=989s&ab_channel=HackersRealm" target="_blank">Correlation matrix</a>

Using <a href="https://www.w3schools.com/python/python_file_handling.asp" target="_blank">Python's File Open</a> we can write our summary statistics to a txt file. (See ***summaryFile()*** function in ***analysis.py***)

## Visualising the Data

It will be easier to gather insights into our data by visualising the data. 

In [None]:
# Similiar format as that used for displaying histograms but now for boxplots. 
    
sns.set(style="darkgrid") # Sets the style of chart in Seaborn 

fig, axs = plt.subplots(2, 2, figsize=(12, 12), sharey="all") # Creates placing for 2x2 subplots with size 12x12. sharey sets 
# the y-axis to the same value for each of the 4 plots. 

# Plots 4 boxplots of each variable by each species using df as data. 
# ax positions the plot in one of 4 locations 
# palette set the colours of each of the species 

sns.boxplot(data=df, x="Species", y="SepalLengthCm", ax=axs[0, 0], palette="Set2")
sns.boxplot(data=df, x="Species", y="SepalWidthCm", ax=axs[0, 1], palette="Set2")
sns.boxplot(data=df, x="Species", y="PetalLengthCm", ax=axs[1, 0], palette="Set2")
sns.boxplot(data=df, x="Species", y="SepalWidthCm", ax=axs[1, 1], palette="Set2")
plt.show()

<h3 style="color:red">Observation 1</h3>

From our calculations and boxplots, we can see Setosa's Petal Lengths are much smaller than Versicolor and Virginica. 

In [None]:
sns.set(style="darkgrid") # Sets the style of chart in Seaborn

fig, axs = plt.subplots(2, 2, figsize=(12, 12)) # Creates placing for 2x2 subplots with size 12x12

# Plots 4 histograms of each variable by each species using df as data. 
# Kde gives us an outline of the distribution species 
# ax positions the plot in one of 4 locations 
# hue groups species and shows them as different colours
# palette set the colours of each of the species 
sns.histplot(data=df, x="SepalLengthCm", kde=True, ax=axs[0, 0], hue="Species", palette="Set2")
sns.histplot(data=df, x="SepalWidthCm", kde=True, ax=axs[0, 1], hue="Species", palette="Set2")
sns.histplot(data=df, x="PetalLengthCm", kde=True, ax=axs[1, 0], hue="Species", palette="Set2")
sns.histplot(data=df, x="PetalWidthCm", kde=True, ax=axs[1, 1], hue="Species", palette="Set2")
plt.show()

<a href="https://www.youtube.com/watch?v=snkkKrek7TU&ab_channel=DataCamp" target="_blank">Plotting Histograms</a>\
<a href="https://python-graph-gallery.com/25-histogram-with-several-variables-seaborn" target="_blank">Seaborn code</a>\
<a href="https://seaborn.pydata.org/generated/seaborn.histplot.html" target="_blank">Seaborn Histogram</a>

<h3 style="color:red">Observation 2</h3>

From the histograms, we can see again Setosa's Petal Lengths and Petal Widths are smaller and seperable from Versicolor and Virginica with a low variance (e.g. the spread of values). There is some overlap between Versicolor and Virginica. Versicolor generally has smaller Petal Lengths and Petal Widths than Virginica. 

It is more difficult to identify differences between species for Sepal Length and Widths. Generally, Setosa has the smallest Sepal Lengths with Virginica has the largest and Versicolor inbetween. While Setosa has the largest Sepal Widths followed by Virginica and Versicolor. There is variance overlap across all species for both Sepal Length and Width. 

From this we can take Petal Length is strongest when classifying the Species type. This is closely followed by Petal Width. Sepal Length is stronger than Petal Width however, not as strong as Petal measurements. 

In [None]:
# Plots each numerical variable against each other, even a variable against its self to show its distribution. 
# In this case kind="scatter" gives scatterplots. The plot duplicates the pairs of variables plotted against 
# eachother but switches the axes.
# hue groups species and shows them as different colours
# markers sets the types of species to different shapes o - circle, s - square, D - diamond. 
# palette set the colours of each of the species 
sns.pairplot(df, kind="scatter", hue="Species", markers=["o", "s", "D"], palette="Set2")
plt.show()

<a href="https://python-graph-gallery.com/111-custom-correlogram" target="_blank">Seaborn code</a>\
<a href="https://www.youtube.com/watch?v=4yz4cMXCkuw&ab_channel=KimberlyFessel" target="_blank">Seaborn Scatterplots</a>

In [None]:
# Outputs the scatterplot on individual basis. fit_reg is set to false as I don not want a regression line. 
sns.set(style="darkgrid")
sns.lmplot(data=df, x="PetalWidthCm", y="PetalLengthCm", fit_reg=False, hue="Species", markers=["o", "s", "D"], palette="Set2")
plt.show()

<h3 style="color:red">Observation 3</h3>

From the scatterplots, we can see Petal Length and Petal Width are the most useful variables for distingushing between Species. Setosa can be clearly identified, while Versicolor and Virginica have some overlap. 

<a href="https://www.kaggle.com/code/mrdheer/beginner-s-guide-to-iris-dataset/notebook" target="_blank">Plotting Boxplots</a>

In [None]:
# Using corrMatrix seen above (ln7) I want to plot it using a heatmap so it's easier to visualise. 
# Like the pairplot scatterplot above the plot duplicates the pairs of variables plotted against eachother. 
# fig, ax = plt.subplots() creates a plot size 7x6
# sns.heatmap() fills the plot with values from corrMatrix, annot=True - displays it. 
# cmap changes colour to red being high correlaton and blue for low correlaton. 

fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(corrMatrix, annot = True, ax = ax, cmap="coolwarm")

<a href="https://www.youtube.com/watch?v=pTjsr_0YWas&t=989s&ab_channel=HackersRealm" target="_blank">Correlation matrix</a>

"<a href="https://www.investopedia.com/terms/c/correlation.asp#:~:text=Correlation%20is%20a%20statistical%20term,to%20have%20a%20positive%20correlation." target="_blank">Correlation</a> *is a statistical term describing the degree to which two variables move in coordination with one another. If the two variables move in the same direction, then those variables are said to have a positive correlation. If they move in opposite directions, then they have a negative correlation. Correlation shows the strength of a relationship between two variables and is expressed numerically by the correlation coefficient. The correlation coefficient's values range between -1.0 and 1.0."*

<h3 style="color:red">Observation 4</h3>

From this we can see Petal Length and Petal Width have the highest positive correlation of all pairs of variables (0.96). Petal Length and Sepal Length (0.87) and Petal Width and Sepal Length (0.82) also have high correlation. 

Petal Length and Sepal Width (-0.42) and Petal Width and Sepal Width (-0.36) both have low negative correlation. 

## Machine Learning

"<a href="https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1&ab_channel=DataSchool" target="_blank">Machine Learning</a> *is the semi-automated extraction of knowledge from data".*

There are two types of machine learning 1. Supervised learning and 2. Unsupervised learning. Supervised learning is making predictions by using data while unsupervised learning is extracting structure from data i.e. segmenting into cathegories with similiar responses. 

In the case of the Iris dataset we will be using supervised learning to predict the species of a given flower from the learnings we have from the data we have on Iris' features (in this case Sepal Length, Sepal Width, Petal Length, Petal Width)

### Set up for Machine Learning

In order us to be able to build machine learning models it is best to convert categorical data (in the case the Species collumn) into numerical data. 

In [None]:
# Copy df to df2
df2 = df

In [None]:
# Label encoder will label Species for us. 
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df2["Species"] = le.fit_transform(df2["Species"])

In [None]:
df2.head()

The Species collumn has now been relabeled: 
0 = Setosa
1 = Versicolor
2 = Virginica

Now, we want to seperate our dataframe into our 4 attributes and call it matrix X. 

In [None]:
# Remove Species collumn from df2 and set it as X
X = df2.drop(["Species"], axis=1)
X.head()

We also want to split our response data (i.e. Species) into Y. 

In [None]:
# Set Y as the Species collumn 
Y = df2["Species"]
Y.head()

We now want to split the data again into training (70%) and testing data (30%).

As mentioned above we will be using supervised learning to predict the species of a given flower from the learnings we have from the data we have on Iris' features. The training data is used to build a model to make these predictions. Then we use the testing data to test the accuracy of our model built by. 

It's important to note we can not use 100% of our data on training the model and then use this data to test the accuracy. If we did this we would get 100% accuracy which would not be accurate as the model already knows what the responses should be.  

In [None]:
from sklearn.model_selection import train_test_split
# 70% training data
# 30% testing data
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)

From the results below we can see x_train and y_train now contains 70% (.7 X 150 = 105) of our data and x_test and y_test contains the remaining 30%. These results are divided randomly. 

In [144]:
print("x_train:", x_train.shape)
print("y_train:", y_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_test.shape)

x_train: (105, 4)
y_train: (105,)
x_test: (45, 4)
y_train: (45,)


### Logistic Regression

Now that we have prepared our data we are able to build a model. We will firstly use <a href="https://www.youtube.com/watch?v=elojMnjn4kk&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=1&ab_channel=DataSchool" target="_blank">Logistic Regression</a>. Despite it's name this is a linear model for classification rather than regression. 

In [None]:
# The Logistic Regression model can be imported from sklearn
# We will rename it logreg and apply it to our training data
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(x_train,y_train)

Now that we have our model we can test it on our test data and store our predictions in a list *y_pred1*. 

In [None]:
y_pred1 = logreg.predict(x_test)

We can then calculate the accuracy of the model using the actual results from y_test versus the models predicted values. 

In [None]:
from sklearn import metrics
accuracy = (metrics.accuracy_score(y_test, y_pred1) * 100)
print("Logistic Regression Model Accuracy: ", accuracy, "%" , sep="")

### K-Nearest Neighbours

<a href="https://scikit-learn.org/stable/modules/neighbors.html" target="_blank">K-Nearest Neighbours</a> takes in a value k from the user greater than 0 and searches for the k number of observations nearest the unknown Iris. The predicted response for the unknown Iris is based of the most popular Species of the k nearest obeservations.   

In [None]:
from sklearn.neighbors import KNeighborsClassifier
scores = []
k_range = []
k_range.extend(range(1,26))
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(x_train,y_train)
    y_pred2 = knn.predict(x_test)
    scores.append(metrics.accuracy_score(y_test, y_pred2))

In [None]:
sns.set(style="darkgrid")
plt.plot(k_range, scores)
plt.xlabel("Value of K for KNN")
plt.ylabel("Accurancy Scores")

<a href="https://www.youtube.com/watch?v=RlQuVL6-qe8&list=PL5-da3qGB5ICeMbQuqbbCOQWcS6OYBr5A&index=4&ab_channel=DataSchool" target="_blank">Machine Learning in Python with scikit-learn</a>

In [None]:
# Converts ints in list to strings
k_range=[str(x) for x in k_range]

<a href="https://blog.finxter.com/how-to-convert-an-integer-list-to-a-string-list-in-python/" target="_blank">Convert interger list to string list</a>

In [None]:
# Creates dictionary with k-values as keys and scores as values
zip_iterator = zip(k_range, scores)
scoresDict = dict(zip_iterator)

<a href="https://www.adamsmith.haus/python/answers/how-to-create-a-dictionary-from-two-lists-in-python" target="_blank">Create Dictionary from two lists</a>

In [120]:
# Find max value in dictionary and print which k this refers to.
max_k = max(scoresDict, key=scoresDict.get)
accuracy=scoresDict[max_k]*100
print("KNN Model Accuracy where k=", max_k, ": ", accuracy, "%", sep="")

KNN Model Accuracy where k=8: 100.0%


<a href="https://stackoverflow.com/questions/268272/getting-key-with-maximum-value-in-dictionary" target="_blank">Max value in dictionary</a>

### Using the model with new data

KNN gives us a higher model accuracy. Therefore we will use the k which we got our max accuarcy and make a new model using all 150 rows of data as our training data. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k=int(max_k)
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X,Y)

We can then ask the user to input measurements of a given Iris flower that we do not know the species of and predict the species from the data in the Iris dataset. 

In [134]:
print("Please input the measurements of the Iris:")
sl = int(input("Sepal Length in cm: "))
sw = int(input("Sepal Width in cm: "))
pl = int(input("Petal Length in cm: "))
pw = int(input("Petal Width in cm: "))
X_new = [sl, sw, pl, pw]

Please input the measurements of the Iris:
Sepal Length in cm: 1
Sepal Width in cm: 2
Petal Length in cm: 3
Petal Width in cm: 4


In [142]:
prediction=knn.predict([X_new])

In [156]:
# As we converted species into numerical values we have to change back to get the species name. 
if prediction[0]==0:
    print("Based on your inputted measurements, this is a Setosa.")
elif prediction[0]==1:
    print("Based on your inputted measurements, this is a Versicolor.")
else:
    print("Based on your inputted measurements, this is a Virginica.")

Based on your inputted measurements, this is a Versicolor.
