# Machine Learning and Statistics Project

**Tatjana Staunton**

***

>• The project is to create a notebook exploring classification algorithms applied on the iris flower data set associated with Ronald
A Fisher.

>• In your notebook, you should first explain what supervised learning is and then explain what classification algorithms are.

>• Describe at least one common classification algorithm and implement it using the `scikit-learn` Python library.

>• Throughout your notebook, use appropriate plots, mathematical
notation, and diagrams to explain the relevant concepts.

![Iris Flowers](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Machine+Learning+R/iris-machinelearning.png)

#### Overview of Fisher's Iris Dataset

The Iris dataset is a popular dataset in the field of machine learning and statistics. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936.  The dataset is based on a study where measurements were taken from three species of iris flowers: setosa, versicolor, and virginica. There are four features (attributes) measured for each flower:
* Sepal length (in cm)
* Sepal width (in cm)
* Petal length (in cm)
* Petal width (in cm)
There are three classes:
* Iris setosa
* Iris versicolor
* Iris virginica

There are typically 150 instances in the dataset, with 50 instances for each of the three species. The Iris dataset has become a classic example in the field and is frequently used in tutorials and educational materials for introducing concepts of machine learning and data analysis.

#### Supervised learning

>Supervised learning is a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory signal) train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.

>https://en.wikipedia.org/wiki/Supervised_learning


#### K-nearest neighbor algorithm

>k-nearest neighbor algorithm:
This algorithm is used to solve the classification model problems. K-nearest neighbor or K-NN algorithm basically creates an imaginary boundary to classify the data. When new data points come in, the algorithm will try to predict that to the nearest of the boundary line.

>https://www.geeksforgeeks.org/k-nearest-neighbor-algorithm-in-python/




#### Fisher's Iris Dataset

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


In [3]:
df=pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [4]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [5]:
df.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


In [6]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


#### References
http://www.lac.inpe.br/~rafael.santos/Docs/CAP394/WholeStory-Iris.html

https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv

http://archive.ics.uci.edu/ml/datasets/Iris

https://en.wikipedia.org/wiki/Iris_flower_data_set

https://www.geeksforgeeks.org/k-nearest-neighbor-algorithm-in-python/

https://en.wikipedia.org/wiki/Supervised_learning




