# Scikit-Learn #

### The purpose of this Jupyter Notebook is to demonstrate three different algorithms from the scikit-learn library. ###

scikit-learn is an opensource python library containing tools for predictive data analysis [25]. The library uses NumPy, SciPy and matplotlib under the hood.

scikit-learn provies tools to help with:
 - Classification - Identifying which category an object belongs to
 - Regression - Predicting a continuous-valued attribute associated with an object
 - Clustering - Grouping similar objects into sets
 - Dimensionality reduction - Eliminating random variables from consideration
 - Model selection - Comparing and validating parameters and models
 - Preprocessing - Feature extraction and normalization

For the purposes of this project, using the Iris Dataset, I will demonstrate the following three scikit-learn algorithms:
 1) Linear Regression
 2) Nearest Neighbours
 3) Model Selection

I will also be using the following libraries to aid in visualisation and analysis:


- **Pandas ** - an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools. [2]
- **Numpy ** – the fundamental package for scientific computing with Python[2]
- **Seaborn ** – a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. [2]
- **Sys ** - System-specific parameters and functions.
This module provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter. It is always available. [20]

## Fisher’s Iris Dataset
Introduced by British statistician and biologist Ronald Fisher in his 1936 paper entitled “The use of multiple measurements in taxonomic problems”. It is an example of linear discriminate analysis.
The dataset gives the measurements in centimetres of the sepal length and width and petal length and width of 50 flowers from each species of iris: setosa, versicolor, and virginica.
The dataset contains a set of 150 records of 5 attributes:
Sepal length in cm
Sepal width in cm
Petal length in cm
Petal width in cm
Species of iris: setosa, versicolor, virginica[1, 2]

## Importing the Data
Downloading the Iris dataset from the internet using the panda and sys libraries which will enable me to analyse the first few rows of data.

In [4]:
import sys
import pandas as pd
#Downloaded iris dataset from https://tinyurl.com/y8fovkyq

sys.stdout = open("variables_summary.txt", "w")

iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

iris_data.columns = ['sepal_length', 'sepal_width',
                     'petal_length', 'petal_width', 'variety']

iris_data.head(10)

iris_data.shape
#print(iris_data)

(150, 5)

## Summary Chart (or a much nicer view of the min, max, mean, std dev)

In [None]:
import sys
import pandas as pd

iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

iris_data.columns = ['sepal_length', 'sepal_width',
                     'petal_length', 'petal_width', 'variety']

#Gathered a summary of the data (count, mean, std deviation, minumin, 25%, 50%, 75%, maximum)
#Used call chaining to make code look pretty.

summary = iris_data.describe().transpose().head()
print(summary)

sys.stdout.close()

## Analysis:
Judging from the above chart, we can see a big range between sepal length and petal length. Next, we will see if this range is determined by the species of iris.

## Classification

## Boxplots
Boxplots are useful because they offer a quick and visually pleasing way to compare data[2][11].
In this case, I have four separate boxplots comparing the distributions across the variables and varieties of iris. First, I compare the sepal length, then sepal width, then the petal length and then, finally, the petal width for each species: Setosa, Versicolor, Virginica

In [None]:
import sys
import seaborn as sns

sns.set(style="whitegrid", palette="BuGn_r",
        rc={'figure.figsize': (11.7, 8.27)})

title = "Compare Distributions of Sepal Length"

sns.boxplot(x="variety", y="sepal_length", data=iris_data)

In [None]:
import sys
import pandas as pd
import seaborn as sns

iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

iris_data.columns = ['sepal_length', 'sepal_width',
                     'petal_length', 'petal_width', 'variety']

sns.set(style="whitegrid", palette="BuGn_r",
        rc={'figure.figsize': (11.7, 8.27)})

title = "Compare Distributions of Sepal Width"

sns.boxplot(x="species", y="sepal_width", data=iris_data)


sns.set(style="whitegrid", palette="BuGn_r",
        rc={'figure.figsize': (11.7, 8.27)})

title = "Compare Distributions of Petal Length"

sns.boxplot(x="variety", y="petal_length", data=iris_data)

sns.set(style="whitegrid", palette="BuGn_r",
        rc={'figure.figsize': (11.7, 8.27)})

title = "Compare Distributions of Petal Width"

sns.boxplot(x="variety", y="petal_width", data=iris_data)

## Scatterplots
With scatterplots we can use variables to show that there is distinct difference in sizes between the species.

In [None]:
import sys
import pandas as pd
import seaborn as sns

iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

iris_data.columns = ['sepal_length', 'sepal_width',
                     'petal_length', 'petal_width', 'variety']

output_file("test1.html")

color1 = '#FF1493'
color2 = '#9400D3'
color3 = '#008080'

#Adding colours
colormap = {'Setosa': color1, 'Versicolor': color2, 'Virginica': color3}
colors = [colormap[x] for x in iris_data['variety']]

#Comparing Petal Width and Petal Length across all three species
p = figure(title="Petal Width and Petal Length")
p.xaxis.axis_label = 'Petal Length'
p.yaxis.axis_label = 'Petal Width'
p.legend.location = "top_left"

p.diamond(iris_data["petal_length"], iris_data["petal_width"],
          color=colors, fill_alpha=0.2, size=10)

show(p)

#Comparing Sepal Width and Sepal Length across all three species
output_file("test2.html")

#adding colors
colormap = {'Setosa': color1, 'Versicolor': color2, 'Virginica': color3}
colors = [colormap[x] for x in iris_data['variety']]

#adding labels
p = figure(title="Sepal Width and Sepal Length")
p.xaxis.axis_label = 'Sepal Length'
p.yaxis.axis_label = 'Sepal Width'

p.circle(iris_data["sepal_length"], iris_data["sepal_width"],
         color=colors, fill_alpha=0.2, size=10)

show(p)



## Analysis:
In the scatterplots we can see that the iris Setosa is clearly the smallest flower in terms of both sepal length and width and petal length and width. Iris Virginica, as we can see in the scatterplots, is the largest.

## Pairplot
I decided to use a pairplot here because it “creates a matrix of axes and shows the relationship for each pair of columns in a data frame. By default, it also draws the univariate distribution of each variable on the diagonal axis”[12].
This way, we have all the data points available to us in one place to analyse.

In [None]:
import sys
import pandas as pd
import seaborn as sns

iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

iris_data.columns = ['sepal_length', 'sepal_width',
                     'petal_length', 'petal_width', 'variety']

sns.pairplot(iris_data, hue="variety",
             palette="GnBu_d", markers=["o", "s", "D"])

plt.show()

### Two Dimensions ###

In [None]:
# New figure.
fig, ax = plt.subplots()

# Scatter plot.
ax.plot(df['petal_width'], df['sepal_length'], '.')

# Set axis labels.
ax.set_xlabel('Petal width');
ax.set_ylabel('Sepal length');

In [None]:
# Seaborn is great for creating complex plots with one command.
sns.lmplot(x="petal_width", y="sepal_length", hue='species', data=df, fit_reg=False, height=10, aspect=1.5);

### Pyplot ###

In [None]:
# Segregate the data.
setos = df[df['species'] == 'setosa']
versi = df[df['species'] == 'versicolor']
virgi = df[df['species'] == 'virginica']

# New plot.
fig, ax = plt.subplots()

# Scatter plots.
ax.scatter(setos['petal_width'], setos['sepal_length'], label='Setosa')
ax.scatter(versi['petal_width'], versi['sepal_length'], label='Versicolor')
ax.scatter(virgi['petal_width'], virgi['sepal_length'], label='Virginica')

# Show the legend.
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.legend();

In [None]:
# How the segregation works.
df['species'] == 'virginica'

In [None]:
df[df['species'] == 'virginica'].head()

### Using groupby() ###

In [None]:
# New plot.
fig, ax = plt.subplots()

# Using pandas groupby().
for species, data in df.groupby('species'):
    ax.scatter(data['petal_width'], data['sepal_length'], label=species)

# Show the legend.
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.legend();

In [None]:
# Group by typically takes a categorical variable.
x = df.groupby('species')
x

In [None]:
# Pivot tables.
x.mean()

In [None]:
# Looping through groupby().
for i, j in x:
    print()
    print(f"i is: '{i}'")
    print(f"j looks like:\n{j[:3]}")
    print()

### Test and Train Split

In [None]:
# Split the data frame in two.
train, test = mod.train_test_split(df)

In [None]:
# Show some training data.
train.head()

In [None]:
# The indices of the train array.
train.index

In [None]:
# Show some testing data.
test.head()

### Two Dimensions: Test Train Split ###

In [None]:
# Segregate the training data.
setos = train[train['species'] == 'setosa']
versi = train[train['species'] == 'versicolor']
virgi = train[train['species'] == 'virginica']

# New plot.
fig, ax = plt.subplots()

# Scatter plots for training data.
ax.scatter(setos['petal_width'], setos['sepal_length'], marker='o', label='Setosa')
ax.scatter(versi['petal_width'], versi['sepal_length'], marker='o', label='Versicolor')
ax.scatter(virgi['petal_width'], virgi['sepal_length'], marker='o', label='Virginica')

# Scatter plot for testing data.
ax.scatter(test['petal_width'], test['sepal_length'], marker='x', label='Test data')

# Show the legend.
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.legend();

### Two Dimensions: Inputs and outputs ###

In [None]:
# Give the inputs and outputs convenient names.
inputs, outputs = train[['sepal_length', 'petal_width']], train['species']

In [None]:
# Peek at the inputs.
inputs.head()

### Two Dimensions: Logistic regression ###

In [None]:
# Create a new classifier.
lre = lm.LogisticRegression(random_state=0)

# Train the classifier on our data.
lre.fit(inputs, outputs)

In [None]:
# Ask the classifier to classify the test data.
predictions = lre.predict(test[['sepal_length', 'petal_width']])
predictions

In [None]:
# Eyeball the misclassifications.
predictions == test['species']

In [None]:
# What proportion were correct?
lre.score(test[['sepal_length', 'petal_width']], test['species'])

### Two Dimensions: Misclassified ###

In [None]:
# Append a column to the test data frame with the predictions.
test['predicted'] = predictions
test.head()

In [None]:
# Show the misclassified data.
misclass = test[test['predicted'] != test['species']]
misclass

In [None]:
# Eyeball the descriptive statistics for the species.
train.groupby('species').mean()

In [None]:
# New plot.
fig, ax = plt.subplots()

# Plot the training data
for species, data in df.groupby('species'):
    ax.scatter(data['petal_width'], data['sepal_length'], label=species)
    
# Plot misclassified.
ax.scatter(misclass['petal_width'], misclass['sepal_length'], s=200, facecolor='none', edgecolor='r', label='Misclassified')

# Show the legend.
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.legend();

### Separating Setosa ###

In [None]:
# Another look at this plot.
sns.pairplot(df, hue='species');

In [None]:
# Give the inputs and outputs convenient names.
inputs = train[['sepal_length', 'petal_width']]

# Set 'versicolor' and 'virginica' to 'other'.
outputs = train['species'].apply(lambda x: x if x == 'setosa' else 'other')

# Eyeball outputs
outputs.unique()

In [None]:
# Create a new classifier.
lre = lm.LogisticRegression(random_state=0)

# Train the classifier on our data.
lre.fit(inputs, outputs)

In [None]:
actual = test['species'].apply(lambda x: x if x == 'setosa' else 'other')

# What proportion were correct?
lre.score(test[['sepal_length', 'petal_width']], actual)

### Using All Possible Inputs ###

In [None]:
# Load the iris data set from a URL.
iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

In [None]:
# Split the data frame in two.
train, test = mod.train_test_split(df)

In [None]:
# Use all four possible inputs.
inputs, outputs = train[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], train['species']

In [None]:
# Create a new classifier.
lre = lm.LogisticRegression(random_state=0)

# Train the classifier on our data.
lre.fit(inputs, outputs)

In [None]:
# Ask the classifier to classify the test data.
predictions = lre.predict(test[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])
predictions

In [None]:
# Eyeball the misclassifications.
(predictions == test['species']).value_counts()

In [None]:
# What proportion were correct?
lre.score(test[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], test['species'])

###  K Nearest Neighbours Classifier ###

In [None]:
# Load the iris data set from a URL.
iris_data = pd.read_csv('tableconvert_csv_xin5ac.csv')

In [None]:
# Split the data frame in two.
train, test = mod.train_test_split(df)

In [None]:
# Use all four possible inputs.
inputs, outputs = train[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], train['species']

In [None]:
# Classifier.
knn = nei.KNeighborsClassifier()

In [None]:
# Fit.
knn.fit(inputs, outputs)

In [None]:
# Test.
knn.score(test[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], test['species'])

In [None]:
# Predict.
predictions = lre.predict(test[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])
(predictions == test['species']).value_counts()

In [None]:
# The score is just the accuracy in this case.
(predictions == test['species']).value_counts(normalize=True)

### Cross validation ###

In [None]:
knn = nei.KNeighborsClassifier()
scores = mod.cross_val_score(knn, df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], df['species'])
scores

In [None]:
print(f"Mean: {scores.mean()} \t Standard Deviation: {scores.std()}")

In [None]:
lre = lm.LogisticRegression(random_state=0)
scores = mod.cross_val_score(lre, df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']], df['species'])
scores

## References:
1.	https: // en.wikipedia.org/wiki/Iris_flower_data_set
2.	https: // github.com/RitRa/Project2018-iris/blob/master/Project % 2B2018 % 2B-%2BFishers % 2BIris % 2Bdata % 2Bset % 2Banalysis.ipynb
3.	https: // tableconvert.com /?output = csv & data = https: // gist.github.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv
4.	https: // realpython.com/python-csv/
5.	https: // stackoverflow.com/questions/1526607/extracting-data-from-a-csv-file-in-python
# pandas.read_csv
6.	https: // pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
7.	https: // www.w3schools.com/python/numpy_intro.asp
8.	https: // seaborn.pydata.org/generated/seaborn.boxplot.html
9.	https: // seaborn.pydata.org/tutorial/color_palettes.html
10.	https: // github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
11.	https: // seaborn.pydata.org/generated/seaborn.scatterplot.html
12.	https: // seaborn.pydata.org/tutorial/distributions.html
13.	https: // docs.bokeh.org/en/latest/docs/reference/plotting.html
14.	https: // docs.bokeh.org/en/latest/docs/reference/colors.html
15.	https: // www.w3schools.com/colors/colors_groups.asp
16.	https: // www.kaggle.com/abhishekkrg/python-iris-data-visualization-and-explanation/data
17.	https: // stackoverflow.com/questions/7152762/how-to-redirect-print-output-to-a-file-using-python
18.	https: // kite.com/python/answers/how-to-redirect-print-output-to-a-text-file-in-python
19.	https: // unsplash.com/photos/gK6f8bKKic0
20.	https: // docs.python.org/3/library/sys.html
21.	https: // en.wikipedia.org/wiki/Iris_setosa
22.	https: // en.wikipedia.org/wiki/Iris_versicolor
23.	https: // en.wikipedia.org/wiki/Iris_virginica
24.	https: // github.com/vwalsh86/Iris-Data-Set-Project
25. https://scikit-learn.org/stable/