<img src="https://i2.wp.com/softwareengineeringdaily.com/wp-content/uploads/2016/09/scikit-learn-logo.png?resize=566%2C202&ssl=1" alt="image info" />

# Scikit-learn
***

* Author: John Paul Lee
* Github: JPLee01
* Email: G00387906@gmit.ie
* Created: 04-11-2021, Last update: XX-12-2021
* Machine Learning and Statistics: Investigation into the Scikit-learn and Scipy-Stats Python libraries.
***
* This Jupyter Notebook has been created to investigate the Scikit-learn Python library by offeing an overview, demonstration, plots and visualisations of each of the libraries.

**Lecturer:** Dr. Ian McLoughlin

The Project instructions can be found [here](https://github.com/JPLee01/Machine_Learning_and_Statistics/blob/main/Instructions.pdf)
***
As part of the project this notebook will deal with three main tasks:

1. Offer an overview of the Scikit-learn and Scipy-Stats Python libraries.
2. Demonstrate three Scikit-learn algorithms and a Scipy-Stats hypothsis test using ANOVA.
3. Create plots and visualisations as necessary.

## Preliminaries


## 1. Scikit-Learn

The scikit-learn is a machine learning library for the Python programming language.<sup>[1](#myfootnote1)</sup> Initially developed as a Google summer of code project by David Cournapeau in 2007, scikit-learn is now one of the most popular machine learning libraries on GitHub.<sup>[2](#myfootnote2)</sup> Built on NumPy, SciPy and matplotlib libraries, scikit-learn is considered the gold standard for Machine Learning in the Python ecosystem.<sup>[3](#myfootnote3)</sup> Scikit-learn's key concepts and features include:<sup>[4](#myfootnote4)</sup>
* Algorithmic decision-making methods, including:
    * **Classification:** identifying and categorizing data based on patterns.
    * **Regression:** predicting or projecting data values based on the average mean of existing and planned data.
    * **Clustering:** automatic grouping of similar data into datasets.
* Algorithms that support predictive analysis ranging from simple linear regression to neural network pattern recognition.
* Interoperability with NumPy, pandas, and matplotlib libraries.

Scikit-learn aims to provide a range of supervised and unsupervised learning algorithms to the user.<sup>[5](#myfootnote5)</sup>
* **Supervised Learning Algorithms** refer to algorithms which attempts to model relationships and dependencies between the target prediction output and the input features.<sup>[6](#myfootnote6)</sup> Within supervised learning algorithms the input variables (*x*) and an output variable (*Y*) are know and the algorithm is employed learn the mapping function from the input to the output. The goal of supervised learning algorithmsis are to approximate the mapping function so well that when you have new input data (*x*) that you can predict the output variables (*Y*) for that data.<sup>[7](#myfootnote7)</sup> As the name suggests supervised learning algorithms incorporate the use of a "supervisor or a teacher". The supervisor is labeled training data which teaches the constructed algorithm to detect the underlying patterns and relationships between the input data and the output labels.<sup>[8](#myfootnote8)</sup> This algorithm is refined until it is able to yield accurate labeling results when presented with never-before-seen data.<sup>[9](#myfootnote9)</sup> This is concept similiar to the way a student learns in the supervision of a teacher.<sup>[10](#myfootnote10)</sup> Some of the advantages of supervised learning algorithms is that they offer the user a high degree of control of the training process and the definition of the classes<sup>[11](#myfootnote11)</sup> While some of the disadvantages are it's not seen as the most efficient option for dealing with complex tasks and can be very time intensive.<sup>[12](#myfootnote12)</sup>

* **Unsupervised Learning Algorithms** are a type of machine learning algorithms in which no pre-assigned labels or scores are provided to the training data.<sup>[13](#myfootnote13)</sup> As a result these algorithms discover hidden patterns or data groupings without the need for human intervention (hence the name unsupervised).<sup>[14](#myfootnote14)</sup> While the goal of supervised learning is to predict outcomes for new data. The goal of unsupervised learning is to get insights from large volumes of new data. The algorithm itself determines what is different or interesting from the dataset.<sup>[15](#myfootnote15)</sup> Due to its ability to discover similarities and differences in data, unsupervised learning algorithms are widely used in the fields of exploratory data analysis and image recognition.<sup>[16](#myfootnote16)</sup> Some of the advantages of unsupervised learning algorithms are that they can uncover hidden patterns which the user might not have previously considered, and the  opportunity for human error is drastically minimized.<sup>[17](#myfootnote17)</sup> While some of the disadvantages include questions over the accuracy of the results due to the lack input data to train from, and no guarantee that the obtained results will be of benefit to the user.<sup>[18](#myfootnote18)</sup>

    The processes for both supervised and unsupervised learning algorithms can be seen in the images below:<sup>[19](#myfootnote19)</sup>

<table><tr>
<td> <img src="https://bigdata-madesimple.com/wp-content/uploads/2018/02/Machine-Learning-Explained1.png" alt="Supervised" style="width: 450px; height: 450px;"/> </td>
<td> <img src="https://bigdata-madesimple.com/wp-content/uploads/2018/02/Machine-Learning-Explained2.png" alt="Unsupervised" style="width: 450px; height: 450px;"/> </td>
</tr></table>


<img src="https://bigdata-madesimple.com/wp-content/uploads/2018/02/Machine-Learning-Explained1.png" alt="Supervised"/> 


<img src="https://bigdata-madesimple.com/wp-content/uploads/2018/02/Machine-Learning-Explained2.png" alt="Unsupervised"/>

Scikit-learn has six main areas in which it can be applied:<sup>[20](#myfootnote20)</sup>
1. **Supervised Learning Algorithms** − Including; Linear Regression, Support Vector Machines, Nearest neighbors and Decision Trees.<sup>[21](#myfootnote21)</sup>
2. **Unsupervised Learning Algorithms** − Including; Gaussian Mixture Models, Clustering, Covariance Estimation and Density Estimation.<sup>[22](#myfootnote22)</sup> 
3. **Model Selection and Evaluation** − Selection of the right model to allow for evaluation of the dataset such as Cross-validation and Validation Curves.<sup>[23](#myfootnote23)</sup> 
4. **Inspection** - Inspection of the performance of algorithms through the use of Partial Dependence and Individual Conditional Expectation plots as well as the Permutation Feature Importance module.<sup>[24](#myfootnote24)</sup> 
5. **Visualisations** - Application Programming Interface (API) for Visualisation generation.<sup>[25](#myfootnote25)</sup> 
6. **Dataset Transformations** - Library of transformers which may clean, reduce, expand or generate feature representations.<sup>[26](#myfootnote26)</sup>

Due to its permissive simplified BSD license, scikit-learn is also widely used in both academic and commercial circles including; JP Morgan, Spotify, Booking.com etc. <sup>[27](#myfootnote27)</sup><sup>[28](#myfootnote28)</sup> 

Scikit-learn has also a number of embedded datasets which do not require to download any file from some external website.<sup>[29](#myfootnote29)</sup> These include both toy (fictional) and real world datasets.<sup>[30](#myfootnote30)</sup><sup>[31](#myfootnote31)</sup> The author will call on these datasets for different sections of the assessment.

## 2. Scikit-Learn Algorithms
For this section of the assessment the author will conduct a detailed analysis of three different scikit-learn algorithms. The author will provide an overview of the algorithms, implement them on a dataset and discuss the results as well as provide visualisations. The author will also compare and contrast the different algorithms in terms of **accuracy, ease of use and relevance**. (REWORD)

**In order to best examine the analogy between the different algorithms the same dataset will be used thorughout. As previously discussed, scikit-learn has a number of embedded datasets. The author will call on the embedded Iris dataset fo this section of the assessment.** (REWORD)

### 2.1 Iris Dataset Oview
The data analysed in this project is the \"Iris Flower Data
Set\".<sup>[32](#myfootnote32)</sup> This data set was collected by R.A. Fisher and presented
as a data set in 1936 in his paper \"The Use of Multiple Measurements in
Taxonomic Problems\".<sup>[33](#myfootnote33)</sup> In this paper Fisher studied the
use of linear combinations of multiple characterising features of a
species to discriminate it from related species. Within the paper Fisher
studied the following three related species of Iris flowers:

![Iris Species](https://miro.medium.com/max/1400/0*Uw37vrrKzeEWahdB)

Fifty samples of each species were collected and analysed. (It should be
noted that the data for the Setosa and Versicolor were already
available from a previous study by Fisher's colleague Botanist Edgar
Anderson). Within each species, Fisher studied four distinct
characteristics:

1.  Sepal Length (Cm)
2.  Sepal Width (Cm)
3.  Petal Length (Cm)
4.  Petal Width (Cm)

These characteristics can be seen below:

  ![Iris Characteristics](https://miro.medium.com/max/800/1*1q79O5DCx_XNrAARXSFzpg.png)

#### 2.1.1 Loading the Dataset

### 2.2 Support Vector Machines

### 2.3 K-Nearest Neighbors 

### 2.4 K-Means

### Summary and Conclusions 

## References
****

<a name="myfootnote1">1</a>: Fabian Pedregosa et al. - Scikit-learn: Machine Learning in Python, https://jmlr.org/papers/v12/pedregosa11a.html

<a name="myfootnote2">2</a>: Thomas Elliott - The State of the Octoverse: machine learning, https://github.blog/2019-01-24-the-state-of-the-octoverse-machine-learning/

<a name="myfootnote3">3</a>: George Seif - An Introduction to Scikit Learn: The Gold Standard of Python Machine Learning, https://www.kdnuggets.com/2019/02/introduction-scikit-learn-gold-standard-python-machine-learning.html

<a name="myfootnote4">4</a>: Active State - What Is Scikit-Learn In Python?, https://www.activestate.com/resources/quick-reads/what-is-scikit-learn-in-python/

<a name="myfootnote5">5</a>: Jason Brownlee - A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library, https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/

<a name="myfootnote6">6</a>: David Fumo - Types of Machine Learning Algorithms You Should Know, https://towardsdatascience.com/types-of-machine-learning-algorithms-you-should-know-953a08248861

<a name="myfootnote7">7</a>: Jason Brownlee  - Supervised and Unsupervised Machine Learning Algorithms, https://machinelearningmastery.com/supervised-and-unsupervised-machine-learning-algorithms/

<a name="myfootnote8">8</a>: Java Point - Supervised Machine Learning, https://www.javatpoint.com/supervised-machine-learning

<a name="myfootnote9">9</a>: David Petersson - Supervised Learning, https://searchenterpriseai.techtarget.com/definition/supervised-learning

<a name="myfootnote10">10</a>: Data Robot - Supervised Machine Learning, https://www.datarobot.com/wiki/supervised-machine-learning/

<a name="myfootnote11">11</a>: Pythonista Planet - Pros and Cons of Supervised Machine Learning, https://pythonistaplanet.com/pros-and-cons-of-supervised-machine-learning/

<a name="myfootnote12">12</a>: Ronald van Loon - Machine learning explained: Understanding supervised, unsupervised, and reinforcement learning, https://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-reinforcement-learning/

<a name="myfootnote13">13</a>: Geoffrey Hinton - A Practical Guide to Training Restricted Boltzmann Machines, https://www.cs.toronto.edu/~hinton/absps/guideTR.pdf

<a name="myfootnote14">14</a>: IBM - Unsupervised Learning, https://www.ibm.com/cloud/learn/unsupervised-learning

<a name="myfootnote15">15</a>: IBM - Supervised vs. Unsupervised Learning: What’s the Difference?, https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning

<a name="myfootnote16">16</a>: Java Point - Unsupervised Machine Learning, https://www.javatpoint.com/unsupervised-machine-learning

<a name="myfootnote17">17</a>: Asquero - Advantages and Disadvantages of different types of machine learning algorithms, https://www.asquero.com/article/advantages-and-disadvantages-of-different-types-of-machine-learning-algorithms/

<a name="myfootnote18">18</a>: Pythonista Planet - Pros and Cons of Unsupervised Learning, https://pythonistaplanet.com/pros-and-cons-of-unsupervised-learning/

<a name="myfootnote19">19</a>: Ronald van Loon - Machine learning explained: Understanding supervised, unsupervised, and reinforcement learning, https://bigdata-madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-and-reinforcement-learning/

<a name="myfootnote20">20</a>: Scikit-learn - User Guide Overview, https://scikit-learn.org/stable/user_guide.html

<a name="myfootnote21">21</a>:  Scikit-learn - User Guide: Chapter 1. Supervised learning, https://scikit-learn.org/stable/supervised_learning.html

<a name="myfootnote22">22</a>: Scikit-learn - User Guide: Chapter 2. Unsupervised learning,https://scikit-learn.org/stable/unsupervised_learning.html

<a name="myfootnote23">23</a>: Scikit-learn - User Guide: Chapter 3. Model selection and evaluation, https://scikit-learn.org/stable/model_selection.html

<a name="myfootnote24">24</a>: Scikit-learn - User Guide: Chapter 4. Inspection, https://scikit-learn.org/stable/inspection.html

<a name="myfootnote25">25</a>: Scikit-learn - User Guide: Chapter 5. Visualizations, https://scikit-learn.org/stable/visualizations.html

<a name="myfootnote26">26</a>: Scikit-learn - User Guide: Chapter 6. 

<a name="myfootnote27">27</a>: Dataquest - Scikit-learn Tutorial: Machine Learning in Python, https://www.dataquest.io/blog/sci-kit-learn-tutorial/

<a name="myfootnote28">28</a>: Scikit-learn - Who is using scikit-learn?, https://scikit-learn.org/stable/testimonials/testimonials.html

<a name="myfootnote29">29</a>: Scikit-learn - User Guide: Chapter 7. Dataset loading utilities, https://scikit-learn.org/stable/datasets.html

<a name="myfootnote30">30</a>: Scikit-learn - User Guide: Chapter 7.1 Toy Datasets, https://scikit-learn.org/stable/datasets/toy_dataset.html

<a name="myfootnote31">31</a>: Scikit-learn - User Guide: Chapter 7.2 Real World Datasets, https://scikit-learn.org/stable/datasets/real_world.html

<a name="myfootnote32">32</a>: UCI Machine Learning Repository - Iris Data Set, <http://archive.ics.uci.edu/ml/datasets/Iris>

<a name="myfootnote33">33</a>: The Use of Multiple Measurements in Taxonomic Problems, <http://www.comp.tmu.ac.jp/morbier/R/Fisher-1936-Ann._Eugen.pdf>

## Bibliography
***

Within the course of this assessment the following sources were also used for research purposes:
* Baglom - 10 Scikit-Learn Case Studies, Examples and Tutorials, http://www.baglom.com/b/10-scikit-learn-case-studies-examples-tutorials-cm572/
* Bogo to Bogo - Scikit-Learn : Unsupervised Learning - Clustering, https://www.bogotobogo.com/python/scikit-learn/scikit_machine_learning_Unsupervised_Learning_Clustering.php
* Code Cademy - What is Scikit-Learn?, https://www.codecademy.com/articles/scikit-learn
* Daniel Johnson - Supervised vs Unsupervised Learning: Key Differences, https://www.guru99.com/supervised-vs-unsupervised-learning.html
* Daniel Johnson - Unsupervised Machine Learning: What is, Algorithms, Example, https://www.guru99.com/unsupervised-machine-learning.html
* Data Robot - Unsupervised Machine Learning, https://www.datarobot.com/wiki/unsupervised-machine-learning/
* Geeks for Geeks - Supervised and Unsupervised learning, https://www.geeksforgeeks.org/supervised-unsupervised-learning/
* IBM - Markdown for Jupyter notebooks cheatsheet, https://www.ibm.com/docs/en/watson-studio-local/1.2.3?topic=notebooks-markdown-jupyter-cheatsheet
* IBM - Supervised Learning, https://www.ibm.com/cloud/learn/supervised-learning
* OpenDataScience - The A – Z of Supervised Learning, Use Cases, and Disadvantages, https://opendatascience.com/the-a-z-of-supervised-learning-use-cases-and-disadvantages/
* Quora - What are the advantages and disadvantages of a supervised learning machine?, https://www.quora.com/What-are-the-advantages-and-disadvantages-of-a-supervised-learning-machine
* Sadrach Pierre - A Comprehensive Guide to Scikit-Learn (Sklearn), https://builtin.com/machine-learning/scikit-learn-guide
* Sanatan Mishra - Unsupervised Learning and Data Clustering, https://towardsdatascience.com/unsupervised-learning-and-data-clustering-eeecb78b422a
* Scikit-learn - User Guide: Frequntly Asked Questions, https://scikit-learn.org/stable/faq.html
* Scikit-learn - Official Website, https://scikit-learn.org/stable/
* Snehit Vaddi - Most used Scikit-Learn Algorithms Part-1|Snehit Vaddi, https://medium.com/analytics-vidhya/most-used-scikit-learn-algorithms-part-1-snehit-vaddi-7ec0c98e4edd
* Snehit Vaddi - Most used Scikit-Learn Algorithms Part-2|Snehit Vaddi, 
* Techopedia - Scikit-Learn, https://www.techopedia.com/definition/33860/scikit-learn
* Tech Vidvan - Unsupervised Learning – Machine Learning Algorithms, https://techvidvan.com/tutorials/unsupervised-learning/
* 