# Scikit Learn

![sklean](images/scikit-learn-logo.png)

[scikit-learn.org](https://scikit-learn.org/stable/)

## 1.0 Introduction

Scikit-learn (Sklearn) is a free open-source machine learning library in Python and is currently one of the most popular machine learning libaries on github. It provides a selection of efficient tools for machine learning & statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. [1](#References)

The library is built upon the [SciPy (Scientific Python)](https://www.scipy.org/) & intergrates well with many other Python libraries, such as Matplotlib, NumPy & Pandas.

### Machine Learning

Machine learning involves the creation of computational models or algorithms from data, that tries to derive rules or procedures to explain the data or to predict future data.

The more data the system is presented with, the more refined the model, the quality of the learned model is also dependent on the quality of the data used to train it. [2](#References)

Three important categories of machine learning are:

1. **Supervised learning** - Models which are trained with known labeled data sets, the trained algorithm produces an inferred function to make predictions about the output values for unknown input values.

2. **Unsupervised learning** - Models which use data that is neither classified nor labeled, the algorithm explores the data & can draw inferences from datasets to describe hidden structures, patterns or natural groupings.

3. **Reinforcement learning & deep learning** - models trained through trial and error to take the best action by establishing a reward system. Note that reinforcement learning & deep learning are currently not within the scope of the scikit-learn library as extensive knowledge to define the architecture is required along with GPUs for efficient computing. Refer to [tensorflow](https://www.tensorflow.org/), [keras](https://keras.io/) and [pytorch](https://pytorch.org/) for deep learning frameworks.

### Scikit Learn Features
The library is focused on modeling data. It is not focused on loading, manipulating and summarizing data. [NumPy](https://numpy.org/) & the [Pandas](https://pandas.pydata.org/) python libaries packages facilitate the structuring, manipulating & performance of mathematical functions or operations on large quantities of data.

Some of the groups of models provided by Sklearn include:

1. **Supervised Learning algorithms** − Linear Regression, Logistic Regression, Support Vector Machine (SVM), Decision Tree, Random Forest, k-nearest neighbors (KNN).
2. **Unsupervised Learning algorithms** − K-Means, Factor Analysis, Principal Component Analysis (PCA).
3. **Clustering** − dividing of unlabeled data into groups or clusters such that the data points within the have similar features.
4. **Cross Validation** − checks the accuracy of supervised models on unseen data.
5. **Dimensionality Reduction** − reducing the number of attributes in data which can be further used for summarisation, visualisation and feature selection.
6. **Ensemble methods** − combining the predictions of multiple supervised models.
7. **Feature extraction** − extracts the features from data to define the attributes in image and text data.
8. **Feature selection** − identify useful attributes to create supervised models.

Scikit learn also incorporates a few small standard datasets that do not require the downloading of any files from external websites, these datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn. 

See the [Dataset loading utilities](https://scikit-learn.org/stable/datasets.html) & [Toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html#toy-datasets) pages on the scikit-learn website for a list of these datasets. 

## 2.0 Scikit Learn Datasets

**Dataset Loading:**

Scikit learn contains a number of small standard datasets which can easilty be loaded through the package.

There are also miscellaneous tools to load datasets of other formats or from other locations, such as loading sample images, downloading datasets from the openml.org or loading external datasets using Pandas & NumPy. Please refer to [scikit-learn - loading other datasets page](https://scikit-learn.org/stable/datasets/loading_other_datasets.html#loading-other-datasets).

In [1]:
# import python libaries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_iris

In [2]:
# set plot style
plt.style.use("ggplot")

# Increase the size of the output plots
plt.rcParams["figure.figsize"] = (10,7)

In [6]:
# loading the iris dataset directly through scikit-learn
iris = load_iris()

# uncomment print() function to view full output of the iris dataset
#print(iris)

print(type(iris))

<class 'sklearn.utils.Bunch'>


**Reviewing the Dataset**

By viewing the output of the loaded iris dataset, it can be seen that the dataset is returned as a scikit learn ```Bunch object``` which is similar to a python dictionary data type consisting of keys & values.

In [8]:
# Output the keys of the iris dataset
print(iris.keys(), "\n")

# Output the feature names & target names
print("feature names = ", iris.feature_names, "\n")
print("target names = ", iris.target_names)

# Output the description of the iris dataset
print(iris.DESCR) 

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename']) 

feature names =  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] 

target names =  ['setosa' 'versicolor' 'virginica']
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (hi

- ```data``` is the input variables, attributes or feature data of the dataset contained within a NumPy array.
- ```target``` is the output variables, target or label data, which is determined by the feature variables and is also contained with an NumPy array.  
- ```target_names``` is the list of names of the output variables.
- ```feature_names``` is the list of all the names of the input variables.

The **features matrix** is a two-dimensional array with shape ```[n_samples, n_features]```, where the samples (rows) refers to the individual objects described by the dataset and the features (columns) refers to the distinct observations or values that describe each sample. The features matrix is often stored in a variable named ```X```.

In the iris dataset there are 150 samples of iris flowers with 4 features measured for each sample: sepal length, sepal width, petal length, petal width & therefore as a two-dimensional array, it will have a shape of 150 x 4.

The **target array** is typically a one dimensional, with length ```n_samples``` and can consists of either continuous numerical values, or discrete classes/labels and is often stored in a variable named ```y```. The target array is usually the quantity that is to be predicted using the scikit learn machine learning algorithms. 

In the iris dataset the ```species``` column is the target array. The preceeding features data is used to construct a model which can predict the species of a flower sample, i.e is the measured flower sample a Setosa, Versicolour or Virginica  

In [13]:
# assign the features data to variable X.
X = iris.data
print(X.shape)

# assign the target data to varaible y.
y = iris.target
print(y.shape)

(150, 4)
(150,)


In [14]:
![sklearn_dataset](images/Scikit_learn_dataset.png)

/bin/bash: -c: line 0: syntax error near unexpected token `images/Scikit_learn_dataset.png'
/bin/bash: -c: line 0: `[sklearn_dataset](images/Scikit_learn_dataset.png)'


## References

[1] [www.tutorialspoint.com, Scikit Learn - Introduction](https://www.tutorialspoint.com/scikit_learn/scikit_learn_introduction.htm)

[2] [Artificial Intelligence and the Future of Work, Thomas W. Malone](https://workofthefuture.mit.edu/wp-content/uploads/2020/12/2020-Research-Brief-Malone-Rus-Laubacher2.pdf)

https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html

https://www.educative.io/blog/scikit-learn-cheat-sheet-classification-regression-methods