# scikit-learn
***
## 1) Introduction

Within the Python programming package there's a machine learning library known as scikit-learn. The scikit-learn library is integrated with other Python packages including NumPy, SciPy, matplotlib and Pandas [3].

Scikit-learn was initially developed as a Google summer of code project in 2007 by David Cournapeau, then further developed by Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel from the French Institute for Research in Computer Science and Automation. The first public release was in January 2010 [1]. Many industries from engineering to banking to social media depend heavily on machine learning as part of their day to day operations and one of the most common tools used in the application of machine learning is the scikit-learn Python package. 

There are many machine learning algorithms built into the scikit-learn library that can be used for many types of machine learning applications. In the following section of this notebook there's an overview of what machine learning is and the different elements of machine learning. Later in this notebook there's some worked examples of machine learning where datasets are used to train multiple algorithms to make predictions. In these examples, algorithms from the scikit-learn package are used.

## 2) Machine Learning
Within computer science there's a subfield known as Artificial Intelligence (AI) and within AI there's a subfield known as Machine Learning. In traditional computer science the programmer/developer writes the program line by line and provides the data for the program to run. In machine learning the programmer/developer gives the system the data and the output and trains the system to produce the program [7].
<br>
<br>
<div>
<img src="TraditionalML.PNG" width="300"/>
</div>

<center><b> Figure 1: Traditional Computer Science vs Machine Learning </b></center>
<br>

With Machine learning there are models/algorithms that are trained using data, just like a human is trained on past experiences. The machine uses these models/algorithms to find patterns and make predictions, these predictions are made automatically without human interaction.

<br>
<div>
<img src="MachineLearning.JPG" width="500"/>
</div>

<center><b> Figure 2: Human learning and traditional computer programming </b></center>
<br>


Machine Learning is becoming increasingly popular in recent times with the advancements in computer science and technology. However it's important to note that this ever evolving area of computer science is not a new phenomenon as it's been studied for decades. In 1959, Arthur Samuel, a pioneer in the field of machine learning (ML) defined it as the “field of study that gives computers the ability to learn without being explicitly programmed”[5].In machine learning, algorithms are used to analyse data and in doing do look for patterns. The different data types that are analysed include but not limited to:

* Numbers 
* Words 
* Images 
* Clicks 

Through data analysis and statistics the machine learns, and based on past conditions it can make predictions about the future [2]. Machine learning is so interesting and powerful because anything that can be digitally stored can be fed into a machine learning algorithm. Another interesting part of machine learning is how common it is, it's used on systems such as Netflix and YouTube to recommend what to watch, also on search engines such as Google and Social media platforms such as Instagram and Facebook for advertising [4]. 

With all these systems the machine is collecting data around film genres that interest you, the music you listen to, the products you buy, what links you click, what posts you like or dislike, based on this past behaviour the machine can predict with confidence about what movie you'd like to watch next, the song you'd like to listen to next, where you'll spend your money and where you won't spend your money[4].

### 2.1Supervised and Unsupervised Learning
Machine learning can be sub divided into two areas; machine learning through supervised learning and machine learning through unsupervised learning. The key difference between the two is that the data used in supervised learning is labeled and unlabeled  for unsupervised learning.

#### 2.1.1 Supervised learning
Labeled data is data that has the answer that the machine learning model has to predict [6]. From Figure 1 it can be seen that labelled data was used to train the machine, the data was labelled as "cats" based on the algorithm the machine was able to predict correctly when presented with 4 animals, two of which were cats. The following sections will discuss in more detail supervised and unsupervised learning.

<div>
<img src="Supervised_machine_learning.png" width="700"/>
</div>

<center><b> Figure 3 Labelled data for supervised learning </b></center>



#### 2.1.2 Unsupervised learning
Unsupervised learning means that the data in unlabeled meaning that it's impossible to pridict the accuracy of the model. In unsupervised learning the model is given the unlabeled dataset and will try to learn some sort of structure or pattern from the data. Clustering Algorithms are commonly used in unsupervised learning. Take for example a dataset of peoples heights and weights for a given age are provided but the data isn't labeled male or female. The unsupervised clustering algorithm can start to cluster the data into groups potentially resulting in two distinct clusters which through inference it could be said that the groups are male and female.

<div>
<img src="Unsupervised.PNG" width="600"/>
</div>

<center><b> Figure 4 Unsupervised and supervised learning </b></center>


What's interesting in Figure 4 is that the same dataset was used to demonstrate supervised an unsupvised. In the supervised with the labelled data the groups can be split into male and female for example but in the unsupervised the two groups are just separated into two clusters with no labels.

## 3) Machine Learning Algorithms in scikit-learn
From the Machine Learning and Statistics module assessment specification - "Demonstrations of three interesting scikit-learn algorithms. You may choose these yourself, based on what is covered in class or otherwise. Note that the demonstrations are at your discretion – you may choose to have an overall spread of examples across the library or pick a particular part that you find interesting."

The approach for this part of the notebook is to pick three interesting scikit-learn algorithms. A high level overview of each of the algoithms is provided in this section. In the following sections there are worked examples of each of the three algorithms. The algorithms selected for this notebook are:

* LinearRegression()
* RandomForestClassifier()
* KNeighborsClassifier()

#### 3.1) Regression and Classification
This notebook looks at supervised learning only. It's important to note that there are two types of supervised machine learning algorithms presented in this notebook, they are:

* Regression
* Classification

Regression predicts continuous value outputs while classification predicts discrete outputs. For instance, predicting the weight of someone based on their height is a regression problem whereas predicting someones gender is a classification problem.


#### 3.2) LinearRegression()
Within regression, linear regression is the most basic machine learning algorithm. The model will predict an output based on inputs and learning. In this notebook LinearRegression() is used in two cases:

* predict if someone has diabetes.
* predict the price of house prices in Boston.

The diabetes dataset is a built in dataset and the Boston house prices dataset was sourced online (reference included below).

#### 3.3) RandomForestClassifier()
RandomForestClassifier() can be used for regression or for classification. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree [13]. The RandomForestClassifier() is used in this notebook along with the iris data set. A model is built that can predict the Iris class based on a never before seen set of inputs. 

#### 3.4) KNeighborsClassifier()
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other [14].

In [None]:
import matplotlib.pyplot as plt
from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split # import train_test_split from scikit-learn library
from sklearn import linear_model # importing linear_model from scikit-learn
from sklearn.metrics import mean_squared_error, r2_score # importing some statistics packages from scikit-learn
import seaborn as sns

## 7) Classification of the Digits DataSet

In [None]:
digits = datasets.load_digits() # load the digits dataset

In [None]:
print(digits.DESCR) # dataset description

In [None]:
print(digits.data)
print('')
print(digits.target)

In [None]:
digits.images.shape

In [None]:
_, axes = plt.subplots(nrows=1, ncols=10, figsize=(16,4))
for ax, image, label in zip(axes, digits.images, digits.target):
    ax.set_axis_off()
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title('Training: %i' % label)

In [None]:
import pandas as pd

n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

pd.DataFrame(data)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, digits.target, test_size=0.5, shuffle=False)

In [None]:
# Support Vector Classifier
from sklearn.svm import SVC
svm = SVC(gamma=0.001, random_state=42)
svm.fit(X_train, y_train)

In [None]:
# Random Forest
rf = RandomForestClassifier(n_estimators=500, random_state=42)
rf.fit(X_train, y_train)

In [None]:
y_test_pred_svm = svm.predict(X_test)
y_test_pred_rf = rf.predict(X_test)

### Visualisation - SVM

In [None]:
_, axes = plt.subplots(nrows=1, ncols=10, figsize=(16,4))
for ax, image, actual, prediction in zip(axes, X_test, y_test, y_test_pred_svm):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title(f'Prediction: {prediction}\n Actual: {actual}')

In [None]:
_, axes = plt.subplots(nrows=1, ncols=10, figsize=(16,4))
for ax, image, actual, prediction in zip(axes, X_test, y_test, y_test_pred_rf):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    ax.set_title(f'Prediction: {prediction}\n Actual: {actual}')

In [None]:
from sklearn.metrics import accuracy_score

accuracy_svm = accuracy_score(y_test, y_test_pred_svm)
accuracy_rf = accuracy_score(y_test, y_test_pred_rf)

print('Accuracy (SVM): ', round(accuracy_svm, 3))
print('Accuracy (RF): ', round(accuracy_rf, 3))

### SVM Confusion matrix

In [None]:
from sklearn.metrics import plot_confusion_matrix

disp = plot_confusion_matrix(svm, X_test, y_test)
disp.figure_.suptitle('Confusion Matrix')
#print(f'Confusion matrix:\n{disp.confusion_matrix}')

### Visualisation - RF

### RF Confusion matrix

In [None]:
disp = plot_confusion_matrix(rf, X_test, y_test, cmap='cool')
disp.figure_.suptitle('Confusion Matrix')
#print(f'Confusion matrix:\n{disp.confusion_matrix}')

# References
[4] https://www.technologyreview.com/2018/11/17/103781/what-is-machine-learning-we-drew-you-another-flowchart/

[5] https://www.udemy.com/course/scikit-learn-in-python-for-machine-learning-engineers/

[6] https://www.cloudfactory.com/data-labeling-guide#:~:text=What%20is%20labeled%20data%3F,machine%20learning%20model%20to%20predict.

[7] https://medium.com/analytics-vidhya/most-used-scikit-learn-algorithms-part-1-snehit-vaddi-7ec0c98e4edd

[8] https://www.youtube.com/watch?v=R15LjD8aCzc

[9] https://www.youtube.com/watch?v=XmSlFPDjKdc

[10] https://www.youtube.com/watch?v=i-k7G6wN96o

[11] https://www.youtube.com/watch?v=ngLyX54e1LU

[12] https://www.youtube.com/watch?v=1i0zu9jHN6U

[13] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

[14] https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

# End