![GMIT_Logo.png](attachment:GMIT_Logo.png)

<br/>
 **Machine Learning and Statistics 2021**
 
 
 **Author:**  Richard Deegan 
 
 
 **Lecturer:** Ian McLoughlin
 
 
 **Student ID:** G00387896@gmit.ie

# Assessment Outline 

Create a Scikit-Learn Jupyter Notebook. Include a Jupyter notebook called scikit-learn.ipynb that contains the following.

10% A clear and concise overview of the scikit-learn Python library.

20% Demonstrations of three interesting scikit-learn algorithms. You may choose
these yourself, based on what is covered in class or otherwise. Note that the
demonstrations are at your discretion – you may choose to have an overall spread of examples across the library or pick a particular part that you find interesting.

10% Appropriate plots and other visualisations to enhance your notebook for viewers.

# Preliminaries 

In order to effectively answer the Problem Statement various relevant libraries must be imported. For this we will import Numpy as it contains essential libraries namely the numpy.random library. Matplotlib.plyplot and Seaborn libraries will be utilised to assist in the visualisation of numbers to user friendly graphs.

In [5]:
# Import the the necessary libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split

The magic inline command will be utilised in order to ensure the correct display of the plots within the Jupyter Notebook. This will allow the plots to be rendered inline within the Notebook [1].

In [6]:
# Magic command used to visualise plots in Jupyter
%matplotlib inline

In [7]:
# control Seaborn aesthetics
# use darkplot plot style for contrast
sns.set_style("darkgrid")
# set the default colour palette
sns.set_palette("colorblind")
# set the size of the seaborn figures
sns.set(rc={'figure.figsize':(11.7,8.27)})

In [8]:
# Set the seed for Numpy Random
np.random.seed(1212)

# Overview of Scikit-Learn

Scikit-learn is a machine learning package as part of the Python programming language. It is a highly efficient tool that can be used for predictive data analysis.  The application of Scikit-learn can assist the user in developing a deeper understanding of data sets with regard to machine learning algorithms.  Scikit-learn is built on NumPy, SciPy and matplotlib.  The software is opensource and commercially usable making it highly useful for all businesses.  Scikit-learn has six main areas in which it can be applied:

    1. Classification
    2. Regression
    3. Clustering
    4. Dimension reduction
    5. Model Selection
    6. Preprocessing

<br/>


## 1. Classification

Scikit-learn can use a classification technique in which an object can be categorised based upon a set of variables.  For example, given X variables (features) of an object, the classifier can predict what the y variable (label) of that object is.  Furthermore, the accuracy of the classifier can be determined based upon a scoring metric which includes its precision. 

Applications: Spam detection and image recognition.
Algorithms: Support Vector Machines, Stochastic Gradient Desent, Nearest Neighbors, Gaussian Processes, Decision Trees, and Neural Networks

<br/>

**Classifier Comparison**


![ClassificationExample.png](attachment:ClassificationExample.png)


<br/>




## 2. Regression

Scikit-learn can use a regression analysis tool that enables the prediction a continuous-valued attribute associated with an object.  Regression can find a possible relationship between two variables, a dependent and independt variable.  Essentially, regression analysis is a way of mathematically sorting out which of those variables does indeed have an impact.

Applications: Drug response, Stock prices
Algorithms: Kernel ridge regression, Support Vector Machines, Stochastic Gradient Desent, Nearest Neighbors, Gaussian Processes, cross decomposition, multiclass, isotonic regression, Decision Trees, and Neural Networks.

<br/>

**Regression using Gaussian Process Regression**

![RegressionExample.png](attachment:RegressionExample.png)

<br/>


## 3. Clustering
Scikit-learn can use a clustering technique that group various data points together in clusters.  Each clustering algorithm comes in two variant:  a class, that implements the “fit” method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the “labels_” attribute

Applications: Customer segmentation, Grouping experiment outcomes
Algorithms: K-Means, Affinity propagation, Mean-shift, Spectral clustering, Ward hierarchical clustering, agglomerative clustering, DBSCAN, OPTICS, Gaussian mixtures and BIRCH.

<br/>

**Overview of Clustering Algorithms**

![ClusteringExample.png](attachment:ClusteringExample.png)


<br/>

## 4. Dimensionality Reduction
Dimensionality reduction reduces the overall number of features, it can reduce the computational demands associated with training a model but also helps combat overfitting by keeping the features that will be fed to the model fairly simple. The primary algorithms used to carry out dimensionality reduction for unsupervised learning are Principal Component Analysis (PCA) and Singular Value Decomposition (SVD).

Applications: Visualization, Increased efficiency
Algorithms: Incremental PCA, Kernel Principal Component Analysis, Truncated singular value decomposition and latent semantic analysis, Dictionary Learning,
Factor Analysis, Independent component analysis, Non-negative matrix factorization and Latent Dirichlet Allocation.

<br/>

**Dimension Reduction- Principal Component Analysis (PCA)**
![DimensionReductionExample.png](attachment:DimensionReductionExample.png)


<br/>

## 5. Model Selection

Scikit-learn can use a model selection tool that allows the user to select a best fit model based upon the data.  This allows for increased accuracy and to the better selection of model. The fine tuning of parameters associated with the data set is also available. Comparing, validating and choosing parameters and models are all possible.

Applications: Improved accuracy via parameter tuning
Algorithms: Grid search, cross validation and metrics
Algorithms: grid search, cross validation, metrics

<br/>

**Model Selection-Comparing Two Models**
![ModelSelectionExample.png](attachment:ModelSelectionExample.png)


<br/>




## 6. Preprocessing

Scikit-learn allows the user to pre-process the data in ways that enable data to be understood by machine learning algorithms more easily.  For instance, the ability to encode a string text with a numerical value, this allows the machine learning algorithms to better quantify the data.  Another popular preprocesing capability is teh scaling function.  This allows the users to create a scaling factor for graphs thats make them more readable.
    

Applications: Transforming input data such as text for use with machine learning algorithms.
Algorithms: preprocessing, feature extraction


<br/>

**Preprocessing- Using Scaling**
![PreProcessingExample.png](attachment:PreProcessingExample.png)

<br/>

## Demonstration of three scikit-learn algorithms

# First Algorithm

# Second Algorithm

# Third Algorithm