In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix, rand_score
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.svm import LinearSVR, LinearSVC
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Introduce the notebook, link to binder and nbviewer.

# Introduce Scikit-Learn

## What is Machine Learning?

https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained

Machine Learning is a form of "artificial intelligence" whereby a computer program is able to learn the relationships within large datasets that will allow is to predict the outcomes specified by the programmer. This is done via the use of various algorithms that utilize advanced statistical methods, many of which have been in development for many years, but until recently haven't been widely viable due to a lack of computer processing power. As computers have become faster, data storage has become cheaper, and the acquisition of data has become more widespread, we're now at the point where these algorithms and models can be applied to real world data to gain new insights that a human would not have been able to unearth within a reasonable amount of time. 

It's not only in academia that machine learning has gained in popularity and usefullness. We encounter the outputs of machine learning models everyday as we engage with software and services over the internet and on our devices. Machine learning models allow Amazon to know that I'm probably at least vaguely interested in a newly published English translation of a Chinese Sci-fi novel (and it is correct) (https://www.amazon.science/the-history-of-amazons-recommendation-algorithm), they select the Facebook (https://blog.hootsuite.com/facebook-algorithm/) or LinkedIn posts (https://www.postbeyond.com/blog/how-linkedin-algorithm-works/) that I'm most likely to engage with, and they even make our photos look a bit better (https://aidaily.co.uk/articles/how-machine-learning-is-changing-your-smartphone-camera-1).

Machine learning is therefore all around us, and everyday new applications and algorithms are being developed and tested that in one way or another will alter the way we engage with our world, hopefully for the better.


## Machine Learning Terminology

Before continuing, there are some key phrases and terminology that we must get acquainted with when discussing machine learning. As the machine learning field is at the cross section of computer science, software engineering and statistics, there can be many keywords sharing the same meaning which can of course be confusing. In the pursuit of clarity, throughout this notebook I will be consistent in my use of the following terms and their definitions.

### Feature Variables, Target Variables & Labels

The whole concept of machine learning is based on using the data available to us to predict something that we do not know and which holds high value for us. For this we use a selection of input variables, known as “Features'', in order to predict an output variable, known as the “Target”. In scenarios where we are looking to predict what category something should fall into, the target may also be known as the “label”. For example when predicting if a customer will churn or not, the classification of churn or no-churn would be known as the “label”. https://medium.com/technology-nineleaps/some-key-machine-learning-definitions-b524eb6cb48 

A common example often used in machine learning education materials (e.g. Kaggle - https://www.kaggle.com/c/house-prices-advanced-regression-techniques)  is to use a set of inputs (features) to predict house prices (the target). In this scenario we can use the information we do know (size of the house, how many bedrooms and bathrooms it has, does it have a front and/or rear garden, what neighborhood is it located in) to predict something that we do not know but is important information to us (the house price). As the house price cannot be known until after the house has sold, it is therefore information that is not available to us at the time or prediction.

### Supervised Learning & Unsupervised Learning

https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning

Supervised learning are algorithms that require an initial training dataset whereby the algorithm is fed the target variables as well as the feature variables. This means that the algorithm can explore the relationship between the target and features, develop a model, and then apply that model to new unseen data in future. As target variables are a necessity for supervised learning, we must use historical data for the initial training of the model, and in some cases we must manually “label” the data so that a target variable is present. High quality historical data with relevant features and accurate labels or targets is therefore incredibly important for training a supervised learning algorithm. This approach is generally used for regression and classification analysis whereby a prediction can clearly be correct or not as the target variable will at some point be known (e.g. when the house is sold).

As the name suggests, unsupervised learning algorithms have one major difference to their supervised peers - we do not have the target or labels available for it to train on. In these scenarios there is no currently known right or wrong answer, and so the unsupervised algorithms are used to unearth new insights that may not already be available to the data scientist. For example, it’s common to use an unsupervised clustering algorithm to form initial customer segments (https://towardsdatascience.com/customer-segmentation-using-k-means-clustering-d33964f238c3) which can then be adapted for use in a supervised classification algorithm in future. Unsupervised algorithms still have the need for high quality and relevant feature data, however the lack of labeled data can make the starting costs lower. However that does come with the drawback that model accuracy may be more ambiguous or that the predictions have less real world usefulness.



## What is Scikit-Learn?

Scikit-Learn is one of the most popular and frequently used machine learning libraries available in Python. It boasts a plethora of different algorithms of various types, and works efficiently with other popular Python data science libraries such as Scypy and Numpy (https://scikit-learn.org/stable/faq.html#why-does-scikit-learn-not-directly-work-with-for-example-pandas-dataframe). The off-the-shelf nature of the algorithms included in Scikit-Learn mean that it meets the needs of most data scientists looking to implement machine learning, and the ease of use of the library itself allows for a gentle introduction to machine learning for new and aspiring data scientists.

All of the algorithms included in the library are well vetted, and must follow a similar implementation, thus allowing users to have confidence in the algorithms they are using and be able to move between different algorithms with minimal additional training (other than learning some theory and new parameters) (https://scikit-learn.org/stable/faq.html#what-are-the-inclusion-criteria-for-new-algorithms).

Scikit-Learn also provides a multitude of tools for the preprocessing of data so that it works effectively with a given algorithm (https://scikit-learn.org/stable/data_transforms.html), selecting the most ideal feature variables for a given dataset, and the means of measuring the algorithm’s performance (https://scikit-learn.org/stable/model_selection.html). 

The categories of algorithm available in scikit learn fall into three main categories:
* Classification
* Regression
* Clustering

As we will see in this notebook, it is possible for a family of algorithms to have applications across more than one of these categories. For example in this notebook I will be showing the Classification and Regression variations of Support Vector Machines (SVMs) and Random Forests.

Other resources
https://www.codecademy.com/articles/scikit-learn


## Introduce the datasets I'll use
* One for regression
* One for classification / Clustering.

REMEMBER - THIS IS NOT ABOUT THE DATA ITSELF, IT'S ABOUT THE PACKAGE AND THE ALGORITHMS!

# Random Forest

* Introduce decision tree and how it works.
* Discuss pros & cons
* Introduce explainability of decision trees (visualize).
* Run a tree on the dataset and visualize.
* Introduce ensemble methods,
* Introduce Random Forest
* Discuss classification and regressor variants.
* Introduce MSE for regression and Confusion Matrix for classification.
* Run a forest on each dataset, visualize and do mse/confusion matrix.
* Discuss how we might optimize the algorithm - nodes, leaves, number of trees, etc.

# Support Vector Machines

* Introduce SVM and how it works
* Discuss pros and cons
* Visualize how it works
* Run SVC and SVR on the dataset
* Do MSE and Confusion Matrix on results.
* Visualize results.
* Discuss how it could be optimized.

# K Means

* Introduce the algorithm.
* Discuss difference between unsurpervised (this) and supervised (the other 2).
* Two different types - KMeans and KMeans++. Discuss the differences.
* Introduce Rand Score as a scoring mechanism.
* Run it on data set, visualize and do a rand score.
* Discuss how it might be optimized, changing K etc.

# Resources

* Train Test Split - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html?highlight=train%20test%20split#sklearn.model_selection.train_test_split

* R2 score - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score

* MSE - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error

* Confusion Matrix - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix

* Rand Score (Clustering) - https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation

* KMeans Clustering - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

* Decision Tree Classification - https://scikit-learn.org/stable/modules/tree.html#classification

* Decision Tree Regression - https://scikit-learn.org/stable/modules/tree.html#regression

* Random Forest Regression - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

* Random Forest classification - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier

* SVM Classification - https://scikit-learn.org/stable/modules/svm.html#classification
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

* SVM Regression - https://scikit-learn.org/stable/modules/svm.html#regression
* https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR