# Scikit-Learn

***

# Overview of scikit-learn
<br>
Scikit-learn (formerly scikits.learn and also known as sklearn) is an open source machine learning library for the Python programming language. It was initially developed by David Cournapeau in 2007. It features a large number of common algorithms,  including classification, regression, clustering and dimensionality reduction algorithms. It is designed to interoperate with the Python numerical and scientific libraries NumPy, Pandas, SciPy and Matplotlib. (wiki: https://en.wikipedia.org/wiki/Scikit-learn) <br>

In general, **Machine Learning** is about creating models from data. Models are usully divided into 3 groups, namely supervised learning, unsupervised learning and reinforcement learning. Scikit-learn deals with the former two. It provides tools for model fitting, model selection and evaluation. 


## So what is supervised learning and unsupervised learning? 

#### Supervised learning
The data set comes with a column(s) of attribute(s) as target values that we want to predict. For example in the famous Iris dataset, "species" is the attribute we would want the model to predict. Depending on data type of the attribute(s), it is divided into two categories of algorithms, namely classification and regression. 

*  **Classification**: target attributes are in discrete form, with a limited number of categories. Aims to correctly label the target attributes. E.g., correctly label the speicies in Iris dataset
*  **Regression**: target attributes are continuous variables, e.g. prediction of salmon's length as a function of its age and weight.

#### Unsupervised learning
The data set does not come with a column of attribute to be predicted. It aims to identify patterns from the data set. 
*   **Clustering**: aim to discover groups of similar examples within the data
*   **Density estimation**: aim to determine the distribution of data within the sample 
*   **Dimentionality reduction**: project data from high-dimensional space down to 2/3D for the purpose of visualization, summarisation and feature selection. (scikit-learn website)

## Scikit-learn divided their algorithms into 6 categories as follows:

1. Classification
2. Regression
3. Clustering
4. Dimensionality reduction
5. Model selection 
    - Cross validation to check accuracy of supervised models
    - Emsemble methods for combining the predictions of multiple supervised models
6. Preprocessing 
    - Feature extraction to extract features from data to define the attributes in image and text data
    - Feature selection to identify useful attributes to create supervised models

ref:https://www.tutorialspoint.com/scikit_learn/scikit_learn_introduction.htm

The steps in using sklearn as follows, 
1. arrange the data into features (x) and target(y, to be predicted),  
2. a data set is then divided into training set and testing set, 
3. fit the model which is training 
5. predict() which is testing set, predicting target
4. then evaluate its predictability

In [1]:
import numpy as np

# Decision Tree based algorithms

Interested in decision tree, it is easy enough to understand which we probably use it everyday to make decision, 
pick 3 algorithms which are related to DT, namely Gini impurity, Entropy or information gain, and bootstrap aggregating which is an ensemble method to enhance its accuracy. 

### How decision tree classifier works?
https://www.analyticsvidhya.com/blog/2016/04/tree-based-algorithms-complete-tutorial-scratch-in-python/

**Decision Tree** is under the umbrella of **Supervised Learning** algorithm, which is mainly used for classification problem. It works for both categorical **(Classification tree)** and continuous **(Regression tree)** dependent variables. It splits the population into two or more heterogeneous groups using various techniques like Gini index, Entropy. As well as identifying the most significant variable. <br>

It can also be used in data exploration stage. For example, we are working on a problem where we have information available in hundreds of variables, there decision tree will help to identify most significant variable.<br>
[ref]

#### How does a tree based algorithms decide where to split
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria is different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node in two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that purity of the node increases with respect to the target variable. Decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on type of target variables. Let’s look at the four most commonly used algorithms in decision tree:
1. Gini
2. Chi-square
3. Information gain
4. Reduction in Variance

### Gini Impurity (decision tree) (supervised learning)


Random forest may be a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, an excellent result most of the time.

It’s also one among the foremost used algorithms, due to its simplicity and variety (it are often used for both classification and regression tasks).

Random Forests are an ensemble learning method that is for performing classification, regression as well as other tasks through the construction of decision trees and providing the output as a class which is the mode or mean of the underlying individual trees.

A Decision Tree Classifier functions by breaking down a dataset into smaller and smaller subsets based on different criteria. Different sorting criteria will be used to divide the dataset, with the number of examples getting smaller with every division.

Once the network has divided the data down to one example, the example will be put into a class that corresponds to a key. When multiple random forest classifiers are linked together they are called Random Forest Classifiers.

**Classification Tree**
- make tree for each variable, check and compare impurity among variables (Gini impurity, imformation gain, entropy)
- Total Gini impurity = weighted average of gini impurities for the leaves
- a bit more calculation if variable is numeric
- variable with lowest impurity at the top of the tree, then calcualtion of Gini Imppurity starts again for the remaining variables as nodes, until no more split --> as leaf and categorize/classify them
**confidence in classification and overfit**
- Cross validation

### Bootstrap aggregating/baggin algorithm

#### Random Forest
https://builtin.com/data-science/random-forest-algorithm

https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76

https://www.datacamp.com/community/tutorials/random-forests-classifier-python

https://stackabuse.com/random-forest-algorithm-with-python-and-scikit-learn/


It is essentially an ensemble of decision trees. While decision trees are better at classification, random forest is better at prediction. 
multiple decision trees, usually trained with the "bagging" method which is generally combination learning models increases the overall result in terms of accuracy and stability. 

### Pruning algorithm

#### random forest
- because DT is easy to create but inaccurate to classify
- bootstrapped dataset; consider a random subset of variables at each step; then repeat --> a wide variety of trees (the forest)
- prediction, run data on all the trees in the forest, see which option received more votes
- bagging: bootstrapping data + using the aggregate to make a decision
- out of bag dataset(testing dataset): test which trees incorrectly predict --> out of bag error (the accuracy of the forest)
<br>
Flow:
1. build a random forest
2. estimate accuracy of the forest --> change number of variables used per step
3. then repeat a bunch of times
**Clustering**
- fill in missing value by guess then refine guesses by running down the trees till the guess converge

## Reference:
[1]


***
# End