
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/1280px-Scikit_learn_logo_small.svg.png" width=400/>

1. Largest **classical machine learning** library in Python
    - Deep learning and reinforcement learning are out of scope
    
2. Started as a Google Summer of Code Project

3. Implements a wide range of **machine learning models** and **other elements** in the machine learning pipeline (e.g., feature selection, hyperparameter tuning, validation, metrics, etc.)

The following shows the supervised machine learning pipeline:

- Pandas and other libraries handles most data related operations

- Scikit-learn handles the machine learning related operations

- MLOps libraries handle deployment and feedback
<br><br>
![](https://i.imgur.com/jJK3PpD.png)

#### Recap: Model Evaluation

Any classifier corresponds to a decision boundary after training. There are three cases for such decision boundary:

![](https://media.springernature.com/lw685/springer-static/image/chp%3A10.1007%2F978-3-030-89010-0_4/MediaObjects/484261_1_En_4_Fig1_HTML.png)

We never rely on the training data to compare models or hyperparameters due to the risk of overfitting (it's not a true random sample of the data). After comparing and choosing the models to arrive at the best one, the reported score or metric must be on a new test set as well; there is a small chance that the validation set gave preference to a specific model or hyperparameters due to its inherent structure (extra safety).

## ✨ Scikit-Learn Package

![](https://i.imgur.com/o6i5Cbu.png)

## ⭐ Scikit-Learn Model Basics

#### Today we consider supervised, semi-supervised and unsupervised learning with Scikit-learn

All models (or estimators) inherit from `sklearn.base.BaseEstimator`<br /><br />

### 1️) Instantiate the Model using `Model(..)`
To instantiate an arbitrary model `CMPModel` which lives in `sklearn.CMP`, use:
```python
from sklearn.CMP import CMPModel

cmp_model = CMPModel(α=0.8, β=1.1)         
```
Here `α` and `β` are hyperparameters that define characteristics of the model (e.g., K in K-Nearest Neighbors or K-Means). A model can have any number of hyperparameters and each of them will be meaningful in some way.
<br /><br /><br />

### 2) Fit the Model on the Training Data using `model.fit(..)`

Given a training set with `m` rows and `n` features:

- `x_train` should generally be a numpy array of dimensions `(m,n)` (or scipy sparse matrix)

- `y_train` a numpy array of dimensions `(m,)` (or also `(m,k)` in some cases such as multi-target regression or classification
    - For classification, y is usually expected to have an integer or string datatype

```python
cmp_model.fit(x_train, y_train)
```

*This trains the models and sets its unknown parameters were needed to perform inference.* This step can take too long depending on the dataset!
<br /><br />

### 3) Predict on New Data using `model.predict(..)` 
Given a validation or test set with `m'` rows and `n` features:

- `x_val` should generally be a numpy array of dimensions `(m,n)` (or scipy sparse matrix)

```python
y_pred = cmp_model.predict(x_val)
```
Result will be a numpy array of dimensions `(m,)` or also `(m,k)` in some cases as we mentioned. <br>

Sometimes, we are interested in the probability or score that cause a certain classification for this we instead use `predict_proba(..)` or `decision_function(..)`
```python
y_pred_prob = predictor.predict_proba(x_val)
```
Return would be of shape like `(m,q)` where `q` is the number of probabilities or scores that have to be computed  to make a decision about a single point in `x_val`
<br /><br />


### 4) Evaluate the Model using `model.score(..)` or some_metric(...)
Can be done by comparing `y_pred` with `y_val` under some defined metric 
```python
from sklearn.metrics import some_metric
metric_val = some_metric(y_val, y_pred)
```

Models also usually come with `model.score(x_val, y_val)` which returns some default metric (e.g., accuracy)
```python
metric_val = cmp_model.score(x_val, y_val)
```
<br /><br />

### 5) Save and Load the Model using `dump(model,..)` and `load(..)`
Now you can deploy it (only `predict` will be used in the deployment environment)
```python
from joblib import dump, load
# save locally
dump(cmp_model, 'cmp_model.joblib') 
# load
cmp_model = load('cmp_model.joblib') 
```
<br />

<div align="center">
<img src="https://miro.medium.com/v2/resize:fit:616/0*WCaRN6ctmND048Xn.jpg" width=500 />
</div>

### In summary,
```python
# assume you have split your dataset into training (x_train, y_train) and validation sets (x_val, y_val) all numpy arrays

from sklearn.CMP import CMPModel
from sklearn.metrics import some_metric
from joblib import dump, load

# 1. instantiate
cmp_model = CMPModel(α=0.8, β=1.1)       
# 2. train the model
cmp_model.fit(x_train, y_train)
# 3. predict
y_pred = cmp_model.predict(x_val)
# 4. score
metric_val = some_metric(y_val, y_pred)
```
<br>
Semi-supervised models are a niche special case we will consider later.
<br>

### For Unsupervised Models
```python
# assume you have a dataset (x_data, y_data) which are numpy arrays

from sklearn.CMP import CMPModel
from joblib import dump, load

# 1. instantiate
cmp_model = CMPModel(α=0.8, β=1.1)       
# 2. train the model
cmp_model.fit(x_data)                             # Notice y_data dropped!
# 3. transform
x_trans = cmp_model.transform(x_data)             # Applies the unsupervised transformation. There is (usually) no predict.
# 4. score
metric_val = cmp_model.score(x_data)             # Notice y_data dropped
```
<br>

There is also `fit_transform(x_data)` which does both together. It exists because sometimes it's more efficient to do both `fit` and `transform` in the same function. There may also be `inverse_transform()` if it makes sense for the method to be revertible.

As we mentioned earlier no reinforcement learning and no (practical) deep learning on `Scikit-learn`