# Introduction to SK-Learn

- This will shows an introduction to the SKLearn API
- We will see a quick walkthrough of the API, how everything is organized, etc

# Types of SK-Learn objects

## Transformers
- Transforms dataset
- transform() method is used for transforming the dataset
- fit() learns the parameters, which then are used to transform
- fit_transform() fits and then transforms (this is more optimized)

## Estimators
- Estimates model parameters based on training data and hyperparameters
- fit() method is used to learn

## Predictors
- Makes predictions on the dataset based on a learning
- predict() method is used. Takes a data point as an argument and returns the predictions
- score() method is used to measure the quality of the predictions

***

Transformers $\rightarrow$ estimators $\rightarrow$ predictors

data preprocessing $\rightarrow$ training $\rightarrow$ inference

# Data API - SciKit Learn

- The data API provides functionalities for loading, generating and preprocessing the training and the test data

Module | Functionality
--- | ---
sklearn.datasets | Loading datasets - custom as well as popular reference datasets
sklearn.preprocessing | Scaling, centering, normalization and binarization methods
sklearn.impute | FIlling missing values
sklearn.feature_selection | Implementing feature selection algorithms
sklearn.feature_extraction | Implementing feature extraction from raw data

***

# Model API - SciKit Learn

- Implements **supervised** and **unsupervised** algorithms

### Regression


```python
sklearn.linear_model
``` 
    - Contains linear, ridge, lasso models

```python 
sklearn.trees
```

### Classification

1.
```python 
sklearn.linear_model
```
2.
```python 
sklearn.svm
```
3.
```python 
sklearn.tree
```
4.
```python 
sklearn.neighbors
```
5.
```python 
sklearn.naive_bayes
```
6.
```python 
sklearn.multiclass
```
***

- To implement multi-output classification and regression:
```python
sklearn.multioutput 
```

- To implement popular clustering algorithms:
```python
sklearn.cluster
```


***
***
# Model Evaluation API

The below can be used for different metrics for model evaluation:


**```sklearn.metrics```**


    - Classification metrics
    - Regression metrics
    - Clustering metrics
    
***
***

# Model Selection API
**```sklearn.model_selection```** implements various model selection strategies like:
    - cross-validation
    - hyperparameter tuning
    - plotting learning curves
    

***
***

If you want the documentation of some API, use $?$ before the method or API


In [3]:
?list

[0;31mInit signature:[0m [0mlist[0m[0;34m([0m[0miterable[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m [0;34m/[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m     
Built-in mutable sequence.

If no argument is given, the constructor creates a new empty list.
The argument must be an iterable if specified.
[0;31mType:[0m           type
[0;31mSubclasses:[0m     _HashedSeq, StackSummary, DeferredConfigList, _ymd, SList, _ImmutableLineList, FormattedText, NodeList, _ExplodedList, Stack, ...


***
***
# Data Loading
-  General dataset APIs have 3 main kind of interfaces:
    - **Loaders**: Used to load toy dataset bundled with the module
    - **Fetchers**: Used to download and load datasets from the internet
    - **Generators**: Used to generate controlled synthetic datasets
    

- Both, loaders and fetchers return a **Bunch** object, which is a dictionary with two keys of our interest:

KEY | Values
--- | ---
data | Array of shape $(n,m)$
target | Array of shape $(n,)$


- Generators (and the other two, if the argument *return_X_y=True*) returns a tuple $(X, y)$ of numpy arrays:
    - $X$ has shape $(n,m)$
    - $y$ has shape $(n,)$
    
Usually, the following syntax is followed:

```
load_*
fetch_*
make_*
```


![Screenshot%202023-10-04%20at%201.47.16%20AM.png](attachment:Screenshot%202023-10-04%20at%201.47.16%20AM.png)




![Screenshot%202023-10-04%20at%201.48.25%20AM.png](attachment:Screenshot%202023-10-04%20at%201.48.25%20AM.png)

![Screenshot%202023-10-04%20at%201.49.05%20AM.png](attachment:Screenshot%202023-10-04%20at%201.49.05%20AM.png)