# Machine Learning with scikit-learn

## Resources

This training material is available under a CC BY-NC-SA 4.0 license.  You can find it at:

> https://github.com/DavidMertz/ML-Webinar

Before attending this course, please configure the environments you will need.  Within the repository, find the file `requirements.txt` to install software using `pip`, or the file `environment.yml` to install software using `conda`.

Please contact me and my training company, [KDM Training](http://kdm.training) for hands-on, instructor-led, onsite or remote, training.  Our email is info@kdm.training.

In [1]:
import sys
sys.version

'3.9.6 | packaged by conda-forge | (default, Jul 11 2021, 03:39:48) \n[GCC 9.3.0]'

In [2]:
import sklearn
sklearn.__version__

'0.24.2'

In [3]:
try:
    from sklearnex import patch_sklearn
    patch_sklearn()
except:
    print("Intel accelerator not installed (not required)", file=sys.stderr)

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## What Is Machine Learning?


* Difference between "Deep Learning" and other ML techniques
* Overview of techniques used in Machine Learning
* Classification vs. Regression vs. Clustering
* Dimensionality Reduction
* Feature Engineering
* Feature Selection
* Categorical vs. Ordinal vs. Continuous variables
* One-hot encoding
* Hyperparameters
* Grid Search
* Metrics

<div><a href="WhatIsML.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Exploring a Data Set

* Looking for anomalies and data integrity problems
* Cleaning data
* Massaging data format to be model-ready
* Choosing features and a target
* Train/test split

<div><a href="Exploring.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Classification

* Choosing a model
* Feature importances
* Cut points in a decision tree
* Comparing multiple classifiers

<div><a href="Classification.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Regression

* Sample data sets in scikit-learn
* Linear regressors
* Probabilistic regressors
* Other regressors

<div><a href="Regression.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Hyperparameters

* Understanding hyperparameters
* Manual search of parameter space
* GridsearchCV
* Attributes of grid search and wrapped model

<div><a href="Hyperparameters.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Clustering

* Overview of (some) clustering algorithms
* Kmeans clustering
* Agglomerative clustering
* Density based clustering: DBSan and HDBScan
* n_clusters, labels, and predictions
* Visualizing results

<div><a href="Clustering.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Decomposition
* Principal Component Analysis (PCA)
* Non-Negative Matrix Factorization (NMF)
* Latent Dirichlet Allocation (LDA)
* Independent component analysis (ICA)

<div><a href="Decomposition.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Feature Expansion

* A Synthetic Example
* Polynomial Features
* One-Hot Encoding
* Binning Values

<div><a href="FeatureExpansion.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Feature Selection

* Scaling with:
  * StandardScaler
  * RobustScaler
  * MinMaxScaler
  * Normalizer
* Univariate Selection
* Model-driven Selection

<div><a href="FeatureSelection.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Pipelines

* Feature Selection and Engineering
* Grid search
* Model

<div><a href="Pipelines.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Robust Train/Test Splits 

* cross_val_score
* ShuffleSplit
* KFold, RepeatedKFold, LeaveOneOut, LeavePOut, StratifiedKFold

<div><a href="TrainTest.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>

## Specialized and custom metrics

* Top N recommendations

<div><a href="CustomMetrics.ipynb"><img src="img/open-notebook.png" align="left"/></a></div>