![Erudio logo](img/erudio-logo-small.png)
---
![Sklearn logo](img/scikit-learn-logo-small.png)

# Machine Learning with scikit-learn

## Machine Learning Libraries

There are [many software libraries available for machine
learning](https://github.com/josephmisiti/awesome-machine-learning).  Some of
them are listed below.

### For General Machine Learning

- **[scikit-learn](http://scikit-learn.org/)**: Free Software (BSD License).
The topic of this course
- **[Spark MLlib](https://spark.apache.org/mllib/)**: Free Software (Apache-2.0 license).
Spark engine's built-in machine learning library, with interfaces to Java,
Scala, Python, and R. MLlib fits into Spark's APIs and interoperates with NumPy
in Python and R libraries. You can use any Hadoop data source (e.g. HDFS, HBase,
or local files), making it easy to plug into Hadoop workflows.
- **[mlpack](https://www.mlpack.org/)**: Free Software (3-clause BSD license;
Mozilla Public License v2.0; Boost Software License, version 1.0). 
A fast, flexible machine learning library, written in C++, that aims to provide
fast, extensible implementations of cutting-edge machine learning algorithms.
mlpack provides these algorithms as simple command-line programs, Python
bindings, and C++ classes which can then be integrated into larger-scale machine
learning solutions.
- **[Accord.NET Framework](http://accord-framework.net/)**: Free Software (LGPLv2.1)
The Accord.NET Framework is a .NET machine learning framework combined with
audio and image processing libraries completely written in C#. It is a complete
framework for building production-grade computer vision, computer audition,
signal processing and statistics applications even for commercial use.
- **[WEKA](https://www.cs.waikato.ac.nz/ml/weka/)**: Free Software (GPL).
Data Mining Software in Java. Weka is a collection of machine learning
algorithms for data mining tasks. It contains tools for data preparation,
classification, regression, clustering, association rules mining, and
visualization.
- **[Shogun](https://github.com/shogun-toolbox/shogun)**: Free Software (BSD 3-clause).
Shogun is among the oldest of machine learning libraries, but continues to be
well maintained and optimized. Shogun was created in 1999 and written in C++.
Via SWIG, Shogun can be used in Java, Python, C#, Ruby, R, Lua, Octave, and
Matlab. Shogun is designed for unified large-scale learning for a broad range of
feature types and learning settings, like classification, regression, or
explorative data analysis.
- **[XGBoost](https://xgboost.readthedocs.io)**: Free Software (Apache-2.0 license)
XGBoost is an optimized distributed gradient boosting library designed to be
highly efficient, flexible and portable. It implements machine learning
algorithms under the Gradient Boosting framework. XGBoost provides a parallel
tree boosting (also known as GBDT, GBM).  Portable for Python, R, Java, Scala,
C++ and interfaces for integration with deep learning frameworks.
- **[LightGBM](https://lightgbm.readthedocs.io)**: Free Software (MIT license)
LightGBM is a boosting framework that implements gradient boosting decision tree
algorithm and histogram-based algorithms which bucket continuous feature values
into discrete bins. LightGBM frameworks are available for Python, R, Java,
Scala, C++. It can run on Linux, Windows, MacOS and also on distributed
environments for faster model training.



### For Deep Learning

* **[TensorFlow](https://www.tensorflow.org/)**: Free Software (Apache 2.0 open
source license). TensorFlow is an open source software library for numerical
computation using data flow graphs. TensorFlow implements what are called data
flow graphs, where batches of data ("tensors") can be processed by a series of
algorithms described by a graph. The movements of the data through the system
are called "flows". Graphs can be assembled with C++ or Python and can be
processed on CPUs or GPUs.
* **[PyTensor](pytensor.readthedocs.io)**: Free Software (BSD License).
PyTensor is a Python library that allows one to define, optimize/rewrite, and
evaluate mathematical expressions, especially ones involving multi-dimensional
arrays (e.g. numpy.ndarrays). Using PyTensor, it is possible to attain speeds
rivaling hand-crafted C implementations for problems involving large amounts of
data. PyTensor combines aspects of a computer algebra system (CAS) with aspects
of an optimizing compiler. It can also generate customized code for multiple
compiled languages and/or their Python-based interfaces, such as C, Numba, and
JAX. PyTensor is a fork of Aesara, which is a fork of Theano.
* **[Keras](https://keras.io/)**: Free Software (MIT License).
Keras is a high-level neural networks API, written in Python and capable of
running on top of JAX, TensorFlow, or PyTorch. It was developed with a focus on
enabling fast experimentation. Keras allows for easy and fast prototyping
(through user friendliness, modularity, and extensibility). It Supports both
convolutional networks and recurrent networks, as well as combinations of the
two. Runs seamlessly on CPU and GPU.
* **[Caffe](http://caffe.berkeleyvision.org/)**: Free Software (BSD 2-Clause License). 
Caffe is a deep learning framework made with expression, speed, and modularity
in mind. It is developed by Berkeley AI Research (BAIR) and by community
contributors. Yangqing Jia created the project during his PhD at UC Berkeley.
Bindings for Python and MATLAB are part of the library. Last commit from 4 years ago.
* **[Chainer](https://chainer.org/)**: Free Software (MIT License).
Chainer supports CUDA computation and runs on multiple GPUs with little effort.
Chainer supports various network architectures including feed-forward nets,
convnets, recurrent nets and recursive nets. It also supports per-batch
architectures. Forward computation can include any control flow statements of
Python without lacking the ability of backpropagation. (Development efforts are
moving to PyTorch).
* **[PyTorch](https://pytorch.org/)**: Free Software (custom BSD-ish license).
Optimized tensor library for deep learning on GPUs and CPUs. Tensors are similar
to numpy arrays and serve as PyTorch's primary data structures. It has a syntax
similar to numpy, making it more intuitive and easier to learn than other deep
learning frameworks. It supports dynamic computation graphs, meaning that the
graph is defined on the fly rather than predefined, making it easier to debug.
Torch project has been absorbed by Pytorch since2017.
* **[Jax](https://jax.readthedocs.io/)**: Free Software (Apache 2.0 open source
license). JAX is a Python library built on top of XLA, which is a
domain-specific compiler for linear algebra that can accelerate machine learning
computations. Provides composable transformations of Python + NumPy programs. It
supports automatic differentiation for computing gradients. JAX functions are
compiled and therefore can be just-in-time compiled and parallelized across
multiple devices like GPUs and TPUs. JAX supports techniques like vectorization,
parallelization, kernel fusion to make computation faster. Jax is a tensor
library, while [Flax](https://flax.readthedocs.io) is a high-performance neural
network library and ecosystem for JAX.


### Cloud Focused

* **[Amazon SageMaker](https://aws.amazon.com/sagemaker)**: Commercial. Amazon
SageMaker provides fully managed instances running Jupyter notebooks for
training data exploration and preprocessing. These notebooks are pre-loaded with
CUDA and cuDNN drivers for popular deep learning platforms, Anaconda packages,
and libraries for TensorFlow, Apache MXNet, Chainer, and PyTorch.
* **[Google Cloud Machine Learning](https://cloud.google.com/products/ai/)**:
Commercial. Google Cloud Machine Learning (ML) Engine is a managed service that
allows developers and data scientists to build and bring machine learning models
to production. Cloud ML Engine offers training and prediction services, which
can be used together or individually. Cloud ML provides access to Python
libraries TensorFow, Keras, XGBoost, and scikit-learn.
* **[Azure Machine Learning](https://azure.microsoft.com/en-us/products/machine-learning)**: Commercial and proprietary.
Azure Machine Learning is a cloud service for accelerating and managing the
machine learning (ML) project lifecycle. ML professionals, data scientists, and
engineers can use it in their daily workflows to train and deploy models and
manage and manage machine learning operations (MLOps).  It can be used to build
models or use models built on an open source platform such as PyTorch,
TensorFlow or scikit-learn. It superseeds Azure ML Studio, which will be retired
on August 31, 2024.

### Model and Data Versioning

* **[DVC Data Version Control](https://dvc.org/)** Free Software (Apache 2.0
Open Source License). Is a tool for data management, ML pipeline automation, and
experiment management. Works as a git extension to manage data and models.
* **[mlFlow](https://mlflow.org/)** Free Software (Apache 2.0 Open Source
License). Platform to streamline machine learning development, including
tracking experiments, packaging code into reproducible runs, and sharing and
deploying models. MLflow offers a set of lightweight APIs that can be used with
any existing machine learning application or library.

Back to intro...

<div><a href="SKLearn-01_WhatIsML.ipynb"><img src="img/open-notebook.png" align="right"/></a></div>


---

Materials licensed under [CC BY-NC-ND 4.0](https://creativecommons.org/licenses/by-nc-nd/4.0/) by the authors