<small><i>This notebook was put together by [Anderson Banihirwe](andersy005.github.io). Source and license info is on [GitHub](https://github.com/IMLAIR/Scikit-Learn-Primer).</i></small>

# A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library

![](http://3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com/wp-content/uploads/2014/04/scikit-learn.png)

## Table of Contents
- [I. Goals of this Tutorial](#Goals-of-this-Tutorial)
- [II. Where did it come from?](#Where-did-it-come-from?)
- [III. What is scikit-learn?](#What-is-scikit-learn?)
- [IV. What are the features?](#What-are-the-features?)
- [V. Preliminaries](#Preliminaries)
- [VI. Checking your installation](#Checking-your-installation)
- [VII. Useful Resources](#Useful-Resources)
  

## Goals of this Tutorial

- **Preliminaires: Specify Python version and library versions to be used**
- **introduce Handy Machine Learning Algorithms mind map**
- **Get an overview of the scikit-learn library and useful references of where you can learn more**

**Preliminaries: Setup & introduction** 
* Making sure your computer is set-up

** Handy machine learning algorithms mind map.
[Jason Brownlee](http://machinelearningmastery.com/start-here/) created a handy mind map of 60+ algorithms organized by type.** 


![](https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=uxqkqzh8fsbg8aowf6xk)

## Where did it come from?
[back to top](#Table-of-Contents)

Scikit-learn was initially developed by David Cournapeau as a Google summer of code project in 2007.

Later Matthieu Brucher joined the project and started to use it as apart of his thesis work. In 2010 INRIA got involved and the first public release (v0.1 beta) was published in late January 2010.

The project now has more than 30 active contributors and has had paid sponsorship from INRIA, Google, Tinyclues and the Python Software Foundation.

## What is scikit-learn?
[back to top](#Table-of-Contents)

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.

It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use.

The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that includes:

- NumPy: Base n-dimensional array package
- SciPy: Fundamental library for scientific computing
- Matplotlib: Comprehensive 2D/3D plotting
- IPython: Enhanced interactive console
- Sympy: Symbolic mathematics
- Pandas: Data structures and analysis


Extensions or modules for SciPy care conventionally named [SciKits](http://scikits.appspot.com/scikits). As such, the module provides learning algorithms and is named scikit-learn.

The vision for the library is a level of robustness and support required for use in production systems. This means a deep focus on concerns such as easy of use, code quality, collaboration, documentation and performance.

Although the interface is Python, c-libraries are leverage for performance such as numpy for arrays and matrix operations, [LAPACK](http://www.netlib.org/lapack/), [LibSVM](http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and the careful use of cython.



## What are the features?
[back to top](#Table-of-Contents)

The library is focused on modeling data. It is not focused on loading, manipulating and summarizing data. For these features, refer to NumPy and Pandas.
![](http://3qeqpr26caki16dnhd19sv6by6v.wpengine.netdna-cdn.com/wp-content/uploads/2014/04/plot_mean_shift_1.png)

Some popular groups of models provided by scikit-learn include:

- **Clustering:** for grouping unlabeled data such as KMeans.
- **Cross Validation:** for estimating the performance of supervised models on unseen data.
- **Datasets:** for test datasets and for generating datasets with specific properties for investigating model behavior.
- **Dimensionality Reduction:** for reducing the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.
- **Ensemble methods:** for combining the predictions of multiple supervised models.
- **Feature extraction:** for defining attributes in image and text data.
- **Feature selection:** for identifying meaningful attributes from which to create supervised models.
- **Parameter Tuning:** for getting the most out of supervised models.
- **Manifold Learning:** For summarizing and depicting complex multi-dimensional data.
- **Supervised Models:** a vast array not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.

## Preliminaries
[back to top](#Table-of-Contents)

This tutorial requires the following packages:

- Python version 3.4+
- `numpy` version 1.8 or later: http://www.numpy.org/
- `scipy` version 0.15 or later: http://www.scipy.org/
- `matplotlib` version 1.3 or later: http://matplotlib.org/
- `scikit-learn` version 0.15 or later: http://scikit-learn.org
- `ipython`/`jupyter` version 3.0 or later, with notebook support: http://ipython.org
- `seaborn`: version 0.5 or later, used mainly for plot styling

The easiest way to get these is to use the [conda](http://store.continuum.io/) environment manager.
I suggest downloading and installing [miniconda](http://conda.pydata.org/miniconda.html).

The following command will install all required packages:
```
$ conda install numpy scipy matplotlib scikit-learn ipython-notebook
```

Alternatively, you can download and install the (very large) Anaconda software distribution, found at https://store.continuum.io/.

## Checking your installation
[back to top](#Table-of-Contents)

You can run the following code to check the versions of the packages on your system:

(in IPython notebook, press `shift` and `return` together to execute the contents of a cell)

In [1]:
import IPython
print('IPython:', IPython.__version__)

import numpy
print('numpy:', numpy.__version__)

import scipy
print('scipy:', scipy.__version__)

import matplotlib
print('matplotlib:', matplotlib.__version__)

import sklearn
print('scikit-learn:', sklearn.__version__)

import seaborn
print('seaborn', seaborn.__version__)

IPython: 6.0.0
numpy: 1.12.1
scipy: 0.19.0
matplotlib: 2.0.0
scikit-learn: 0.18.1
seaborn 0.7.1


## Useful Resources
[back to top](#Table-of-Contents)

### Documentation
I recommend starting out with the quick-start tutorial and flicking through the user guide and example gallery for algorithms that interest you.

Ultimately, scikit-learn is a library and the API reference will be the best documentation for getting things done.

- Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html
- User Guide http://scikit-learn.org/stable/user_guide.html
- API Reference http://scikit-learn.org/stable/modules/classes.html
- Example Gallery http://scikit-learn.org/stable/auto_examples/index.html


- **matplotlib:** http://matplotlib.org (see especially the gallery section)
- **IPython:** http://ipython.org (also check out http://nbviewer.ipython.org)