# Introduction to Scikit-Learn


**Lesson Goals**

In this lesson we focus on explaining the Scikit-learn Machine Learning Toolkit:

    You will learn how to install Scikit-learn and its dependencies.
    You will learn about the functionalities of Scikit-learn that will help you in the implementation of the Machine Learning workflow in your Machine Learning projects.

**Introduction**

Scikit-learn is the leading machine learning library in Python. It is an open source library with contributions from hundreds of developers as well as support from major corporations like Google. Scikit-learn utilizes numpy and scipy.
Installation

In this section, we will guide you through the process of installing Scikit-learn as a python package. Scikit-learn depends on other packages, so first we have to check that those dependecies are installed.
Dependencies

Scikit-learn depends on NumPy and SciPy, so before proceeding to install the Scikit-learn package, we will check the availability of the Numpy and Scipy packages. We can invoke the Python interpreter asking it to load Numpy, and it will reply with an error if Numpy is unavailable. We can start up a Jupyter notebook and enter the following code in a cell:

In [1]:
import numpy as np

The Python interpreter will try load the Numpy package. If the Numpy package is not installed, the python interpreter will print a message similar to the following.

ImportError: No module named 'numpy'

In this case, we will install Numpy directly in Jupyter by adding an exclamation before the command.

**pip install numpy**

Now we will check the installation of the Scipy package in a similar way. Just type at the command prompt:

In [2]:
import scipy

If no error is reported, then it means that the Scipy package is installed and accessible to the Python interpreter.

If the package is not installed, you can install the package similarly with the command:

**pip install scipy**


# Install Scikit-learn

So far you have checked that the required dependencies of the Scikit-learn package are in place, so now you are ready to proceed to install the Scikit-learn package.

The quickest and easiest way is to install the scikit-learn package directly in Jupyter. As before, we add an exclamation before the command.

**pip install sklearn**



The log should indicate that the installation was successful and note what version was installed.


# Main Functionalities

In previous lessons we have discussed the machine learning workflow. As an open source library, scikit-learn has benefited from many contributions that have turned the library a great resource for all stages of the machine learning workflow.


**Load Dataset**

Scikit-learn comes bundled with several well known public datasets, to take you up to speed quickly, avoiding the hassle of finding and downloading datasets from the web. These bundled datasets can be loaded by name without even providing a path to the dataset file. This is an example:

In [3]:
from sklearn import datasets
diabetesDataset = datasets.load_diabetes()
diabetesDataset

{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990842, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06832974, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286377, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04687948,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452837, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00421986,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

This data structure contains 5 different components:

    The data, which is a numpy array with 442 rows and 10 columns



In [4]:
diabetesDataset.data

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

    The target (this is the variable that we would like to predict) which is a one dimensional array with 442 rows.

In [5]:
diabetesDataset.target

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28



    The description of the dataset

    The data filename (this contains the file path)

    The target filename (also contains a file path)

While the library contains preloaded datasets, we can use Pandas dataframes as well in scikit-learn. One major point is that when using a Pandas dataframe to perform a machine learning task with scikit-learn, we must separate the dataset into predictor variables and response variables. The predictor variables are used to estimate the response variable (or target variable).


# Preprocess Dataset

In previous lessons, we have mentioned a number of data preprocessing functions including scaling and test train splitting. Scikit-learn has many data preprocessing functions that are geared towards numpy arrays.


# Feature Selection

This is another stage of our Machine Learning workflow for which Scikit-Learn provides support. As mentioned previously on this course, there are multiple ways to select the best features for our model. Scikit-learn comes with a number of functions that help us perform this task.


# Train a Model

Model-training is the core functionality of Scikit-Learn. It provides a variety of Machine Learning algorithms grouped according to the availability of the teaching signal (availability of supervision/criticism) criterion. In previous lessons, the interpretation of a training set as a set of solved problems was introduced. If a supervisor provided the solutions, then we can perform supervised Machine Learning. If no solutions are available then we are restricted to unsupervised Machine Learning.


# Model Selection and Evaluation

Scikit-Learn test and tuning functionalities include the functionalities that were introduced in the previous lessons. These functionalities are available as functions from the sklearn.metrics package. 