# Introducing Scikit-Learn

<a href="http://scikit-learn.org">Scikit-Learn</a>
- package that provides efficient versions of a large number of common algorithms
- clean, uniform, and streamlined API
- very useful and complete online documentation.
    - once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward

## Data Representation in Scikit-Learn

### Data as table

- a two-dimensional grid of data
    - rows represent individual elements of the data set
    - columns represent quantities related to each of these elements

- Example: [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set)
    - analyzed by Ronald Fisher in 1936
    - download this dataset in the form of a Pandas ``DataFrame`` 

In [1]:
import seaborn as sns
import pandas as pd
train_size = 0.67



Download the __Iris__ dataset at the url `https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data` or from your local file, if you already have it. The file does not have header, use as column names the list below,  inspect the text file to see which character is used as separator.

`'sepal length', 'sepal width', 'petal length', 'petal width', 'species'`

Use the dataframe name `iris`. Show the head of `iris`

As an alternative way of loading the data, use [this utility](https://scikit-learn.org/stable/datasets/index.html) included in scikit-learn



In [2]:
iris_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(iris_url, sep = ',', header = None\
                   , names = ['sepal length', 'sepal width', 'petal length', 'petal width', 'species'])
iris.head(4) # show first 4 data rows

Unnamed: 0,sepal length,sepal width,petal length,petal width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa


- each row refers to a single observed flower
    - the number of rows is the total number of flowers in the dataset.
    - *sample*: a single row
    - ``n_samples``: number of rows

- each column refers to a piece of information that describes each sample
    - *feature*: a single column
    ``n_features``: the number of columns
        - each column has a data type: number (continuous), boolean, discrete (nominal or ordinal, represented with integers or strings)

#### Features matrix

The part of the data matrix containing the ``unsupervised attributes``

Usually in *scikit-learn* documentation referred as ``X``

Can be a:
- two-dimensional numpy array with shape ``[n_samples, n_features]``
- SciPy ``sparse matrix``
- Pandas ``DataFrame``

The matrix cases require uniform data types in columns


#### Target array

*label* or *target* array, by convention usually called ``y``
- usually one dimensional, with length ``n_samples``, 
- generally contained in a NumPy array or Pandas ``Series``.
- may have continuous numerical values, or discrete classes/labels
- usually it the quantity we want to *predict from the data*
    - in statistical terms, it is the dependent variable

In the example we may wish to construct a model that can predict the species of flower based on the other measurements

The measurements of the flower components are the ``features array``

The ``species`` column can be considered the target array