## Machine Learning with Python Scikit-Learn
There are several Python libraries that provide solid implementations of a range of machine learning algorithms. One of the best known is **Scikit-Learn** (scikit-learn Machine Learning in Python, n.d.), see figure below (source: https://www.kdnuggets.com/2020/11/top-python-libraries-data-science-data-visualization-machine-learning.html)

![](images/python-libraries.png)
  
Characteristics:  
- provides efficient versions of a large number of commonly used algorithms. 
- clean, uniform, and streamlined API
- very useful and complete online documentation.  
  
A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is very straightforward (VanderPlas, 2016).

### Data Representation in Scikit-Learn
Machine learning is about creating models from data: for that reason, we’ll start by discussing how data can be represented in order to be understood by the computer. The best way to think about data within Scikit-Learn is in terms of tables of data.  
  
#### Data as table
- a basic table is a two-dimensional grid of data
- rows represent individual elements of the dataset
- columns represent quantities related to each of these elements
  
For example, titanic case:
- each row refers to a single passenger
- the number of rows is the total number of passengers in the dataset. 
  
In general, we will refer to the rows of the matrix as samples, and the number of rows as n_samples. 
  
- each column of the data refers to a particular quantitative piece of information that describes each sample. 
  
In general, we will refer to the columns of the matrix as features, and the number of columns as n_features.
  
#### Features matrix
- a two dimensional numerical array or matrix, which we will call the features matrix. 
- by convention, this features matrix is often stored in a variable named X. 
- is assumed to be two-dimensional, with shape [n_samples, n_features]
- is most often contained in a NumPy array or a Pandas DataFrame. 
  
#### Target array
- we also generally work with a label or target array
- by convention we will usually call this array y. 
- usually one dimensional, with length n_samples
- generally contained in a NumPy array or Pandas Series. 
- may have continuous numerical values, or discrete classes/labels. 
- some Scikit-Learn estimators do handle multiple target values in the form of a two-dimensional [n_samples, n_targets] target array
- we will primarily be working with the common case of a one-dimensional target array.
- the distinguishing feature of the target array is that it is usually the quantity we want to predict from the data: in statistical terms, it is the dependent variable. For example, in the titanic case we may wish to construct a model that can predict if a passenger has survived or not based on the other features; in this case, the Survived column would be considered the target. 
  
To summarize, the expected layout of features and target values can be visualized as follows:
  
![](images/scikit-learn.png)
 
### Scikit-Learn’s Estimator API
The Scikit-Learn API is designed with the following guiding principles in mind. 
  
- **Consistency**: all objects share a common interface drawn from a limited set of methods, with consistent documentation.
- **Inspection**: all specified parameter values are exposed as public attributes.
- **Limited object hierarchy**: only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.
- **Composition**: many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes use of this wherever possible.
- **Sensible defaults**: when models require user-specified parameters (the so-called _hyperparameters_), the library defines an appropriate default value.

#### Basics of the API

Most commonly, the steps in using the Scikit-Learn estimator API are as follows:
1.	Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
1.	Choose model hyperparameters by instantiating this class with desired values.
1.	Arrange data into a features matrix and target vector following the discussion from before.
1.	Fit the model to your data by calling the fit() method of the model instance.
1.	Apply the model to new data:
  - for supervised learning, often we predict labels for unknown data using the predict() method.
  - for unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.
