# Round 1 - Components of Machine Learning

<img src="https://github.com/Bookiebookie/MachineLearningwithPython/blob/master/R1_ComponentsML/AMLProblem.png?raw=true" alt="Drawing" style="width: 600px;"/>

Many machine learning (ML) problems and methods consist of three components: 

1. <b>Data</b> points as the basic (atomic) unit of information. Data points are characterized by features, which are  properties that can be measured (or computed) easily. Besides features, data points are often associated with certain labels that represent some higher-level information or quantity of interest. In contrast to features, labels are difficult to acquire and much of machine learning is about to develop methods that allow to estimate or predict the labels of a data point based on its features.  

2. A <b>hypothesis</b> space (also referred to as a ML model) consisting of computationally feasible predictor functions.

3. A <b>loss function</b> that is used to assess the quality of a particular predictor function. 

## Learning Goals

* Learn to make useful definitions for what data points (examples, samples), features and labels are in different real-life applications. 
* Learn how to represent data as numpy arrays which are, in turn, the Python implemenation of vectors and matrices.   
* Learn to use ("toy") datasets provided by the Python library `scikit-learn`. 
* Learn about the concept of hypothesis spaces. 
* Learn how to fit (linear) predictions functions to data. 

This notebook contains several student tasks which require you to write a few lines of Python code to solve small problems. In particular, you have to fill in the gaps marked as **Student Task**.

<b><center><font size=4>Additional material</font></center></b>

<b><font size=4>Videos</font></b>

* [Data](https://youtu.be/WWYRH3x7_5M), [Hypothesis Space](https://youtu.be/CDcRfak1Mh4), [Hypothesis Space of Linear Models](https://youtu.be/Mch5hmhVuiA), [Hypothesis Space of Decision Trees](https://youtu.be/0FmaLfjAaRE), [Hypothesis Space of Deep Learning](https://youtu.be/im8mweIrpAM),[Loss Functions](https://www.youtube.com/watch?v=Uv9lihDfsBs&t=4s)

<b><font size=4>Tutorials</font></b>

* components of ML can be found under [this link](https://arxiv.org/pdf/1910.12387.pdf) 

* Python library `numpy` can be found under [this link](https://hackernoon.com/introduction-to-numpy-1-an-absolute-beginners-guide-to-machine-learning-and-data-science-5d87f13f0d51).

* "Learn the Basics" and "Data Science Tutorial" sections from [this link](https://www.learnpython.org/en/).

* a quick refresher for basic properties of matrices can be found under [this link](http://math.mit.edu/~gs/linearalgebra/linearalgebra5_1-3.pdf)
and [this link](https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470549094.app1)

* mathematical notation [this link](https://en.wikipedia.org/wiki/List_of_mathematical_symbols)




## Data as Matrices and Vectors
<a id="Q1"></a>

To implement ML methods, we need to be able to efficiently **store and manipulate** data.  A quite powerful tool to represent and manipulate data are [vectors and matrices](https://en.wikipedia.org/wiki/Matrix_(mathematics)) which are, in turn, special cases of [tensors](https://en.wikipedia.org/wiki/Tensor). 

The data points arising in many application domains can often be characterized by a list of numeric attributes. This numeric attributes or "features", $x_{r}$ can be stacked conveniently into a vector $\mathbf{x}=\big(x_{1},\ldots,x_{n}\big)^{T}$. Many ML methods, such as linear regression (see round 2) or logistic regression (see round 3), use predictor functions of the form $h(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x}$ with some weight vector $\mathbf{w}$. 

Once we restrict ourselves to linear functions of the form $h(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x}$, we can represent a predictor function by the weight vector $\mathbf{w}$. Indeed, given the weight vector $\mathbf{w}$, we can evaluate the predictor function for any feature vector $\mathbf{x}$ as $h(\mathbf{x}) = \mathbf{w}^{T} \mathbf{x}$. Thus, not noly we can represent data using a vector, but also the predictor functions applied to this data. 

Assume we have a set of data points which we index with $i=1,...,m$. The $i$th data point is characterized by the feature vector $\mathbf{x}^{(i)} = \big( x_{1}^{(i)}, \ldots, x^{(i)}_{n} \big)^{T}$ 
Accepted way to organize the data in ML is following: features are stored in the ("feature") matrix **X** with each row containing the data for each data point ($m$ - number of data points) and with each column storing the data of each feature vector ($n$ - number of features):

\begin{equation}
\mathbf{X}  = \begin{pmatrix} X_{1,1} & X_{1,2}& \ldots & X_{1,n} \\ 
X_{2,1} & X_{2,2}& \ldots & X_{2,n} \\ 
\vdots & \vdots & \vdots & \vdots \\ 
X_{m,1} & X_{m,2} & \ldots & X_{m,n} \end{pmatrix}\in \mathbb{R}^{m \times n}   \quad \quad (Eq.1)
\end{equation} 
\
The matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ is stored in Python as a numpy array of shape (m,n). The $i$th row of the matrix $\mathbf{X}$ is the feature vector $\mathbf{x}^{(i)}$ of the $i$th data point. 

Labels of data points are stored in vector **y**: 

\begin{equation}
\mathbf{y}  = \begin{pmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{m} \end{pmatrix}\in \mathbb{R}^{m}
\end{equation} 
\
**y** vector is represented as a numpy array of shape (m,1)
\
\
\
$m$ - number of data points\
$n$ - number of features\
$\mathbf{X}$       - upper-case bold letters denote a matrix  \
$\mathbf{x}$       - lower-case bold letters denote a vector  \
$\mathbf{x}^{T}$   - transpose of vector x \
$x_{1}$            - first entry of vector x\
$x_{r}$            - $r$th entry of vector x\
$\mathbf{x}^{(i)}$ - feature vector of $i$th data point\
$x_{r}^{(i)}$      - $r$th feature of $i$th data point\
$\mathbb{R}$       - real numbers\
$\mathbb{R}^{n}$   - [real coordinate space](https://en.wikipedia.org/wiki/Real_coordinate_space) consisting of length-$n$ lists of real numbers \
$\mathbb{R}^{m \times n}$ - matrices with $m$ rows and $n$ columns of real-valued numbers$



## Features and Labels 

Let us illustrate the main ML terminology using a concrete example. Imagine that we want to build a model for classifying songs according to their genre (such as "Pop", "Blues" or "Hip-Hop"). In this case the **data points** will be songs, one particular song correspond to one particular data point. To build a classifier for the song genre, we need some labeled data points, i.e., songs for which we know the correct genre. Each data point has several   **features**, which characterize the songs. Features include e.g., the city where the song was produced, the length of the song's lyrics, its tempo or even the power spectrum of audio signal. The quantity of interest or **label** in this case is the genre to which the song belongs to. 

<img src="https://github.com/Bookiebookie/MachineLearningwithPython/blob/master/R1_ComponentsML/FeaturesLabels.jpg?raw=true" alt="Drawing" style="width: 1000px;"/>



<a id='Bonus1'></a>
<div class=" alert alert-warning">
    <b>Bonus Task.</b> Machine Learning in your life. 
    
Bonus task worth of 50 points.
    
Produces a short video/slides/description where some real-life situation is modelled as a machine learning problem. 
</div>

<b>Bonus Task Answer:</b> Weather prediction could be modelled as a machine learning problem.
In this model, the data points is the sensor data in one data in a specifict location or area. The feautures in terms of data include pressure, humidity, temperature, wind, etc. And the lables include raining, sunny, cloudy etc. Although most of the features used in this model are digits, the labels are not. So this should be a classification problem. We collect data in different places for a rather long time to build this model and then use loss function to verify how applicable this model is. 

## Scikit-Learn Data

The Python library `scikit-learn` comes with a few standard datasets, for instance the [iris](https://scikit-learn.org/stable/datasets/index.html#iris-plants-dataset) and [digits](https://scikit-learn.org/stable/datasets/index.html#optical-recognition-of-handwritten-digits-dataset) datasets for classification and the [boston house prices](https://scikit-learn.org/stable/datasets/index.html#boston-house-prices-dataset) and [linnerrud](https://scikit-learn.org/stable/datasets/index.html#linnerrud-dataset) datasets for regression.
These are [Toy datasets](https://scikit-learn.org/stable/datasets/index.html#toy-datasets) - small datasets that do not require to download any file from some external websites. However, `sciki-learn` also provides significantly larger datasets that are referred to as [Real world datasets](https://scikit-learn.org/stable/datasets/index.html#real-world-datasets) which can be accessed online. 

Find more information about `scikit-learn` datasets here: https://scikit-learn.org/stable/datasets/index.html

More datasets can be found here:
https://archive.ics.uci.edu/ml/index.php
https://www.kaggle.com/datasets

Let us now take a closer look on some of these `scikit-learn` datasets and try to identify features and labels for these datasets.

### Toy datasets

The code snippet below shows how to download datasets from `sklearn` and how to access features and labels of the data points in these datasets. Small toy datasets are imported using command `from sklearn import datasets`. 

These datasets are stored using the [`bunch` data type](https://pypi.org/project/bunch/), which is similar to the `dictionary` data type. A `bunch` object containes key-value pairs. Most datasets contain at least the keys `'data', 'target', 'target_names','DESCR'`. The value of the key `DESCR` is a short description of the dataset. The value of the `'target_names'` and `'target'` keys are the labels' names and labels, respectively, for each data point. 
By default, the labels of data points are always numbers. 

In a classification problem, these numbers are integers starting from $0$. The values of the key ``target_names`` provide a textual description of the meaning of different label values. E.g., the labels of images could be $y=0$ or $y=1$ and the label names would be 0="Cat", 1="Dog". 
The value of the `'data'`key is the feature matrix (see <a href='#Q1'>(Eq.1)</a>). 



<b><center><font size=3>Explore the dataset</font></center></b>
The 
**["Digits" dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits)** contains images of hand-written digits. This dataset can be used for testing a classification method to [recognize digits from hand-written images](https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html#recognizing-hand-written-digits).




In [4]:
# import toy datasets from sklearn library
from sklearn import datasets 

# load the digits dataset into the bunch object "digits"
digits = datasets.load_digits() 
# print the keys of all (key,value) pairs contained in digits
digits.keys() 

KeyboardInterrupt: 