# Data

**Contents**:
- [Machine Learning](#Machine-Learning)
 - [What is machine learning and why do we care?](#What-is-machine-learning-and-why-do-we-care?)
 - [Types of Machine Learning](#Types-of-Machine-Learning)
 - [Components of a model](#Components-of-a-model)
- [What is data?](#What-is-data?)
- [Reading and interpreting data](#Reading-and-interpreting-data)
- [n-dimensional data](#n-dimensional-data)

# Machine Learning

## What is machine learning and why do we care?
Through experience, humans learn to act intelligently - to acquire and apply knowledge. We are all able to learn to accomplish tasks for ourselves, some tasks with explicit guidance, and others without. For a human, a simple task is to be able to distinguish images of cats and dogs. A more complicated task would be to win a chess game. Both of these tasks require us to act intelligently and make decisions. Machine Learning is the field which focuses on getting machines to accomplish goals or tasks for themselves.

In general, the problem of accomplishing goals can be framed as trying to produce some intended output given an input. As such, machine learning is all about automatically learning to represent the relationships between inputs and outputs. These inputs and outputs can take many forms as illustrated below.

![](images/inp-out.jpg)

We'd like to be able to build algorithms that can learn to use inputs to predict useful outputs, and solve problems such as those shown above.

**Building machines which can model the relationships between inputs and outputs is the goal of machine learning.**

## Types of Machine Learning
There are three main categories of problems within Machine Learning. It's worth knowing these straight away, although in this chapter, we only considered *supervised learning*. In the next chapter (Python for Data Scientists), we'll briefly investigate *unsupervised learning* too.

**Unsupervised learning.** Where we only have an input and try to predict something useful as an output, without being explicitly shown examples of what the output should be. This what data is likely to in order to better understand the underlying structure of it. E.g. we have data about houses and try to split these examples into different clusters.

**Reinforcement Learning.** Where our algorithm controls and agent that interacts with it's evironment and has to learn what actions to take to perform a task. E.g. we are trying to get an robot to walk or an algorithm to learn how to win a game of chess.

**Supervised Learning.** Where we predict an output from a input, given examples of input-output pairs. E.g. we use different features about a house such as location, number of rooms, etc and try to predict the price.


## Components of a model
A machine learning algorithm is known as a model. In general, there are four components to a model:
1. Data
2. Model
3. Criterion
4. Optimiser

There are two 'modes' to a model/algorithm: training and inference. In *training* mode, models use *data* to be trained. The (input) data is fed to the *model*, which makes a prediction. A *criterion* is responsible for determining how good/bad/far off of the model's prediction was from the output in the dataset. Finally, an *optimiser* uses the criterion to tune the model, so it performs better with the next data it sees. In *inference* mode, we have a model which has finished being trained. This means that its job is now just to perform predictions given some data. Therefore, we do not need the criterion and optimiser in inference mode.  

# What is data?
As we saw above, algorithms can take inputs and produce outputs of many different forms, types, and shapes. In essence, all of these inputs and outputs are types of data.

Data in and of itself is quite an abstract notion. What is data? How do we quantify it? How can we structure and organise data in a way that algorithms can interpret? This is what we will explore next.

Imagine we would like to predict the average salary of a person. That is - we will capture the notion of a 'person' as a group of data points, and then somehow use this to predict a single data point - or scalar - being the average salary we can expect that person would achieve.

How would we quantify a person in data? What kind of properties of that person would we capture? Which would be more useful or relevant for predicting their salary?

How could we structure this data in a format which algorithms can interpret?


Let's import our own dataset.

In `data/income_data.csv`, we have a .csv file containing 30 data points - examples of people in terms of their Age, and their yearly income in thousands. This is what it looks like:

| Age (Years) | Salary (£K) |
|-------------|-------------|
| 43          | 40.3453          |
| 32          | 36.4522        |
| 64          | 55.2352          |


In [None]:
# Let's import some data.
import numpy as np
 
data_income = np.genfromtxt('data/income_data.csv', delimiter=',') ## Import income data and save to variable.

print(data_income)

X = data_income[:, 0] ## Extract an array of the ages from the data.
Y = data_income[:, 1] ## Extract an array of the salaries from the data.

print(X)
print(Y)

In [None]:
import plotly.graph_objects as go


fig = go.Figure(data=go.Scatter(x=X, y=Y, mode="markers"))
fig.update_layout(
    title="Age vs Salary",
    xaxis_title="Age (Years)",
    yaxis_title="Salary (£K / Year)",
)
fig.show()


# Reading and interpreting data

In the dataset we loaded above, our data had one input variable mapping to one output variable. The input variable was age, and the *label* was salary.

Obviously you all realise that this isn't a very realistic dataset. There are a lot more factors than just age which relate to salary. Perhaps the number of cups of coffee you drink a day also affects your salary:

Coffee (#cups)| Age (Years) | Salary (£K) |
--------------|-------------|-------------|
1             | 43          | 40.3453     |
2             | 32          | 36.4522     |
2             | 64          | 55.2352     |

How many input variables do we have now per label?

Input variables tend to be referred to as *features*. As in, Coffee and Age are features of Salary. In a second, we'll expand on this, but the number of features we have is known as the *dimensionality*. Therefore, this new dataset is 2-dimensional.

Visualising a 2-d dataset in a similar fashion to 1-d is slightly more complicated:

In [None]:
data_income = np.genfromtxt('data/income_data_coffee.csv', delimiter=',') ## Import income data and save to variable.

print(data_income)

X = data_income[:, 0] ## Extract an array of the #coffees from the data.
Y = data_income[:, 1] ## Extract an array of the ages from the data.
Z = data_income[:, 2] ## Extract an array of the salaries from the data.

print(X)
print(Y)
print(Z)

In [None]:
fig = go.Figure(data=go.Scatter3d(x=X, y=Y, z=Z, mode="markers"))
fig.update_layout(scene=dict(
        xaxis_title="Number of Coffees",
        yaxis_title="Age (Years)",
        zaxis_title="Salary (£K / Year)"
    ),
    title="Age and Coffee vs Salary"
)
fig.show()

It's a bit harder to interpret than the 1-d example, but after playing around with it, you should be able to see some trends... that is, the more cups of coffee you drink, the higher your salary (the graph below makes this a bit clearer).

In [None]:
fig = go.Figure(data=go.Scatter(x=X, y=Z, mode="markers"))
fig.show()

## Shapes and Dimensions

Generally speaking, in a real dataset, one data point will be composed of many features/dimensions (i.e. input variables). Strictly speaking we don't need to have a one dimensional output either, but we will consider these kinds of cases where our label is multi-dimensional in future lectures.


Let's look at some notations and talk about how data can be structured.

![image](images/data.jpg)
![image](images/labels.jpg)


In the initial Age vs Salary dataset we introduced earlier on, there were $m$ inputs (30 in our income data example). Each of these inputs has $1$ feature and is therefore our *feature vectors* are 1-dimensional.

What about our Age and Coffee vs Salary dataset? How many features does our input have?

In the general case, we have $m$ inputs, each of which will have $n$ features. There will also be $m$ outputs. Mathematically, you will often see this phrased as $$X = \{x_1, ..., x_m\}, \; x \in \mathbf{R}^n \\ Y = \{y_1, ..., y_n\}, \; y \in \mathbf{R}^1$$.

Take a second to try and understand this.

# n-dimensional data

We've looked at the simpler 1-d and 2-d examples above. In most cases you'll be working with datasets which have signficantly more features and more complicated data types (something we'll look at in Chapter 2). To provide some brief intuition and a teaser of what's to come, let's apply the knowledge that we've just learnt about to a larger datatsets. We'll introduce **sklearn** and **pandas** later in the course so don't worry about them for now - we're just using them to load in the data to show it! 

In [None]:
from sklearn import datasets
import pandas as pd

# To https://scikit-learn.org/stable/datasets/index.html#boston-house-prices-dataset
house_prices = datasets.load_boston()
print(house_prices.DESCR) #MEDV (Median value of owner-occupied homes in $1000's) is the target


features = house_prices.data
target = house_prices.target
col_names = house_prices.feature_names

dataframe = pd.DataFrame(house_prices.data, columns=col_names)
dataframe['target'] = pd.Series(target)

print("A peek into our data...")
print(dataframe.head())

