## Data Processing
Once we have analysed the data, and decide it is suitable for machine learning, we then need to prcoess the data to prepare it for input to a machine learning algorithm. In this notebook we will cover some basics of pre-processing data for machine learning. Specifically, we will look at (1) Formatting the data correctly, and (2) Normalising the data.

First, let us install the required packages and import the diabetes data.

In [None]:
import pandas as pd

In [None]:
df_diabetes = pd.read_csv('Data/diabetes.tsv', sep='\t', header=0)
df_diabetes.head()

### Formating data for (supervised) machine learning
Supervised machine learning algorithms requires the data to be formatted a particular way. Specificially, the data needs to consist of an array of target values, often denoted $y$, and an array of attribute values, often denoted $X$.

Let's look at how we would map the dibetes DataFrame to this format.

First, let us create an array of targets, $y$, by extracting the **Y** column from the diabetes DataFrame, and print the head.

In [None]:
y = df_diabetes['Y']
y.head()

Now, let us create the array/DataFrame of attributes, denoted $X$, by dropping the target variable from diabetes DataFrame, and print 

In [None]:
X = df_diabetes.drop('Y', axis = 1)
X.head()

Finally, let us compare the shapes of $y$ and $X$.

In [None]:
print(y.shape)
print(X.shape)

We see that $X$ consists of $442$ samples of $10$ attributes. We also see that $y$ consists of $442$ samples, it is *one dimensional* (as opposed to two dimensional) array so have zero values in the second dimension.

### Normalising data for (supervised) machine learning
*Data normalisation* is often required for machine learning algorithms to perform well. Essentially, this ammounts to trying to make the data better suit the assumptions on which the algorithm was developed. Here we will explore transforming the attributes data to have zero mean and unit variance.

First, let us use the `agg` method to calculate the *mean* and *standard deviation* of the columns of $X$.

In [None]:
X.agg({'mean','std'}).head()

Normalising the data is easy in Pandas, we simply subtract the mean and divide by the standard deviation. Let's do that and then calculate the *mean* and *standard deviation* of the columns of $X$ once more.

In [None]:
X=(X-X.mean())/X.std()
X.agg({'mean','std'}).head()

We see that the standard deviation (std) is equal to $1$ (why does this mean the variance is also equal to $1$?). We also see that the mean is very close to zero (1.0e-17 is a very small number). This is close enough for our purpose.

The target array $y$ and the attributes array $X$ are now ready to be inputed into a machine learning algorithm.