## Training and Testing
The real test of a machine learning algorithm is when it is deployed and making predictions on *new*, unseen data for which the target variable is unknown. For example, when an algorithm predicts the level of disease progression for a diabetes patient with attributes not seen before.

To be able us to estimate how a machine learning algorithm will perform on new data, we need to use some of the available dataset to test the algorithm after it has been trained. To do this, we split the dataset into what we refer to as a *training set* and a *test set*. In this notebook we will explore this idea further.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
%matplotlib inline

### Splitting a dataset into training and test sets
There are a few different ways to split a dataset into a training and test set. Some key decision are:
* What proportion of the dataset is going to be saved for testing?
* How will the data points be allocated to the training and test sets?

The size of test set (as a proportion of the dataset) saved for testing depends on the problem and the data, but generally it will be between $[0.5,0.9]$. It will be large enough to be confident about the estimate of future performance, but small enough so that the performance of the algorithm will not change too much if we retrain on the whole dataset.

The method for allocating data points to the training and test set will depend also depend on the problem and the data, but a common way to allocate datapoints is to do it by random. Let's explore what this looks like for a toy dataset for $X$.

Run the code below.

In [None]:
# Generate synthetic data
X = np.arange(20)
print(f"Dataset: {X}")

# Specify size of test set as fraction of dataset
test_size=0.2

# Generate training and test sets
np.random.shuffle(X)
n = round(len(X)*test_size)
X_train, X_test = X[n:], X[:n]
print(f"Training set: {X_train}")
print(f"Testing set: {X_test}")

Now try changing `test_size` and re-running the above code.

### Evaluating model performance on the test set
Lucky for us, scikit learn has built in features for dealing with training and testing. Let's introduce those while brining togther all the concepts covered already to train a machine learning regression model and evaluate it on a test set of the diabetes data.

Step by step:

(1) Load the data

In [None]:
df_diabetes = pd.read_csv('Data/diabetes.tsv', sep='\t', header=0)
df_diabetes.head()

(2) Format and normalise (pre-process) the data into $X$ and $y$

In [None]:
y = df_diabetes['Y']
X = df_diabetes.drop('Y', axis = 1)
X=(X-X.mean())/X.std()

(3) Split $X$ and $y$ into training and test sets and check the shapes.

In [None]:
test_size=0.2
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=0)
print(X.shape)
print(y.shape)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(4) Fit a linear regression model to the training set and calculate predictions for the *test* set.

In [None]:
lmod = LinearRegression()
lmod.fit(X_train, y_train)
y_pred = lmod.predict(X_test)

(5) Use the predictions to calculate the RSME of the model on the *test* set.

In [None]:
rmse = (np.sqrt(mean_squared_error(y_test, y_pred)))
print(rmse)

(6) Plot y_test against y_pred

In [None]:
results = pd.DataFrame({'y_test':y_test,'y_pred':y_pred})
results.plot.scatter(x='y_test', y='y_pred');