## Regression Models
Regression models are an important set of tools for doing machine learning. In this notebook we will introduce the idea of regression. Specifically, we will cover (1) fitting a linear model to one dimension, and (2) evaluating the performance of the model on the training set.

First let's import the required packages.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
%matplotlib inline

You will see that we are using a new package sklearn (scikit-learn) which is is a free software machine learning library for the Python programming language. The new tools we have import are the `LinearRegression` model (which we will use to perform regression) and the `mean_squared_error` metric (which we will use to measure the performance of the model).

### Fitting a linear model to one dimension

Let's import the diabetes data from the previous notebook.

In [None]:
df_diabetes = pd.read_csv('Data/diabetes.tsv', sep='\t', header=0)
df_diabetes.head()

And let's re-create the scatter plot of **Y** against **BMI**.

In [None]:
df_diabetes.plot.scatter(x='BMI', y='Y');

To capture the relationship between **Y** and **BMI** using regression, we will map the **Y** column to a target variable $y$ and and the **BMI** to an attributes variable $X$ and model the relationship between them with a linear model.

First, let's create our target and attributes, and check the shape.

In [None]:
X = df_diabetes['BMI']
y = df_diabetes['Y']
print(X.shape)
print(y.shape)

Sklearn does not accept attribute data of zero dimension, so we need to add the second dimension manually. It's as easy as redefing $X$ to be:

In [None]:
X = df_diabetes['BMI'][:, np.newaxis]
print(X.shape)
print(y.shape)

Let's now create an instance of a linear model and fit the linear model to the data.

In [None]:
lmod = LinearRegression()
lmod.fit(X, y);

Finally, let's ask the model to make predictions for $X$ and see how well the model performs.

In [None]:
y_pred = lmod.predict(X)
results = pd.DataFrame({'X':df_diabetes['BMI'],'y':y,'y_pred':y_pred})
ax1 = results.plot.scatter(x='X', y='y');
ax2 = results.plot.scatter(x='X', y='y_pred', ax=ax1, c='k');

### Evaluating the performance of the model on the training set
While the above plot is useful, it would be better if we could quantify the error between the actual values of $y$ and the predicted values $y_{pred}$. 

For this problem, we do this using the Root Mean Squared Error, which is caluclated as follows.

In [None]:
rmse = (np.sqrt(mean_squared_error(y, y_pred)))
print(rmse)