# Sklearn

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# have plots render in notebook
%matplotlib inline

Scikit-learn, or sklearn, is one of the most commonly used python packages for machine learning. It contains many different models that you can use to solve lots of different problems, and is usually a good place to get started.

First, let's load in the house data we used in the pandas exercise.

In [None]:
df = pd.read_csv("kc-house-data.csv")

In [None]:
df.head()

The id and date will be used as the index, so let's set them.

#### Task: Convert the `date` column to a datetime type.

In [None]:
# your code goes here

#### Task: Now set a multi-index using first the `id` column and then the `date` column.

In [None]:
# your code goes here

We are going to try and predict the price of a house given the many other features in the data. To do this we will use perhaps the most simple machine learning model; Linear Regression.

Linear regression assumes that the relationship between the features and the target is linear, i.e. for a target $ y $ and features $x_j$ one has that
$$ y = \beta_0 + \sum_{j=1}^k \beta_j x_j + \epsilon $$ where
- k is the number of features
- each $\epsilon$ is normally distrbuted of mean zero, and is independent of $y$ and $x_j$ with a fixed variance

The term simple linear regression is often used to refer to the case when $k=1$.

In the above the prediction is
$$ \beta_0 + \sum_{j=1}^k \beta_j x_j $$
and the $\epsilon$ is the difference between the prediction and the true value $y$. Machine learning will always look to minimise some loss function, and in this case the loss function is the sum of the squared errors.

Let's set up our data for prediction.

In [None]:
target = df["price"]
features = df.drop("price", axis=1)

Below we import the linear regression class from sklearn. We have not come across classes directly yet in python but most machine learning libraries will use them to store models.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
simple_model = LinearRegression()

The above line of code has created a linear regression model, but currently it has not been fit to any data. We do this using one of the many class methods, namely `fit`. Note a class method is essentially just a function that is attached to the class. It has access to any data that may be attached to the class.

We'll first fit a model with just the sqft living space as a feature.

In [None]:
# note for the first argument we past a list to the dataframe, with just one column as an entry
# this is because the first argument needs to be 2-dimensional, and passing just the string
# would create a 1-dimensional pandas series
simple_model.fit(features[["sqft_living"]], target)

We've now fitted our model, but what coefficients do we get out? And how can we find out how good the model is? The coefficients can be accessed as attributes of the class.

In [None]:
beta0 = simple_model.intercept_  # this is the intercept of the line
beta1 = simple_model.coef_[
    0
]  # this is a list of the feature coefficients, in this case of length 1

Let's use these coefficients to make predictions!

#### Task: Write a function that takes a feature value, or a numpy array of feature values, and returns the prediction(s).

In [None]:
def get_prediction(x):
    """
    x: float, np.array
        The feature value, or an array of feature values
    """
    # your code here

The `LinearRegression` class unsurpirsingly can do this automatically, using the `predict` method. It also has a `score` method which can be used to evaluate the model. Let's try it out now.

In [None]:
simple_model.score(features[["sqft_living"]], target)

The metric this has outputted is called $R^2$ which you may have come across before. It is related to the sum of the squared errors by the following formula:
$$ R^2 = 1 - \frac{\sum_{i=1}^n \epsilon_i^2}{\sum_{i=1}^n (y_i - \bar{y})^2} $$ where
- $y_i$ refers to the $i$-th of $n$ datapoints
- $\epsilon_i$ refers to the $i$-th error
- $\bar{y}$ is the mean of the $\{y_i\}_{i=1}^n$

#### Task: Verify that the score outputed by the above method is the same as from the formula on the right hand side.

In [None]:
# your code here

How good is this value for $R^2$? The answer depends very much on the problem; for some problems an $R^2$ of 0.5 is fairly bad, whereas for others it could be world class. The only thing we can say for sure is that as our $R^2$ gets closer to 1.0 our predictions are getting better.

Since we've only used one feature we can easily plot the feature against the target along with a line of best fit. To do this we'll use matplotlib, one of the most popular plotting libraries in python. Matplotlib actually has two apis to interface with it, and you can switch between them whilst writing the same piece of code. We focus on just one of them below.

In [None]:
# this creates a square graph, size 10 by 10
plt.figure(figsize=(10, 10))
# a scatter plot with sqft_living on the x-axis and price on the y-axis
plt.scatter(features["sqft_living"], target)
# a line connecting two points, arguments in the form (x1, x2), (y1, y2)
plt.plot((0, 14000), (get_prediction(0), get_prediction(14000)), color="k")
# the next two lines add labels to the two axes
plt.xlabel("SQFT Living Space")
plt.ylabel("House Price")
# finally we plot the graph
plt.show()

We'll get lots more practice with matplotlib throughout this training. Let's put what we've learned above into an exercise.

#### Task: write a function that takes a feature name as input and produces a graph like the above:
- the feature on the x-axis, the price on the y-axis
- a line of best fit through the points
- print out the $R^2$ as well to 3 d.p. (use the inbuilt function `round`)

In [None]:
def plot_feature(feature_name):
    """
    feature_name: str
        a column in the `features` dataframe
    """
    # your code here

Of course we don't have to just use one feature, we can use as many as we want. Let's try using all of them.

#### Task: Fit a linear model with all the features included and calculate the $R^2$

In [None]:
# your code here

Let's try and re-write our `get_prediction` function from above using this new model, or course the `predict` method still  exists but this  way we can practice our numpy.

#### Task: Write a function that takes an numpy array of feature values and returns the predictions.

In [None]:
def get_prediction(array):
    """
    array: np.array
        an array of feature values, in the correct column order
    """
    # your code here

The $R^2$ has improved quite a bit, but as we've included a lot more features we should check that the model has not become overfit. We use a function called `train_test_split` which is in sklearn. 

In [None]:
from sklearn.model_selection import train_test_split

Let's see what arguments it takes:

In [None]:
train_test_split?

In order to use train_test_split we'll pass it a sequence of arrays, and then specify the test size parameter.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2
)

Having split the data into a training and test set we can now see how well our model performs on unseen data.

#### Task: fit a linear model on the training data, and compare it's $R^2$s between the training and testing sets. Is the model overfit?

In [None]:
# your code here

### Further Work

That's the end of this notebook. We'll dive into this material in a lot more depth in future sessions. Here's a couple of other things you could try if you have some spare time:
- The Linear Regression model didn't really overfit to the data above, try replacing it with a more advanced model from sklearn and repeat the last exercise (for example `RandomForestRegressor` from `sklearn.ensemble`)
- House prices actually form a time series, which means it's not really valid to use future house prices to make predictions of those in the past. Try and repeat the last exercise above, but make sure your training set happens earlier in time than your test set.