In [None]:
import numpy as np
import pandas as pd

from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import Ridge

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import mean_squared_error

# Boston Housing with Linear Regression

In this notebook we will work with the classic [Boston Housing dataset](https://scikit-learn.org/stable/datasets/index.html#boston-house-prices-dataset). 

Let's load the data and takea look at the information.

In [None]:
data = load_boston()

In [None]:
print(data.DESCR)

In [None]:
X = data.data
y = data.target

# Explore the data

The first step of any machine learnign project is to explore the data a little bit. We already have a general idea because of the description given to us, but let's dig in a tiny bit more.

In [None]:
# printing the shape of your X matrix is always a good idea.
X.shape

We have 506 samples (housing records) and 13 columns (features).

In [None]:
y.shape

These are the 506 "labels" or put another way "thing we are interested in predicting".  In this case they are median housing prices, that is the price that the house sold for.

Let's put our data into a dataframe with pandas and examine the first 5 rows of each.

In [None]:
X = pd.DataFrame(X, columns=data.feature_names) # little trick to get the column names in correctly
y = pd.Series(y)

In [None]:
X.head()

In [None]:
y.head()

Ok, so we can glance over the first 5 rows just to get a "feel" for the data, do we notice anything interesting? Well we see that the range of the values for each column is quite different, we need to deal with that. So we can add scaling our data to a list of things to do.  Before we run off and start scaling though, let's take a look at some basic statistics.

### pandas.DataFrame.describe()

This function is very useful to get a "feeling" for the dataset that you are working with.  Describe will give you thec ommon statistics of your matrix, the mean, std (standard deviation), min, max, and quantiles.  It's nice to scroll through and see if anything interesting pops out.  Looking through the output below, do you notice anything interesting?

In [None]:
X.describe()

Just a few observations that I see.  The `B` column has most of it's data centered around 375-400 range, we can tell this because it's 25-max quantiles are all in that range. That means anything out of that range is probably an outlier.  The standard deviation for age is 28, which is nearly 50% of it's mean (60), which means that the `AGE` column probably has a _lot_ of variation in it. Finally just scanning the `count` row, they all have 506, which means we don't have any missing values, but we should double check that.  Note that these kinds of observations may lead nowhere or they may lead somewhere.  Getting to know your data is very important, maybe the most important thing you can do as a data scientist. So definitely take your time and write down your thoughts.

Let's check for any empty values in our data now.

In [None]:
X.info() # this is a simple function that will get us some basic info about our dataframe.

## Scaling our dataset

Ok, we need to scale our dataset.  We learned about two choices so you can pick from two options!

* StandardScaler
* MinMaxScaler

The order of operations goes like this:
1. initialize a scaler
2. call `fit` on our dataset
3. transform our dataset and return that value to a new variable.

Note: `.fit` and `.transform` are called seperately for some sneaky tricky reasons we will get into a bit later. For now just think to yourself about _why_ sklearn _might_ have seperate `.fit` and `.transform` functions on it's scalers.

In [None]:
## initialize a scaler here

scaler = # your choice of scaler here (check the imports to remember their names)

In [None]:
## fit your scaler here -- fitting happens "in place" which means it doesn't return anything

scaler.fit(#put in our dataset here#)  

In [None]:
## transform our dataset now and return the value

X_transformed = #use the scaler transform function on our dataset to return a new dataset

# Train a model

Ok it's time to train our model.  We will want to choose one of the three options we imported earlier.

* LinearRegression
* SGDRegressor
* Ridge

## Scikit-Learn API 

Many (most?) objects in scikit-learn have a `.fit` method which will tell the model to do whatever it was designed to learn.  From the objects that have a `.fit` method, you will get either a follow up `.transform` or `.predict` method.  Objects with `.transform` are transfromers (like the scaler we just used), their job is to learn some statistics from the data and transform it according to some logic (using the statistics that were learned).  Objects that have a `.predict` method are machine learning algorithms, and when you call `.fit` on them they are going to learn whatever rules / functions that the algorithm is designed to learn. When you call `.predict` with them, they will take as input a data point(s) and give you a prediction(s) for the input.

So, in summary we frequently see three methods

* `.fit` : this tells the object to learn
* `.transform` : this tells the object to apply it's learning in the form a transformation to the input. This method _returns_ an object (the transformed data
* `.predict` : this tells the object to make a prediction on the input, using whatever function the model learned (from the `.fit` call it made earlier)

In [None]:
# initialize a model here

reg = # initialize your model here

In [None]:
# Train your model here with .fit


## umm... are we done?
So we fit a model right? Now unlike our previous work, we can't really plot this model, it's not a line in 2d, it's a plane in 13d.  So what can we do?  We need to evaluate our model somehow! What should we do?
How about we make a bunch of predictions and see what kind of accuracy it gets?  We can evaluate it's mean-squared error, the same metric we used to optimize it.

# Evaluate our model

Ok, let's use our trained model to make predictions, then we can evaluate those predictions against the real known `y` values. You will need to use your models `.predict()` function which needs some input to predict on.

In [None]:
## use your model to make predictions on the data

y_pred = # your model.predict here  

Ok now we need to evaluate our predictions. We will use scikit-learns inbuilt mean squared error metric for this. It's very important that you pass your arguments correctly to the evaluation function, all scikit-learn metrics use `y_true, y_pred` format, which means pass the ground-truth first, followed by the prediction.

Well, actually in this case MSE is the same regardless which argument goes first because of the way the math works out, but that isn't always the case.

In [None]:
score = mean_squared_error(y, y_pred)
print (f"the MSE is:{score}")

## That's a wrap!

So.. are we done now? Are you happy with our model?  What does our MSE even mean? Is it good? Is it bad? Would you be ready to roll this model out to production?
What questions do you have?

## Is this Model Overfit?
How do we have any idea if this model is overfit or not? Ideally we'd be able to plot it and visualize it. But we cannot! It is in 13D.
Do you have any reason to believe this model would _generalize_ to unseen data?  If yes, what is it? If no, why not?
Something _else_ is needed.... what is it?

## Please go on the forums and post about this question!

Try not to read other peoples response who are ahead of you in the course.  In fact, when you get ahead make sure to help the newbies out and not spoil the surprise for them!
What is the funny feeling in our stomach from? What "smells" wrong here?

[LINK TO FORUM TOPIC](https://forum.codingnomads.co/t/post-your-answer-to-the-question-what-is-funny-in-my-boston-housing-problem-is-my-model-good/344)

## Bonus work
Try fitting all the 3 different models we created.  Further, see if you can make the MSE lowes by adjusting any of the parameters (regularization?) that we have learned about so far.