# Regression models for House Pricing
In the last exercise we have prepared our dataset so that we can feed it to a machine learning model for regression. Building such a regression model will be the task of this exercise.

First, we **import** the standard libraries **numpy** as np and **pandas** as pd. Afterwards, we **read the pickled data** from the previous exercise. Therefore, use the Pandas method **read_pickle()** with the filepathes **'../data/houses_train.pkl'** and **'../data/houses_test.pkl'** as the argument. Call the resulting dataframes **train** and **test**.

**Remark**: If the pickle files are not working, please use the csv files together with the method *.read_csv()*. Please set the optional argument *index_col* to 0 for the csv method. This will set the first column of the csv file to the index.

In [None]:
# Import libraries


In [None]:
# Load dataframes


Before we train a model let us **separate** the **features** and the **target**. Hence, **create** variables called **X_train** and **X_test** which include all the data of the features from the training and test data, respectively, and variables called **y_train**, and **y_test** which contain only the target ('SalePrice').

In [None]:
# Separate features and target


## Regression Tree
Finally, we can train our first model: a regression tree. For this reason we import the following class:

In [None]:
# Just execute
from sklearn.tree import DecisionTreeRegressor

Please **create an object of that class called tree_reg**. As the argument of the constructor use **random_state=42**. For all the other parameters we use the default values. If you are interested in the possible arguments and default settings you can use 'Shift+Tab' or you may have a look at http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html.

In [None]:
# Create object of class DecisionTreeRegressor


Next, use the **fit(x,y)** method of the **tree_reg** object to train the model. As the argument use the training dataset **(X_train, y_train)**. The return value of that method is again the decision tree object, but this time with fitted model parameters. You do not need to assign the result to a new variable.

In [None]:
# Train the model


Instead of a transform method like for the preprocessing objects (e.g. Imputer), machine learning models have **predict(X) methods**. For further details please have a look at the documentation.

Next, we want to use that predict method to compute predictions for house prices on our test dataset. Hence, please use the **predict** method of the **tree_reg** object on the test data **X_test** and save the result to a variable called **y_pred**.

In [None]:
# Compute predictions for the test data


The return value is again a numpy array. We want to **compare** these values **to the true house prices** of the test dataset. For this reason, **construct** a dataframe called **results** which contains two columns: the true y values and the predictions. Call the two columns **'y_true'** and **'y_pred'**, respectively. Afterwards, print the first 10 values of the dataframe.

**Hint**: An easy way to do that is by using the *dictionary*-method.

In [None]:
# Create result dataframe


### Evaluation of our Results

At a first glance our results do not look that bad. Of course, we want to quantify the performance of the regression model by using a **performance metric**. One common metric, which has also been used by the construction of the regression tree, is the **mean squared error** (mse) which is the squared sum of the residuals. Please, perform the following steps to compute the mse: first, **add a new column called 'residuals'** which contains the residuals (**differences between y_true and y_pred**) to the results dataframe. Afterwards, take the **mean of the squared residuals** by using the mean function of numpy.

In [None]:
# Compute residuals and mse


Of course, we do not have to compute that metric manually every time. **Sklearn provides a lot of different metrics** for regression, classification and clustering. Please, **import** the function **mean_squared_error** from the module **sklearn.metrics**. Afterwards, use that function to **crosscheck your result**. Furthermore, take the square root of the result. This gives the **root mean squared error** (rmse). It represents the sample standard deviation of the differences between predicted values and observed values.

In [None]:
# Import mse function


In [None]:
# Compute mse


Hopefully, you got the same result.

Next, let us **plot the predictions** against the true house price values. Therefore, you can use the **seaborn** function **lmplot**. As the arguments use **data=results** and **height=8** and set **x** and **y** to **'y_pred'** and **'y_true'**, respectively. However, do not forget to **import the module first** and to issue the command **%matplotlib inline** afterwards.

In [None]:
# Import seaborn


In [None]:
# Create lmplot


The results do not look that bad. Of course, we can see **a few outliers** which are always hard to predict. For a perfect model all the points would lie on the angle bisector.

Next, we look at the **distribution of the residuals**. Ideally, the residuals should be **normally distributed** around zero. If you see a completely different shape you have used a model which does not describe the data very well and which is conceptually wrong. Can you think of an example?

Please, use the function **distplot** of seaborn to look at the distribution of the residuals.

In [None]:
# Plot distribution of the residuals


This looks a bit like a normal distribution. However, we have large residuals for some outliers. Furthermore, the distribution is slighly right skewed. Do you have any idea why this could be the case?

During the theory part we have seen another performance measure for regression, the **r2 score**, which can be seen as the proportion of **explained variance** of the target with our model.

**Bonus:**

First, compute this score by hand. Therefore, compute the **sum of all squared residuals** and **divide it by the total sum of squares (proportional to the variance of y_true)**. Afterwards, **subtract the result from the value one**. Crosscheck your result with the **r2_score** function from the modul **sklearn.metrics**. The formula is given by

$$ r^2 = 1 - \frac{\sum_i residuals_i^2}{\sum_i(y_{true,i}- \bar{y}_{true})^2}, $$

where $\bar{y}_{true}$ represents the mean value of the observations.

In [None]:
# Compute r2 score by hand


In [None]:
# Compute r2_score with sklearn function


Actually, the mean squared error is not the best metric for our target variable and although our model uses this measure during training. **Can you think of any problem with the mean squared error regarding the distribution and scales of our target**?

Maybe you came up with the actual problem. If not, we will explain that later during the discussion of the exercise. A possible solution to overcome this problem is taking the log of the house prices and using it as the new target.

Please, **create a new target variable** **y_test_log** and **y_train_log**. Afterwards, **retrain the model** and compute again the **lmplot** and the **r2_score** with the log data. 


**Hint**: Use np.log() and np.exp() for the transformations.

In [None]:
# Compute the log of the target variable


Train another model called **tree_reg_log** using the logarithmic target.

**Remark**: Instanciate a new DecisonTreeRegressor. Set the random_state again to 42.

In [None]:
# Train a new model


Compute the logarithmic predictions using the test dataset.

In [None]:
# Compute log predictions on test data


Add **two more columns** to the result dataframe containing the **logarithmic predictions** and **true logarithmic house price values**. Call the columns 'y_true_log' and 'y_pred_log'.

In [None]:
# Add result columns


In [None]:
# log r2_score


**Bonus**:

Compute the r2_score, the mse and rmse on the backtransformation of the log predictions and log observations.

**Hint**: Use the inverse transformation.

In [None]:
# original scale


In [None]:
# mse and rmse


Use lmplot to plot the log precitions against the true values.

In [None]:
# Create lmplot


**Bonus**: 

Plot the distribution of the logarithmic residuals and of the back transformed residuals.

Furthermore, do the following: 

Think of a regression model with a single feature X and a target label y. If you train a regression tree, how many dimensions does the hyperrectangle have? To solve this task, please draw a scatter plot with a linear relationship between X and y. Furthermore, draw (in a qualitative way) four hyperrectangles into the graph and the regression predictions of the regression tree. In addition, draw a regression line of a linear model.

In [None]:
# Log residuals


In [None]:
# Backtransformed residuals


### Visualization of the regression tree

In this part we want to visualize a regression tree. Since the previously trained trees have been built completely and not been pruned, the resulting plot would be way too large. Hence, we train another tree by setting the hyperparameter **max_depth** to **4**. Please create such a tree. Call the object **reg_tree_fixedDepth** and set the **random_state to 42**.

In [None]:
# Instanciate a tree with fixed max depth


Train the new tree on the log data.

In [None]:
# Train tree


To visualize the tree we have to **import** the module **export_graphviz** **from** the module **sklearn.tree**. Please do so.

In [None]:
# Import export_graphiz


To export the tree, just execute the command below. Make sure that your tree object has exactly the same variable name.

In [None]:
# Just execute
export_graphviz(reg_tree_fixedDepth, out_file="tree_4_reg.dot",
                feature_names=X_train.columns.tolist(),
                filled=True, rounded=True)

Next, we have to render the *tree_4_reg.dot* file. Therefore, we have to install another package. Open a terminal and execute the command **sudo apt install graphviz**. If the package has been installed, you can execute the command in the cell below. Afterwards, you should find an image of the tree in your jupyter home folder.

**Remark:** If you cannot install the package, you can copy the content of the *tree_4_reg.dot* file to a web application http://webgraphviz.com/.

**Can you describe the decision tree?**

**Bonus**: Please try to recompute some of the *samples* and *values* for the first two levels. Remember, that the tree shows values of the training dataset and not the test dataset.

In [None]:
%%bash
dot -Tpng tree_4_reg.dot -o tree_4_reg.png

In [None]:
# First level


In [None]:
# Second level (left)


In [None]:
# Second level (right)


## Random Forest

During the theory part we have already learned that a single tree overfits the data pretty often. Hence, an ensemble of uncorrelated trees like the random forest could be a better choice.

Please **import** the **RandomForestRegressor** from the module **sklearn.ensemble** and create an object **rf_reg**. Afterwards, train the model on the log data and compute the r2 score on the log data.

**Remark**: Do not forget to set the **random_state** again to **42**.

In [None]:
# Import Random Forest


In [None]:
# Create object rf_reg


In [None]:
# Train model on log data


In [None]:
# Compute r2 score on log data


Hey, this is much better. Maybe we can even improve upon this result by adding more trees to the forest? Please, create **another RandomForest** and increase the **number of trees** to **n_estimators=100** and set the **random_state** again to **42**. After the **model training** please compute again the **r2 score**.

In [None]:
# Create random forest with more trees and retrain and reevaluate the model


Hm, not much better, but at least a bit. Adding more trees never hurts, except the computation performance. However, at some point more trees do not improve the model at all.

### Feature Importance

Now, let us go on to compute the **feature importance**. The most important feature is the one which has reduced the mean squared error among all the splits. The random forest (and also the decision tree) has an **attribute** called **feature\_importances\_**.

Please use that attribute and give the extracted result a new name called **feature\_importances**. Afterwards, print the result.

In [None]:
# Extract feature importance


The result is an array which contains the proportion of the reduced mean squared error for each feature. Hence, summing over all values yields one. But which element corresponds to which feature? Fortunately, the order is the same as the column order in the feature dataframe X_train. Therefore, **extract the feature names** by accessing the **attribute columns** and transform it to a list by using the method **tolist()**. Call that list **features**.

In [None]:
# Create feature list


Next, we want to combine the feature names and the feature importances. There are several methods to do that.

One solution is creating a dataframe with those two columns:

Create a **dataframe** called **importance_df** using the *dictionary method* (pd.DataFrame(dict)). As keys of the dict use **'feature'** and **'importance'**, as the data use the arrays/lists **features** and **feature_importances**, respectively. Finally, **sort the dataframe** by the importance value in descending order and print the result.

In [None]:
# Create dataframe and sort it


Here, we only see the impact of single categories of the categorical variables.

**Bonus**: Can you sum all the values belonging to one categorical variable?

**Hint**:

1. Use apply method and the split function on the column importance_df[feature]
2. Use groupby, sum, and sort

In [None]:
# Create new column


In [None]:
# Compute sum


**This is the end of the exercise.**

### Bonus I: Linear Regression
Compute a linear Regression model. Therefore, load the model **LinearRegression** from the module **sklearn.linear_model**.

In [None]:
# Import model and create instance


In [None]:
# Train model


In [None]:
# Compute predictions


In [None]:
# Compute r2_score


In [None]:
# Add prediction to result df


In [None]:
# Plot prediction


In [None]:
# Plot residuals


### Bonus II: Linear Regression with Lasso regularization
Compute a linear Regression model which includes a regularization term. You can think of such a term as a penalty term which disfavors many features and therefore sets some coefficients to zero if the features are not very useful.

Hence, do similar steps as in Bonus I. The model is called Lasso and can be found in the module sklearn.linear_model.