# Exercise 04: Regression I


Welcome to the fourth exercise for Applied Machine Learning. 

Your objectives for this session are to: 
- learn the basics of the fitting a model to data, 
- fit linear regression and logistic regression models with `sklearn`, and
- interpret the outputs of the models.

____

### Part 1: Fitting a model to data with `sklearn`

Last time we looked into `sklearn` and how to use this python library to implement decision trees. Today we will use the same library to implement linear and logstic regression. 

Linear regression predicts a continuous value for a variable based on the value of another variable. The classic ***regression*** task with a continuous target variable.

Logistic regression predicts a class value for a a variable based on the value of another variable. Although the model has regression in its name, logistic regression is for ***classification*** tasks with a categorical target variable.

Let's import our libraries for today.

In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

Remember what we learned last week about `sklearn`? We need to train (or "fit") a pre-programmed algorithm (or "estimator") to some data so that it can be used make predictions on new data.

Like last time, the data should be structured as two arrays:

* `X` = 2D array of samples/instances/observations (sometimes called the "samples matrix," "design matrix," or "features matrix"), where instances are represented as rows and features/attributes are represented as columns.

* `y` = 1D array where the ith entry corresponds to the target of the ith sample (row) of X. For regression tasks the target values are real numbers. For classification tasks the target values are integers (or any other discrete set of values). 


Let's look at an example with the same basic structure as last week's example, but this time with just one attribute: 

In [2]:
# define our example model - we will use the Linear Regression model
example_lr = LinearRegression() 

# make some fake samples for X — 4 instances, 1 atrribute
X = [[1],  
     [11],
     [21],
     [31]]

# make some fake target values for y - continuous numeric values because regression!
y = [2.5, 16.2, 24.9, 38.1]  

# fit our example classifier to the data
example_lr.fit(X, y)

# print the score
print("Score: {:.3f}".format(example_lr.score(X, y)))

Score: 0.993


Not so different from last week, right? 

The main change from last week is that today we won't use the `predict` function — we'll just check the `score` to see how well we can fit a model to the data given features and a target. 

However, one other thing we want to do today is inspect the coefficients of our regression models. This is done like so: 

In [3]:
coefficient = example_lr.coef_
print(coefficient)

[1.155]


Remember how to interpret this from lecture? The coefficient is the slope of the regression line. More specifically, this tells value tells us the average change in our dependent variable (`y`) for one-unit change in our independent variable (or attribute, `x`). 

____

### Part 2: Introduction to the dataset

Today we will be working with `HomesSoldHellerup.csv`. A Dataset of housing sales in Hellerup. Each row consists of a home that was sold and the columns represent different attributes of the home. The attributes (columns) are listed for you below:

Attribute Information
- 1) `Road name`: The name of the road
- 2) `Road number`: The road number of the home
- 3) `Type`: What type of home is it? ("Apartment", "Villa" etc.)
- 4) `m2`: The area of the home given in square meters
- 5) `build year`: The year the home was build
- 6) `Zip code`: The zip code which the home lies within
- 7) `City`: The city the home is located in
- 8) `Date of sale`: The day the house was sold
- 9) `Type of sale`: How was the home sold? 
- 10) `Price`: The final price of the home in DKK

Let's see if we can use these attributes to fit a linear regression model capable of estimating the price of a home.

# <font color='red'>TASK 1</font>

It's always a good idea to get an overview of the data we are about to work with. Start by reading the `HomesSoldHellerup.csv` file, using the `read_csv` method from `pandas`, and call the dataset `homes_df`. Then use the `value_counts()` method to check how many of each `Type` of home there is in the dataset. Then inspect the range of `Price`. 


*Hint: Note that the separator in this dataset is a semicolon and not a comma, which is the default. So you need to specify this with a parameter in the `read_csv()` function.*

In [4]:
# read in dataset
homes_df = pd.read_csv('HomesSoldHellerup.csv', sep=';')

In [5]:
# check value counts for `Type`
homes_df['Type'].value_counts()

Lejlighed          1237
Villa               742
Rækkehus            174
Stuehus               3
Erhverv               2
Døgninstitution       2
Name: Type, dtype: int64

In [32]:
# check the range of `Price`
print("Minimum price:", homes_df['Price'].min())
print("Maximum price:", homes_df['Price'].max())
print("Range of prices:", homes_df['Price'].max()-homes_df['Price'].min())

Minimum price: 133333
Maximum price: 50000000
Range of prices: 49866667


### Part 3: Fitting a linear regression model to the data

# <font color='red'>TASK 2</font>

To use a regression model we need at least one attribute and one target varaiable. The target is what we wish to predict based on one or more attributes.  We are going to use Linear Regression using the `sklearn` library and we will therefore use much of the same syntax and functions as we did in week 3. 

Define your feature matrix `X` with the single feature, `m2`, which is the size of the home in square meters. Then define your target `y` as `Price` from the dataset. 

In [7]:
X = homes_df[['m2']]
y = homes_df['Price']

# <font color='red'>TASK 3</font>
Let's fit a simple linear regression model to `X` and `y`. If you have defined them correctly, this will estimate a coefficient based on the linear relationship between the size of the home in square meters and the price it was sold for. It could also potentially be used to predict the selling price of a home.

Try it. Fit a linear regression model to `X` and `y`, and report the score and the coefficient.


*Hint: Check out the documentation on the `sklearn` site for the linear regression to get a better idea of how it works:* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html.

In [8]:
# fit a linear regression model to X and y 
lr = LinearRegression()
lr.fit(X,y)

In [9]:
# print the score
print("Score: {:.3f}".format(lr.score(X, y)))

Score: 0.305


What is the "score" for this model? What measure is it exactly?

**Answer:** The "score" is the coefficient of determination of the prediction, *R^2*. The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). 

In [10]:
# print the coefficient
coefficient = lr.coef_
print(coefficient)

[36683.88180408]


What does this coefficient suggest about the relationship between `m2` and `Price`?

**Answer:** On average, price increases by DKK 36,683 for every extra square meter (!).

### Part 4: Fitting a linear regression model to the data, with a categorical attribute

# <font color='red'>TASK 4</font>
Last week you learned how to make dummy variables from a categorical variable with the `get_dummies` method from the `pandas` library. Dummy variables are especially important for the implementation of regression models, so let's try it here.

Build a linear regression model just like you did in Part 3, but re-define `X` to use only the `Type` of home as the attribute this time. Since this is a categorical attribute, you must first make dummy variables.

Fit the model to `X` and `y`, then report and interpret the score and the coefficient

In [11]:
# re-define X using the `Type` attribute
X = pd.get_dummies(homes_df[['Type']], drop_first = True)
X.head()

Unnamed: 0,Type_Erhverv,Type_Lejlighed,Type_Rækkehus,Type_Stuehus,Type_Villa
0,0,1,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,1,0,0,0
4,0,1,0,0,0


In [12]:
# fit model
lr = LinearRegression(fit_intercept=True)
lr.fit(X,y)

In [13]:
# print score
print("Score: {:.3f}".format(lr.score(X, y)))

Score: 0.084


Based on this score, is this model better or worse than the one you fit with `m2` as the attribute?

**Answer:** It's worse. 0.084 < 0.30, and we want to get close to 1.0 for *R^2* 

Use the code below to extract the regression coefficients, align them with the column names, and then print them.

In [14]:
# Extract the coefficients
coefficients = lr.coef_

# Get the names of the dummy variable columns
column_names = X.columns

# Create a dictionary with column names as keys and coefficients as values
coef_dict = dict(zip(column_names, coefficients))

# Print the dictionary
print(coef_dict)

{'Type_Erhverv': 9252499.999999793, 'Type_Lejlighed': -3082904.4373485288, 'Type_Rækkehus': -1581394.6436782412, 'Type_Stuehus': -3410152.000000147, 'Type_Villa': -418120.711590397}


How do we interpret this? Why is there not just one coefficient value?

*Hint: Are your coefficients all identical? If so, you've been affected by the ["dummy variable trap"](https://www.statology.org/dummy-variable-trap/). Go back to where you used `get_dummies`, set the argument `drop_first = True`, and re-run your code.*  

*The dummy variable trap generally does not affect a model's predictive performance (i.e., the `score` is not affected). However, when interpreting regression coefficients, we need to avoid multicollinearity — we don't want an attribute to be correlated with another, because then it's hard to distinguish the individual effects of the correlated attributes. If we straightforwardly create a dummy variable for each possible value of a categorical variable, then we often introduce multicollinearity (e.g., if we know that a property is a villa, then we also know that it's not an apartment, business, etc.).*

*The most common way to avoid the dummy variable trap is to create `n-1` dummy variables for a categorical variable with `n` unique values. We do this by setting `drop_first = True` with the `get_dummies` method.*

**Answer:** There's a separate coefficient for each dummy variable you created because these are represented as separate terms in your regression equation. Each coefficient tells you how much `Price`is expected to change if the property is the corresponding `Type` relative to the model's intercept term. In this case the model's intercept term represents the baseline price for homes of type `Døgninstitution` (which was dropped). You can inspect the intercept term of your model with the code below.

In [18]:
lr.intercept_

7247500.0000001015

### Part 5: Fitting a logistic regression model to the data

So far we've been predicting price as a continuous target variable. But let's say, for illustrative purposes, we only care about whether the price is above or below the median price of all the houses on the market. Maybe I want to brag to my friends about having a home that costs more than at least 50% of all homes... I'd be a horrible dinner guest, but this makes for a nice classification task, so let's go with it.

# <font color='red'>TASK 5</font>

Add a new column to your main dataset, `homes_df`, which labels instances where the price is below the median price as 0 (indicating a "low" price) and all other instances 1 (indicating a "high" price). Name the column `Price_binary`.

In [19]:
k = homes_df['Price']. median()
k

4000000.0

In [20]:
homes_df['Price_binary'] = np.where(homes_df['Price']<k, 0, 1)

In [21]:
homes_df.columns

Index(['Road name', 'Road Number', 'Type', 'm2', 'Build Year', 'ZipCode',
       'City', 'Date of Sale', 'Type of Sale', 'Price', 'Price_binary'],
      dtype='object')

# <font color='red'>TASK 6</font>
Now we have a binary, categorical variable that we want to predict: `Price_binary`. So, we no longer have a regression task on our hand, but a classification task. For this we'll use logistic regression.

Implement a logistic regression model that uses `m2` as an attribute and `Price_binary` as a target variable. Fit the model to `X` and `y`, then report the score and the coefficient.

In [22]:
X = homes_df[['m2']]
y = homes_df['Price_binary']

In [23]:
# fit model
logr = LogisticRegression()
logr.fit(X,y)

In [24]:
#print score
print("Score: {:.3f}".format(logr.score(X, y)))

Score: 0.831


What is the "score" for this logistic regression model? What measure is it exactly?

**Answer:** The mean accuracy on the given data. This is more meaningful when doing a train-test split and evaluating on test data, but here it shows how the fitted sigmoid curve doesn't achieve perfect accuracy even on the data it's trained on.

In [25]:
#print coefficient
coefficient = logr.coef_
print(coefficient)

[[0.03555657]]


How do you interpret this coefficient? What does it tell us? Remember, it's a bit more confusing than linear regression.

**Answer:** the change in the *log-odds ratio* for y per unit change in x; the change in log odds of a house having a "high" price for each. 

When exponentiated, this becomes the change in the *odds ratio* for y per unit change in X.

In [26]:
odds = np.exp(coefficient)
odds

array([[1.03619627]])

So, the above tells us that, on average, for an increase of 1 square meter in size, the odds that the house will have a "high" price is 1.036 — the odds increase a little bit (a 3.6% increase). An odds ratio of 1.00 would mean absolutely no effect. 

---------
**That's it for this week! Next week we'll build regression models with multiple attributes and see how well our models generalize.**