# Exercise 3

Exercise 3 includes a **written assignment** (10 points), a **programming assignment with 5 problems** (9 points) and a **feedback/workload assessment assignment** (1 point). For each problem you need to modify the notebook by adding your own programming solutions or written text. Remember to save and commit your changes locally, and push your changes to GitHub after each major change! Regular commits will help you to keep track of your changes (and revert them if needed). Pushing your work to GitHub will ensure that you don't lose any work in case your computer crashes (can happen!).

### Time allocation

**Completing this exercise takes approximately: 10-15 hours** (based on previous year statistics). However, the time it takes can vary significantly from student to student, so we **recommended that you start immediately working on the exercise to avoid issues meeting with the deadline**.

### Due date

You should submit (push) your Exercise answers to your personal Github repository **by Friday 9.2**. 

### Start your exercise in CSC Notebooks

Before you can start programming, you need to launch the CSC Notebook instance and clone your **personal copy of the Exercise repository** (i.e. something like `exercise-3-htenkanen`) there using Git. If you need help with this, [read the documentation on the course site](https://sustainability-gis.readthedocs.io/en/latest/lessons/L1/git-basics.html).

### Working with Jupyter Notebooks

Jupyter Notebooks are documents that can be used and run inside the JupyterLab programming environment (e.g. at [notebooks.csc.fi](https://notebooks.csc.fi/)) containing the computer code and rich text elements (such as text, figures, tables and links). 

**A couple of hints**:

- You can **execute a cell** by clicking a given cell that you want to run and pressing <kbd>Shift</kbd> + <kbd>Enter</kbd> (or by clicking the "Play" button on top)
- You can **change the cell-type** between `Markdown` (for writing text) and `Code` (for writing/executing code) from the dropdown menu above. 

See [**further details and help from here**](https://pythongis.org/part1/chapter-01/nb/04-using-jupyterlab.html). 
 
### Hints 

If there are general questions arising from this exercise, we will add hints to the course website under [Exercise 3 description](https://sustainability-gis.readthedocs.io/en/latest/lessons/L3/exercise-3.html). 

<hr style="border:2px solid gray">

# Part 1: Written assignment (10 points)

In the "Economic inequalities, growth and green economy" and the "Spatial regression" lessons + tutorial 
this week, we went through different indicators that can be used to understand economic inequalities and how 
(spatial) regression models can be used e.g. to understand which factors influence pricing of Airbnb listings. 

Write approx. 0.5-2 pages (A4) of text in English. Remember to **cite your sources appropriately** when you use literature or other reading materials in your text. 

In this essay, you should cover following questions:
 
 - What are the key characteristics, strenghts and weaknesses of different indicators that aim to measure economic inequality in the world?
 - Do you find economic inequality problematic? How big differences in income/wealth should be acceptable?
 - What kind of changes do you think should be made for the economy to be more sustainable?
 - How can regression analysis be used to understand economic inequalities? Why is it important to take spatial effects into account when doing regression analysis? 
  
Use the lesson materials and the recommended readings (optional) as a source of information for answering to these.

### Grading criteria for the essay

- Answers to the question(s): 4 points

- Reflection against literature + materials: 4 points

- Fluency / clarity of the text: 1 point

- Appropriate citation practices used: 1 point

----------------

## Answers / reflection

**Add your text here.**

*Hint: To "activate" this cell in Editing mode, double click this cell. If you want to get this cell back in the "Reading-mode", <kbd>Shift</kbd> + <kbd>Enter</kbd>.*

## Hints

- If you need help in Markdown formatting (e.g. how to add headings, bold, italics, links etc.), please take a look at this excellent [guide / cheatsheet](https://www.markdownguide.org/cheat-sheet/) 

<hr style="border:2px solid gray">

# Part 2: Programming assignment (9 points)

In this exercise, we practice doing spatial regression using postal code level data from Finland that is openly available via Statistics Finland. We aim to understand what factors influence the housing prices in Helsinki Region. In this exercise, you will learn how to:

 - investigate linear relationship between different attributes
 - create spatial weights using pysal library
 - investigate spatial autocorrelation in the data using Moran's I indicator
 - conduct Ordinary Least Squares regression and spatial regression models using pysal library 


### Due date

The exercise should be returned by the end of Friday (19th of February, 2021).  

### Start your exercise in CSC Notebooks

Before you can start programming, you need to launch the CSC Notebook instance and clone your Exercise repository there.
If you need help with this, [read the documentation on the course site](https://sustainability-gis.readthedocs.io/en/latest/lessons/L1/git-basics.html).
 
### Hints 

If there are general questions arising from this exercise, we will add hints to the course website under [Exercise 3 description](https://sustainability-gis.readthedocs.io/en/latest/lessons/L3/exercise-3.html). 

## Input data

In this Exercise, we will use openly available postal code level data from Statistics Finland. We use two different datasets: 
  1. [Paavo postal code level data](https://www.stat.fi/tup/paavo/paavon_aineistokuvaukset_en.html) - Provides us many useful attributes that we can use as explanatory variables in our linear regression models.  
  2. [Average prices of dwellings](https://www.stat.fi/til/ashi/index_en.html) in housing companies in Finland (â‚¬ per square meter)
  3. OpenStreetMap data - Provides us street network data that we use to calculate an accessibility index for each postal code area which is used as one explanatory variable

## Hypothesis

In this exercise, we hypothesize that following attributes might explain the average apartment price on a postal code level:

 1. Travel time to city center by public transport
 2. Number of jobs in the area
 3. Number of people living in the area with higher education degree
 4. Average size of the households (i.e. how many people live in an household)
 5. Average income of the households
 
In addition to these, we finally consider using a spatial lag model that:

 6. The average price of the neighboring areas influences the price on a given area


## Helper functions

The following helper functions will be used in the exercise for parsing catchment area information for hospitals. **Execute this cell before starting to work on the exercise.**

In [None]:
def prepare_network(osm_fp, gtfs_fp):
    """
    Helper function to create a routable network for r5py based on OpenStreetMap data and GTFS data.
    
    Parameters
    ----------
    
    osm_fp : str
    
        Filepath to the OpenStreetMap PBF file (*.osm.pbf).
        
    gtfs_fp : str | list
    
        Filepath to the GTFS zip-file, or alternatively a list of multiple filepaths to GTFS Zipfiles.
         
    """
    
    from r5py import TransportNetwork

    if isinstance(gtfs_fp, str):
        gtfs_fp = [gtfs_fp]
    elif isinstance(gtfs_fp, list):
        # Check that the inputs are valid
        for item in gtfs_fp:
            assert isinstance(item, str), f"All objects in 'gtfs_fp' list should be filepath strings. Got '{type(item)}'."

    # Build network
    net = TransportNetwork(osm_fp,  gtfs_fp)
    return net

## Problem 0 - Download data

Before starting the exercise, download the necessary data by executing the following cell (you only need to do this once). Here, we can use the same data as we used in Exercise 2: 

In [None]:
# Download the data from a S3 bucket into 'data' folder
!wget -P data/ https://a3s.fi/swift/v1/AUTH_0914d8aff9684df589041a759b549fc2/Sustainability-GIS/Helsinki.zip
    
# Extract the contents
!unzip -q data/Helsinki.zip -d data/

If running the cell above does not work for some reason, you can [manually download the data](https://a3s.fi/swift/v1/AUTH_0914d8aff9684df589041a759b549fc2/Sustainability-GIS/Exercise-2-data.zip). If you do this, extract the contents of the Zip file (Exercise-2-data.zip) into the `<YOUR_FOLDER_CONTAINING_THIS_NOTEBOOK>/data` -folder.

## Problem 1 - Prepare data (1 point)

In this problem you should:

- Read the postal code level data from Statistics Finland and select the data for Helsinki Region based on municipality code (`kunta`)
- Remove unnecessary columns by selecting only following attributes from the postal code data:

   - id --> Unique id for each row
   - posti_alue --> Postal code id
   - tp_tyopy --> Number of jobs in the area (total)
   - ko_yl_kork --> Number of people having higher education degree
   - te_takk --> Average size of the households
   - tr_ktu --> Average income of the households
   - geometry --> Geometries


- Read the apartment price data from `apartment_prices_finland_2019.csv` file which is located in the `data` directory (when reading the data, ensure that the data type of the `postal_code` attribute is string. Hint: Check the pandas documentation for read_csv() and the parameter `dtype`. 

- Make a table join between the postal code dataset and the apartment price dataset based on the `posti_alue` and `postal_code` attributes. 
- Investigate your data. Are there missing values in any of the selected attributes? If there are:

   - remove NaN values
   - reset the index

- Make map out of the average price per square meter in Helsinki Region. As a result you should have something like following:

![Average housing prices](img/housing_prices.PNG)

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.


In [None]:
import geopandas as gpd
import pandas as pd

# URL for postal code level data from Statistics Finland
url = "http://geo.stat.fi/geoserver/postialue/wfs?request=GetFeature&typename=postialue:pno_tilasto_2019&outputformat=JSON"

# Read postal code data
data = gpd.read_file(url)

# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

## Problem 2 - Create an accessibility index for each postal code area (2 points)

In this problem, the objective is to create an accessibility index for our postal code areas based on the network distance to the Helsinki city center. 

**Task 2.1.** Prepare a routable network:
 
- Use the `prepare_network()` function (above) to create the routable network. 
- As input for the function, you should specify the filepaths to the OSM and GTFS files which you downloaded in the Problem 0:
   - pass the OpenStreetMap file using `osm_fp` parameter, and 
   - the GTFS Zipfile using `gtfs_fp` parameter. 

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.


In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

**Task 2.2.** Prepare origin and destination data:

*Destinations:*

- Calculate the centroid of each postal code area into a new column called `centroid` in the `data` GeoDaraFrame (from Problem 1).
- Specify that the **active** source for geometries in the `data` GeoDataFrame is `centroid`:
    - Use the `.set_geometry()` -method of geopandas for specifying the active geometry
    - Hint: Read details about `.set_geometry()` from [geopandas documentation](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.set_geometry.html) 

- Reproject the data into `WGS84`:
    - Use the `.to_crs()` function to reproject the data into WGS84 coordinate reference system
    
*Origins*:
- Use `osmnx` library to geocode the point for **Helsinki Central Railway station** in a similar manner as in [Tutorial 2.2](https://sustainability-gis.readthedocs.io/en/latest/lessons/L2/r5py_calculating_travel_time_matrices.html): 
   - Store the data into a GeoDataFrame called `origin` 
   - Calculate the centroid of the Railway Station polygon into a column `geometry` of the `origin` GeoDaraFrame (i.e. we overwrite the old Polygon geometry with a Point geometry)
   - add an `id` column with a value `0` to the `destination` GeoDataFrame.
   
 <br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

**Task 2.3**. Calculate travel times by public transport from Helsinki Railway station to all postal code area centroids. Use `r5py` to do the travel time calculations as we have learned during week 2 (and in Exercise 2): 
   
- Initialize the `TravelTimeMatrixComputer` into a variable called `travel_time_matrix_computer` in a similar manner as shown in the [Tutorial 2.2](https://sustainability-gis.readthedocs.io/en/latest/lessons/L2/r5py_calculating_travel_time_matrices.html). As parameters, you should have following:
   - For the `transport_network` parameter, you should pass the routable `network` which you created in **Task 2.1**.
   - For the `origins` parameter which indicates the location where the travel begins, you should pass the `origin` GeoDataFrame.
   - For the `destinations` parameter which indicates all the destination locations, you should pass the `data` GeoDataFrame.
   - For the `departure` parameter which indicates the time of departure, you should pass the time as a [datetime object](https://pythongis.org/part1/chapter-03/nb/03-temporal-data.html#constructing-datetime-objects) representing 7:30 in the morning on January 19th 2023 (see tutorial for example) 
   - For the `transport_modes` parameter which specifies the mode of travel or their combinations, you should define that the travel happens with a combination of `TRANSIT` and `WALK` (see tutorial for example). 
    
- Calculate the travel times and store them into a variable `ttm`. 
   - You can do this by executing the `.compute_travel_times()` method of the `travel_time_matrix_computer` which you initialized earlier.

- Make a table join between the postal code dataset (variable `data`) and the `ttm` DataFrame from the previous step. This allows us to map our results. 
   - You can do the table join by using the `data.merge()` method and use the `id` column in the `data` layer as the left key, and the `to_id` column in `ttm` as the right-key. 
   - Store the result of the table join into a variable called `access`
   - **Hint:** Read more details about how table joins work from [Pythongis.org website](https://pythongis.org/part1/chapter-03/nb/01-data-manipulation.html#table-joins-combining-dataframes-based-on-a-common-key)


<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
from r5py import TravelTimeMatrixComputer, TransportMode
import datetime

# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

**Task 2.4.** Calculate and visualize the accessibility index

- Currently the active `geometry` in our `access` GeoDataFrame is the centroid of a postal code area. For visualization purposes, we want to change this back to the Polygon geometries. To do this, you should set the active geometry of the GeoDataFrame as `geometry`:
  - Use the `.set_geometry()` and specify that the column `geometry` for this.

- Calculate the inverse distance for our `access` GeoDataFrame based on the values in `travel_time` column using formula: `abs( travel_time - max(travel_time) )`
  - Store the results in column `inverse_distance`

- Create an accessibility index for the `access` GeoDataFrame as a new column called `access_index` in which you should:
  
  - Standarize the inversed distance to scale 0-1 with formula: `inverse_distance / max(inverse_distance)`

- Visualize the access index using the `.plot` function. 
   - You should use the `access_index` as the `column` which will be used to visualize the data
   - You can pass `cmap="RdYlBu_r"` as the colormap if you want (also other colormaps are fine)
   - You can use `scheme="natural_breaks"` to classify the travel times according Natural Breaks classifier
   - You can use `k=8` to specify that you want to classify the travel times into 8 different classes

As a result for the accessibility index, you should have something like following (the higher the value, the better accessibility):

![Accessibility index](img/access_index.png)


<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()


**Task 2.5. Sanity check**: Plot the travel times (in column `travel_time`) against the price information which should be available for you in the `access` GeoDataFrame as a scatter plot using the `lmplot()` function of seaborn library: 
 - check [Seaborn lmplot()](https://seaborn.pydata.org/generated/seaborn.lmplot.html)).
 
As a result of this, you should get something like below:

![Travel times against Housing prices](img/price_against_distance.png)


<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

## Problem 3 - Make a correlation matrix (1 points)

In this problem, we start our statistical analysis by investigating the correlation between the price and the variables we have selected. One of the assumptions of Ordinary Least Squares regression is that there should be linear relationship between variables. Hence, before doing anything else, it is good to check whether this assumption You should:

- Calculate a correlation matrix between all the variables in our model (i.e. price, access_index, tp_tyopy, ko_yl_kork, te_takk, tr_ktu). 
   - Use the `access` GeoDataFrame as the source data for the correlation matrix and store the results in a new DataFrame called `correlation_matrix`. 
   - Round the values so that they have only 2 decimals. 
   - Hint: Check the pandas documentation for `corr()` method (see [docs](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html)).
- Visualize the correlation matrix as a heatmap using Seaborn's [heatmap()](https://seaborn.pydata.org/generated/seaborn.heatmap.html) functionality. You should annotate the correlation values so that they are visible in the plot, and answer to the questions (scroll down to find them).

As a result, you should have something like below (*correlation values for price are hidden*):

![Correlation matrix ](img/correlation_matrix.png)

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

### Questions

- **Question 3.1.** Which three (3) variables have the strongest correlation with price, and what are their correlation coefficients?
- **Question 3.2.** Which variable you might want to drop due to low correlation with price? (Note: do not drop the variable from your data, only answer to the question based on your understanding)
- **Question 3.3.** Multicollinearity should be avoided in OLS, meaning that there shouldn't be a relationship between the explanatory variables. One way to detect multicollinearity is to investigate whether there are high correlation values between the explanatory variables (a typical "rule of thumb" cutoff value is 0.8, although lower thresholds are used as well). Based on the correlation matrix, do you see issues with multicollinearity in our variables? Justify your answer with a sentence or two.   
  


### Answers

Answer to questions above by adding text after the `Answer` bullet points below:

(*Hint*: double-click this cell to activate editing)

- **Answer for Q3.1**: 
- **Answer for Q3.2**: 
- **Answer for Q3.3**: 

## Problem 4 - Is there spatial autocorrelation in the price values? (2 points)

As we learned during the lesson this week, spatial autocorrelation is something that can influence quite a bit how well our statistical models work. Hence, it is good to try to understand if our dependent variable (*price*) have spatial autocorrelation. Based on the map from Problem 1, we can already see that our values seem to cluster. But as humans are very good at seeing patterns (even if there wouldn't be), it is always good to investigate more analytically (using here `Moran's I` indicator), whether the values have spatial correlation or not. To do this, you should:

- Create spatial weights based on how the boundaries of our postal code areas touch each other. Create the weights based on the data GeoDataFrame using **Queen contiguity** and store the resulting weights into a variable `w`. For creating the spatial weights, you can use the `weights` submodule from `pysal` library (see [docs](https://pysal.org/libpysal/api.html)). If you need further information, we also recommend checking [chapter 4](https://geographicdata.science/book/notebooks/04_spatial_weights.html) from "Geographic Data Science with Python" book (Rey et al. *forthcoming*). Also it highly recommended to check the [lesson video 4.2](https://sustainability-gis.readthedocs.io/en/latest/lessons/L4/overview.html#lesson-videos) for additional details about spatial weights.

Sanity check (optional): If you plot the spatial weights (i.e. variable `w`) with the postal code areas, you should get something like following as a result (hint: you can plot the weights as you would plot any GeoDataFrame):

![Queen contiguity](img/queen_contiguity_weights.PNG)

- Row standardize the weights as was shown during the tutorial this week.

- In our data, there are islands. Which indices of our data represent islands? 

- Remove the rows from our GeoDataFrame representing islands (check lesson video) by dropping the rows based on the island indices. Also reset the index at this point.

- Based on this *cleaned data*, recreate the Queen contiguity weights and store them again in variable `w` and row standardize the weights

- Calculate the `Moran's I` based on the "price" attribute and using the spatial weights that we created in the previous step. For doing this, you can use the [Moran()](https://pysal.org/esda/generated/esda.Moran.html) function from the pysal library, which accepts the Series of our price column as one parameter and the weights as another, check the pysal docs for details. What is the global Moran's I for our data?

- Create a Moran plot based on our data that allows us to investigate the spatial autocorrelation visually. For doing this, you can use a [plot_moran()](https://splot.readthedocs.io/en/latest/generated/splot.esda.plot_moran.html) -function from pysal's splot submodule. Sanity check: if everything is correct Moran plot should produce something like following (*Moran's I value is hidden*):

![Moran plot](img/moran_plot.PNG)


**How to read the plot?** The plot on the right displays a positive relationship between both variables (it also shows the global Moran index in the title). This means that we have positive spatial autocorrelation in our data as similar values tend to be located close to each other: high values tend to be close to other high values, and low values tend to close to other low values. On the left plot we can see, how the distribution of our data should look like if the data would be spatially random. The blue vertical line (at x-axis position 0) shows where the mean is in spatially random distribution, and the red vertical line on the very right shows where the Moran's I of our data is. I.e. it is clearly beyond random. 

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()


## Problem 5 - Spatial regression (3 points)


As an overview, in this problem, we will:

1. investigate the linear relationship between our variables using scatter plots with fitted regression line

2. Do log transformations to some of our variables (to ensure the relationship between dependent and explanatory variable is ~ linear)

3. conduct three different regression models:

   - A normal Ordinary Least Squares (OLS) regression model
   - Exogenous effects SLX model (i.e. spatially lagging explanatory variables)
   - Spatial lag model (i.e. considering the price of the neighboring areas also as an explanatory variable)



### Task 5.1. Investigate whether our explanatory variables seem to have linear relationship with the "price":

- Create a scatter matrix based on our model attributes (price, access_index, tp_tyopy, ko_yl_kork, te_takk, tr_ktu) and fit a regression line to the plot. Use seaborn [pairplot()](https://seaborn.pydata.org/generated/seaborn.pairplot.html#seaborn.pairplot) function to make the visualization. Hint: you can use the parameter `"kind"` to specify which kind of plot you want to visualize. For fitting a regression trend line to your plot, you can specify that the kind of plot you want is "reg" ([read more here](https://seaborn.pydata.org/tutorial/regression.html#plotting-a-regression-in-other-contexts)). 

Sanity check: The first row in the result should look something like following:

![Scatter matrix with trend line](img/pairplot.PNG)

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

### Questions

- **Question 5.1.** Looking at the relationship between the price and other variables, which of the variables seem to have a clear linear relationship with the price? Justify your answer with a sentence or two.
- **Question 5.2.** Based on the scatter plot, which of the variables seem to have a negative relationship against the "price"? 

### Answers

Answer to questions above by adding text after the `Answer` bullet points below:

(*Hint*: double-click this cell to activate editing)

- **Answer for Q5.1**: 
- **Answer for Q5.2**: 

### Task 5.2. Logarithmic transformations 

Logarithmic transformation of variables in a regression model is a common way to handle situations where a non-linear relationship exists between the independent and dependent variables. Based on the insights from the previous step, we might want to do a log transformation to some of our attributes in order to make the relationship between attributes linear. Using a log transformation also helps with heavy skewness in our data (e.g. looking at the price histogram shows that the values are heavily skewed to the left side of the histogram).

In this task, you should (use the `access` GeoDataFrame as input data):

- Make a logarithmic transformation (as shown in the [tutorial 4](https://sustainability-gis.readthedocs.io/en/latest/lessons/L4/spatial_regression.html#baseline-nonspatial-regression)) to following attributes and store the values as shown below (making the attribute names easier to understand):

 - "price" --> "log_price" (the column name to which you should store the result)
 - "tp_tyopy" --> "log_n_jobs"
 - "ko_yl_kork" --> "log_high_edu"
 - "te_takk" --> "log_avg_household_size"
 - "tr_ktu" --> "log_avg_income"
 - **Hint:** In case calculating the logaritmic values with `np.log()` produces an error (because log cannot be taken from 0), you might want to add a small decimal number such as `0.000001` to the input values when taking the logarithmic (see example from Tutorial 4). 
 
 
- Create another scatter matrix in a similar manner as in Task 5.1, but now use the log transformed attributes instead of the original ones (notice that we didn't log transform the accessibility index which you should also include in the plot). 

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

### Questions

- **Question 5.3**. Looking at the scatter matrix, what happened to the histograms of your variables? Explain with a sentence or two.
- **Question 5.4**. Looking at the scatter matrix, does the log transformed explanatory variables seem to have clearer linear relationship with the dependent variable? Explain with a sentence or two.

### Answers

Answer to questions above by adding text after the `Answer` bullet points below:

(*Hint*: double-click this cell to activate editing)

- **Answer for Q5.3**: 
- **Answer for Q5.4**: 

### Task 5.3 - Ordinary Least Squares (OLS)

Here we do our first regression model using Ordinary Least Squares regression (use the `access` GeoDataFrame as input data):

1. Create OLS where the "log_price" is the dependent variable and access_index and the other logged variables (log_n_jobs, log_high_edu, log_avg_household_size, log_avg_income) as the explanatory variables
2. Print out the summary of the regression model and answer to questions (scroll down for questions)

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

### Questions

- **Question 5.5.** What is the R-squared value of our model?
- **Question 5.6.** Which of the explanatory variables has the highest coefficient?

### Answers

Answer to questions above by adding text after the `Answer` bullet points below:

(*Hint*: double-click this cell to activate editing)

- **Answer for Q5.5**: 
- **Answer for Q5.6**: 

### Task 5.4: Exogenous effects, SLX model 

In this task, we will learn how introducing spatial effects into the explanatory variables influences our regression model (use the `access` GeoDataFrame as input data). You should:

- Create row standardized Queen contiguity spatial weights based on our GeoDataFrame (alternatively, you can use the same weights which were done in problem 4) 
- Create spatially lagged versions of our explanatory variables using the `weights.spatial_lag.lag_spatial()` function of pysal, i.e. the same approach which was introduced in the [tutorial](https://sustainability-gis.readthedocs.io/en/latest/lessons/L4/spatial_regression.html#spatially-lagged-exogenous-regressors-wx) and store them as columns into the data, named as follows:

  - "access_index" --> "w_access_index"
  - "log_n_jobs" --> "w_log_n_jobs"
  - "log_high_edu" --> "w_log_high_edu"
  - "log_avg_household_size" --> "w_log_avg_household_size"
  - "log_avg_income" --> "w_log_avg_income"
 
- Do a regular OLS regression in a similar manner as in Step 2 (having "log_price" as the dependent variable), but now use these spatially lagged attributes as explanatory variables. Print the summary of the model and answer to questions (scroll down to find them):

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

### Questions

- **Question 5.7.** What is the R-squared value of our model? Did it improve?
- **Question 5.8.** Which of the explanatory variables has the highest coefficient?

### Answers

Answer to questions above by adding text after the `Answer` bullet points below:

(*Hint*: double-click this cell to activate editing)

- **Answer for Q5.7**: 
- **Answer for Q5.8**: 

### Task 5.5: Spatial lag model


In this task, we will learn how introducing spatial effects also into the dependent variables influence our regression model, i.e. we will do a spatial lag model (use the `access` GeoDataFrame as input data). You should:

1. Create row standardized Queen contiguity spatial weights based on our GeoDataFrame (alternatively, you can use the same weights which were used in previous step) 
2. Conduct a spatial two stage least squares regression (i.e. a spatial lag model) in a similar manner as shown [in the tutorial](https://sustainability-gis.readthedocs.io/en/latest/lessons/L4/spatial_regression.html#spatially-lagged-endogenous-regressors-wy). This time, you should use the same explanatory variables as in our first OLS model, i.e. access_index, log_n_jobs, log_high_edu, log_avg_household_size, and log_avg_income. Print the summary of the model and answer to questions (scroll down to find them).

<br>

Please write your solution to the cell below (remove the `raise NotImplementedError()` code). You can create new cells as well if needed.

In [None]:
# REPLACE THE ERROR BELOW WITH YOUR OWN CODE
raise NotImplementedError()

### Questions

- **Question 5.9.** What is the Pseudo R-squared value of our model? 
- **Question 5.10.** Which of the explanatory variables has the highest coefficient? Does it make sense to you? Explain with a sentence or two.

### Answers

Answer to questions above by adding text after the `Answer` bullet points below:

(*Hint*: double-click this cell to activate editing)

- **Answer for Q5.9**: 
- **Answer for Q5.10**: 

## Problem 6 - How long did it take? Optional feedback (1 point)

To help developing the exercises, and understanding the time that it took for you to finish the Exercise, please provide an estimate of how many hours you spent for doing this exercise? *__Hint:__ To "activate" this cell in Editing mode, double click this cell. If you want to get this cell back in the "Reading-mode", press Shift+Enter.*

I spent approximately this many hours: **X hours**

In addition, if you would like to give any feedback about the exercise (optional), please provide it below:

**My feedback:**