# Brixton Room Prices

This notebook is used to implement a regression model, designed to predict the going rate of rooms in shared houses in the London district of Brixton. The data is scraped from [SpareRoom](https://spareroom.co.uk).

The idea of the project was to write our code as a testable and maintainable Python package with entry-points to tune, train and test our model so it can easily be integrated into a CI/CD flow. This was attempted after viewing a [tutorial](https://morioh.com/p/d9fffafd5f92) on maintainable code.

## Load Packages

As the project uses custom built classes and functions, written in a text editor / IDE for convenience, the `autoreload` method below ensures that the notebook does not have to be restarted every time this code is altered.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)

## Scraping the Data

The `SpareRoomScraper` object takes an ID on init. This is found in the url of any search made through SpareRoom, it simply has to be copied and pasted - this could be any location. The class uses the HTMLRequests library to scrape the website.

In [None]:
from scrape import SpareRoomScraper

scraper = SpareRoomScraper(search_id = 1097663249)
df = scraper.get_data()
df.head()

In [None]:
#df.to_feather('./data/brixton.feather')

## Pipeline and ColumnTransformer

#### Preprocessing

It has been attempted to use sklearn's preprocessing framework to complete all of the preprocessing. The idea being, inspired by the aforementioned [tutorial](https://morioh.com/p/d9fffafd5f92), that the process of preparing the data for modelling can be undertaken using a combination of `Pipeline` and `ColumnTransformer` objects enabling the optimisation of their parameters through `GridSearchCV` just as is the case for the regression models.

However, `Pipeline` objects are limited in that they can only transform the X matrix - any transformation completed on the target must be completed externally beforehand. In this case, this means the removal of any listings that are whole properties rather than rooms, and the removal or imputation of any missing values within the dataframe.

Thus, while the parameters of many of the preprocessing classes and the regression algorithm can be altered through `GridSearchCV`, the imputation method must be chosen beforehand.

There are many preprocessing steps that are completed within the pipeline however, using the classes in `preprocessing.py`. These include:

  - The extraction of separation of distinct pieces of information within some features into their own separate features. E.g. the original `distance_to_station` feature is split into the distance and the mode of transport used to get there (walk or bus).
  - Transformation of some features into a more useful form - `availability` is transformed into `available_in` which is simply and integer of the number of days between the current day and the room's availability date.
  - Encoding of Yes/No features in a binary format, and other categorical features in a one-hot format where appropriate

In addition to `preprocessing.py`, the project also uses `encoders.py` to encode the features in a format suitable for modelling.

#### Model building

The preprocessing steps are combined with a regression algorithm using a `Pipeline` object. We use a dummy estimator within the pipe so that the algorithm can be set with `GridSearchCV`.

In [25]:
from preprocessing import prep_for_model
from model import build_model
from sklearn import linear_model

df = pd.read_feather('./data/brixton.feather')
X_train, X_test, y_train, y_test = prep_for_model(df)

model = build_model()
model.set_params(**{'model':linear_model.LinearRegression()})
model.fit(X_train, y_train)
model.predict(X_train)[:5]

array([ 902.59003035,  853.43336107,  915.55769054,  971.71401698,
       1063.98920047])

## Tuning and Testing

Now that we have a working pipeline, many different algorithms and parameters can be easily tested. The inputted parameters are listed in `config.py`.

The algorithms tested are Ridge and Lasso linear models along with Decision Tree and Random Forest regressors, scored with the coefficient of determination, R<sup>2</sup>.

In [26]:
from model import tune_model
df = pd.read_feather('./data/brixton.feather')
tune_model(df)

Best Hyperparameters: {'model': RandomForestRegressor(max_features=5), 'model__max_features': 5, 'model__n_estimators': 100}
Best score: 0.23953676341307698


In [27]:
from sklearn.ensemble import RandomForestRegressor
from model import train_model
from model import test_model

params = {
    'model': RandomForestRegressor(max_features = 5, n_estimators = 200),
    'model__max_features': 5,
    'model__n_estimators': 200
}

train_model(df, './models/model.joblib', params)
test_model(df, './models/model.joblib')

R2 on the test set: -0.24240644689106028


## Results

It's clear to see that the performance is pretty poor, even with the best performing algorithm. It's likely that this is simply just a result of the data being collected not being the main factors in price. For example, there's no listing of the actual room size on SpareRoom listings - a very important factor in how much the rent will be! 

But, adding in additional features is relatively straight-forward given the modular nature of the web scraper and the model testing. So over time more improvements can be made.