# Web Scraping and Data Preprocessing

We'll be scraping weather data from https://williamprofit.github.io/ICDSS-Lecture-Webscraping/

Hopefully by the end we'll get a useful model to predict temperatures from multiple features.

We'll proceed as follows:
- Scrape website to create a dataset
- Visualise the data
- Preprocess the data and create a model

## 1. Web Scraping

In order to web scrape, we start by importing all necessary libraries and setting some useful constant values.

In [None]:
from bs4 import BeautifulSoup # Used to parse HTML files
import pandas as pd           # Used to store data
from requests import get      # Used to request website HTML page

# URL we'll be scraping from
URL = 'https://williamprofit.github.io/ICDSS-Lecture-Webscraping'
# File path to save the data to
FILE = './data.csv'

We create a container to hold our data (empty for now). It contains all the features we'll be extracting and our script will fill it as we go on.

In [None]:
# Code here

From the main page, we need to get the URLs of all the sub pages to scrape from. This is done by scraping the page for the links to the subpages and then visiting them in turn.

In [None]:
# Code here

Now that `week_pages` holds links to all the pages containing the data, we can traverse it and scrape the sub pages.

In [None]:
# Code here

We can now save our data using pandas.

In [None]:
# Code here

We can now simply load our data from a file instead of web scraping again.

In [None]:
# Code here

## 2. Visualisation
In order to create a performant model, it's important to visualise the data to get an intuition of how it could be modeled. We typically plot the predicted feature against other features and compute correlation coefficients.

### Plotting graphs
We can start plotting graphs to visualise how the temperature evolves with different metrics. We create a list `days` which simply enumerates the days from 0 to `len(data)` which will be used for the x-axis. This can be achieved using the range function and making it into a list.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# Code here

Let's plot the temperature with pressure. The syntax of the plot function is as follows: `plt.plot(x-axis, y-axis, x-axis, y-axis, ..)`.

In [None]:
# Code here

We continue with temperature and wind speed:

In [None]:
# Code here

Now temperature and dew point:

In [None]:
# Code here

And finally temperature and humidity:

In [None]:
# Code here

The graphs can now be used to visually get an intuition of how each feature relates to the one we're trying to predict.

### Computing correlation coefficients
Graphs are useful but sometimes they can be difficult to interpret and having a standard way of determining how correlated two features are can come in quite handy. For several metrics can help, we'll be looking at the covariance. Numpy has a handy `cov` function that we'll be using. The function returns 4 values in a 2D array but we're only interested in the value at `[0][1]`.

In [None]:
from numpy import cov

# Code here

This gives us a better idea of correlations between our features. The higher the covariance factor the higher the correlation. If the covariance is negative then the two features are inversely correlated. We can use all this information to decide which features to select for our model.

## 3. Preprocessing and Modelling

Our data is well balanced and is not categorical so there is no need to balance, augment or one-hot-encode it. We'll only be normalising it to have it in a range of `0-1`. For that we'll use the following formula $X_{normalised}=\frac{X-min(X)}{max(X)-min(X)}$:

In [None]:
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Normalise data (min-max normalisation)
# Code here

We now split our data into two vectors `X` and `y`. `y` contains the column we want to predict, namely the temperature. `X` contains all the columns (features) we'll use to predict the temperature. 

In [None]:
# Code here

To better evaluate our model we separate the data into a training and a testing set. The training set is fed to the model to train and the test set contains data the model has not trained on so we can compute accuracy on unseen data. To achieve this, scikit has the `train_test_split` functions which splits and shuffles the data into 4 partitions for X/y and train/test.

In [None]:
# Code here

We finally create a linear regression model using the training set and gauge the accuracy on the test set:

In [None]:
# Code here