# Web Scraping and Data Preprocessing

We'll be scraping weather data from https://williamprofit.github.io/ICDSS-Lecture-Webscraping/

Hopefully by the end we'll get a useful model to predict temperatures from multiple features.

We'll proceed as follows:
- Scrape website to create a dataset
- Visualise the data
- Preprocess the data and create a model

## 1. Web Scraping

In order to web scrape, we start by importing all necessary libraries and setting some useful constant values.

In [None]:
from bs4 import BeautifulSoup # Used to parse HTML files
import pandas as pd           # Used to store data
from requests import get      # Used to request website HTML page

# URL we'll be scraping from
URL = 'https://williamprofit.github.io/ICDSS-Lecture-Webscraping'
# File path to save the data to
FILE = './data.csv'

We create a container to hold our data (empty for now). It contains all the features we'll be extracting and our script will fill it as we go on.

In [None]:
data = {
  'temp': [],
  'pressure': [],
  'wind_speed': [],
  'dew_point': [],
  'humidity': [],
}

From the main page, we need to get the URLs of all the sub pages to scrape from. This is done by scraping the page for the links to the subpages and then visiting them in turn.

In [None]:
# Get main page
resp = get(URL)

# Parse it using BeautifulSoup
soup = BeautifulSoup(resp.text, 'html.parser')

# Extract the links to the week pages
week_pages = soup.select('li > a')
# Extract the 'href' attribute
week_pages = list(map(lambda x: x.get('href'), week_pages))

Now that `week_pages` holds links to all the pages containing the data, we can traverse it and scrape the sub pages.

In [None]:
# Web scrape all week pages
for p in week_pages:
  url = URL + '/' + p # Append page's URL to base URL
  print('Scraping {}'.format(url))

  # Get HTML page
  resp = get(url)
  # Parse page
  soup = BeautifulSoup(resp.text, 'html.parser')

  # Get rows and omit 1st since it contains table headers
  rows = soup.select('tr')[1:]
  for row in rows:
    # Select all values from row
    vals = row.select('th')
    # Extract the text fields
    vals = list(map(lambda v: v.text, vals))[1:]

    # Store in data
    data['temp'].append(vals[0])
    data['dew_point'].append(vals[1])
    data['humidity'].append(vals[2])
    data['wind_speed'].append(vals[3])
    data['pressure'].append(vals[4])
    
print('Done.')

We can now save our data using pandas.

In [None]:
# Save data in csv file
print('Saving data to {}'.format(FILE))
# Create a pandas dataframe with given feature columns
data = pd.DataFrame(data, columns=['temp', 'pressure', 'wind_speed', 'dew_point', 'humidity'])
# Save data to file at path FILE
data.to_csv(FILE, sep=',')

We can now simply load our data from a file instead of web scraping again.

In [None]:
data = pd.read_csv(FILE)

## 2. Visualisation
In order to create a performant model, it's important to visualise the data to get an intuition of how it could be modeled. We typically plot the predicted feature against other features and compute correlation coefficients.

### Plotting graphs
We can start plotting graphs to visualise how the temperature evolves with different metrics. We create a list `days` which simply enumerates the days from 0 to `len(data)` which will be used for the x-axis. This can be achieved using the range function and making it into a list.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

# x-axis contains days
days = list(range(len(data)))

Let's plot the temperature with pressure. The syntax of the plot function is as follows: `plt.plot(x-axis, y-axis, x-axis, y-axis, ..)`.

In [None]:
plt.xlabel('days')
plt.ylabel('temp, pressure')
plt.plot(days, data['temp'], days, data['pressure'])

We continue with temperature and wind speed:

In [None]:
plt.xlabel('days')
plt.ylabel('temp, wind_speed')
plt.plot(days, data['temp'], days, data['wind_speed'])

Now temperature and dew point:

In [None]:
plt.xlabel('days')
plt.ylabel('temp, dew_point')
plt.plot(days, data['temp'], days, data['dew_point'])

And finally temperature and humidity:

In [None]:
plt.xlabel('days')
plt.ylabel('temp, humidity')
plt.plot(days, data['temp'], days, data['humidity'])

The graphs can now be used to visually get an intuition of how each feature relates to the one we're trying to predict.

### Computing correlation coefficients
Graphs are useful but sometimes they can be difficult to interpret and having a standard way of determining how correlated two features are can come in quite handy. For several metrics can help, we'll be looking at the covariance. Numpy has a handy `cov` function that we'll be using. The function returns 4 values in a 2D array but we're only interested in the value at `[0][1]`.

In [None]:
from numpy import cov

pressure_cov = cov(data['temp'], data['pressure'])[0][1]
wind_cov = cov(data['temp'], data['wind_speed'])[0][1]
dew_cov = cov(data['temp'], data['dew_point'])[0][1]
humidity_cov = cov(data['temp'], data['humidity'])[0][1]

print('temp vs pressure: cov={}'.format(pressure_cov))
print('temp vs wind_speed: cov={}'.format(wind_cov))
print('temp vs dew_point: cov={}'.format(dew_cov))
print('temp vs humidity: cov={}'.format(humidity_cov))

This gives us a better idea of correlations between our features. The higher the covariance factor the higher the correlation. If the covariance is negative then the two features are inversely correlated. We can use all this information to decide which features to select for our model.

## 3. Preprocessing and Modelling

Our data is well balanced and is not categorical so there is no need to balance, augment or one-hot-encode it. We'll only be normalising it to have it in a range of `0-1`. For that we'll use the following formula $X_{normalised}=\frac{X-min(X)}{max(X)-min(X)}$:

In [None]:
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Normalise data (min-max normalisation)
data = (data - data.min()) / (data.max() - data.min())

We now split our data into two vectors `X` and `y`. `y` contains the column we want to predict, namely the temperature. `X` contains all the columns (features) we'll use to predict the temperature. 

In [None]:
# Features we use to predict the temperature
features = ['humidity', 'pressure', 'dew_point']

X = data[features].to_numpy()
y = data['temp'].to_numpy()

To better evaluate our model we separate the data into a training and a testing set. The training set is fed to the model to train and the test set contains data the model has not trained on so we can compute accuracy on unseen data. To achieve this, scikit has the `train_test_split` functions which splits and shuffles the data into 4 partitions for X/y and train/test.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
  X,
  y,
  train_size=0.75,
  test_size=0.25,
  random_state=0,
)

We finally create a linear regression model using the training set and gauge the accuracy on the test set:

In [None]:
reg = LinearRegression()
# Train on training set
reg.fit(X=X_train, y=y_train)

# Get accuracy on test set
reg.score(X=X_test, y=y_test)

And that's it! We now have a handy model to predict temperatures.

_William Profit (williamprofit.com) on behalf of ICDSS (icdss.uk)_