# Introduction: Webscraper for ovfietsbeschikbaar.nl

For the course online data collection and management, our team scraped data from all stations in the Netherlands that provide OV-bikes. NS, the railway company of the Netherlands, provide these so called OV-bikes. There are in total 284 stations in the Netherlands where you can rent the bikes. We scraped data of all the 284 stations in the Netherlands for a period of 7 days with an interval of 15 minutes. We collected data about the number of bikes that are available in total and the current availibility of bikes. Moreover, we collected standard information about the type of bicycle storage and the address. We also stored the date and time of when the data was scraped, so we knew about which moment the availibility of bikes was. To give an overview, we scraped:

* Date
* Time
* Name of the station
* Total bikes 
* Current availability
* Type of facility
* Address

As mentioned before, we scraped for a period of 7 days and with intervals of 15 minutes. Of course, it is impossible to scrape 7 days in a row, every 15 minutes manually. We did this with the help of a virtual computer. We copied our code to a virtual computer and ran our Python script every 15 minutes with the help of a cronjob. The type of virtual computer we used was an EC2 instance offered by Amazon Web Services (AWS). AWS also offers a so called s3 bucket; this is a cloud storage service and it allows you to store and retrieve any amount of data from anywhere on the web. We used this tool to automatically store the data every 15 minutes in the cloud, so we were sure the data was stored safely. We can show the roadmap of our project graphically in the graph below, were the project is divided into three parts:

<img src="https://scrapeovfiets.s3.amazonaws.com/Roadmap.png" alt="Roadmap">

# 1. Write scraper code in Python
The first step in our process was just simply setting up our web scraper which is just the basic of scraping OV-bikes data from ovfietsbeschikbaar.nl. This can be subdivided into 8 steps, which we will go over one by one:
### 1.1 Define the URL of the website and the user agent header
We firstly define the base URL from which we want to scrape the data. With the user agent, we let Python know for which browser version to retrieve the website

In [None]:
# Import the required libraries
import requests
from bs4 import BeautifulSoup
import json
import datetime
import pytz
import time
import os, sys
import boto3

# Define the URL of the website and the user agent header
url = 'https://ovfietsbeschikbaar.nl/locaties'
header = {'User-agent': 'Mozilla/5.0'}

### 1.2 Get the HTML code from the website and process it with BeautifulSoup
BeautifulSoup makes it possible to extract information from the source code object. BeautifulSoup provides methods and attributes for navigating and manipulating the HTML document structure, which we will use in the next step to extract data from the soup0 object.

In [None]:
# Get the HTML code from the website and process it with BeautifulSoup
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
source_code_0 = res.text  # make website html code readable as text

# Make information "extractable" using BeautifulSoup
soup0 = BeautifulSoup(source_code_0, 'html.parser')

### 1.3 Store the raw data in a dictionary
We store the raw data in a dictionary. This provides a record of the exact HTML source code that was used to extract the data. This can be useful for debugging and troubleshooting purposes, as it allows us (and others that in the future) to see the exact data that was being worked with at the time of extraction.

In [None]:
# store the raw data in a dictionary
raw_data_hoofdlink = {'html': str(soup0)}

### 1.4 Find the list of locations on the page and get the links to the OV-fiets stations


In [None]:
# Find the list of locations on the page and get the links to the OV-fiets stations
locatielijst = soup0.find('a', attrs={'name': 'Locatielijst'})
stations = [a['href'] for a in locatielijst.find_next('div').find_all('a', {'class': 'panel-block'})]

### 1.5 Create a list of complete links to the OV-fiets stations
We start with an empty list and write a for loop that goes over the list of stations we created in the previous code chunk. For every station it creates a link of the base URL of the website and includes the tag of the station. This way we had a list of seeds to use in the next parts to iterate through and extract data from each sub-page. 

In [None]:
# Create a list of links to the OV-fiets stations
links = []
for station in stations:
    combined_link = "https://ovfietsbeschikbaar.nl" + station
    links.append(combined_link)

### 1.6 Define a function to extract information from each OV-fiets station webpage
This function uses as input an url of a webpage. For this webpage, relevant information is scraped. In some cases, an (el)if-else statement needed to be put in place, since sometimes, there was no data available in the place we expected it. We then made sure this error was replaced with relevant feedback and we did not just made our code ignore the errors, since this would cause our code to be very prone to mistakes. 
We also listed the date and time of extraction with the datetime package. Eventually, we stored all this scraped data in a dictionary. We again made sure to store the raw data for all the individual links. 

In [None]:
# Define the parse_website function to extract information from each OV-fiets station webpage
def parse_website(url):
    header = {'User-agent': 'Mozilla/5.0'}  # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers=header)
    request.encoding = request.apparent_encoding  # set encoding to UTF-8
    source_code = request.text  # make website html code readable as text

    # Make information "extractable" using BeautifulSoup
    soup = BeautifulSoup(source_code, 'html.parser')

    # Scrape the relevant information
    station = soup.find(class_='title has-text-weight-bold').get_text()
    totaal = soup.find(title='Schatting op basis van de laaste drie maanden').get_text(strip=True)
    if soup.find(class_="grafiek-overlay").get_text(strip=True) == 'geen actueledata beschikbaar':
        beschikbaar = 'Geen actuele data beschikbaar'
    elif soup.find("div", {"class": "grafiek-overlay"}).find("span", {"class": "badge"})['data-badge'] == '!':
        beschikbaar = 'Vertraging in de data'
    else:
        beschikbaar = soup.find('td', string='Nu beschikbaar').find_next_sibling('td').get_text(strip=True)
    soort = soup.find('td', string='Type stalling').find_next_sibling('td').get_text(strip=True)
    adres = soup.find(class_='table is-narrow is-fullwidth').find('td').get_text()
    if adres.startswith('Adres:'):
        adres = adres.replace('Adres:', '')
    else:
        adres = 'Onbekend'

    # extract current date and time
    current_datetime = datetime.datetime.now(pytz.timezone('Europe/Amsterdam'))
    formatted_datetime = current_datetime.strftime("%Y-%m-%d %H:%M")

    # store the information in a dictionary
    data = {'date': formatted_datetime,
            'station': station,
    		'totaal': totaal,
    		'beschikbaar': beschikbaar,
        'type stalling': soort,
        'adres': adres}

    # store the raw data in a dictionary
    raw_data = {'date': formatted_datetime, 'html': str(soup)}

    return (data, raw_data)

### 1.7 Store the raw HTML content in a JSON file
We can store the raw data in a file called raw_data.json and the actual scraped data in a file called ov_data.json. This way, we have two seperate files, one with the source code of every single page scraped and one with the structured scraped data (see 1.8). 

In [None]:
# store the raw data of the html structure in a json file
f = open('raw_data.json', 'a', encoding='utf-8')
f.write(json.dumps(raw_data_hoofdlink))
f.write('\n')  # new line to separate objects
f.close()

f = open('raw_data.json', 'a', encoding='utf-8')
for link in links:
    raw_data = parse_website(link)[1]
    f.write(json.dumps(raw_data))
    f.write('\n')  # new line to separate objects
f.close()

### 1.8 Store the scraped data  in a JSON file

In [None]:
# store the scraped data in a json file
f = open('ov_data.json', 'a', encoding='utf-8')
for link in links:
    data = parse_website(link)[0]
    f.write(json.dumps(data))
    f.write('\n')  # new line to separate objects
f.close()

# 2. Set up EC2 instance
After creating this python scraping code, we decided to set up an EC2 instance to run scheduled tasks. One of the main purposes is to automate running our python code that need to be executed on a regular basis. By setting up a scheduled task on our EC2 instance, the task can be executed automatically at specific intervals, without requiring manual intervention. Moreover, we could leave our python code running every 15 minutes, even if we shut down the laptop.

### 2.1 Launch an EC2 instance via AWS
We launched an EC2 instance via Amazon Web Services (AWS) by following the steps from the [Running Computations Remotely article](https://tilburgsciencehub.com/tutorials/scale-up/running-computations-remotely/cloud-computing/) on Tilburg Science hub. This way, we got an EC2 instance as shown in the picture below:

<img src="https://scrapeovfiets.s3.amazonaws.com/EC2.png" alt="EC2 instance" width = "900">

### 2.2 Connect to EC2 instance
The next step was to connect to the EC2 instance via our terminal. this had to be done by specifying the HOST and the KEY variable by running the following commands in our terminal:
```dash
export HOST=<OUR-PUBLIC-IPV4-DNS>
export KEY=<PATH_TO_OUR_KEY_PAIR_FILE>
```
You than use the `ssh` command to actualy connect to the instance:
```dash
ssh -i /path/to/key.pem ec2-user@<EC2-instance-DNS>
```
We then got the following output in our terminal, which meant we were connected to our EC2 instance:

<img src="https://scrapeovfiets.s3.amazonaws.com/Connecting_to_EC2.png" alt="Connecting to EC2 instance" width = "500">

### 2.3 Move python script from own computer to EC2 instance
The next step was to move our python script to our EC2 instance so we can, in the next step, run it via our cronjob. First, we needed to make sure we were in the directory where our python script was located via the terminal (with cd commands). We then ran one line of code to copy files to the EC2 instance:
```dash
scp -i $KEY -r $(pwd) ec2-user@$HOST:/home/ec2-user
```

### 2.4 Set up a cronjob to run the code every 15 minutes
The code now needed to be executed every 15 minutes automatically. This was done by setting up a cronjob on our EC2 instance. In small steps, the following was typed in our terminal while connected with the EC2 instance:
1. ` crontab -e `: this opened the vim editor
2. `I`: this allowed us to type a cronjob in the vim editor
3. `*/15 * * * * /usr/bin/python3 /home/ec2-user/ScrapeFiets/ScrapeFiets.py `: the cron syntax '`*/15 * * * *`' means that the code runs all the time with an interval of 15 minutes. `/usr/bin/python3` is the path to the python installation on our EC2 instance. To know this path to our python installation, we runned the code below once in python. `/home/ec2-user/ScrapeFiets/ScrapeFiets.py ` is the path to our python script.
4. ` :wq `: this way, we closed the vim editor and saved the cronjob. 

Now we could verify our crontab was set up by running `crontab -l` and seeing our crontab line.

In [None]:
# finding the path to the python installation on the EC2 instance
path = os.path.dirname(sys.executable)
print(path)

# 3. Linking to a S3 bucket
The last step was to link our EC2 instance to an s3 bucket. An s3 bucket is a cloud-based object storage service also offered by Amazon Web Services (AWS).
We used S3 buckets for backup and recovery, since we automatically uploaded the data to the s3 bucket every 15 minutes. This way, it can be accessed when needed and even when, in extreme cases, the EC2 instance crashes or gets hacked, we still have the lastest retrieved data in the cloud in our s3 bucket. An s3 bucket can be set up in a few steps:

### 3.1 Creating an s3 bucket
First of all, we created an s3 bucket on our same AWS account as on which we created our EC2 instance. This was not that difficult and could be set up in minutes (see picture). Our s3 bucket was called 'scrapeovfiets'. The difficult part was creating the IAM role and linking the s3 bucket to the EC2 instance with the IAM role (see 3.2 and 3.3). 

<img src="https://scrapeovfiets.s3.amazonaws.com/bucket.png" alt="s3 bucket" width = "900">

### 3.2 Creating an IAM role with s3:PutObject permissions
For the EC2 instance to be allowed to put objects in the s3 bucket, it needed to be assigned an IAM role. An IAM role can be compared to a permission that a mother grants to her child to play with certain toys in the house. The EC2 instance is the child, and the S3 bucket is the toy box. We thus needed to set up an IAM role in two steps:
1. Create a policy

    We first need to create a policy which is a permission to put objects in the s3 bucket (policies and permissions are used both, but have the same meaning). We did this by tying the following json formatted code in the json editor for creating a policy:

    ```json
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "S3PutObject",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject"
                ],
                "Resource": [
                    "arn:aws:s3:::scrapeovfiets/*"
                ]
            }
        ]
    }
    ```
    We then could give our policy a name and save it. 

2. Create IAM role and assign policy

    Then we create an actual IAM role. In this step we can assign the policy created in step 3.2 to our IAM role. We then give our IAM role a name and save it. 


### 3.3 Assign IAM role to the EC2 instance
So far, we have created an s3 bucket and we created an IAM role with s3:PutObject permissions. We now want to assign this IAM role to the EC2 instance, so the EC2 instance is allowed to put objects in the s3 bucket (we want the mother to give permissioins to the child to play with the toys). We do this by just connecting the IAM role to the EC2 instance.

<img src="https://scrapeovfiets.s3.amazonaws.com/AssignIAMrole.png" alt="Assign IAM role" width = "900">

After we have connected the IAM role with the EC2 instance, we can tes in our terminal if it worked. For this, we have to connect with the EC2 instance in our terminal and type the following command in our terminal:
```dash
aws iam list-users
```
If everything worked, we should get the output:

<img src="https://scrapeovfiets.s3.amazonaws.com/IAMroleterminal.png" alt="Assign IAM role" width = "350">

### 3.4 Include code in Python script that puts the json files in the s3 bucket
The last step was to include code in our Python script that puts the json files created in the script in our s3 bucket. This can be done with the following code:

In [None]:
# Make sure s3 bucket can be used (define parameters)
bucket_name = 'scrapeovfiets'
destination_file_key_data = 'ov_data.json'
destination_file_key_html = 'raw_data.json'

# Create an S3 client
s3 = boto3.client('s3')

# Upload the file to S3
s3.put_object(Body=open('ov_data.json', 'rb'), Bucket=bucket_name, Key=destination_file_key_data)
s3.put_object(Body=open('raw_data.json', 'rb'), Bucket=bucket_name, Key=destination_file_key_html)

time.sleep(1)

# Wrap up
We now have set up an virtual machine that runs our python script every 15 minutes and stores the json files with data created in this pyhton code in the cloud! 

The code in one block is pasted below:

In [None]:
# Import the required libraries
import requests
from bs4 import BeautifulSoup
import json
import datetime
import pytz
import time
import os, sys
import boto3

# finding the path to the python installation on the EC2 instance
path = os.path.dirname(sys.executable)
print(path)

# make sure s3 bucket can be used
bucket_name = 'scrapeovfiets'
destination_file_key_data = 'ov_data.json'
destination_file_key_html = 'raw_data.json'

# Define the URL of the website and the user agent header
url = 'https://ovfietsbeschikbaar.nl/locaties'
header = {'User-agent': 'Mozilla/5.0'}
# Get the HTML code from the website and process it with BeautifulSoup
res = requests.get(url, headers=header)
res.encoding = res.apparent_encoding
source_code_0 = res.text  # make website html code readable as text

# Make information "extractable" using BeautifulSoup
soup0 = BeautifulSoup(source_code_0, 'html.parser')

# store the raw data in a dictionary
raw_data_hoofdlink = {'html': str(soup0)}

# Find the list of locations on the page and get the links to the OV-fiets stations
locatielijst = soup0.find('a', attrs={'name': 'Locatielijst'})
stations = [a['href'] for a in locatielijst.find_next('div').find_all('a', {'class': 'panel-block'})]

# Create a list of links to the OV-fiets stations
links = []
for station in stations:
    combined_link = "https://ovfietsbeschikbaar.nl" + station
    links.append(combined_link)

# Define the parse_website function to extract information from each OV-fiets station webpage
def parse_website(url):
    header = {'User-agent': 'Mozilla/5.0'}  # with the user agent, we let Python know for which browser version to retrieve the website
    request = requests.get(url, headers=header)
    request.encoding = request.apparent_encoding  # set encoding to UTF-8
    source_code = request.text  # make website html code readable as text

    # Make information "extractable" using BeautifulSoup
    soup = BeautifulSoup(source_code, 'html.parser')

    # Scrape the relevant information
    station = soup.find(class_='title has-text-weight-bold').get_text()
    totaal = soup.find(title='Schatting op basis van de laaste drie maanden').get_text(strip=True)
    if soup.find(class_="grafiek-overlay").get_text(strip=True) == 'geen actueledata beschikbaar':
        beschikbaar = 'Geen actuele data beschikbaar'
    elif soup.find("div", {"class": "grafiek-overlay"}).find("span", {"class": "badge"})['data-badge'] == '!':
        beschikbaar = 'Vertraging in de data'
    else:
        beschikbaar = soup.find('td', string='Nu beschikbaar').find_next_sibling('td').get_text(strip=True)
    soort = soup.find('td', string='Type stalling').find_next_sibling('td').get_text(strip=True)
    adres = soup.find(class_='table is-narrow is-fullwidth').find('td').get_text()
    if adres.startswith('Adres:'):
        adres = adres.replace('Adres:', '')
    else:
        adres = 'Onbekend'

    # extract current date and time
    current_datetime = datetime.datetime.now(pytz.timezone('Europe/Amsterdam'))
    formatted_datetime = current_datetime.strftime("%Y-%m-%d %H:%M")

    # store the information in a dictionary
    data = {'date': formatted_datetime,
            'station': station,
    		'totaal': totaal,
    		'beschikbaar': beschikbaar,
        'type stalling': soort,
        'adres': adres}

    # store the raw data in a dictionary
    raw_data = {'date': formatted_datetime, 'html': str(soup)}

    return (data, raw_data)

# store the raw data of the html structure in a json file
f = open('raw_data.json', 'a', encoding='utf-8')
f.write(json.dumps(raw_data_hoofdlink))
f.write('\n')  # new line to separate objects
f.close()

f = open('raw_data.json', 'a', encoding='utf-8')
for link in links:
    raw_data = parse_website(link)[1]
    f.write(json.dumps(raw_data))
    f.write('\n')  # new line to separate objects
f.close()

# store the scraped data in a json file
f = open('ov_data.json', 'a', encoding='utf-8')
for link in links:
    data = parse_website(link)[0]
    f.write(json.dumps(data))
    f.write('\n')  # new line to separate objects
f.close()

# Create an S3 client
s3 = boto3.client('s3')

# Upload the file to S3
s3.put_object(Body=open('ov_data.json', 'rb'), Bucket=bucket_name, Key=destination_file_key_data)
s3.put_object(Body=open('raw_data.json', 'rb'), Bucket=bucket_name, Key=destination_file_key_html)

time.sleep(1)