<a href="https://colab.research.google.com/github/lpc49/LuxPollen/blob/main/LuxPollen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pollen in Luxembourg

We are using the available historical data of pollen in Luxembourg to predict the pollen concentration in the next 10 days. 
<br>
We are then visualizing it in a webapp via dash. 

In [None]:
import pandas as pd

## Web scraping from Pollen.lu

In this section we scrape the data directly from the pollen.lu website. We do not use data from https://data.public.lu/en/ as it is not updated.
<br>
Looping through all the years and weeks from 1992 to today represents around a thousand pages to request from and takes a few minutes. While that is acceptable for our purpose, we also make our output available as a csv file as at end of September 2021. Expand the next 2 subsections to see how it was obtained.

### Example for the first week of 1992

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
response = requests.get("http://www.pollen.lu/index.php?qsPage=data&year=1992&week=0&qsLanguage=Fra")
response.status_code    # We expect 200 as a response status

200

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')
soup.title              # We expect the following title: <title>Pollen</title>

<title>Pollen</title>

In [None]:
html_tables = soup.find_all('table')    # Storing the html tables

We can see that the pollen data is stored on the table number 5. 
<br>
The table does not present the standard header in a \<th> tag but rather as a sub-table inside the table's first row (see below). 
<br>
We also already note that the data is split by weeks, with the url to weekly data stored in the \<option> tags. 

In [None]:
pollen_table = html_tables[5]
print(pollen_table)

<table width="100%">
<tr>
<td width="5"> </td>
<td>
<div class="content">
<h1>Données de l'année 1992</h1>
<p>Il n’y a plus de pollens allergisants dans l’air.
(Actualisation: 28.09.2021)</p>
<form action="index.php?qsPage=data&amp;year=1992&amp;week=1&amp;qsLanguage=Fra" method="post" name="week">
<p align="center">
Faites un choix:<br>
<select name="cboWeek" onchange="jumpMenu('parent',this,0)">
<option selected="" value="index.php?qsPage=data&amp;year=1992&amp;week=0&amp;qsLanguage=Fra">
            du 1992-01-01 au 1992-01-04            </option>
<option value="index.php?qsPage=data&amp;year=1992&amp;week=1&amp;qsLanguage=Fra">
            du 1992-01-05 au 1992-01-11            </option>
<option value="index.php?qsPage=data&amp;year=1992&amp;week=2&amp;qsLanguage=Fra">
            du 1992-01-12 au 1992-01-18            </option>
<option value="index.php?qsPage=data&amp;year=1992&amp;week=3&amp;qsLanguage=Fra">
            du 1992-01-19 au 1992-01-25            </option>
<option va

Since we wiil reuse the same procedure for all relevant years/weeks we define the following function:

In [None]:
def pollen_df_from_table(pollen_table):
    dfs = pd.read_html(str(pollen_table))                 # assign the tables from the pollen table
    df = dfs[0].iloc[1:, :].copy()                        # getting the pollen data stored in dfs[0], without the first row
    df.columns = dfs[1].values.tolist()[0]                # adding the header which is stored in dfs[1]
    df = df.transpose()                                   # transposing to get the species as columns and dates as row
    df.columns = df.iloc[1:2, :].values.tolist()          # defining the header as the species name, in Latin
    df = df.drop(['Français', 'Latin', 'Deutsch', 'Lëtzebuergesch'])
    df.index = pd.to_datetime(df.index)                   # making sure the index is in date type
    df = df.astype(float)                                 # making sure the content is in float type
    return df

In [None]:
pollen_df_from_table(pollen_table)

Unnamed: 0,Ambrosia,Artemisia,Asteraceae,Alnus,Betula,Ericaceae,Carpinus,Castanea,Quercus,Chenopodium,Cupressaceae,Acer,Fraxinus,Gramineae,Fagus,Juncaceae,Aesculus,Larix,Corylus,Juglans,Umbellifereae,Ulmus,Urtica,Rumex,Populus,Pinaceae,Plantago,Platanus,Salix,Cyperaceae,Filipendula,Sambucus,Tilia
1992-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Looping for each year and week

We start by storing all the weekly URLs in a list. 
<br> 
The page's numbering is not very consistent (the first week of the year is sometimes 0 or 1, the last week is sometimes 39 or 51, there are erroneous URLs for year 2001, etc).

In [None]:
from datetime import datetime
current_year = datetime.today().year


In [None]:
weekly_url = []

for year in range(1992, current_year+1):
    url_year = 'http://www.pollen.lu/index.php?qsPage=data&year='+str(year)+'&week=0&qsLanguage=Fra'  # url from which we will extract the url for weekly data for that year, from the 'option' html tags
    response = requests.get(url_year)
    soup = BeautifulSoup(response.text, 'html.parser')
    html_tables = soup.find_all('table')
    link_table = html_tables[5]                                                                       # option tags containing the url for weekly data are stored in table 5
    for option in link_table.find_all('option'):
        link = option['value']
        url_year_week = 'http://www.pollen.lu/'+link
        weekly_url.append(url_year_week)

In [None]:
# Uncomment below to see the url list 
# weekly_url

['http://www.pollen.lu/index.php?qsPage=data&year=1992&week=0&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=1&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=2&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=3&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=4&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=5&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=6&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=7&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=8&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=9&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=10&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=1992&week=11&qsLanguage=Fra',
 'http://www.pollen.lu/index.php?qsPage=data&year=

We note that for year 2001, weeks 22 through 25, the data doesn't exist, which brings trouble later. We thus remove these lines:

In [None]:
weekly_url.remove('http://www.pollen.lu/index.php?qsPage=data&year=2001&week=&qsLanguage=Fra')    # removing url for year 2001, week 22
weekly_url.remove('http://www.pollen.lu/index.php?qsPage=data&year=2001&week=&qsLanguage=Fra')    # removing url for year 2001, week 23
weekly_url.remove('http://www.pollen.lu/index.php?qsPage=data&year=2001&week=&qsLanguage=Fra')    # removing url for year 2001, week 24
weekly_url.remove('http://www.pollen.lu/index.php?qsPage=data&year=2001&week=&qsLanguage=Fra')    # removing url for year 2001, week 25


In [None]:
pollen_dfs = []                                                 # this will be a list of small dataframes that we concatenate at the end of the loop

for url_weekly_data in weekly_url:
        response = requests.get(url_weekly_data)
        soup = BeautifulSoup(response.text, 'html.parser')
        html_tables = soup.find_all('table')
        pollen_table = html_tables[5]                           # as above we see that the weekly pollen data is on table 5
        pollen_df = pollen_df_from_table(pollen_table)          # formatting the weekly pollen data in a dataframe using the predefined function pollen_df_from_table
        pollen_dfs.append(pollen_df)                            # adding the weekly pollen dataframe to the list 

result = pd.concat(pollen_dfs, ignore_index=False)              # concatenating all the weekly pollen dataframes into a single result dataframe
     


In [None]:
result

Unnamed: 0,Ambrosia,Artemisia,Asteraceae,Alnus,Betula,Ericaceae,Carpinus,Castanea,Quercus,Chenopodium,Cupressaceae,Acer,Fraxinus,Gramineae,Fagus,Juncaceae,Aesculus,Larix,Corylus,Juglans,Umbellifereae,Ulmus,Urtica,Rumex,Populus,Pinaceae,Plantago,Platanus,Salix,Cyperaceae,Filipendula,Sambucus,Tilia
1992-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-02,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-03,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1992-01-05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021-12-27,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-12-28,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-12-29,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2021-12-30,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Acquiring weather data

In [None]:
https://data.public.lu/en/datasets/r/a67bd8c0-b036-4761-b161-bdab272302e5

In [59]:
len(test_list)

1001