# Weather website web scraping

## Overview

While a lot of local weather data is accessible on the internet, it is often sparse and not easily downloadable. 
Here we have a specific weather station, the data of which is only accesible day by day. In order to get it for a whole year or more, scraping seems to be the fastest solution.

## Implementation
We use the web-scraping library BeautifulSoup4, and lxml as a parser. 

In [2]:
from lxml import html
import requests
import bs4 as BeautifulSoup
import pandas as pd
from datetime import date

In [None]:
from suivi_meteo import functions

### Data structure
For each day the website offers a table with a timestamp (hourly step), its temperature and rainfall.

In [7]:
# Time span for which we collect data.
# Here for example one week
origin = pd.Timestamp('20230101')
enddate = pd.Timestamp('20230107')

In [None]:
# Pandas time range definition
dti = pd.date_range(origin, date.today(), freq="d")

# Output dataframe definition
output = pd.DataFrame({"time":[], "temperature":[], "precipitations":[]})

# 
for index in range(len(dti)):
    print(dti[index])
    url = functions.url_from_datetime(dti[index])
    soup = functions.soup_from_url(url)
    # Let's access the table 
    tablesoup = soup.find(string='locale').parent.parent.parent.parent
    uglydf = functions.dataframe_from_soup(tablesoup)
    df = functions.dataframe_cleanup(uglydf, dti[index])
    output = pd.concat([output, df],axis=0)

# Output conservation as csv
path = "../results/meteo_" + str(origin.year) + str(origin.month) + str(origin.day) + '_' + str(enddate.year)+str(origin.month)+str(origin.day)
output.to_csv(path)


### Caveat
It should be noted that this method is poorly replicable to other websites, and that the data cleaning function must be adaptated. 