### Purpose

As the citation date is not always available from the elastic search request, we can search for the first date '(YYYY)' that appears in the description meta tag from the web page.
Do a view source of the web page to see the structure and the different tags. In this case we search for the content of the tag ```<meta name="description">```

Examples :
 * https://doi.pangaea.de/10.1594/PANGAEA.826455
   - Elastic search request: https://ws.pangaea.de/es/dataportal-oa-icc/pansimple/_search?q=_id:PANGAEA.826455
 * https://doi.pangaea.de/10.1594/PANGAEA.959648
   - Elastic search request: https://ws.pangaea.de/es/dataportal-oa-icc/pansimple/_search?q=_id:PANGAEA.959648 
   
There are other tags with dates but they all appear with ```<meta name="DC.relation"``` so cannot distinguish them between 'Supplement to', 'Source' and 'Documentation'.

In [1]:
import requests
from bs4 import BeautifulSoup
import re

#### Try some codes

In [2]:
url = "https://doi.pangaea.de/10.1594/PANGAEA.959648"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
content = soup.find('meta', {'name':'description'}).get('content')
content

'Takolander, Antti; Cabeza, Mar; Leskinen, Elina (2023): Seawater carbonate chemistry and ecophysiology of brown macroalga Fucus vesiculosus L. PANGAEA, https://doi.pangaea.de/10.1594/PANGAEA.959648 (DOI registration in progress)'

In [3]:
m = re.search(r"([0-9]{4})", content)
m.group(0)

'2023'

#### Make a function from previous lines

In [4]:
def get_date_from_description(url):
    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        content = soup.find('meta', {'name':'description'}).get('content')
        m = re.search(r"([0-9]{4})", content)
        date = int(m.group(0))
    except:
        date = None
    return date

In [5]:
get_date_from_description("https://doi.pangaea.de/10.1594/PANGAEA.959648")

2023