<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/DW_scrape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a multi-lingual text corpus

Within this notebook, we will extract textual information from the news portal [Deutche Welle](https://www.dw.com/en/). DW produces content in 30+ languages.

Our main goal is to use some articles available from the DW website to create a small corpus in the VERT format.

We will:
* Gather URLs to individual articles
* Extract useful information from the dynamically loaded HTML pages
* Use the Stanza toolkit to tokenize the texts and normalize the tokens
* Store the corpus data in a VERT format file

!NB: This notebook is only for educational purposes. Please respect the IPR of DW.

## Corpus structure

The final corpus [VERT file](https://www.sketchengine.eu/glossary/vertical-file/) should store all the tokenized texts and the metadata information of each document.

Within this corpus, we will use two structures - documents (`<doc>`) and sentences (`<s>`) - where the division into sentences is handled by the `Stanza` toolkit.

Each document tag (`<doc>`) shall contain the following attributes:
* language
* portal (always DW in this case)
* article path
* title
* author
* time

## Initial set-up

In real world situations, the information on websites is often **loaded dynamically**, which means that our previous approach (cf. `TextExtraction.ipynb`) of simply reading the retrieved HTML file does not work.

To address this issue, we will use a couple of python `WebDriver` libraries to load the pages, so that the article contents are loaded when we scrape the HTML file:
* [Selenium](https://selenium-python.readthedocs.io/) with Python
* [Geckodriver](https://github.com/mozilla/geckodriver) – required driver for calling the Firefox browser (see Selenium with Python documentation Section 1.5 for other browser drivers)

In [None]:
%%capture

!pip install selenium
!pip install pandas
!pip install stanza

!wget https://github.com/mozilla/geckodriver/releases/download/v0.34.0/geckodriver-v0.34.0-linux64.tar.gz
!tar -xvzf geckodriver*
!chmod +x geckodriver
!sudo mv geckodriver /usr/local/bin/

In [None]:
import os
import re
import sys
import json
import requests
from bs4 import BeautifulSoup

import time
import stanza
import selenium
from selenium import webdriver
from selenium.webdriver import FirefoxOptions
from selenium.webdriver.common.by import By
import pandas as pd



---



## HTML parsing

In [None]:
MAX_COUNT = 10
DW_URL = "http://www.dw.com"
CORPUS_FILE = "stanza_output.vert"
LANGUAGES = {"de", "en", "es", "fr", "uk"}

In [None]:
info = []

def scrape_article(soup, url):
    a = {}

    language = url[len(DW_URL):].split("/")[1]
    filename = url.split("/")[-1]

    a["language"] = language
    a["portal"] = "Deutsche Welle"
    a["link"] = url

    with open(os.getcwd()+"/"+language+"/"+filename+".txt", "w", encoding="utf-8") as file:
        # Read Title
        header = soup.find("header")
        try:
            title = header.find("h1").text
            a["title"] = title
            file.write(title+"\n")
        except:
            a["title"] = ""

        # Read article author
        try:
            author = header.find("div", {"class": re.compile(".*author-details.*")}).text
            a["author"] = author
            file.write(author+"\n")
        except:
            a["author"] = ""

        # Read article publishing date
        try:
            time = header.find("time").text
            a["time"] = time
            file.write(time+"\n")
        except:
            a["time"] = ""

        file.write("\n")
        # Read main article text
        try:
            main_div = soup.find("div", {"class": re.compile(".*content-area.*")})
            main_text = main_div.find("div", {"class": re.compile(".*rich-text.*")})
            a["text"] = ""
            for p in main_text.find_all("p"):
                a["text"] += p.text+"\n"
                file.write(p.text+"\n")
        except:
            pass

    print(a)
    info.append(a)

In [None]:
def scraper(urls):
    opts = FirefoxOptions()
    opts.add_argument("--headless")
    browser = webdriver.Firefox(options=opts)

    for url in urls:
        url = DW_URL + url
        browser.get(url)
        soup = BeautifulSoup(browser.page_source, "html.parser")
        scrape_article(soup, url)

    browser.quit();

In [None]:
# Let's test that the scraper works
if not os.path.exists(os.getcwd()+"/fr"):
    os.mkdir(os.getcwd()+"/fr")

TEST_URL = "/fr/lubero-beni-butembo-meutres-serie-tutsis-m23-appels-au-calme-reportage/a-69649209"

scraper([TEST_URL])
info.clear()

## Article URLs

In many cases, RSS feeds are available for easily accessing new articles that have been published. DW also has a couple of [links](https://corporate.dw.com/en/rss-feeds/a-68693346) that allow for access to the news feed. However, these available feeds are not suited for the purposes of creating a multi-lingual text corpus.

We will instead iterate recursively through the article links provided in each indiviadual article. Additionally, we will only examine those pages whose ID starts with "a-" as these are text based articles as opposed to video or image based news stories.

In [None]:
seen = []
articles = []

def fetch_articles(url):
    if len(articles) >= MAX_COUNT:
        return

    seen.append(url)
    r = requests.get(url)

    soup = BeautifulSoup(r.text, "html.parser")

    # print(soup.prettify())
    teasers = soup.find_all("div", {"class": "teaser-data"})
    for t in teasers:
        if len(articles) >= MAX_COUNT:
            return
        article = (t.find("a").get("href"))
        artinfo = article.split("/")[-1]
        if artinfo[:2] == "a-" and article not in articles:
            articles.append(article)
        if DW_URL + article not in seen and artinfo[:2] == "a-":
            fetch_articles(DW_URL + article)

In [None]:
def analyze_articles(nlp, output_file=CORPUS_FILE):
    with open(output_file, 'a', encoding="utf-8") as s_file:
        for d in info:
            s_file.write('<doc language="{}" portal="{}" link="{}" title="{}" author="{}" time="{}">\n'.format(d["language"], d["portal"], d["link"], d["title"], d["author"], d["time"]))
            for s in nlp(d["text"]).sentences:
                s_file.write(f'<s>\n')
                for w in s.words:
                    s_file.write(f'{w.text}\t{w.upos}\t{w.lemma}\n')
                s_file.write("</s>\n")
            s_file.write("</doc>\n")

In [None]:
def fetch_lang(language, stanza_code=""):
    seen.clear()
    articles.clear()
    if not os.path.exists(os.getcwd()+"/"+language):
        os.mkdir(os.getcwd()+"/"+language)
    dw_lang = DW_URL + "/" + language
    fetch_articles(dw_lang)
    scraper(articles)

    if not stanza_code:
        stanza_code = language
    stanza.download(stanza_code)
    nlp = stanza.Pipeline(stanza_code, processors='tokenize,pos,lemma')
    analyze_articles(nlp)
    info.clear()

# fetch_lang("fr")
info.clear()
open(CORPUS_FILE, "w").close()
for lang in LANGUAGES:
    fetch_lang(lang)

## Quick corpus analysis

Here we look at how many tokens and lemmas are present in the texts of each language.

In [None]:
def print_corpus_info():
    with open(CORPUS_FILE, "r", encoding="utf-8") as corpus:
        lang = ""
        df = pd.DataFrame(columns=['lang','token','pos','lemma'])

        for line in corpus:
            if line.startswith("<doc"):
                lang = line.split("\"")[1]
            elif not line.startswith("<") or line.startswith("<\t"):
                token = line.strip().split("\t")
                tok = {'lang': lang, 'token': token[0], 'pos': token[1], 'lemma': token[2]}
                df.loc[len(df)] = tok

        print(df.head())
        return df

df = print_corpus_info()

In [None]:
# Print out each language data
for lang in LANGUAGES:
    print(f"Language: {lang}")
    print(f"Total tokens: {len(df[df['lang'] == lang])}")
    print(f"Unique tokens: {df[df['lang'] == lang]['token'].nunique()}")
    print(f"Unique lemmas: {df[df['lang'] == lang]['lemma'].nunique()}")
    # print(df[(df['lang'] == lang) & (df['pos'] != "PUNCT")]['lemma'].value_counts()[:10])
    print("")