This is a script for scraping **news items** from the **DOCUMENTA FIFTEEN WEBSITE**: https://documenta-fifteen.de/en/

First of all, you need to connect this Colab notebook with your Google Drive and define the directory for input and output data.

In [None]:
## mount drive
from google.colab import drive
drive.mount("/content/drive")
directory="/content/drive/My Drive/Colab_NLP_UM/"

Then we need to install additional packages.

In [None]:
## install packages that are not part of Python's standard distribution

!pip install BeautifulSoup
!pip install requests

Now we can import the packages and run the actual code. In the **first section**, the script will navigate to the DOCUMENTA15 news site and identity the links to individual news items.

It will also download the individual news pages as HTML files for offline use and store them in a Google Drive folder.

The page where all news items are linked is: https://documenta-fifteen.de/en/news/

The links to each news item are hidden in an "unordered list" within the HTML source code:

```
<ul class="filter-list">
<li class="filter-list__item" x-show="isFiltered('general ')">
<a  @mouseenter="{
                        $refs.previewImage.src = 'https://documenta-fifteen.de/wp-content/uploads/2022/09/d15_Fridericianum_2_2022_09_24_©_Nicolas_Wefers_001.jpg'; 
                        $refs.previewImageWrap.classList.add('active');
                    }" 
    @mouseleave="$refs.previewImageWrap.classList.remove('active');"  href="https://documenta-fifteen.de/en/news/documenta-fifteen-closes-with-very-good-attendance-figures/" class="news-list__item">
                            <div class="news-list__date">26.9.2022</div>
                            <div class="news-list__textbox">
<h3 class="news-list__headline">documenta fifteen closes with very good attendance figures </h3>
<p class="news-list__informal">General</p>
                            </div>
                        </a>
                    </li>
```

The **second part** of the code will further analyse each news item to only extract the actual news content. In the source code, this content is hidden in the `<div class="news__content">` HTML tag.

The plain text extracted here will be written to one TXT file for item, and all TXT files will also be written to a separate folder on Google Drive.

In [None]:
# import packages

import requests
import urllib.request
from urllib.parse import urljoin
import os
from bs4 import BeautifulSoup
import bs4.builder._lxml
from xml.etree.ElementTree import XML, fromstring

# URL to be called

NEWS_url="https://documenta-fifteen.de/en/news/"

# function to extract html document from given url
# as suggested on https://www.geeksforgeeks.org/beautifulsoup-scraping-link-from-html/

def getHTMLdocument(url):
      
    # request for HTML document of given url
    response1 = requests.get(url)
      
    # response will be provided in JSON format
    return response1.text

# Navigate to the application home page

html_document = getHTMLdocument(NEWS_url)
soup = BeautifulSoup(html_document, 'html')

# find links for individual files

links = soup.find_all("li", {"class" : "filter-list__item"})

# number of links found

print("Number of links found: ", len(links))

# create counter to number files

counter=0
no_links=len(links)

# traverse list and get individual XML URLs

try:
    for lnk in links:
        index=links.index(lnk)
        counter=+index
        print(counter)
        l_tag = lnk.find("a")
        l_rel = l_tag.get("href")
        print(l_rel) 

    # use new URL to access individual HTML files
    # get entire page content including metadata

        response2 = requests.get(l_rel)
        outfile=response2.text
        
    # generate new file names based on link names

        new_l = l_rel.replace("https://documenta-fifteen.de/en/news/", "")[:-1]
        #print(new_l)

    # get news content only

        soup2 = BeautifulSoup(outfile, 'html')
        news_content = soup2.find("div", {"class" : "news__content"})
        #print(news_content)
        news_text = news_content.text
        print(news_text) 

    # save each XML file to drive

        with open((directory+"Documenta15_XML/" + new_l + '.html'), 'w') as f: # encoding="utf-8"
            print(f)
            f.write(outfile)

    # save news content to TXT files on drive

        with open((directory+"Documenta15_TXT/" + new_l + '.txt'), 'w') as f: # encoding="utf-8"
            f.write(news_text)
            print("File no.", counter, "downloaded!")

except AttributeError:
    print("No more files on current page!")

Script provided by Monika Barget, Maastricht University, February 2022