# Kapitel 3: Websites durchkämmen und Extraktion von Daten

## Setup
Es werden die Verzeichnisse festgelegt. Wenn Sie mit Google Colab arbeiten: Die erforderlichen Dateien werden kopiert und die erforderlichen Bibliotheken installiert.

## Hinweis

Mit ### ergänzte Code-Zeilen geben Werte an, die angepasst werden können.

In [1]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master'
    os.system(f'wget {GIT_ROOT}/ch03/setup.py')

%run -i setup.py

You are working on a local system.
Files will be searched relative to "..".


## Python-Einstellungen laden

Allgemeine Importe, Standardwerte für die Formatierung in Matplotlib, Pandas usw.

In [2]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# 1. Blueprint: robots.txt herunterladen und auswerten

In [3]:
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.reuters.com/robots.txt")
rp.read()

In [4]:
rp.can_fetch("*", "https://www.reuters.com/arc/outboundfeeds/news-sitemap/?outputType=xml")

True

In [5]:
rp.can_fetch("*", "https://www.reuters.com/finance/stocks/option")

False

# 2. Blueprint: URLs aus sitemap.xml finden

In [6]:
# xmltodict muss installiert sein
import xmltodict
import requests

sitemap = xmltodict.parse(requests.get('https://www.reuters.com/arc/outboundfeeds/news-sitemap/?outputType=xml').text)

In [7]:
# Es werden nur ein paar URLs angesehen
urls = [url["loc"] for url in sitemap["urlset"]["url"]]
print("\n".join(urls[0:5])) ###

https://www.reuters.com/business/energy/vattenfall-ameresco-operate-zero-carbon-heat-network-bristol-2022-03-30/
https://www.reuters.com/world/asia-pacific/covid-cases-asia-surpass-100-million-reuters-tally-2022-03-30/
https://www.reuters.com/lifestyle/sports/injured-marsh-out-pakistan-matches-heads-india-join-ipl-team-2022-03-30/
https://www.reuters.com/world/middle-east/dubais-dewa-says-increases-ipo-size-17-could-raise-much-57-bln-2022-03-30/
https://www.reuters.com/world/middle-east/dubais-new-crypto-regulator-brings-uae-firm-bitoasis-under-its-wing-2022-03-30/


# 3. Blueprint: URLs von RSS finden

Reuters hat seinen RSS-Feed mittlerweile leider entfernt. Wir verwenden daher eine gespeicherte Kopie aus dem Internet-Archiv

In [8]:
# feedparser muss installiert sein
import feedparser
feed = feedparser.parse('http://web.archive.org/web/20200613003232if_/http://feeds.reuters.com/Reuters/worldNews')

In [9]:
[(e.title, e.link) for e in feed.entries]

[('Mexico City to begin gradual exit from lockdown on Monday',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/OQtkVdAqHos/mexico-city-to-begin-gradual-exit-from-lockdown-on-monday-idUSKBN23K00R'),
 ('Mexico reports record tally of 5,222 new coronavirus cases',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/Rkz9j2G7lJU/mexico-reports-record-tally-of-5222-new-coronavirus-cases-idUSKBN23K00B'),
 ('Venezuela supreme court to swear in new electoral council leaders, government says',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/cc3R5aq4Ksk/venezuela-supreme-court-to-swear-in-new-electoral-council-leaders-government-says-idUSKBN23J39T'),
 ("One-fifth of Britain's coronavirus patients were infected in hospitals: Telegraph",
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/1_7Wb0S_6-8/one-fifth-of-britains-coronavirus-patients-were-infected-in-hospitals-telegraph-idUSKBN23J382'),
 ('France to lift border controls for EU travellers on June 15',
  'http://feeds.reuters.com/~r/

In [10]:
[e.id for e in feed.entries]

['https://www.reuters.com/article/us-health-coronavirus-mexico-city/mexico-city-to-begin-gradual-exit-from-lockdown-on-monday-idUSKBN23K00R?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-mexico/mexico-reports-record-tally-of-5222-new-coronavirus-cases-idUSKBN23K00B?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-venezuela-politics/venezuela-supreme-court-to-swear-in-new-electoral-council-leaders-government-says-idUSKBN23J39T?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-britain-hospitals/one-fifth-of-britains-coronavirus-patients-were-infected-in-hospitals-telegraph-idUSKBN23J382?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-france-borders/france-to-lift-border-controls-for-eu-travellers-on-june-15-idUSKBN23J385?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-brazil/brazils-covid-19-deaths-sur

# 4. Blueprint: Herunterladen von HTML-Seiten mit Python

In [11]:
%%time
s = requests.Session()
for url in urls[0:5]:   ###
    # Den Teil des Strings als Dateiname verwenden, der zwischen dem letzten und vorletzten Teichen '/' steht
    file = url.split("/")[-2] ### war -1
    
    r = s.get(url)
    if file == '':
        print("error with URL %s" %url)
    else:
        print(file)
        with open(file, "w+b") as f:
            f.write(r.text.encode('utf-8'))
                

vattenfall-ameresco-operate-zero-carbon-heat-network-bristol-2022-03-30
covid-cases-asia-surpass-100-million-reuters-tally-2022-03-30
injured-marsh-out-pakistan-matches-heads-india-join-ipl-team-2022-03-30
dubais-dewa-says-increases-ipo-size-17-could-raise-much-57-bln-2022-03-30
dubais-new-crypto-regulator-brings-uae-firm-bitoasis-under-its-wing-2022-03-30
CPU times: total: 46.9 ms
Wall time: 698 ms


In [12]:
with open("urls.txt", "w+b") as f:
    f.write("\n".join(urls).encode('utf-8'))

# 5. Blueprint: Extraktion mit regulären Ausdrücke

Wir müssen zunächst einen einzelnen Artikel herunterladen

In [13]:
url = 'https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

file = url.split("/")[-1] + ".html"

print(file)

r = requests.get(url)

with open(file, "w+", encoding="utf-8") as f:
    f.write(r.text)

us-health-vaping-marijuana-idUSKBN1WG4KT.html


In [14]:
import re
with open(file, "r", encoding="utf-8") as f:
    html = f.read()
    g = re.search(r'<title>(.*)</title>', html, re.MULTILINE|re.DOTALL)
    if g:
        print(g.groups()[0])

Banned in Boston: Without vaping, medical marijuana patients must adapt | Reuters


# 6. Blueprint: Verwendung eines HTML-Parsers für die Extraktion

Reuters verändert seine inhaltliche Struktur permanent. Leider wurde der Inhalt *verschleiert*, so dass die Methoden nicht mehr ohne massive Änderungen funktionieren könnten

In diesem Jupyter-Notebook laden wir die Artikel aus dem Internet-Archiv herunter, das noch eine alte, dauerhaft unveränderte HTML-Struktur hat.

In [15]:
WA_PREFIX = "http://web.archive.org/web/20200118131624/"
html = s.get(WA_PREFIX + url).text

In [16]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
soup.select("h1.ArticleHeader_headline")

[<h1 class="ArticleHeader_headline">Banned in Boston: Without vaping, medical marijuana patients must adapt</h1>]

## Extraktion des Titels/der Überschrift

In [17]:
soup.h1

<h1 class="ArticleHeader_headline">Banned in Boston: Without vaping, medical marijuana patients must adapt</h1>

In [18]:
soup.h1.text

'Banned in Boston: Without vaping, medical marijuana patients must adapt'

In [19]:
soup.title.text

'\n                Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

In [20]:
soup.title.text.strip()

'Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

## Extrahieren des Artikeltextes

In [21]:
soup.select_one("div.StandardArticleBody_body").text

'BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah NouvelageThe 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. There are other ways to get the desired effect from  marijuana, and patients have filled dispensaries across the state in recent days to ask about edible or smokeable forms. But Medeiros has come to depend on her battery-powered pen, and wondered how she would cope without her usual supply of cartridges.  “In the midst of something where I’m on t

## Extrahieren von Bildunterschriften

In [22]:
soup.select("div.StandardArticleBody_body figure")

[<figure class="Image_zoom" style="padding-bottom:"><div class="LazyImage_container LazyImage_dark" style="background-image:none"><img aria-label="FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage" src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20"/><div class="LazyImage_image LazyImage_fallback" style="background-image:url(//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20);background-position:center center;background-color:inherit"></div></div><div aria-label="Expand Image Slideshow" class="Image_expand-button" role="button" tabindex="0"><svg focusable="false" height="18px" version="1.1" viewbox="0 0 18 18" width="18px"

Varianten

In [23]:
soup.select("div.StandardArticleBody_body figure img")

[<img aria-label="FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage" src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20"/>,
 <img src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991145&amp;r=LYNXMPEF9039M"/>]

In [24]:
soup.select("div.StandardArticleBody_body figcaption")

[<figcaption><div class="Image_caption"><span>FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage</span></div></figcaption>,
 <figcaption class="Slideshow_caption">Slideshow<span class="Slideshow_count"> (2 Images)</span></figcaption>]

## Extrahieren der URL

In [25]:
soup.find("link", {'rel': 'canonical'})['href']

'http://web.archive.org/web/20200118131624/https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

In [26]:
soup.select_one("link[rel=canonical]")['href']

'http://web.archive.org/web/20200118131624/https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

## Extrahieren von Listeninformationen (Autoren)

In [27]:
soup.find("meta", {'name': 'Author'})['content']

'Jacqueline Tempera'

Varianten

In [28]:
sel = "div.BylineBar_first-container.ArticleHeader_byline-bar div.BylineBar_byline span"
soup.select(sel)

[<span><a href="/web/20200118131624/https://www.reuters.com/journalists/jacqueline-tempera" target="_blank">Jacqueline Tempera</a>, </span>,
 <span><a href="/web/20200118131624/https://www.reuters.com/journalists/jonathan-allen" target="_blank">Jonathan Allen</a></span>]

In [29]:
[a.text for a in soup.select(sel)]

['Jacqueline Tempera, ', 'Jonathan Allen']

## Text von Links extrahieren (Abschnitt)


In [30]:
soup.select_one("div.ArticleHeader_channel a").text

'Health News'

## Extraktion der Lesezeit

In [31]:
soup.select_one("p.BylineBar_reading-time").text

'6 Min Read'

## Attribute extrahieren (id)

In [32]:
soup.select_one("div.StandardArticle_inner-container")['id']

'USKBN1WG4KT'

Alternative: URL

## Namensnennungen extrahieren

In [33]:
soup.select_one("p.Attribution_content").text

'Reporting Jacqueline Tempera in Brookline and Boston, Massachusetts, and Jonathan Allen in New York; Editing by Frank McGurty and Bill Berkrot'

## Zeitstempel extrahieren

In [34]:
ptime = soup.find("meta", { 'property': "og:article:published_time"})['content']
print(ptime)

2019-10-01T19:23:16+0000


In [35]:
from dateutil import parser
parser.parse(ptime)

datetime.datetime(2019, 10, 1, 19, 23, 16, tzinfo=tzutc())

In [36]:
parser.parse(soup.find("meta", { 'property': "og:article:modified_time"})['content'])

datetime.datetime(2019, 10, 1, 19, 23, 16, tzinfo=tzutc())

# 7. Blueprint: Spidering (= Spionage)

In [37]:
import requests
from bs4 import BeautifulSoup
import os.path
from dateutil import parser

def download_archive_page(page):
    filename = "page-%06d.html" % page
    if not os.path.isfile(filename):
        url = "https://www.reuters.com/news/archive/" + \
               "?view=page&page=%d&pageSize=10" % page
        r = requests.get(url)
        with open(filename, "w+", encoding="utf-8") as f:
            f.write(r.text)

def parse_archive_page(page_file):
    with open(page_file, "r", encoding="utf-8") as f:
        html = f.read()
    soup = BeautifulSoup(html, 'html.parser')
    hrefs = ["https://www.reuters.com" + a['href'] 
               for a in soup.select("article.story div.story-content a")]
    return hrefs

def download_article(url):
    # Überprüfung, ob der Artikel bereits vorhanden ist
    filename = url.split("/")[-1] + ".html"
    if not os.path.isfile(filename):
        r = requests.get(url)
        with open(filename, "w+", encoding="utf-8") as f:
            f.write(r.text)

def parse_article(article_file):
    def find_obfuscated_class(soup, klass):
        return soup.find_all(lambda tag: tag.has_attr("class") and (klass in " ".join(tag["class"])))
    
    with open(article_file, "r", encoding="utf-8") as f:
        html = f.read()
    r = {}
    print(html)
    soup = BeautifulSoup(html, 'html.parser')
    r['url'] = soup.find("link", {'rel': 'canonical'})['href']
    r['id'] = r['url'].split("-")[-1]
    # Abbruch möglich, daher Test auf None
    if soup.h1 is not None:
        r['headline'] = soup.h1.text
# Auskommentiert, weil Abbruch wegen Überlauf ###
#    r['section'] = find_obfuscated_class(soup, "ArticleHeader-channel")[0].text
    r['text'] = "\n".join([t.text for t in find_obfuscated_class(soup, "Paragraph-paragraph")])
# Auskommentiert, weil Abbruch wegen Überlauf
#    r['authors'] = find_obfuscated_class(soup, "Attribution-attribution")[0].text
    r['time'] = soup.find("meta", { 'property': "og:article:published_time"})['content']
    return r

In [38]:
# 3 Seiten des Archivs herunterladen
for p in range(1, 3): ###
    download_archive_page(p)

In [39]:
# Archiv parsen und zu article_urls hinzufügen
import glob

article_urls = []

for page_file in glob.glob("page-*.html"):
    article_urls += parse_archive_page(page_file)
print(article_urls)

['https://www.reuters.com/article/us-health-coronavirus-asia-cases/covid-cases-in-asia-surpass-100-million-reuters-tally-idUSKCN2LR0GD', 'https://www.reuters.com/article/us-ukraine-crisis-snapshot/ukraine-and-russia-what-you-need-to-know-right-now-idUSKCN2LO026', 'https://www.reuters.com/article/us-ukraine-crisis/ukraine-isnt-naive-zelenskiy-says-after-russian-pledge-to-scale-down-attack-on-kyiv-idUSKCN2LR023', 'https://www.reuters.com/article/us-health-coronavirus-hongkong/hong-kong-leader-says-citys-brain-drain-unarguable-idUSKCN2LR09T', 'https://www.reuters.com/article/us-health-coronavirus-china/shanghai-expands-covid-lockdown-as-new-daily-caseload-surges-by-a-third-idUSKCN2LR01V', 'https://www.reuters.com/article/us-australia-weather/thousands-of-australians-flee-homes-as-floods-inundate-towns-idUSKCN2LR03R', 'https://www.reuters.com/article/health-coronavirus-china-ipo/china-9-billion-ipo-plans-stalled-amid-covid-outbreak-filings-idUSKCN2LR06J', 'https://www.reuters.com/article/u

In [40]:
# Artikel herunterladen
for url in article_urls:
    download_article(url)

In [41]:
import pandas as pd

df = pd.DataFrame()
for article_file in glob.glob("*-id???????????.html"):
    df = df.append(parse_article(article_file), ignore_index=True)
df['time'] = pd.to_datetime(df.time)

<!DOCTYPE html><html lang="en"><head><title>Biden makes lynching a U.S. hate crime, signs Emmett Till law | Reuters</title><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="apple-itunes-app" content="app-id=602660809"/><script>(function(){
      var current_location = window.location.href;

      if (current_location.indexOf('/info-pages/supported-browsers/') === -1) {
        var supportFetchApi = 'fetch' in window;
        var supportCSSGrid = window.CSS && CSS.supports('display', 'grid');

        if (!supportFetchApi && !supportCSSGrid) {
          window.location.href = '/info-pages/supported-browsers/';
        }
      }
    })()</script><script src="/pf/resources/dist/reuters/js/index.js?d=80" async="" data-config="{&quot;ADMIN&quot;:false,&quot;SEGMENT_WRITE_KEY&quot;:&quot;IEWBqQ8VWHijTQxb7lEBGFGS9uIJzigZ&quot;,&quot;SEGMENT_WRITE_KEY_MOBILE&quot;:&quot;YlmAIaFBxsNtlVJdfuSV0ncE931ghRtS&quot;,&quot;API_ORIGIN&quot;:&quot;https://api-reuters-reuter

<!DOCTYPE html><html lang="en"><head><title>Biden says budget targets Trump&#x27;s &#x27;fiscal mess,&#x27; raises taxes on wealthy | Reuters</title><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="apple-itunes-app" content="app-id=602660809"/><script>(function(){
      var current_location = window.location.href;

      if (current_location.indexOf('/info-pages/supported-browsers/') === -1) {
        var supportFetchApi = 'fetch' in window;
        var supportCSSGrid = window.CSS && CSS.supports('display', 'grid');

        if (!supportFetchApi && !supportCSSGrid) {
          window.location.href = '/info-pages/supported-browsers/';
        }
      }
    })()</script><script src="/pf/resources/dist/reuters/js/index.js?d=80" async="" data-config="{&quot;ADMIN&quot;:false,&quot;SEGMENT_WRITE_KEY&quot;:&quot;IEWBqQ8VWHijTQxb7lEBGFGS9uIJzigZ&quot;,&quot;SEGMENT_WRITE_KEY_MOBILE&quot;:&quot;YlmAIaFBxsNtlVJdfuSV0ncE931ghRtS&quot;,&quot;API_ORIGIN&quot;:&quot;

<!DOCTYPE html><html lang="en"><head><title>Florida governor signs bill limiting LGBTQ instruction in schools | Reuters</title><meta name="viewport" content="width=device-width, initial-scale=1"/><meta name="apple-itunes-app" content="app-id=602660809"/><script>(function(){
      var current_location = window.location.href;

      if (current_location.indexOf('/info-pages/supported-browsers/') === -1) {
        var supportFetchApi = 'fetch' in window;
        var supportCSSGrid = window.CSS && CSS.supports('display', 'grid');

        if (!supportFetchApi && !supportCSSGrid) {
          window.location.href = '/info-pages/supported-browsers/';
        }
      }
    })()</script><script src="/pf/resources/dist/reuters/js/index.js?d=80" async="" data-config="{&quot;ADMIN&quot;:false,&quot;SEGMENT_WRITE_KEY&quot;:&quot;IEWBqQ8VWHijTQxb7lEBGFGS9uIJzigZ&quot;,&quot;SEGMENT_WRITE_KEY_MOBILE&quot;:&quot;YlmAIaFBxsNtlVJdfuSV0ncE931ghRtS&quot;,&quot;API_ORIGIN&quot;:&quot;https://api-reuters-re

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [42]:
df

Unnamed: 0,url,id,headline,text,time
0,https://www.reuters.com/world/us/biden-make-lynching-us-hate-crime-with-signing-emmett-till-bill-2022-03-29/,29/,"Biden makes lynching a U.S. hate crime, signs Emmett Till law",,2022-03-30 02:34:35+00:00
1,https://www.reuters.com/world/us/bidens-budget-boost-military-raise-taxes-billionaires-2022-03-28/,28/,"Biden says budget targets Trump's 'fiscal mess,' raises taxes on wealthy",,2022-03-29 05:17:04+00:00
2,https://www.reuters.com/business/china-9-bln-ipo-plans-stalled-amid-covid-outbreak-filings-estimate-2022-03-30/,30/,China $9 bln IPO plans stalled amid COVID outbreak - filings,,2022-03-30 04:31:43+00:00
3,https://www.reuters.com/business/media-telecom/streaming-showdown-ceremony-shakeup-set-stage-oscar-sunday-2022-03-27/,27/,"'CODA' takes top prize, Will Smith slaps Chris Rock at Oscars",,2022-03-30 01:41:02+00:00
4,https://www.reuters.com/world/asia-pacific/covid-cases-asia-surpass-100-million-reuters-tally-2022-03-30/,30/,COVID cases in Asia surpass 100 million - Reuters tally,,2022-03-30 06:31:37+00:00
5,https://www.reuters.com/world/us/exclusive-two-former-us-officials-help-ethics-probe-trump-ally-clark-source-says-2022-03-29/,29/,"EXCLUSIVE Two former U.S. officials help ethics probe of Trump ally Clark, source says",,2022-03-29 21:24:30+00:00
6,https://www.reuters.com/world/us/florida-governor-says-he-is-signing-bill-limiting-lgbtq-instruction-schools-2022-03-28/,28/,Florida governor signs bill limiting LGBTQ instruction in schools,,2022-03-29 08:42:37+00:00
7,https://www.reuters.com/legal/government/floridas-desantis-vetoes-state-congressional-map-tells-lawmakers-try-again-2022-03-29/,29/,,,2022-03-29 20:34:55+00:00
8,https://www.reuters.com/world/asia-pacific/hong-kong-leader-says-citys-brain-drain-unarguable-2022-03-30/,30/,Hong Kong leader says city's brain drain 'unarguable',,2022-03-30 05:30:30+00:00
9,https://www.reuters.com/world/europe/villages-near-kyiv-how-ukraine-has-kept-russias-army-bay-2022-03-29/,29/,"In villages near Kyiv, how Ukraine has kept Russia's army at bay",,2022-03-30 04:17:20+00:00


In [43]:
df.sort_values("time")

Unnamed: 0,url,id,headline,text,time
22,https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT,idUSKBN1WG4KT,"Banned in Boston: Without vaping, medical marijuana patients must adapt","BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\nThe 32-year-old massage th...",2019-10-01 18:35:04+00:00
20,https://www.reuters.com/world/europe/ukraine-russia-what-you-need-know-right-now-2022-03-25/,25/,Ukraine and Russia: What you need to know right now,,2022-03-29 00:33:35+00:00
1,https://www.reuters.com/world/us/bidens-budget-boost-military-raise-taxes-billionaires-2022-03-28/,28/,"Biden says budget targets Trump's 'fiscal mess,' raises taxes on wealthy",,2022-03-29 05:17:04+00:00
6,https://www.reuters.com/world/us/florida-governor-says-he-is-signing-bill-limiting-lgbtq-instruction-schools-2022-03-28/,28/,Florida governor signs bill limiting LGBTQ instruction in schools,,2022-03-29 08:42:37+00:00
16,https://www.reuters.com/world/us/us-capitol-riot-probe-wants-talk-with-justice-thomas-wife-report-2022-03-28/,28/,U.S. Capitol attack probe may seek interview with Justice Thomas' wife,,2022-03-29 10:08:01+00:00
15,https://www.reuters.com/world/us/fda-authorizes-second-booster-pfizerbiontech-covid-shot-2022-03-29/,29/,FDA authorizes second booster of Pfizer/BioNTech COVID shot,,2022-03-29 14:50:15+00:00
7,https://www.reuters.com/legal/government/floridas-desantis-vetoes-state-congressional-map-tells-lawmakers-try-again-2022-03-29/,29/,,,2022-03-29 20:34:55+00:00
10,https://www.reuters.com/world/middle-east/saudi-led-coalition-halt-military-operations-yemen-help-negotiations-spa-2022-03-29/,29/,Saudi-led coalition to halt military operations in Yemen to help negotiations -SPA,,2022-03-29 21:10:34+00:00
18,https://www.reuters.com/world/us-senators-try-avoid-weeks-long-delay-russia-trade-measure-2022-03-29/,29/,U.S. senators try to avoid weeks-long delay in Russia trade measure,,2022-03-29 21:20:03+00:00
5,https://www.reuters.com/world/us/exclusive-two-former-us-officials-help-ethics-probe-trump-ally-clark-source-says-2022-03-29/,29/,"EXCLUSIVE Two former U.S. officials help ethics probe of Trump ally Clark, source says",,2022-03-29 21:24:30+00:00


# 8. Blueprint: Extraktion der Density 

In [44]:
from readability import Document

doc = Document(html)
doc.title()

'Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

In [45]:
doc.short_title()

'Banned in Boston: Without vaping, medical marijuana patients must adapt'

In [46]:
doc.summary()

'<html><body><div><div class="StandardArticleBody_body"><p>BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 </p><div class="PrimaryAsset_container"><div class="Image_container" tabindex="-1"><figure class="Image_zoom"></figure><figcaption><div class="Image_caption"><span>FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage</span></div></figcaption></div></div><p>The 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. </p><p>There are other ways to get the desired effect from  marijuana, and patients h

In [47]:
doc.url

In [48]:
density_soup = BeautifulSoup(doc.summary(), 'html.parser')
density_soup.body.text

'BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah NouvelageThe 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. There are other ways to get the desired effect from  marijuana, and patients have filled dispensaries across the state in recent days to ask about edible or smokeable forms. But Medeiros has come to depend on her battery-powered pen, and wondered how she would cope without her usual supply of cartridges.  “In the midst of something where I’m on t

# 9. Blueprint: Scrapy

Leider kann der Code für `scrapy` nicht einfach geändert werden. Ein weiteres Argument für die Verwendung *aktueller* separater Bibliotheken. In dieser Version sammelt es immer noch die Titel der Artikel, aber nicht mehr.

In [49]:
# Scrapy muss installiert sein
import scrapy
import logging


class ReutersArchiveSpider(scrapy.Spider):
    name = 'reuters-archive'
    
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_FORMAT': 'json',
        'FEED_URI': 'reuters-archive.json'
    }
    
    start_urls = [
        'https://www.reuters.com/news/archive/',
    ]

    def parse(self, response):
        for article in response.css("article.story div.story-content a"):
            yield response.follow(article.css("a::attr(href)").extract_first(), self.parse_article)

        next_page_url = response.css('a.control-nav-next::attr(href)').extract_first()
        if (next_page_url is not None) & ('page=2' not in next_page_url):
            yield response.follow(next_page_url, self.parse)

    def parse_article(self, response):
        yield {
          'title': response.css('h1::text').extract_first().strip(),
        }

In [50]:
# Reuters gestattet nur einmal diese Aufruf von einem Jupyter-Notebook aus
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()

process.crawl(ReutersArchiveSpider)
process.start()

2022-03-30 09:00:35 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-30 09:00:35 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:22:46) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Windows-10-10.0.19044-SP0
2022-03-30 09:00:35 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30}


In [51]:
glob.glob("*.json")

['reuters-archive.json']

In [52]:
file = open("reuters-archive.json","r") 
content = file.read() 
print(content)

[
{"title": "Islamic State leader and family blended in among Syrians uprooted by war"},
{"title": "Biden admin eases Trump-era solar tariffs but doesn't end them"},
{"title": "Putin and Xi unveil alliance at Olympics, mixing politics and sport"},
{"title": "Omicron surge likely slowed U.S. job growth, losses possible"},
{"title": "Ahmaud Arbery killer withdraws guilty plea on U.S. hate-crime charges"},
{"title": "Republican senator urges U.S. to monitor China's digital yuan push during Olympics"},
{"title": "U.S. House set to pass sweeping vote on China competition bill"},
{"title": "U.S. winter storm leaves hundreds of thousands without power"},
{"title": "Oil hits fresh seven-year highs, closing out seven weeks of gains"},
{"title": "Texas man charged with threatening Georgia election officials pleads not guilty"},
{"title": "News Corp suspects China behind cyberattack on its system"},
{"title": "Loyal to Trump, Republican Party moves to censure U.S. Reps. Cheney, Kinzinger"},
{"tit