# Kapitel 3: Websites durchkämmen und Extraktion von Daten

## Setup
Es werden die Verzeichnisse festgelegt. Wenn Sie mit Google Colab arbeiten: Die erforderlichen Dateien werden kopiert und die erforderlichen Bibliotheken installiert.

## Hinweis

Mit ### ergänzte Code-Zeilen geben Werte an, die angepasst werden können.

In [1]:
import sys, os
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    GIT_ROOT = 'https://github.com/blueprints-for-text-analytics-python/blueprints-text/raw/master'
    os.system(f'wget {GIT_ROOT}/ch03/setup.py')

%run -i setup.py

You are working on a local system.
Files will be searched relative to "..".


## Python-Einstellungen laden

Allgemeine Importe, Standardwerte für die Formatierung in Matplotlib, Pandas usw.

In [2]:
%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = 'png'

# 1. Blueprint: robots.txt herunterladen und auswerten

In [3]:
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://www.reuters.com/robots.txt")
rp.read()

In [4]:
rp.can_fetch("*", "https://www.reuters.com/arc/outboundfeeds/news-sitemap/?outputType=xml")

True

In [5]:
rp.can_fetch("*", "https://www.reuters.com/finance/stocks/option")

False

# 2. Blueprint: URLs aus sitemap.xml finden

In [6]:
# xmltodict muss installiert sein
import xmltodict
import requests

sitemap = xmltodict.parse(requests.get('https://www.reuters.com/arc/outboundfeeds/news-sitemap/?outputType=xml').text)

In [7]:
# Es werden nur ein paar URLs angesehen
urls = [url["loc"] for url in sitemap["urlset"]["url"]]
print("\n".join(urls[0:5])) ###

https://www.reuters.com/lifestyle/sports/us-owned-haas-terminate-russian-racer-mazepins-contract-2022-03-05/
https://www.reuters.com/world/europe/ukraine-interior-ministry-adviser-says-there-will-be-more-civilian-evacuation-2022-03-05/
https://www.reuters.com/markets/europe/top-wrap-1-europes-largest-nuclear-power-plant-fire-after-russian-attack-mayor-2022-03-04/
https://www.reuters.com/lifestyle/sports/china-off-dominant-start-with-eight-medals-day-one-winter-games-2022-03-05/
https://www.reuters.com/lifestyle/sports/haas-f1-terminates-contract-with-russian-racer-mazepin-team-2022-03-05/


# 3. Blueprint: URLs von RSS finden

Reuters hat seinen RSS-Feed mittlerweile leider entfernt. Wir verwenden daher eine gespeicherte Kopie aus dem Internet-Archiv

In [8]:
# feedparser muss installiert sein
import feedparser
feed = feedparser.parse('http://web.archive.org/web/20200613003232if_/http://feeds.reuters.com/Reuters/worldNews')

In [9]:
[(e.title, e.link) for e in feed.entries]

[('Mexico City to begin gradual exit from lockdown on Monday',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/OQtkVdAqHos/mexico-city-to-begin-gradual-exit-from-lockdown-on-monday-idUSKBN23K00R'),
 ('Mexico reports record tally of 5,222 new coronavirus cases',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/Rkz9j2G7lJU/mexico-reports-record-tally-of-5222-new-coronavirus-cases-idUSKBN23K00B'),
 ('Venezuela supreme court to swear in new electoral council leaders, government says',
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/cc3R5aq4Ksk/venezuela-supreme-court-to-swear-in-new-electoral-council-leaders-government-says-idUSKBN23J39T'),
 ("One-fifth of Britain's coronavirus patients were infected in hospitals: Telegraph",
  'http://feeds.reuters.com/~r/Reuters/worldNews/~3/1_7Wb0S_6-8/one-fifth-of-britains-coronavirus-patients-were-infected-in-hospitals-telegraph-idUSKBN23J382'),
 ('France to lift border controls for EU travellers on June 15',
  'http://feeds.reuters.com/~r/

In [10]:
[e.id for e in feed.entries]

['https://www.reuters.com/article/us-health-coronavirus-mexico-city/mexico-city-to-begin-gradual-exit-from-lockdown-on-monday-idUSKBN23K00R?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-mexico/mexico-reports-record-tally-of-5222-new-coronavirus-cases-idUSKBN23K00B?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-venezuela-politics/venezuela-supreme-court-to-swear-in-new-electoral-council-leaders-government-says-idUSKBN23J39T?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-britain-hospitals/one-fifth-of-britains-coronavirus-patients-were-infected-in-hospitals-telegraph-idUSKBN23J382?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-france-borders/france-to-lift-border-controls-for-eu-travellers-on-june-15-idUSKBN23J385?feedType=RSS&feedName=worldNews',
 'https://www.reuters.com/article/us-health-coronavirus-brazil/brazils-covid-19-deaths-sur

# 4. Blueprint: Herunterladen von HTML-Seiten mit Python

In [11]:
%%time
s = requests.Session()
for url in urls[0:5]:   ###
    # Den Teil des Strings als Dateiname verwenden, der zwischen dem letzten und vorletzten Teichen '/' steht
    file = url.split("/")[-2] ### war -1
    
    r = s.get(url)
    if file == '':
        print("error with URL %s" %url)
    else:
        print(file)
        with open(file, "w+b") as f:
            f.write(r.text.encode('utf-8'))
                

us-owned-haas-terminate-russian-racer-mazepins-contract-2022-03-05
ukraine-interior-ministry-adviser-says-there-will-be-more-civilian-evacuation-2022-03-05
top-wrap-1-europes-largest-nuclear-power-plant-fire-after-russian-attack-mayor-2022-03-04
china-off-dominant-start-with-eight-medals-day-one-winter-games-2022-03-05
haas-f1-terminates-contract-with-russian-racer-mazepin-team-2022-03-05
CPU times: total: 31.2 ms
Wall time: 352 ms


In [12]:
with open("urls.txt", "w+b") as f:
    f.write("\n".join(urls).encode('utf-8'))

# 5. Blueprint: Extraktion mit regulären Ausdrücke

Wir müssen zunächst einen einzelnen Artikel herunterladen

In [13]:
url = 'https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

file = url.split("/")[-1] + ".html"

print(file)

r = requests.get(url)

with open(file, "w+", encoding="utf-8") as f:
    f.write(r.text)

us-health-vaping-marijuana-idUSKBN1WG4KT.html


In [14]:
import re
with open(file, "r", encoding="utf-8") as f:
    html = f.read()
    g = re.search(r'<title>(.*)</title>', html, re.MULTILINE|re.DOTALL)
    if g:
        print(g.groups()[0])

Banned in Boston: Without vaping, medical marijuana patients must adapt | Reuters


# 6. Blueprint: Verwendung eines HTML-Parsers für die Extraktion

Reuters verändert seine inhaltliche Struktur permanent. Leider wurde der Inhalt *verschleiert*, so dass die Methoden nicht mehr ohne massive Änderungen funktionieren könnten

In diesem Jupyter-Notebook laden wir die Artikel aus dem Internet-Archiv herunter, das noch eine alte, dauerhaft unveränderte HTML-Struktur hat.

In [15]:
WA_PREFIX = "http://web.archive.org/web/20200118131624/"
html = s.get(WA_PREFIX + url).text

In [16]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
soup.select("h1.ArticleHeader_headline")

[<h1 class="ArticleHeader_headline">Banned in Boston: Without vaping, medical marijuana patients must adapt</h1>]

## Extraktion des Titels/der Überschrift

In [17]:
soup.h1

<h1 class="ArticleHeader_headline">Banned in Boston: Without vaping, medical marijuana patients must adapt</h1>

In [18]:
soup.h1.text

'Banned in Boston: Without vaping, medical marijuana patients must adapt'

In [19]:
soup.title.text

'\n                Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

In [20]:
soup.title.text.strip()

'Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

## Extrahieren des Artikeltextes

In [21]:
soup.select_one("div.StandardArticleBody_body").text

'BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah NouvelageThe 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. There are other ways to get the desired effect from  marijuana, and patients have filled dispensaries across the state in recent days to ask about edible or smokeable forms. But Medeiros has come to depend on her battery-powered pen, and wondered how she would cope without her usual supply of cartridges.  “In the midst of something where I’m on t

## Extrahieren von Bildunterschriften

In [22]:
soup.select("div.StandardArticleBody_body figure")

[<figure class="Image_zoom" style="padding-bottom:"><div class="LazyImage_container LazyImage_dark" style="background-image:none"><img aria-label="FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage" src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20"/><div class="LazyImage_image LazyImage_fallback" style="background-image:url(//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20);background-position:center center;background-color:inherit"></div></div><div aria-label="Expand Image Slideshow" class="Image_expand-button" role="button" tabindex="0"><svg focusable="false" height="18px" version="1.1" viewbox="0 0 18 18" width="18px"

Varianten

In [23]:
soup.select("div.StandardArticleBody_body figure img")

[<img aria-label="FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage" src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991144&amp;r=LYNXMPEF9039L&amp;w=20"/>,
 <img src="//web.archive.org/web/20200118131624im_/https://s3.reutersmedia.net/resources/r/?m=02&amp;d=20191001&amp;t=2&amp;i=1435991145&amp;r=LYNXMPEF9039M"/>]

In [24]:
soup.select("div.StandardArticleBody_body figcaption")

[<figcaption><div class="Image_caption"><span>FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage</span></div></figcaption>,
 <figcaption class="Slideshow_caption">Slideshow<span class="Slideshow_count"> (2 Images)</span></figcaption>]

## Extrahieren der URL

In [25]:
soup.find("link", {'rel': 'canonical'})['href']

'http://web.archive.org/web/20200118131624/https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

In [26]:
soup.select_one("link[rel=canonical]")['href']

'http://web.archive.org/web/20200118131624/https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT'

## Extrahieren von Listeninformationen (Autoren)

In [27]:
soup.find("meta", {'name': 'Author'})['content']

'Jacqueline Tempera'

Varianten

In [28]:
sel = "div.BylineBar_first-container.ArticleHeader_byline-bar div.BylineBar_byline span"
soup.select(sel)

[<span><a href="/web/20200118131624/https://www.reuters.com/journalists/jacqueline-tempera" target="_blank">Jacqueline Tempera</a>, </span>,
 <span><a href="/web/20200118131624/https://www.reuters.com/journalists/jonathan-allen" target="_blank">Jonathan Allen</a></span>]

In [29]:
[a.text for a in soup.select(sel)]

['Jacqueline Tempera, ', 'Jonathan Allen']

## Text von Links extrahieren (Abschnitt)


In [30]:
soup.select_one("div.ArticleHeader_channel a").text

'Health News'

## Extraktion der Lesezeit

In [31]:
soup.select_one("p.BylineBar_reading-time").text

'6 Min Read'

## Attribute extrahieren (id)

In [32]:
soup.select_one("div.StandardArticle_inner-container")['id']

'USKBN1WG4KT'

Alternative: URL

## Namensnennungen extrahieren

In [33]:
soup.select_one("p.Attribution_content").text

'Reporting Jacqueline Tempera in Brookline and Boston, Massachusetts, and Jonathan Allen in New York; Editing by Frank McGurty and Bill Berkrot'

## Zeitstempel extrahieren

In [34]:
ptime = soup.find("meta", { 'property': "og:article:published_time"})['content']
print(ptime)

2019-10-01T19:23:16+0000


In [35]:
from dateutil import parser
parser.parse(ptime)

datetime.datetime(2019, 10, 1, 19, 23, 16, tzinfo=tzutc())

In [36]:
parser.parse(soup.find("meta", { 'property': "og:article:modified_time"})['content'])

datetime.datetime(2019, 10, 1, 19, 23, 16, tzinfo=tzutc())

# 7. Blueprint: Spidering (= Spionage)

In [37]:
import requests
from bs4 import BeautifulSoup
import os.path
from dateutil import parser

def download_archive_page(page):
    filename = "page-%06d.html" % page
    if not os.path.isfile(filename):
        url = "https://www.reuters.com/news/archive/" + \
               "?view=page&page=%d&pageSize=10" % page
        r = requests.get(url)
        with open(filename, "w+", encoding="utf-8") as f:
            f.write(r.text)

def parse_archive_page(page_file):
    with open(page_file, "r", encoding="utf-8") as f:
        html = f.read()
    soup = BeautifulSoup(html, 'html.parser')
    hrefs = ["https://www.reuters.com" + a['href'] 
               for a in soup.select("article.story div.story-content a")]
    return hrefs

def download_article(url):
    # Überprüfung, ob der Artikel bereits vorhanden ist
    filename = url.split("/")[-1] + ".html"
    if not os.path.isfile(filename):
        r = requests.get(url)
        with open(filename, "w+", encoding="utf-8") as f:
            f.write(r.text)

def parse_article(article_file):
    def find_obfuscated_class(soup, klass):
        return soup.find_all(lambda tag: tag.has_attr("class") and (klass in " ".join(tag["class"])))
    
    with open(article_file, "r", encoding="utf-8") as f:
        html = f.read()
    r = {}
    soup = BeautifulSoup(html, 'html.parser')
    r['url'] = soup.find("link", {'rel': 'canonical'})['href']
    r['id'] = r['url'].split("-")[-1]
    r['headline'] = soup.h1.text
# Auskommentiert, weil Abbruch wegen Überlauf ###
#    r['section'] = find_obfuscated_class(soup, "ArticleHeader-channel")[0].text
    r['text'] = "\n".join([t.text for t in find_obfuscated_class(soup, "Paragraph-paragraph")])
# Auskommentiert, weil Abbruch wegen Überlauf
#    r['authors'] = find_obfuscated_class(soup, "Attribution-attribution")[0].text
    r['time'] = soup.find("meta", { 'property': "og:article:published_time"})['content']
    return r

In [38]:
# 3 Seiten des Archivs herunterladen
for p in range(1, 3): ###
    download_archive_page(p)

In [39]:
# Archiv parsen und zu article_urls hinzufügen
import glob

article_urls = []

for page_file in glob.glob("page-*.html"):
    article_urls += parse_archive_page(page_file)
print(article_urls)

['https://www.reuters.com/article/us-ukraine-crisis/russia-announces-limited-ceasefire-in-ukraine-to-allow-evacuations-but-continues-broad-offensive-idUSKBN2L204Z', 'https://www.reuters.com/article/us-china-parliament-taiwan/china-pledges-peaceful-growth-of-taiwan-ties-but-opposes-foreign-interference-idUSKBN2L202K', 'https://www.reuters.com/article/us-usa-china-diplomacy/new-u-s-ambassador-nicholas-burns-arrives-in-china-idUSKBN2L206Z', 'https://www.reuters.com/article/us-southchinasea-china-vietnam/china-announces-south-china-sea-drills-close-to-vietnam-coast-idUSKBN2L2082', 'https://www.reuters.com/article/us-pakistan-blast-mosque/at-least-58-killed-in-suicide-bombing-at-shiite-mosque-in-pakistan-idUSKBN2L10RI', 'https://www.reuters.com/article/us-ukraine-crisis/russian-law-on-fake-news-prompts-media-to-halt-reporting-as-websites-blocked-idUSKCN2L100P', 'https://www.reuters.com/article/us-ukraine-crisis-usa-nuclear/u-s-embassy-in-ukraine-calls-nuclear-power-plant-attack-war-crime-id

In [40]:
# Artikel herunterladen
for url in article_urls:
    download_article(url)

In [41]:
import pandas as pd

df = pd.DataFrame()
for article_file in glob.glob("*-id???????????.html"):
    df = df.append(parse_article(article_file), ignore_index=True)
df['time'] = pd.to_datetime(df.time)

In [42]:
df

Unnamed: 0,url,id,headline,text,time
0,https://www.reuters.com/article/us-nuclear-power-analysis-idUSKBN2L12FN,idUSKBN2L12FN,Analysis-Russian attacks spur debate about nuclear power as climate fix,WASHINGTON (Reuters) -Russia’s takeover of Europe’s largest nuclear power plant in Ukraine should spur companies and policymakers to be more careful in plans to build reactors to fight climate cha...,2022-03-04 22:10:35+00:00
1,https://www.reuters.com/world/asia-pacific/mosque-blast-northwestern-pakistani-kills-five-wounds-dozens-police-2022-03-04/,04/,At least 58 killed in suicide bombing at Shi'ite mosque in Pakistan,,2022-03-05 05:16:26+00:00
2,https://www.reuters.com/world/us/biden-heads-wisconsin-after-delivering-his-first-state-union-speech-2022-03-02/,02/,"Biden touts infrastructure, Ukraine support on Wisconsin trip",,2022-03-02 22:35:26+00:00
3,https://www.reuters.com/world/china-announces-south-china-sea-drills-close-vietnam-coast-2022-03-05/,05/,China announces South China Sea drills close to Vietnam coast,,2022-03-05 07:00:23+00:00
4,https://www.reuters.com/world/china/china-pledges-peaceful-growth-taiwan-ties-opposes-foreign-interference-2022-03-05/,05/,"China pledges peaceful growth of Taiwan ties, but opposes foreign interference",,2022-03-05 08:57:06+00:00
5,https://www.reuters.com/world/us/exclusive-americans-broadly-support-ukraine-no-fly-zone-russia-oil-ban-poll-2022-03-04/,04/,"EXCLUSIVE Americans broadly support Ukraine no-fly zone, Russia oil ban -poll",,2022-03-04 21:05:06+00:00
6,https://www.reuters.com/world/us/house-oversight-panel-postpones-big-oil-hearing-amid-ukraine-crisis-2022-03-02/,02/,House oversight panel postpones Big Oil hearing amid Ukraine crisis,,2022-03-02 21:15:36+00:00
7,https://www.reuters.com/world/more-russians-ukrainians-seek-asylum-us-mexico-border-2022-03-04/,04/,"More Russians, Ukrainians seek asylum at U.S.-Mexico border",,2022-03-04 21:14:58+00:00
8,https://www.reuters.com/business/aerospace-defense/nato-meets-ukraine-calls-no-fly-zone-hinder-russia-2022-03-04/,04/,"NATO rejects Ukraine no-fly zone, unhappy Zelenskiy says this means more bombing",,2022-03-04 21:40:14+00:00
9,https://www.reuters.com/world/new-us-ambassador-nicholas-burns-arrives-china-2022-03-05/,05/,New U.S. ambassador Nicholas Burns arrives in China,,2022-03-05 08:56:33+00:00


In [43]:
df.sort_values("time")

Unnamed: 0,url,id,headline,text,time
25,https://www.reuters.com/article/us-health-vaping-marijuana-idUSKBN1WG4KT,idUSKBN1WG4KT,"Banned in Boston: Without vaping, medical marijuana patients must adapt","BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\nThe 32-year-old massage th...",2019-10-01 18:35:04+00:00
6,https://www.reuters.com/world/us/house-oversight-panel-postpones-big-oil-hearing-amid-ukraine-crisis-2022-03-02/,02/,House oversight panel postpones Big Oil hearing amid Ukraine crisis,,2022-03-02 21:15:36+00:00
2,https://www.reuters.com/world/us/biden-heads-wisconsin-after-delivering-his-first-state-union-speech-2022-03-02/,02/,"Biden touts infrastructure, Ukraine support on Wisconsin trip",,2022-03-02 22:35:26+00:00
15,https://www.reuters.com/world/us/texas-republican-quits-us-house-race-admits-affair-with-former-isis-war-bride-2022-03-02/,02/,"Texas Republican quits U.S. House race, admits affair with former ISIS war bride",,2022-03-02 22:40:26+00:00
24,https://www.reuters.com/world/europe/ukraine-russia-what-you-need-know-right-now-2022-03-02/,02/,Ukraine and Russia: What you need to know right now,,2022-03-03 21:51:55+00:00
22,https://www.reuters.com/world/us/us-ramp-up-staffing-havana-embassy-process-some-visas-2022-03-03/,03/,U.S. to boost staffing at Havana Embassy to process some visas,,2022-03-03 23:02:39+00:00
20,https://www.reuters.com/business/solid-us-job-gains-forecast-february-unemployment-rate-seen-dipping-39-2022-03-04/,04/,Solid U.S. job gains forecast in February; unemployment rate seen dipping to 3.9%,,2022-03-04 05:02:19+00:00
19,https://www.reuters.com/article/us-ukraine-crisis-usa-nuclear-idUSKBN2L11IA,idUSKBN2L11IA,U.S. Embassy in Ukraine calls nuclear power plant attack 'war crime',"WASHINGTON (Reuters) -The U.S. Embassy in Ukraine said that attacking a nuclear power plant is a war crime, after Russia on Friday seized a Ukrainian nuclear facility that is the biggest in Europe...",2022-03-04 14:23:49+00:00
16,https://www.reuters.com/world/us/us-archives-turns-over-trump-white-house-visitor-logs-jan-6-committee-2022-03-04/,04/,U.S. Archives turns over Trump White House visitor logs to Jan. 6 committee,,2022-03-04 16:39:04+00:00
13,https://www.reuters.com/markets/us/sept-11-victims-seek-seizure-iran-oil-us-owned-tanker-2022-03-04/,04/,Sept. 11 victims seek seizure of Iran oil from U.S.-owned tanker,,2022-03-04 18:40:37+00:00


# 8. Blueprint: Extraktion der Density 

In [44]:
from readability import Document

doc = Document(html)
doc.title()

'Banned in Boston: Without vaping, medical marijuana patients must adapt - Reuters'

In [45]:
doc.short_title()

'Banned in Boston: Without vaping, medical marijuana patients must adapt'

In [46]:
doc.summary()

'<html><body><div><div class="StandardArticleBody_body"><p>BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 </p><div class="PrimaryAsset_container"><div class="Image_container" tabindex="-1"><figure class="Image_zoom"></figure><figcaption><div class="Image_caption"><span>FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah Nouvelage</span></div></figcaption></div></div><p>The 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. </p><p>There are other ways to get the desired effect from  marijuana, and patients h

In [47]:
doc.url

In [48]:
density_soup = BeautifulSoup(doc.summary(), 'html.parser')
density_soup.body.text

'BOSTON (Reuters) - In the first few days of the four-month ban on all vaping products in Massachusetts, Laura Lee Medeiros, a medical marijuana patient, began to worry.\xa0 FILE PHOTO: An employee puts down an eighth of an ounce marijuana after letting a customer smell it outside the Magnolia cannabis lounge in Oakland, California, U.S. April 20, 2018. REUTERS/Elijah NouvelageThe 32-year-old massage therapist has a diagnosis of post-traumatic stress disorder (PTSD) from childhood trauma. To temper her unpredictable panic attacks, she relied on a vape pen and cartridges filled with the marijuana derivatives THC and CBD from state dispensaries. There are other ways to get the desired effect from  marijuana, and patients have filled dispensaries across the state in recent days to ask about edible or smokeable forms. But Medeiros has come to depend on her battery-powered pen, and wondered how she would cope without her usual supply of cartridges.  “In the midst of something where I’m on t

# 9. Blueprint: Scrapy

Leider kann der Code für `scrapy` nicht einfach geändert werden. Ein weiteres Argument für die Verwendung *aktueller* separater Bibliotheken. In dieser Version sammelt es immer noch die Titel der Artikel, aber nicht mehr.

In [49]:
# Scrapy muss installiert sein
import scrapy
import logging


class ReutersArchiveSpider(scrapy.Spider):
    name = 'reuters-archive'
    
    custom_settings = {
        'LOG_LEVEL': logging.WARNING,
        'FEED_FORMAT': 'json',
        'FEED_URI': 'reuters-archive.json'
    }
    
    start_urls = [
        'https://www.reuters.com/news/archive/',
    ]

    def parse(self, response):
        for article in response.css("article.story div.story-content a"):
            yield response.follow(article.css("a::attr(href)").extract_first(), self.parse_article)

        next_page_url = response.css('a.control-nav-next::attr(href)').extract_first()
        if (next_page_url is not None) & ('page=2' not in next_page_url):
            yield response.follow(next_page_url, self.parse)

    def parse_article(self, response):
        yield {
          'title': response.css('h1::text').extract_first().strip(),
        }

In [50]:
# Reuters gestattet nur einmal diese Aufruf von einem Jupyter-Notebook aus
from scrapy.crawler import CrawlerProcess
process = CrawlerProcess()

process.crawl(ReutersArchiveSpider)
process.start()

2022-03-05 10:50:40 [scrapy.utils.log] INFO: Scrapy 2.6.1 started (bot: scrapybot)
2022-03-05 10:50:40 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.12, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.2.0, Python 3.8.12 | packaged by conda-forge | (default, Oct 12 2021, 21:22:46) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 22.0.0 (OpenSSL 1.1.1m  14 Dec 2021), cryptography 36.0.1, Platform Windows-10-10.0.19044-SP0
2022-03-05 10:50:40 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 30}


In [51]:
glob.glob("*.json")

['reuters-archive.json']

In [52]:
file = open("reuters-archive.json","r") 
content = file.read() 
print(content)

[
{"title": "Islamic State leader and family blended in among Syrians uprooted by war"},
{"title": "Biden admin eases Trump-era solar tariffs but doesn't end them"},
{"title": "Putin and Xi unveil alliance at Olympics, mixing politics and sport"},
{"title": "Omicron surge likely slowed U.S. job growth, losses possible"},
{"title": "Ahmaud Arbery killer withdraws guilty plea on U.S. hate-crime charges"},
{"title": "Republican senator urges U.S. to monitor China's digital yuan push during Olympics"},
{"title": "U.S. House set to pass sweeping vote on China competition bill"},
{"title": "U.S. winter storm leaves hundreds of thousands without power"},
{"title": "Oil hits fresh seven-year highs, closing out seven weeks of gains"},
{"title": "Texas man charged with threatening Georgia election officials pleads not guilty"},
{"title": "News Corp suspects China behind cyberattack on its system"},
{"title": "Loyal to Trump, Republican Party moves to censure U.S. Reps. Cheney, Kinzinger"},
{"tit