# Web Scraping

In [23]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

#### Web Scraping an Article from Tagesschau

In [2]:
url = 'https://www.tagesschau.de/ausland/ischgl-corona-111.html'
requests.get(url).ok
# ".ok" - built-in status code lookup

True

In [4]:
html = requests.get(url).content
print (html)

b'<!DOCTYPE html>\n<html lang="de">\n<head>\n<meta http-equiv="X-UA-Compatible" content="IE=edge"/>\n<title>Corona-Hotspot Ischgl: Justiz ermittelt gegen vier Personen | tagesschau.de</title>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>\n<meta http-equiv="pragma" content="no-cache"/>\n<meta http-equiv="cache-control" content="private"/>\n<meta name="viewport" content="width=device-width, user-scalable=yes, initial-scale=1.0, minimum-scale=1.0"/>\n<meta name="apple-mobile-web-app-capable" content="no"/>\n<meta name="description" content="Ischgl soll ma\xc3\x9fgeblich zur Virus-Verbreitung in Teilen Europas beigetragen haben. Was lief schief im Tiroler Skiort? 10.000 Seiten Material hat die Justiz zusammengetragen - gegen vier Personen wird ermittelt. <em>Von Clemens Verenkotte.</em>" />\n<meta name="keywords" content="Nachrichten, Inland, Ausland, Wirtschaft, Sport, Kultur Reportage, Bericht, News, Tagesthemen, Aktuell, Neu, Neuigkeiten, Hintergrund, Hintergrund

In [6]:
#soup = BeautifulSoup(html, 'lxml')
soup = BeautifulSoup(html, 'html.parser')

#lxml & html.parser are almost equal - both used to parse the content

In [7]:
soup.find_all(['h1','h2', 'h3', 'p'])

# ".find_all"-method finds all instances; ".find"-method would just find the first instance
# we can use it to find all different kind of tags on a website; either more than one or just one
# What you get is a list of items, which all have different arguments - one of it is the text

[<p>Detail Navigation:</p>,
 <h1>
 <span class="dachzeile">Corona-Hotspot Ischgl</span>
 <span class="headline">Justiz ermittelt gegen vier Personen</span>
 </h1>,
 <p class="text"><span class="stand">Stand: 30.09.2020 13:30 Uhr</span></p>,
 <p class="text small"><strong>Ischgl soll maßgeblich zur Virus-Verbreitung in Teilen Europas beigetragen haben. Was lief schief im Tiroler Skiort? 10.000 Seiten Material hat die Justiz zusammengetragen - gegen vier Personen wird ermittelt.</strong></p>,
 <p class="autorenzeile small">Von Clemens Verenkotte, ARD-Studio Wien</p>,
 <p class="text small"> Die Staatsanwaltschaft Innsbruck hat gegen vier Beschuldigte rund um die Umsetzung der Quarantäne-Verordnungen im Paznauntal und der Verkehrsbeschränkungen in Ischgl Mitte März dieses Jahres Ermittlungsverfahren eingeleitet. Dies bestätigte der Sprecher der Staatsanwaltschaft Innsbruck, Hansjörg Mayr, gegenüber dem <em>ARD-Studio Wien</em>. Auskünfte dazu, um wen es sich bei den Beschuldigten handele,

In [12]:
# now extracting the text argument; here as an example of the "h2"-Tag
#headings2 = soup.find_all('h2')
#[i.text for i in soup.find_all('h2')]
#alternative_way:
#headings2 = soup.find_all('h2')
#[i.text for i in headings2]
# alternative_way_2:
heading2 = [i.text for i in soup.find_all('h2')]
print (heading2)

['Corona-Verordnung Stunden später an "schwarzen Brettern"', '"Mehr konnten wir nicht tun"', '10.000 Seiten Beweismaterial gesichtet', 'Überblick über die tagesschau.de-Seiten und weitere ARD Online-Angebote']


In [14]:
# to get all link tags("a"):
[i.text for i in soup.find_all("a")]
#".text" gives us links as a text
# [i.text for i in soup.find_all(['a'])] - works as well, but "[]" isn't needed as we just have one element

['',
 'Hauptnavigation',
 'Zum Inhalt',
 'Zur Suche',
 'Zum Seitenanfang',
 '',
 'Zum Inhalt',
 'ARD Navigation',
 'ARD Home',
 'Nachrichten',
 'Sport',
 'Börse',
 'Ratgeber',
 'Wissen',
 'Kultur',
 'Kinder',
 'Die\xa0ARD',
 'Fernsehen',
 'Radio',
 'ARD Mediathek',
 '',
 '',
 '',
 '',
 '\n\n',
 '',
 'Menü',
 'Startseite',
 'Videos & Audios',
 'Startseite Videos & Audios',
 'Livestream',
 'Die wichtigsten Nachrichten als Video',
 'Tagesschau in 100 Sekunden',
 'Letzte Sendung',
 'Tagesschau 20 Uhr',
 'Tagesschau 20 Uhr (Gebärdensprache)',
 'Tagesthemen',
 'Nachtmagazin',
 'Tagesschau24',
 'Bericht aus Berlin',
 '#kurzerklärt',
 'Sendungsarchiv',
 'Podcasts',
 'Politik im Radio',
 'Bildergalerien',
 'Inland',
 'Startseite Inland',
 'Dossiers',
 'DeutschlandTrend',
 'Regional',
 'Sieben-Tage-Überblick',
 'Bundestagswahl',
 'Ausland',
 'Startseite Ausland',
 'Dossiers',
 'Nachrichten aus der EU',
 'Sieben-Tage-Überblick',
 'Investigativ',
 'Wirtschaft',
 'Startseite Wirtschaft',
 'Dossiers

In [13]:
# now extracting the text argument; here as an example of the "h1", "h2", "p"-Tag
text = [i.text for i in soup.find_all(['h1', 'h2', 'p'])]
print("\n".join(text))

# "\n" as a line-break; it's a way to join as well as - 
# it causes that each element of list is in a new line, with empty line in between!
# ".text" in part of bs4

Detail Navigation:

Corona-Hotspot Ischgl
Justiz ermittelt gegen vier Personen

Stand: 30.09.2020 13:30 Uhr
Ischgl soll maßgeblich zur Virus-Verbreitung in Teilen Europas beigetragen haben. Was lief schief im Tiroler Skiort? 10.000 Seiten Material hat die Justiz zusammengetragen - gegen vier Personen wird ermittelt.
Von Clemens Verenkotte, ARD-Studio Wien
 Die Staatsanwaltschaft Innsbruck hat gegen vier Beschuldigte rund um die Umsetzung der Quarantäne-Verordnungen im Paznauntal und der Verkehrsbeschränkungen in Ischgl Mitte März dieses Jahres Ermittlungsverfahren eingeleitet. Dies bestätigte der Sprecher der Staatsanwaltschaft Innsbruck, Hansjörg Mayr, gegenüber dem ARD-Studio Wien. Auskünfte dazu, um wen es sich bei den Beschuldigten handele, würden nicht erteilt.
Corona-Verordnung Stunden später an "schwarzen Brettern"
 Damit richten sich rund sechs Monate nach der teilweise chaotisch abgelaufenen Abreise von rund 10.000 Touristen aus Ischgl und dem Paznauntal am Freitag, den 13. Mä

#### Writing a function for getting the url

In [16]:
#def make_soup(url):
#    html = requests.get(url).content
#    soup = BeautifulSoup(html, 'lxml')
#    return soup

def make_soup(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html, 'html.parser')
    return soup

#### Web Scraping a Wikipedia Article - example:  Retrieve list of dogs

In [24]:
#url = 'https://en.wikipedia.org/wiki/List_of_police_dog_breeds'
#html = requests.get(url).content
#soup = BeautifulSoup(html, 'lxml')

url = "https://en.wikipedia.org/wiki/List_of_police_dog_breeds"
html = requests.get(url).content
soup = BeautifulSoup (html, 'html.parser')
print(soup.prettify()) #- prints content in a more pretty way!

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of police dog breeds - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c6d51784-6453-4f44-9335-092da7f94d13","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_police_dog_breeds","wgTitle":"List of police dog breeds","wgCurRevisionId":980139453,"wgRevisionId":980139453,"wgArticleId":17333279,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Articles to be merged from July 2

In [25]:
# with our formula

url = 'https://en.wikipedia.org/wiki/List_of_police_dog_breeds'
soup = make_soup(url)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of police dog breeds - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"c6d51784-6453-4f44-9335-092da7f94d13","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_police_dog_breeds","wgTitle":"List of police dog breeds","wgCurRevisionId":980139453,"wgRevisionId":980139453,"wgArticleId":17333279,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles with short description","Short description is different from Wikidata","Articles to be merged from July 2

In [49]:
help(soup.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Extracts a list of Tag objects that match the given
    criteria.  You can specify the name of the Tag and any
    attributes you want the Tag to have.
    
    The value of a key-value pair in the 'attrs' map can be a
    string, a list of strings, a regular expression object, or a
    callable that takes a string and returns whether or not the
    string matches for some custom definition of 'matches'. The
    same is true of the tag name.



In [19]:
soup.find_all("li")
# to find all list_items

[<li class="toclevel-1 tocsection-1"><a href="#All_police_dog_breeds_used_in_law_enforcement"><span class="tocnumber">1</span> <span class="toctext">All police dog breeds used in law enforcement</span></a></li>,
 <li class="toclevel-1 tocsection-2"><a href="#Illicit-substance_detection_dogs"><span class="tocnumber">2</span> <span class="toctext">Illicit-substance detection dogs</span></a></li>,
 <li class="toclevel-1 tocsection-3"><a href="#Tracking_dogs"><span class="tocnumber">3</span> <span class="toctext">Tracking dogs</span></a></li>,
 <li class="toclevel-1 tocsection-4"><a href="#Cadaver-sniffing_dogs"><span class="tocnumber">4</span> <span class="toctext">Cadaver-sniffing dogs</span></a></li>,
 <li class="toclevel-1 tocsection-5"><a href="#See_also"><span class="tocnumber">5</span> <span class="toctext">See also</span></a></li>,
 <li class="toclevel-1 tocsection-6"><a href="#References"><span class="tocnumber">6</span> <span class="toctext">References</span></a></li>,
 <li><a hr

In [31]:
# "li" - are list tags
dogs = [i.text.replace('[', '').replace(']', '').strip() for i in soup.find_all('li') if re.match('[A-Z]', i.text)][0:17]
print(dogs)
#strip() is an inbuilt function in Python programming language that returns a copy of the string with both 
#leading and trailing characters removed (based on the string argument passed).

#alternative_way:
# dogs = [i.text.replace('[', '').replace(']', '').strip() for i in soup.find_all('li') if re.match('[A-Z]', i.text)]
# dogs[0:17]


['Airedale Terrier', 'Akita', 'Belgian Malinois', 'Belgian Sheepdog', 'Staffordshire Bull Terrier', 'Border Collie1', 'Bouvier des Flandres', 'Boxer', 'Doberman Pinscher', 'Dutch Shepherd', 'German Shepherd', 'Giant Schnauzer', 'Indian pariah dog2', 'Labrador Retriever', 'Rottweiler', 'Weimaraner', 'Bloodhound']


In [32]:
# bring 
pd.DataFrame(dogs)

Unnamed: 0,0
0,Airedale Terrier
1,Akita
2,Belgian Malinois
3,Belgian Sheepdog
4,Staffordshire Bull Terrier
5,Border Collie1
6,Bouvier des Flandres
7,Boxer
8,Doberman Pinscher
9,Dutch Shepherd


In [80]:
# to find all link tags, which belong to s specified class
soup.find_all('a', {'class':'mw-jump-link'})

[<a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#searchInput">Jump to search</a>]

In [76]:
help(soup.find_all)

Help on method find_all in module bs4.element:

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
    Extracts a list of Tag objects that match the given
    criteria.  You can specify the name of the Tag and any
    attributes you want the Tag to have.
    
    The value of a key-value pair in the 'attrs' map can be a
    string, a list of strings, a regular expression object, or a
    callable that takes a string and returns whether or not the
    string matches for some custom definition of 'matches'. The
    same is true of the tag name.

