# Website scraping

---

Group name: K

---


Setup:

In [3]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


## Requests

Mit einer der get-requests-Funktion scrapen wir die Webseite:

In [4]:
url = 'https://fivethirtyeight.com/features/house-control-republicans/'

In [5]:
html = requests.get(url)

Mit status_code checken ob ausgabe erfolgreich war

In [6]:
html.status_code

200

Prüfen, ob der Code die richtige Ausgabe liefert:

In [7]:
assert html.status_code == 200
assert html.url == "https://fivethirtyeight.com/features/house-control-republicans/"

## Beautiful Soup

Mit Beautiful Soup HTML untersuchen:

In [8]:
soup = BeautifulSoup(html.text, 'html.parser')

In [9]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <link href="//dcf.espn.com" rel="dns-prefetch"/>
  <script type="text/javascript">
   function setCookie(name,value,days) {
				var expires = "";
				if (days) {
					var date = new Date();
					date.setTime(date.getTime() + (days*24*60*60*1000));
					expires = "; expires=" + date.toUTCString();
				}
				document.cookie = name + "=" + (value || "") + expires + "; path=/";
			}
			setCookie('country','ca',1);
  </script>
  <script src="https://dcf.espn.com/TWDC-DTCI/prod/Bootstrap.js">
  </script>
  <title>
   Republicans Won The House — Barely | FiveThirtyEight
  </title>
  <meta content="max-image-preview:large" name="robots">
   <link as="font" crossorigin="anonymous" href="https://fivethirtyeight.com/wp-content/themes/espn-fivethirtyeight/dist/fonts/AtlasGrotesk-Bold-Web.woff2" rel="preload" type="font/woff2"/>
   <link as=

In [10]:
print(soup.get_text())









Republicans Won The House — Barely | FiveThirtyEight

































































































Skip to main content







FiveThirtyEight





Search



Search




ABC News

Menu




											Republicans Won The House — Barely									
Share on Facebook
Share on Twitter






						Politics					



						Sports					



						Science					



						Podcasts					



						Video					



						ABC News					
















2022 Election
Republicans Won The House — Barely



By Nathaniel Rakich


									Nov. 16, 2022, at 7:12 PM								


 







As expected, Republicans have taken control of the U.S. House of Representatives, ending two years of Democratic rule in Washington, D.C. But what wasn’t expected was how narrow the Republicans’ majority would be.
As of Wednesday evening, ABC News estimates that Republicans have won at least 218 seats and Democrats have won at least 210. 
Seven districts remain undecided: four wher

## Titel untersuchen:

In [11]:
soup.title

<title>Republicans Won The House — Barely | FiveThirtyEight</title>

In [12]:
soup.title.string

'Republicans Won The House — Barely | FiveThirtyEight'

In [13]:
soup.title.parent.name

'head'

In [14]:
title_text=[]
title_text.append(soup.title.text)
title_text

['Republicans Won The House — Barely | FiveThirtyEight']

Titel in Pandas dataframe einfügen

In [16]:
df_title = pd.DataFrame({"Titel" : title_text})
df_title

Unnamed: 0,Titel
0,Republicans Won The House — Barely | FiveThirt...


Namen des Autors finden und ausgeben

In [22]:
authors_text = []

for i in soup.find_all('a', {'class': 'author url fn'}):
    authors_text.append((i).text)

authors_text

['Nathaniel Rakich']

Namen des Autors in Pandas dataframe einfügen

In [23]:
df_author = pd.DataFrame({"Author" : authors_text})
df_author

Unnamed: 0,Author
0,Nathaniel Rakich


Titel des Artikels mit dataframe kombinieren:

In [25]:
df_author.join(df_title)

Unnamed: 0,Author,Titel
0,Nathaniel Rakich,Republicans Won The House — Barely | FiveThirt...


## Body:

Den Text bzw. Body des Artikels extrahieren:

In [17]:
body = []

for i in soup.find_all('div', {'class': 'entry-content'}):
    body.append((i).text)

body

['\n\n\n\n2022 Election\nRepublicans Won The House — Barely\n\n\n\nBy Nathaniel Rakich\n\n\n\t\t\t\t\t\t\t\t\tNov. 16, 2022, at 7:12 PM\t\t\t\t\t\t\t\t\n\n\n \n\n\n\n\n\n\n\nAs expected, Republicans have taken control of the U.S. House of Representatives, ending two years of Democratic rule in Washington, D.C. But what wasn’t expected was how narrow the Republicans’ majority would be.\nAs of Wednesday evening, ABC News estimates that Republicans have won at least 218 seats and Democrats have won at least 210. \nSeven districts remain undecided: four where Democrats currently lead and three where Republicans do. If those leads all hold (which seems likely), Republicans will go into the 118th Congress with a 221-to-214 majority.\xa0\n\nSeveral House races are still up in the air\nDistricts where ABC News has not yet reported a projected winner, as of 7:04 p.m. Eastern\nRace\nDemocrat\nRepublican\nPercent reporting\nVote margin\nVote share margin\nAK-1\nPeltola\xa0i\nPalin\n81%\n53,297\nD

## Problem:

 Der Artikel besteht aus mehreren 'Bausteinen' und nicht aus einem zusammenhängenden Text (Text hat keine eigene class). Da einige Grafiken in dem Artikel vorhanden sind, wird der Text der Grafiken hier auch ausgegeben. 

Daten in Pandas-Dataframe speichern:

In [27]:
df_body = pd.DataFrame({"Artikeltext" : body})
df_body

Unnamed: 0,Artikeltext
0,\n\n\n\n2022 Election\nRepublicans Won The Hou...


Mit Hilfe von left join alle Daten in einem Pandas-Dataframe kombinieren:

In [28]:
df = (df_title).join(df_author).join(df_body)
df

Unnamed: 0,Titel,Author,Artikeltext
0,Republicans Won The House — Barely | FiveThirt...,Nathaniel Rakich,\n\n\n\n2022 Election\nRepublicans Won The Hou...


Dataframe in CSV Datei umwandeln und in data/raw abspeichern

In [30]:
df.to_csv("/Users/helenbruker/Uni/OMM4/BigData/homework-1/data/raw/webscraping.csv")