# Scraping tables 

Extracting data manually is sometimes neccessary for a single website but it would be very time consuming if you would have to do it for thousands of websites. You can write scripts that extract data automatically: This process is called scraping. 

In [None]:
import requests 
from bs4 import BeautifulSoup
import pandas as pd
import re

> Taken from the very good Chris Albon tutorials: https://chrisalbon.com/python/beautiful_soup_scrape_table.html

In this notebook I'll show you how to scrape this simple table, so you get the idea how scraping works: 
    "http://www1.wdr.de/nachrichten/landespolitik/landtagswahl/wss-kriminalitaet-barrierefrei-104.html"

Requests is a python package, to get data from a url into python  

In [None]:
r = requests.get("http://www1.wdr.de/nachrichten/landespolitik/landtagswahl/wss-kriminalitaet-barrierefrei-104.html")

In [None]:
r

> We will now use a very helpfull python scraping package, it's called 'BeautifulSoup':
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) 

In [None]:
soup = BeautifulSoup(r.text, 'lxml')

In [None]:
soup.table

https://tutorial.djangogirls.org/en/html/

## What is HTML?

HTML is a simple code that is interpreted by your web browser – such as Chrome, Firefox or Safari – to display a web page for the user.

HTML stands for "HyperText Markup Language". HyperText means it's a type of text that supports hyperlinks between pages. Markup means we have taken a document and marked it up with code to tell something (in this case, a browser) how to interpret the page. HTML code is built with tags, each one starting with < and ending with >. These tags represent markup elements.

HTML ist not our topic today. If you want to get into it, this is a good start:  
https://www.w3schools.com/

https://www.w3schools.com/html/html_tables.asp

```
An HTML table is defined with the <table> tag.
Each table row is defined with the <tr> tag. 
A table header is defined with the <th> tag. 
By default, table headings are bold and centered. A table data/cell is defined with the <td> tag.
```

## You do it 
https://www.w3schools.com/html/tryit.asp?filename=tryhtml_table_intro

Please open the link and change one data cell and a table header  

If you sort the sort.table output nicely, you get:

In [None]:
<div class="table">
    <table class="thleft"><caption>Bekannt gewordene Fälle in NRW</caption>
        <thead>
            <tr class="headlines">
                <th class="entry">Jahr</th>
                <th class="entry">Bekannt gewordene Fälle</th>
            </tr>
        </thead>
        
        <tbody>
                    <tr class="data"><td class="entry">2005</td><td class="entry">1.503.451</td></tr>
                    <tr class="data"><td class="entry">2006</td><td class="entry">1.491.897</td></tr>
                    [...]
                    <tr class="data"><td class="entry">2015</td><td class="entry">1.517.448</td></tr>
                    <tr class="data"><td class="entry">2016</td><td class="entry">1.469.426</td></tr>
        </tbody>
    </table>
</div>

## Scrape it 

In [None]:
# Create empty lists to store the scraped data in
jahr = []
fälle = []

In [None]:
# Create an object of the first object that is class_=tabele
table = soup.find(class_='table')

In [None]:
for row in table.find_all('tr')[1:]:
    print(row)

In [None]:
for row in table.find_all('tr')[1:]:
    col = row.find_all('td')
    print(col)

In [None]:
for row in table.find_all('tr')[1:]:
    col = row.find_all('td')
    column_1 = col[0].string.strip()
    print(column_1)

In [None]:
# Find all the <tr> tag pairs, skip the first one, then for each.
for row in table.find_all('tr')[1:]:
    # Create a variable of all the <td> tag pairs in each <tr> tag pair,
    col = row.find_all('td')

    # Create a variable of the string inside 1st <td> tag pair,
    column_1 = col[0].string.strip()
    # and append it to first_name variable
    jahr.append(column_1)
    
    # Create a variable of the string inside 2nd <td> tag pair,
    column_2 = col[1].string.strip()
    # and append it to last_name variable
    fälle.append(column_2)

In [None]:
# Create a variable of the value of the columns
columns = {'jahr': jahr, 'fälle': fälle}

# Create a dataframe from the columns variable
df = pd.DataFrame(columns, columns=['jahr', 'fälle'])

In [None]:
df

## You do it 

Please scrape this table:  

http://www1.wdr.de/nachrichten/landespolitik/landtagswahl/wss-kinderarmut-barrierefrei-100.html