## The Web Scraping Recipe

To scrape information from the web is:
1. **MAPPING**: Finding URLs of the pages containing the information you want.
2. **DOWNLOAD**: Fetching the pages via HTTP.
3. **PARSE**: Extracting the information from HTML.  
  
  
You could also add `connection`, `storing`, `logging`, etc.
   


### Packages used
Today we will mainly build on the python skills you have gotten so far, and tomorrow we will look into more specialized packages.

* for connecting to the internet we use: **requests**
* for parsing: **beautifulsoup** and **regex**
* for automatic browsing / screen scraping: **selenium** 
* for mitigating errors we use: **time**

We will write our scrapers with basic python, for larger projects consider looking into the packages **scrapy**

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns
import yfinance as yf
import os

In [24]:
import requests
from bs4 import BeautifulSoup
import re
import selenium
import time
import pandas as pd

In [29]:
import requests
response = requests.get('https://isdsucph.github.io/isds2021/')

In [32]:
# NBA网站
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html' # link to the website
dfs = pd.read_html(url) # parses all tables found on the page.
dfs[1]

Unnamed: 0,Western Conference,W,L,W/L%,GB,PS/G,PA/G,SRS
0,Houston Rockets*,65,17,0.793,—,112.4,103.9,8.21
1,Golden State Warriors*,58,24,0.707,7.0,113.5,107.5,5.79
2,Portland Trail Blazers*,49,33,0.598,16.0,105.6,103.0,2.6
3,Oklahoma City Thunder*,48,34,0.585,17.0,107.9,104.4,3.42
4,Utah Jazz*,48,34,0.585,17.0,104.1,99.8,4.47
5,New Orleans Pelicans*,48,34,0.585,17.0,111.7,110.4,1.48
6,San Antonio Spurs*,47,35,0.573,18.0,102.7,99.8,2.89
7,Minnesota Timberwolves*,47,35,0.573,18.0,109.5,107.3,2.35
8,Denver Nuggets,46,36,0.561,19.0,110.0,108.5,1.57
9,Los Angeles Clippers,42,40,0.512,23.0,109.0,109.0,0.15


In [41]:
url = 'https://www.basketball-reference.com/leagues/NBA_2018.html'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('h2')[0].text

'Conference Standings'

In [126]:
# Mapping exercise - 招聘网站
url = 'https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=2&q=python'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

jobs = int(soup.find('span',attrs={'class':'d-md-none'}).text[0:3])   # element:<span class="d-md-none">162 job fundet</span>
print(jobs)

# 20 jobs per page
for i in range(round(jobs/20)+1):
    print(i)
#for i in range(round(jobs/20)+1):
    #print('https://www.jobindex.dk/jobsoegning/storkoebenhavn?page=' + str(i) +'&q=python')  #没看懂这步在干啥

162
0
1
2
3
4
5
6
7
8


In [107]:
# keep logging as a habit
import scraping_class
logfile = 'log.csv'## name your log file.
connector = scraping_class.Connector(logfile)
print(connector)

<scraping_class.scraping_class.Connector object at 0x0000026E614382B0>


In [145]:
# Exercise 不会做
url = 'https://job.jobnet.dk/CV/FindWork?Offset=0&SortValue=BestMatch'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

jobs = soup.find_all('span',attr={'class':'sr-only'})
print(jobs)

[]
