In [1]:
import pandas as pd

# Web scraping - pridobivanje podatkov s spleta


## What Is Web Scraping?

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png" src="https://cdn-images-1.medium.com/max/1600/1*GOyqaID2x1N5lD_rhTDKVQ.png">

### Why Web Scraping for Data Science?

## Network complexity

## HTTP

## HTTP in Python: The Requests Library

[Requests: HTTP for Humans](https://2.python-requests.org/en/master/)

In [2]:
import requests

In [3]:
url = 'http://example.com/'

In [4]:
r = requests.get(url)

In [5]:
r

<Response [200]>

In [6]:
type(r)

requests.models.Response

In [7]:
r.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 50px;\n        background-color: #fff;\n        border-radius: 1em;\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        body {\n            background-color: #fff;\n        }\n        div {\n            width: auto;\n            margin: 0 auto;\n            border-radius: 0;\n            padding: 1em;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n

In [8]:
r.status_code

200

In [9]:
r.reason

'OK'

In [10]:
r.headers

{'Content-Encoding': 'gzip', 'Accept-Ranges': 'bytes', 'Cache-Control': 'max-age=604800', 'Content-Type': 'text/html; charset=UTF-8', 'Date': 'Mon, 27 May 2019 14:24:59 GMT', 'Etag': '"1541025663"', 'Expires': 'Mon, 03 Jun 2019 14:24:59 GMT', 'Last-Modified': 'Fri, 09 Aug 2013 23:54:35 GMT', 'Server': 'ECS (dcb/7F37)', 'Vary': 'Accept-Encoding', 'X-Cache': 'HIT', 'Content-Length': '606'}

In [11]:
r.request

<PreparedRequest [GET]>

In [12]:
r.request.headers

{'User-Agent': 'python-requests/2.21.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}

## HTML and CSS

<img class="progressiveMedia-image js-progressiveMedia-image" data-src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png" src="https://cdn-images-1.medium.com/max/1600/1*x9mxFBXnLU05iPy19dGj7g.png">

### Hypertext Markup Language: HTML

Link strani: https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687

In [13]:
import requests

In [14]:
url_got = 'https://en.wikipedia.org/w/index.php?title=List_of_Game_of_Thrones_episodes&oldid=802553687'

In [15]:
r = requests.get(url_got)

In [16]:
r.text[:1000]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of Game of Thrones episodes - Wikipedia</title>\n<script>document.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":898999050,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Articles containing potentially dated statements from August 2017","All articles containing potentially dated statements","Official website not in Wikidata","Featured lists","Game of Thrones episodes","Lists of American drama television series episodes","Lists of fantasy television series episodes"],"wgBreakFrames":!1,"wgPageContentLanguage"

- `<p>...</p>` to enclose a paragraph;
- `<br>` to set a line break;
- `<table>...</table>` to start a table block, inside; `<tr>...<tr/>` is used for the rows; and `<td>...</td>` cells;
- `<img>` for images;
- `<h1>...</h1> to <h6>...</h6>` for headers;
- `<div>...</div>` to indicate a “division” in an HTML document, basically used to group a set of elements;
- `<a>...</a>` for hyperlinks;
- `<ul>...</ul>, <ol>...</ol>` for unordered and ordered lists respectively; inside of these, `<li>...</li>` is used for each list item.

## Using Your Browser as a Development Tool

## The Beautiful Soup Library

In [17]:
html_content = r.text

> **[beautifulsoup4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**: Beautiful Soup tries to organize complexity: it helps to parse, structure and organize the oftentimes very messy web by fixing bad HTML and presenting us with an easy-to-work-with Python structure.

In [18]:
from bs4 import BeautifulSoup

In [19]:
html_soup = BeautifulSoup(html_content, 'html.parser')

In Python, multiple parsers exist to do so:
- `html.parser`: a built-in Python parser that is decent (especially when using recent versions of Python 3) and requires no extra installation.
- `lxml`: which is very fast but requires an extra installation.
- `html5lib`: which aims to parse web page in exactly the same way as a web browser does, but is a bit slower.

- `find(name, attrs, recursive, string, **keywords)`
- `find_all(name, attrs, recursive, string, limit, **keywords)`

In [20]:
html_soup.find('h1')

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [21]:
html_soup.find('', {'id':'firstHeading'})

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [23]:
all_h1=html_soup.find_all('h2')
all_h1

[<h2>Contents</h2>,
 <h2><span class="mw-headline" id="Series_overview">Series overview</span></h2>,
 <h2><span class="mw-headline" id="Episodes">Episodes</span></h2>,
 <h2><span class="mw-headline" id="Home_media_releases">Home media releases</span></h2>,
 <h2><span class="mw-headline" id="Ratings">Ratings</span></h2>,
 <h2><span class="mw-headline" id="References">References</span></h2>,
 <h2><span class="mw-headline" id="External_links">External links</span></h2>,
 <h2>Navigation menu</h2>]

In [25]:
first_h1=html_soup.find('h1')
first_h1

<h1 class="firstHeading" id="firstHeading" lang="en">List of <i>Game of Thrones</i> episodes</h1>

In [26]:
first_h1.name

'h1'

In [27]:
first_h1.contents

['List of ', <i>Game of Thrones</i>, ' episodes']

In [28]:
first_h1.text

'List of Game of Thrones episodes'

In [29]:
first_h1.attrs

{'id': 'firstHeading', 'class': ['firstHeading'], 'lang': 'en'}

In [32]:
first_h1.get_text('876')

'List of 876Game of Thrones876 episodes'

In [33]:
first_h1['id']

'firstHeading'

In [38]:
cite=html_soup.find_all('cite',limit=4,class_='citation')
cite

[<cite class="citation web">Fowler, Matt (April 8, 2011). <a class="external text" href="http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">"Game of Thrones: "Winter is Coming" Review"</a>. <a href="/wiki/IGN" title="IGN">IGN</a>. <a class="external text" href="https://web.archive.org/web/20120817073932/http://tv.ign.com/articles/116/1160215p1.html" rel="nofollow">Archived</a> from the original on August 17, 2012<span class="reference-accessdate">. Retrieved <span class="nowrap">September 22,</span> 2016</span>.</cite>,
 <cite class="citation news">Fleming, Michael (January 16, 2007). <a class="external text" href="http://www.variety.com/article/VR1117957532.html?categoryid=14&amp;cs=1" rel="nofollow">"HBO turns <i>Fire</i> into fantasy series"</a>. <i><a href="/wiki/Variety_(magazine)" title="Variety (magazine)">Variety</a></i>. <a class="external text" href="https://web.archive.org/web/20120516224747/http://www.variety.com/article/VR1117957532?refCatId=14" rel="nofollow">A

In [39]:
cite[0].text

'Fowler, Matt (April 8, 2011). "Game of Thrones: "Winter is Coming" Review". IGN. Archived from the original on August 17, 2012. Retrieved September 22, 2016.'

In [43]:
link=cite[0].find('a').get('href')
link

'http://tv.ign.com/articles/116/1160215p1.html'

In [45]:
for c in cite:
    print(c.find('a').get('href'))

http://tv.ign.com/articles/116/1160215p1.html
http://www.variety.com/article/VR1117957532.html?categoryid=14&cs=1
http://www.emmys.com/shows/game-thrones
https://web.archive.org/web/20120401123724/http://travel.usatoday.com/destinations/story/2012-04-01/Where-the-HBO-hit-Game-of-Thrones-was-filmed/53876876/1


In [46]:
html_soup.text[:500]

'\n\n\n\nList of Game of Thrones episodes - Wikipedia\ndocument.documentElement.className=document.documentElement.className.replace(/(^|\\s)client-nojs(\\s|$)/,"$1client-js$2");RLCONF={"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_Game_of_Thrones_episodes","wgTitle":"List of Game of Thrones episodes","wgCurRevisionId":898999050,"wgRevisionId":802553687,"wgArticleId":31120069,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"w'

In [53]:
episodes=[]

In [54]:
ep_tables=html_soup.find_all('table',class_='wikiepisodetable')

In [55]:
len(ep_tables)

7

In [56]:
for table in ep_tables:
    headers=[]
    rows=table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    for row in rows[1:]:
        values=[]
        for col in row.find_all(['th','td']):
            values.append(col.text)
        if values:
            episode_dict={headers[i]:values[i] for i in range(len(values))}
            episodes.append(episode_dict)

In [57]:
episodes[0]

{'No.overall': '1',
 'No. inseason': '1',
 'Title': '"Winter Is Coming"',
 'Directed by': 'Tim Van Patten',
 'Written by': 'David Benioff & D. B. Weiss',
 'Original air date': 'April\xa017,\xa02011\xa0(2011-04-17)',
 'U.S. viewers(millions)': '2.22[20]'}

In [58]:
pd.DataFrame(episodes)

Unnamed: 0,Directed by,No. inseason,No.overall,Original air date,Title,U.S. viewers(millions),Written by
0,Tim Van Patten,1,1,"April 17, 2011 (2011-04-17)","""Winter Is Coming""",2.22[20],David Benioff & D. B. Weiss
1,Tim Van Patten,2,2,"April 24, 2011 (2011-04-24)","""The Kingsroad""",2.20[21],David Benioff & D. B. Weiss
2,Brian Kirk,3,3,"May 1, 2011 (2011-05-01)","""Lord Snow""",2.44[22],David Benioff & D. B. Weiss
3,Brian Kirk,4,4,"May 8, 2011 (2011-05-08)","""Cripples, Bastards, and Broken Things""",2.45[23],Bryan Cogman
4,Brian Kirk,5,5,"May 15, 2011 (2011-05-15)","""The Wolf and the Lion""",2.58[24],David Benioff & D. B. Weiss
5,Daniel Minahan,6,6,"May 22, 2011 (2011-05-22)","""A Golden Crown""",2.44[25],Story by : David Benioff & D. B. Weiss Telepla...
6,Daniel Minahan,7,7,"May 29, 2011 (2011-05-29)","""You Win or You Die""",2.40[26],David Benioff & D. B. Weiss
7,Daniel Minahan,8,8,"June 5, 2011 (2011-06-05)","""The Pointy End""",2.72[27],George R. R. Martin
8,Alan Taylor,9,9,"June 12, 2011 (2011-06-12)","""Baelor""",2.66[28],David Benioff & D. B. Weiss
9,Alan Taylor,10,10,"June 19, 2011 (2011-06-19)","""Fire and Blood""",3.04[29],David Benioff & D. B. Weiss


## Web APIs

### Primer uporabe APIja

https://github.com/HackerNews/API

In [78]:
articles=[]

In [79]:
url = 'https://hacker-news.firebaseio.com/v0'

In [80]:
top_stories=requests.get(url+'/topstories.json')
top_stories.text[:50]

'[20022534,20019647,20019874,20019206,20022186,2001'

In [81]:
top_stories=top_stories.json()

In [82]:
top_stories[:5]

[20022534, 20019647, 20019874, 20019206, 20022186]

In [83]:
for story_id in top_stories[:5]:
    story_url=url+f'/item/{story_id}.json'
    print(story_url)
    r=requests.get(story_url)
    story_dict=r.json()
    articles.append(story_dict)

https://hacker-news.firebaseio.com/v0/item/20022534.json
https://hacker-news.firebaseio.com/v0/item/20019647.json
https://hacker-news.firebaseio.com/v0/item/20019874.json
https://hacker-news.firebaseio.com/v0/item/20019206.json
https://hacker-news.firebaseio.com/v0/item/20022186.json


In [84]:
articles[0]

{'by': 'tlb',
 'descendants': 3,
 'id': 20022534,
 'kids': [20022753, 20022752, 20022751],
 'score': 14,
 'time': 1558968224,
 'title': 'Power Is Overrated',
 'type': 'story',
 'url': 'https://www.nytimes.com/2019/05/21/opinion/power-is-overrated.html'}

### Import data from web - pandas

##### [Odprti podatki Slovenije](https://podatki.gov.si/)


Na portalu OPSI boste našli vse od podatkov, orodij, do koristnih virov, s katerimi boste lahko razvijali spletne in mobilne aplikacije, oblikovali lastne infografike in drugo

Primer: https://support.spatialkey.com/spatialkey-sample-csv-data/

In [85]:
data=pd.read_csv('http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv')

In [86]:
data.head(5)

Unnamed: 0,street,city,zip,state,beds,baths,sq__ft,type,sale_date,price,latitude,longitude
0,3526 HIGH ST,SACRAMENTO,95838,CA,2,1,836,Residential,Wed May 21 00:00:00 EDT 2008,59222,38.631913,-121.434879
1,51 OMAHA CT,SACRAMENTO,95823,CA,3,1,1167,Residential,Wed May 21 00:00:00 EDT 2008,68212,38.478902,-121.431028
2,2796 BRANCH ST,SACRAMENTO,95815,CA,2,1,796,Residential,Wed May 21 00:00:00 EDT 2008,68880,38.618305,-121.443839
3,2805 JANETTE WAY,SACRAMENTO,95815,CA,2,1,852,Residential,Wed May 21 00:00:00 EDT 2008,69307,38.616835,-121.439146
4,6001 MCMAHON DR,SACRAMENTO,95824,CA,2,1,797,Residential,Wed May 21 00:00:00 EDT 2008,81900,38.51947,-121.435768


## Web Scraping using pandas

> Spletna stran: https://www.fdic.gov/bank/individual/failed/banklist.html

`pandas.read_html: ` Read HTML tables into a list of DataFrame objects. -> [Dokumentacija](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.read_html.html)



In [87]:
tables=pd.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')

In [88]:
data1=tables[0]

In [89]:
data1.head(2)

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Washington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,"December 15, 2017","February 1, 2019"
1,The Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,"October 13, 2017","February 21, 2018"


## Primeri

### Scraping and Visualizing IMDB Ratings

Stran: http://www.imdb.com/title/tt0944947/episodes

In [90]:
import requests
from bs4 import BeautifulSoup
url = 'http://www.imdb.com/title/tt0944947/episodes'

In [91]:
episodes=[]
rankings=[]

In [94]:
for season in range(1,9):
    r=requests.get(url,params={'season':season})
    if r.status_code==200:
        soup=BeautifulSoup(r.text,'html.parser')
        listing=soup.find('div',class_='eplist')
        for epnr,div in enumerate(listing.find_all('div',recursive=False)):
            episode=f'{season}.{epnr+1}'
            rating_el=div.find(class_='ipl-rating-star__rating')
            print(episode,rating_el)
            print('____________________')
            rating=float(rating_el.get_text(strip=True))
            episodes.append(episode)
            rankings.append(rating)

1.1 <span class="ipl-rating-star__rating">9.1</span>
____________________
1.2 <span class="ipl-rating-star__rating">8.8</span>
____________________
1.3 <span class="ipl-rating-star__rating">8.7</span>
____________________
1.4 <span class="ipl-rating-star__rating">8.8</span>
____________________
1.5 <span class="ipl-rating-star__rating">9.1</span>
____________________
1.6 <span class="ipl-rating-star__rating">9.2</span>
____________________
1.7 <span class="ipl-rating-star__rating">9.3</span>
____________________
1.8 <span class="ipl-rating-star__rating">9.1</span>
____________________
1.9 <span class="ipl-rating-star__rating">9.6</span>
____________________
1.10 <span class="ipl-rating-star__rating">9.5</span>
____________________
2.1 <span class="ipl-rating-star__rating">8.9</span>
____________________
2.2 <span class="ipl-rating-star__rating">8.6</span>
____________________
2.3 <span class="ipl-rating-star__rating">8.9</span>
____________________
2.4 <span class="ipl-rating-star__rat

In [95]:
rankings[:20]


[9.1,
 8.8,
 8.7,
 8.8,
 9.1,
 9.2,
 9.3,
 9.1,
 9.6,
 9.5,
 8.9,
 8.6,
 8.9,
 8.9,
 8.9,
 9.1,
 9.0,
 8.9,
 9.7,
 9.5]

### Scraping Fast Track data

Stran: https://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/

In [96]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

In [97]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [98]:
page=requests.get(urlpage)

In [100]:
soup = BeautifulSoup(page.text, 'html.parser')

In [101]:
table=soup.find('table',class_='tableSorter')

In [102]:
len(table)

3

In [103]:
results=table.find_all('tr')
len(results)

101

In [104]:
rows=[]

for row in results[0].find_all('th'):
    rows.append(row.contents[0])

In [107]:
rows

['Rank',
 'Company',
 'Location',
 'Year end',
 'Annual sales rise over 3 years',
 'Latest sales £000s',
 'Staff',
 'Comment']

In [108]:
rows=[]
rows.append(['Rank','Company Name','Webpage','Description',
             'Location','Year End','Annual Sales Rise Over 3 Years',
             'Sales £000s','Staff','Comments'])

In [109]:
for result in results:
    data=result.find_all('td')
    if len(data)==0:
        continue

In [110]:
# write columns to variables
rank = data[0].getText()
company = data[1].getText()
location = data[2].getText()
yearend = data[3].getText()
salesrise = data[4].getText()
sales = data[5].getText()
staff = data[6].getText()
comments = data[7].getText()

In [111]:
rank

'100'

In [116]:
companyname=data[1].find('span',class_='company-name').text
companyname

'Brompton Technology'

In [118]:
description=company.replace(companyname,'')
description

'Video technology provider'

In [119]:
sales.strip('*')

'5,250'

In [120]:
url=data[1].find('a').get('href')

In [121]:
url

'https://www.fasttrack.co.uk/company_profile/brompton-technology/'

In [124]:
page=requests.get(url)
soup=BeautifulSoup(page.text,'html.parser')

In [125]:
try:
    tableRow=soup.find('table').find_all('tr')[-1]
    webpage=tableRow.find('a').get('href')
except:
    webpage=None

In [126]:
webpage

'http://www.bromptontech.com'

#### Celotni program skupaj

In [135]:
# import libraries
from bs4 import BeautifulSoup
import requests
import csv

In [136]:
# specify the url
urlpage =  'http://www.fasttrack.co.uk/league-tables/tech-track-100/league-table/'

In [137]:
page=requests.get(urlpage)

In [138]:
soup = BeautifulSoup(page.text, 'html.parser')

In [139]:
table=soup.find('table',class_='tableSorter')

In [140]:
results=table.find_all('tr')
len(results)

101

In [141]:
rows=[]
rows.append(['Rank','Company Name','Webpage','Description',
             'Location','Year End','Annual Sales Rise Over 3 Years',
             'Sales £000s','Staff','Comments'])

In [143]:
for num,result in enumerate(results):
    data=result.find_all('td')
    if len(data)==0:
        continue
        # write columns to variables
    rank = data[0].getText()
    company = data[1].getText()
    location = data[2].getText()
    yearend = data[3].getText()
    salesrise = data[4].getText()
    sales = data[5].getText()
    staff = data[6].getText()
    comments = data[7].getText()
      
    companyname=data[1].find('span',class_='company-name').text
    description=company.replace(companyname,'')
    sales=sales.strip('*').strip('†').replace(',','')
    url=data[1].find('a').get('href')
    page=requests.get(url)
    soup=BeautifulSoup(page.text,'html.parser')
    try:
        tableRow=soup.find('table').find_all('tr')[-1]
        webpage=tableRow.find('a').get('href')
    except:
        webpage=None
    rows.append([rank,companyname,webpage,description,location,yearend,salesrise,sales,staff,comments])

In [144]:
with open('OUT_companies.csv','w',newline='') as f_output:
    csv_output=csv.writer(f_output)
    csv_output.writerows(rows)