# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_21_Jump_Street_episodes'

In [6]:
response = requests.get(url)

In [7]:
response

<Response [200]>

In [9]:
response.text[:300]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of 21 Jump Street episodes - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<sc'

In [11]:
soup = BeautifulSoup(response.text, 'html.parser')

#Pass response text into Beautiful soup

In [12]:
soup.find('h2')

#find anything that was a h2 header

<h2>Contents</h2>

In [15]:
all_h2 = soup.find_all('h2')
for header in all_h2[:5]:
    print(header.text)

Contents
Series Overview[edit]
Season 1 (1987)[edit]
Season 2 (1987-88)[edit]
Season 3 (1988-89)[edit]


In [16]:
soup.find('h2').text

'Contents'

In [None]:
soup.find('p')

In [None]:
len(soup.find_all('p'))

In [None]:
len(soup.find_all('h2'))

In [None]:
soup.find('a', {'data-click-id': 'body'})['href']

In [None]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [None]:
links

In [None]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [None]:
df.head()

### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

In [2]:
url = 'https://www.basketball-reference.com/players/a/anthoca01.html'

In [3]:
response = requests.get(url)

In [4]:
response

<Response [200]>

In [6]:
response.text[:500]

'\n<!DOCTYPE html>\n<html data-version="klecko-" data-root="/home/bbr/build" itemscope itemtype="https://schema.org/WebSite" lang="en" class="no-js" >\n<head>\n    <meta charset="utf-8">\n    <meta http-equiv="x-ua-compatible" content="ie=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=2.0" />\n    <link rel="dns-prefetch" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242" />\n\n<!-- no:cookie fast load the css.           -->\n<link rel="preconnect" h'

In [7]:
soup = BeautifulSoup(response.text, 'html.parser')

In [59]:
print(soup.prettify())

<!DOCTYPE html>
<html class="no-js" data-root="/home/bbr/build" data-version="klecko-" itemscope="" itemtype="https://schema.org/WebSite" lang="en">
 <head>
  <meta charset="utf-8"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
   <link href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242" rel="dns-prefetch"/>
   <!-- no:cookie fast load the css.           -->
   <link crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net" rel="preconnect"/>
   <link crossorigin="" href="https://d2cwpp38twqe55.cloudfront.net" rel="preconnect"/>
   <style>
   </style>
   <link as="style" crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242/css/bbr/sr-min.css" onload="this.rel='stylesheet'" rel="preload"/>
   <noscript>
    <link href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242/css/bbr/sr-min.css" rel="stylesheet" type="text/css"/>
   </noscript>
   <link as="scri

TypeError: 'NoneType' object is not subscriptable

In [75]:
body = list(soup.children)[3]

In [78]:
list(body.children)[1]

<head>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1.0, maximum-scale=2.0" name="viewport">
<link href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242" rel="dns-prefetch"/>
<!-- no:cookie fast load the css.           -->
<link crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net" rel="preconnect"/>
<link crossorigin="" href="https://d2cwpp38twqe55.cloudfront.net" rel="preconnect"/>
<link as="style" crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242/css/bbr/sr-min.css" onload="this.rel='stylesheet'" rel="preload"/>
<noscript><link href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242/css/bbr/sr-min.css" rel="stylesheet" type="text/css"/></noscript>
<link as="script" crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242/js/bbr/sr-min.js" rel="preload"/>
<link as="fetch" crossorigin="" href="https://d2p3bygnnzw9w3.cloudfront.net/req/201807242/icons/sr

In [79]:
body.get_text()



In [84]:
soup.find_all('tbody')[0].get_text()

'\n2003-0419DENNBASF828236.57.617.9.4260.82.6.3226.815.3.444.4495.06.4.7772.23.86.12.81.20.53.02.721.0\n2004-0520DENNBASF757534.87.116.4.4310.62.1.2666.514.3.455.4486.17.6.7961.93.85.72.60.90.43.03.120.8\n2005-0621DENNBASF808036.89.519.7.4810.51.9.2439.017.8.506.4937.28.9.8081.53.44.92.71.10.52.72.926.5\n2006-0722DENNBASF656538.210.622.4.4760.62.3.26810.020.1.499.4897.18.7.8082.23.86.03.81.20.43.63.128.9\n2007-0823DENNBASF777736.49.519.2.4920.82.1.3548.717.1.509.5116.07.7.7862.35.17.43.41.30.53.33.325.7\n2008-0924DENNBASF666634.58.118.3.4431.02.6.3717.215.7.455.4695.67.1.7931.65.26.83.41.10.43.03.022.8\n2009-1025DENNBASF696938.210.021.8.4580.92.7.3169.119.1.478.4787.48.9.8302.24.46.63.21.30.43.03.328.2\n2010-1126TOTNBASF777735.78.919.5.4551.23.3.3787.616.3.470.4876.67.9.8381.55.87.32.90.90.62.72.925.6\n2010-1126DENNBASF505035.58.719.3.4520.82.5.3337.916.8.470.4746.98.3.8231.56.17.62.80.90.62.82.725.2\n2010-1126NYKNBASF272736.29.119.9.4612.04.6.4247.215.2.472.5106.17.0.8721.55.26.73.00.

In [63]:
[type(item) for item in list(soup.children)]

[bs4.element.NavigableString,
 bs4.element.Doctype,
 bs4.element.NavigableString,
 bs4.element.Tag,
 bs4.element.NavigableString]

In [21]:
soup.find_all('h2')[:10]

[<h2>Per Game</h2>,
 <h2>View on stats.nba.com</h2>,
 <h2>Player News</h2>,
 <h2>Totals</h2>,
 <h2>Per 36 Minutes</h2>,
 <h2>Per 100 Poss</h2>,
 <h2>Advanced</h2>,
 <h2>Shooting</h2>,
 <h2>Play-by-Play</h2>,
 <h2>Playoffs Per Game</h2>]

In [28]:
len(soup.find_all('h2'))

28

In [33]:
all_h2 = soup.find_all('h2')
for header in all_h2[:10]:
    print(header.text)

Per Game
View on stats.nba.com
Player News
Totals
Per 36 Minutes
Per 100 Poss
Advanced
Shooting
Play-by-Play
Playoffs Per Game


In [24]:
soup.find_all('li')[:10]

[<li><a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference</a></li>,
 <li><a href="https://www.baseball-reference.com/">Baseball</a></li>,
 <li><a href="https://www.pro-football-reference.com/">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>,
 <li class="current"><a href="https://www.basketball-reference.com/">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>,
 <li><a href="https://www.hockey-reference.com/">Hockey</a></li>,
 <li><a href="https://fbref.com/">Soccer</a></li>,
 <li><a href="https://www.sports-reference.com/blog/">Blog</a></li>,
 <li><a href="https://stathead.com/">Stathead</a></li>,
 <li><a href="https://widgets.sports-reference.com/">Widgets</a></li>,
 <li><a href="https://www.sports-reference.com/feedback/">Questions or Comments?</a></li>]

In [30]:
soup.findAll('li')[:10]

[<li><a href="https://www.sports-reference.com/"><svg height="15px" width="20px"><use xlink:href="#ic-sr-pennant"></use></svg> Sports Reference</a></li>,
 <li><a href="https://www.baseball-reference.com/">Baseball</a></li>,
 <li><a href="https://www.pro-football-reference.com/">Football</a> <a href="https://www.sports-reference.com/cfb/">(college)</a></li>,
 <li class="current"><a href="https://www.basketball-reference.com/">Basketball</a> <a href="https://www.sports-reference.com/cbb/">(college)</a></li>,
 <li><a href="https://www.hockey-reference.com/">Hockey</a></li>,
 <li><a href="https://fbref.com/">Soccer</a></li>,
 <li><a href="https://www.sports-reference.com/blog/">Blog</a></li>,
 <li><a href="https://stathead.com/">Stathead</a></li>,
 <li><a href="https://widgets.sports-reference.com/">Widgets</a></li>,
 <li><a href="https://www.sports-reference.com/feedback/">Questions or Comments?</a></li>]

In [39]:
all_li = soup.find_all('li')
for element in all_li[:10]:
    print(element.text)

 Sports Reference
Baseball
Football (college)
Basketball (college)
Hockey
Soccer
Blog
Stathead
Widgets
Questions or Comments?


In [36]:
soup.find_all('tr')[10:]

[<tr class="light_text partial_table" id="per_game.2011"><th class="left " data-stat="season" scope="row"><a href="/players/a/anthoca01/gamelog/2011/">2010-11</a><span class="sr_star"></span></th><td class="center " data-stat="age">26</td><td class="left " data-stat="team_id"><a href="/teams/NYK/2011.html">NYK</a></td><td class="left " data-stat="lg_id"><a href="/leagues/NBA_2011.html">NBA</a></td><td class="center " data-stat="pos">SF</td><td class="right " data-stat="g">27</td><td class="right " data-stat="gs">27</td><td class="right " data-stat="mp_per_g">36.2</td><td class="right " data-stat="fg_per_g">9.1</td><td class="right " data-stat="fga_per_g">19.9</td><td class="right " data-stat="fg_pct">.461</td><td class="right " data-stat="fg3_per_g">2.0</td><td class="right " data-stat="fg3a_per_g">4.6</td><td class="right " data-stat="fg3_pct">.424</td><td class="right " data-stat="fg2_per_g">7.2</td><td class="right " data-stat="fg2a_per_g">15.2</td><td class="right " data-stat="fg2_

In [97]:
all_tr = soup.find_all('tr')
for element in all_tr[1:15]:
    print(element.text)

2003-0419DENNBASF828236.57.617.9.4260.82.6.3226.815.3.444.4495.06.4.7772.23.86.12.81.20.53.02.721.0
2004-0520DENNBASF757534.87.116.4.4310.62.1.2666.514.3.455.4486.17.6.7961.93.85.72.60.90.43.03.120.8
2005-0621DENNBASF808036.89.519.7.4810.51.9.2439.017.8.506.4937.28.9.8081.53.44.92.71.10.52.72.926.5
2006-0722DENNBASF656538.210.622.4.4760.62.3.26810.020.1.499.4897.18.7.8082.23.86.03.81.20.43.63.128.9
2007-0823DENNBASF777736.49.519.2.4920.82.1.3548.717.1.509.5116.07.7.7862.35.17.43.41.30.53.33.325.7
2008-0924DENNBASF666634.58.118.3.4431.02.6.3717.215.7.455.4695.67.1.7931.65.26.83.41.10.43.03.022.8
2009-1025DENNBASF696938.210.021.8.4580.92.7.3169.119.1.478.4787.48.9.8302.24.46.63.21.30.43.03.328.2
2010-1126TOTNBASF777735.78.919.5.4551.23.3.3787.616.3.470.4876.67.9.8381.55.87.32.90.90.62.72.925.6
2010-1126DENNBASF505035.58.719.3.4520.82.5.3337.916.8.470.4746.98.3.8231.56.17.62.80.90.62.82.725.2
2010-1126NYKNBASF272736.29.119.9.4612.04.6.4247.215.2.472.5106.17.0.8721.55.26.73.00.90.62.43.326

In [27]:
soup.find_all('th')[:10]

[<th aria-label="If listed as single number, the year the season ended.★ - Indicates All-Star for league.Only on regular season tables." class=" poptip sort_default_asc center" data-stat="season" data-tip="If listed as single number, the year the season ended.&lt;br&gt;★ - Indicates All-Star for league.&lt;br&gt;Only on regular season tables." scope="col">Season</th>,
 <th aria-label="Age of Player at the start of February 1st of that season." class=" poptip sort_default_asc center" data-stat="age" data-tip="Age of Player at the start of February 1st of that season." scope="col">Age</th>,
 <th aria-label="Team" class=" poptip sort_default_asc center" data-stat="team_id" data-tip="Team" scope="col">Tm</th>,
 <th aria-label="League" class=" poptip sort_default_asc center" data-stat="lg_id" data-tip="League" scope="col">Lg</th>,
 <th aria-label="Position" class=" poptip sort_default_asc center" data-stat="pos" data-tip="Position" scope="col">Pos</th>,
 <th aria-label="Games" class=" popti

In [31]:
len( soup.find_all('th'))

52

In [49]:
soup.find('tr#per_game.2004.full_table')

In [53]:
help(BeautifulSoup)

Help on class BeautifulSoup in module bs4:

class BeautifulSoup(bs4.element.Tag)
 |  This class defines the basic interface called by the tree builders.
 |  
 |  These methods will be called by the parser:
 |    reset()
 |    feed(markup)
 |  
 |  The tree builder may call these methods from its feed() implementation:
 |    handle_starttag(name, attrs) # See note about return value
 |    handle_endtag(name)
 |    handle_data(data) # Appends to the current data node
 |    endData(containerClass=NavigableString) # Ends the current data node
 |  
 |  No matter how complicated the underlying parser is, you should be
 |  able to build a tree using 'start tag' events, 'end tag' events,
 |  'data' events, and "done with data" events.
 |  
 |  If you encounter an empty-element tag (aka a self-closing tag,
 |  like HTML's <br> tag), call handle_starttag and then
 |  handle_endtag.
 |  
 |  Method resolution order:
 |      BeautifulSoup
 |      bs4.element.Tag
 |      bs4.element.PageElement
 | 

In [55]:
class BoxScore:
    def __init__(
            self,
            first_name,
            last_name,
            date,
            team,
            opponent,
            is_home,
            seconds_played,
            field_goals,
            field_goal_attempts,
            three_point_field_goals,
            three_point_field_goal_attempts,
            free_throws,
            free_throw_attempts,
            offensive_rebounds,
            defensive_rebounds,
            total_rebounds,
            assists,
            steals,
            blocks,
            turnovers,
            personal_fouls,
            points
    ):
        self.first_name = first_name
        self.last_name = last_name
        self.date = date
        self.team = team
        self.opponent = opponent
        self.is_home = is_home
        self.seconds_played = seconds_played
        self.field_goals = field_goals
        self.field_goal_attempts = field_goal_attempts
        self.three_point_field_goals = three_point_field_goals
        self.three_point_field_goal_attempts = three_point_field_goal_attempts
        self.free_throws = free_throws
        self.free_throw_attempts = free_throw_attempts
        self.offensive_rebounds = offensive_rebounds
        self.defensive_rebounds = defensive_rebounds
        self.total_rebounds = total_rebounds
        self.assists = assists
        self.steals = steals
        self.blocks = blocks
        self.turnovers = turnovers
        self.personal_fouls = personal_fouls
        self.points = points

In [56]:
class PlayerSeasonStatistics:
    def __init__(
            self,
            first_name,
            last_name,
            age,
            team,
            position,
            games_played,
            games_started,
            minutes_played,
            field_goals_made,
            field_goals_attempted,
            three_point_field_goals_made,
            three_point_field_goals_attempted,
            two_point_field_goals_made,
            two_point_field_goals_attempted,
            free_throws_made,
            free_throws_attempted,
            offensive_rebounds,
            defensive_rebounds,
            assists,
            steals,
            blocks,
            turnovers,
            personal_fouls,
            points
    ):
        self.age = age
        self.first_name = first_name
        self.last_name = last_name
        self.team = team
        self.position = position
        self.games_played = games_played
        self.games_started = games_started
        self.minutes_played = minutes_played
        self.field_goals_made = field_goals_made
        self.field_goals_attempted = field_goals_attempted
        self.three_point_field_goals_made = three_point_field_goals_made
        self.three_point_field_goals_attempted = three_point_field_goals_attempted
        self.two_point_field_goals_made = two_point_field_goals_made
        self.two_point_field_goals_attempted = two_point_field_goals_attempted
        self.free_throws_made = free_throws_made
        self.free_throws_attempted = free_throws_attempted
        self.offensive_rebounds = offensive_rebounds
        self.defensive_rebounds = defensive_rebounds
        self.assists = assists
        self.steals = steals
        self.blocks = blocks
        self.turnovers = turnovers
        self.personal_fouls = personal_fouls
        self.points = points

In [98]:
soup_2 = BeautifulSoup(response.content, 'html.parser')

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [17]:
consumer_key = 'I7JmWNmyqhzu0VIVpn1DtUAsc'
consumer_secret = 'DApKlKIMg2MMdiDEZNE3p3WtSNUF1tq6G70W0TBP4w7AVcZZkk'
access_token = '2772844623-sad4NXYbTZhingQY2KycKK6bT3exbjp9dvl25lt'
access_token_secret = '06ukALDQTMOCXG2bMXfSmxWMhdoXhzXD3qvUZLKy3YpRb'

In [19]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [20]:
user = api.get_user('thrashermag')

In [21]:
for tweet in user.timeline():
    print(tweet.text)

Double the handcuffs equals triple the fun. Or pain. Probably pain. Watch the latest episode Tuesday night at 9 pm… https://t.co/CcyguAD6wC
On Friday, attendees pulled up to the historic Max Fish bar in the Lower East Side of Manhattan from all over the E… https://t.co/hoZvHDxdKW
With a nod to past, but firmly planted in the future, this offering from @dcshoes is everything you could ever want… https://t.co/w4nxVWK4lh
With Chico and Carroll as their guides, watch Real tackle some '90s moves and f--ked up Goofy Boy fashion.… https://t.co/BoaQSv0QSD
It all kicks off with a beautiful block-to-block line and just keeps getting better. Resounding pop, majestic style… https://t.co/hl76dds6tU
Element attacks Sacto with help from Cardiel, Carroll and Chico gives Real an early-90s makeover and the F Troop ba… https://t.co/qWHqhBL2MW
T-Funk blows the doors off an NYC hotspot with a rail combo bordering on the absurd. Hell YES.… https://t.co/PbmnhhaqCO
A solid crew went over to Omar Hassan’s back

In [22]:
print(user.followers_count)

415866


In [23]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [24]:
tweets[:5]

['Double the handcuffs equals triple the fun. Or pain. Probably pain. Watch the latest episode Tuesday night at 9 pm… https://t.co/CcyguAD6wC',
 'On Friday, attendees pulled up to the historic Max Fish bar in the Lower East Side of Manhattan from all over the E… https://t.co/hoZvHDxdKW',
 'With a nod to past, but firmly planted in the future, this offering from @dcshoes is everything you could ever want… https://t.co/w4nxVWK4lh',
 "With Chico and Carroll as their guides, watch Real tackle some '90s moves and f--ked up Goofy Boy fashion.… https://t.co/BoaQSv0QSD",
 'It all kicks off with a beautiful block-to-block line and just keeps getting better. Resounding pop, majestic style… https://t.co/hl76dds6tU']

In [25]:
user_BK = api.get_user('Rotoworld_BK')

In [26]:
print(user_BK.followers_count)

110575


In [27]:
tweets = []
for tweet in user_BK.timeline(count = 200):
    tweets.append(tweet.text)

In [29]:
tweets[:15]

['Clippers acquire Johnathan Motley from Dallas https://t.co/RuWmR7Cfek',
 'RT @MikeSGallagher: New fantasy hoops pod! @docktora and I go over the new-look Hawks with tons of John Collins hype, some of the low-key t…',
 'Dakari Johnson traded to Memphis for Martin https://t.co/eK26sbIJEz',
 'Carmelo Anthony intends to sign with Rockets https://t.co/3sjVswxC6V',
 'Tobias Harris turns down $80m extension offer https://t.co/T0R9W8pp4M',
 'Report: Cavs have spoken with David Nwaba https://t.co/DxFe4CUNjU',
 'Shams: Alex Len finalizing deal with Hawks https://t.co/4FhdkFLs6Z',
 'Shams: Kings, Nemanja Bjelica agree to deal https://t.co/Ifts6aWR0Z',
 'Shams: Lakers, Michael Beasley agree to deal https://t.co/ejBXkueNNs',
 'Shams: Magic trade for Dakari Johnson https://t.co/DLHwLESp3G',
 'Shams: Kings, Yogi Ferrell agree to deal https://t.co/qv6bXokAPF',
 'Jonah Bolden signs 4-year deal with Sixers https://t.co/YRbgjN5Bem',
 'Shams: Sixers trading Richaun Holmes to PHX https://t.co/tTjxlDdCgC'

In [32]:
user_BK.favourites_count

9

In [33]:
user.favourites_count

1050

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?