# Introduction to Web Scraping

To begin, we will examine the reddit page dealing with Machine Learning.  Our goal is to scrape the basic information for posts.

![](images/reddit.png)

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [5]:
url = 'https://en.wikipedia.org/wiki/List_of_21_Jump_Street_episodes'

In [6]:
response = requests.get(url)

In [7]:
response

<Response [200]>

In [9]:
response.text[:300]

'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of 21 Jump Street episodes - Wikipedia</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<sc'

In [11]:
soup = BeautifulSoup(response.text, 'html.parser')

#Pass response text into Beautiful soup

In [12]:
soup.find('h2')

#find anything that was a h2 header

<h2>Contents</h2>

In [15]:
all_h2 = soup.find_all('h2')
for header in all_h2[:5]:
    print(header.text)

Contents
Series Overview[edit]
Season 1 (1987)[edit]
Season 2 (1987-88)[edit]
Season 3 (1988-89)[edit]


In [16]:
soup.find('h2').text

'Contents'

In [None]:
soup.find('p')

In [None]:
len(soup.find_all('p'))

In [None]:
len(soup.find_all('h2'))

In [None]:
soup.find('a', {'data-click-id': 'body'})['href']

In [None]:
links = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)

In [None]:
links

In [None]:
links = []
titles = []
bodys = []
for i in soup.find_all('a', {'data-click-id': 'body'}):
    url_link = 'https://www.reddit.com' + i['href']
    links.append(url_link)
    response = requests.get(url_link)
    soup2 = BeautifulSoup(response.text, 'html.parser')
    title = soup2.find('h2')
    body = soup2.find_all('p')
    titles.append(title)
    bodys.append(body)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame({'links': links, 'title': titles, 'body': bodys})

In [None]:
df.head()

### Wikipedia Exercise

Scraping Wikipedia tables and adding information found through links.

![](images/wiki_table.png)

Problem:

1. Create a dataframe that contains the information displayed on the Wikipedia page "List of 2018 Albums".
2. What is Sub Pop releasing in 2018?
3. Did Drake put anything out?
4. What label is putting out the most music?  Visualize this.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_2018_albums'

In [None]:
response = requests.get(url)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
soup.find('table', {'class':'wikitable'})

### Tweepy

- Sign into Twitter apps (https://apps.twitter.com/)
- Create application and retrieve `consumer_key`, `consumer_secret`, `access_token`, and `access_token_secret`.  
- Follow example below filling in your info.  For more info, see the Tweepy documentation [here](http://tweepy.readthedocs.io/en/v3.5.0/getting_started.html#introduction).

In [17]:
consumer_key = 'I7JmWNmyqhzu0VIVpn1DtUAsc'
consumer_secret = 'DApKlKIMg2MMdiDEZNE3p3WtSNUF1tq6G70W0TBP4w7AVcZZkk'
access_token = '2772844623-sad4NXYbTZhingQY2KycKK6bT3exbjp9dvl25lt'
access_token_secret = '06ukALDQTMOCXG2bMXfSmxWMhdoXhzXD3qvUZLKy3YpRb'

In [19]:
import tweepy

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

In [20]:
user = api.get_user('thrashermag')

In [21]:
for tweet in user.timeline():
    print(tweet.text)

Double the handcuffs equals triple the fun. Or pain. Probably pain. Watch the latest episode Tuesday night at 9 pm… https://t.co/CcyguAD6wC
On Friday, attendees pulled up to the historic Max Fish bar in the Lower East Side of Manhattan from all over the E… https://t.co/hoZvHDxdKW
With a nod to past, but firmly planted in the future, this offering from @dcshoes is everything you could ever want… https://t.co/w4nxVWK4lh
With Chico and Carroll as their guides, watch Real tackle some '90s moves and f--ked up Goofy Boy fashion.… https://t.co/BoaQSv0QSD
It all kicks off with a beautiful block-to-block line and just keeps getting better. Resounding pop, majestic style… https://t.co/hl76dds6tU
Element attacks Sacto with help from Cardiel, Carroll and Chico gives Real an early-90s makeover and the F Troop ba… https://t.co/qWHqhBL2MW
T-Funk blows the doors off an NYC hotspot with a rail combo bordering on the absurd. Hell YES.… https://t.co/PbmnhhaqCO
A solid crew went over to Omar Hassan’s back

In [22]:
print(user.followers_count)

415866


In [23]:
tweets = []
for tweet in user.timeline(count = 200):
    tweets.append(tweet.text)

In [24]:
tweets[:5]

['Double the handcuffs equals triple the fun. Or pain. Probably pain. Watch the latest episode Tuesday night at 9 pm… https://t.co/CcyguAD6wC',
 'On Friday, attendees pulled up to the historic Max Fish bar in the Lower East Side of Manhattan from all over the E… https://t.co/hoZvHDxdKW',
 'With a nod to past, but firmly planted in the future, this offering from @dcshoes is everything you could ever want… https://t.co/w4nxVWK4lh',
 "With Chico and Carroll as their guides, watch Real tackle some '90s moves and f--ked up Goofy Boy fashion.… https://t.co/BoaQSv0QSD",
 'It all kicks off with a beautiful block-to-block line and just keeps getting better. Resounding pop, majestic style… https://t.co/hl76dds6tU']

In [25]:
user_BK = api.get_user('Rotoworld_BK')

In [26]:
print(user_BK.followers_count)

110575


In [27]:
tweets = []
for tweet in user_BK.timeline(count = 200):
    tweets.append(tweet.text)

In [29]:
tweets[:15]

['Clippers acquire Johnathan Motley from Dallas https://t.co/RuWmR7Cfek',
 'RT @MikeSGallagher: New fantasy hoops pod! @docktora and I go over the new-look Hawks with tons of John Collins hype, some of the low-key t…',
 'Dakari Johnson traded to Memphis for Martin https://t.co/eK26sbIJEz',
 'Carmelo Anthony intends to sign with Rockets https://t.co/3sjVswxC6V',
 'Tobias Harris turns down $80m extension offer https://t.co/T0R9W8pp4M',
 'Report: Cavs have spoken with David Nwaba https://t.co/DxFe4CUNjU',
 'Shams: Alex Len finalizing deal with Hawks https://t.co/4FhdkFLs6Z',
 'Shams: Kings, Nemanja Bjelica agree to deal https://t.co/Ifts6aWR0Z',
 'Shams: Lakers, Michael Beasley agree to deal https://t.co/ejBXkueNNs',
 'Shams: Magic trade for Dakari Johnson https://t.co/DLHwLESp3G',
 'Shams: Kings, Yogi Ferrell agree to deal https://t.co/qv6bXokAPF',
 'Jonah Bolden signs 4-year deal with Sixers https://t.co/YRbgjN5Bem',
 'Shams: Sixers trading Richaun Holmes to PHX https://t.co/tTjxlDdCgC'

### Open Table

![](images/open_table.png)

Finding restaurants in New York City. (https://www.opentable.com/new-york-restaurant-listings)  Is there good Indian food in the Upper West Side?  Where?  What are people saying is good?