![ADSA Logo](http://i.imgur.com/BV0CdHZ.png?2 "ADSA Logo")

# Spring 2019 ADSA Workshop - Introduction to Web Scraping



##What is webscraping? 
Web "scraping" (also called "web harvesting", "web data extraction" or even "web data mining"), can be defined as "the construction of an agent to download, parse, and organize data from the web in an automated manner"

***
## Using `urllib` to Access Web Data and `BeautifulSoup` to Parse it.

`urllib` is a very easy-to-use module to fetch URLs (Uniform Resource Locators). You can use this module to easily read and use web content in your code. We are going to use this module to build an app that gets weather data.

`BeautifulSoup` on the other hand is HTML and XML parser. It creates a parse tree from the parsed webpage and can be used to access several tags in the HTML page. This makes it a very useful tool for web-scraping.

Let's start by seeing what reading the Python.org homepage through `urllib` looks like. Then we will use `BeautifulSoup` to print all the links present in the webpage!   
For more information visit: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [1]:
import urllib
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'http://python.org'
response = urlopen(url) 
html = response.read()

# We can use BeautifulSoup to parse the web tree to give us only the web-links instead.
soup = BeautifulSoup(html, "html.parser") #Create a soup object. Check its class using:  print type(soup)
print(type(soup))
#print(soup.prettify())

<class 'bs4.BeautifulSoup'>


In [2]:
for link in soup.find_all('a', href=True): #Finding all the tags containing 'a' and its a link
    if "http" in link['href']:
        print(link['href'])

https://docs.python.org
https://pypi.python.org/
http://plus.google.com/+Python
http://www.facebook.com/pythonlang?fref=ts
http://twitter.com/ThePSF
http://brochure.getpython.info/
https://docs.python.org/3/license.html
https://wiki.python.org/moin/BeginnersGuide
https://devguide.python.org/
https://docs.python.org/faq/
http://wiki.python.org/moin/Languages
http://python.org/dev/peps/
https://wiki.python.org/moin/PythonBooks
https://wiki.python.org/moin/
https://www.python.org/psf/codeofconduct/
http://planetpython.org/
http://pyfound.blogspot.com/
http://pycon.blogspot.com/
https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event
http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator
https://docs.python.org
http://blog.python.org
http://feedproxy.google.com/~r/PythonInsider/~3/4U66sA2wtWw/python-371rc1-and-367rc1-now-available.html
http://feedproxy.google.com/~r/PythonInsider/~3/5EA0ClmtbD8/python-356-and-python-349-are-now.html
http://feedproxy.

This prints out the complete source HTML of the website. We have this data stored as a regular string in the html variable, and we can now do whatever we want with it.

### Build a Weather Reporting Program!

Let's now use the `urllib` module to build a small program that tells you the city and the current weather when you give it the zip code of a place.
For the weather data, we will use the service OpenWeatherMap.org. Copy the URL http://api.openweathermap.org/data/2.5/weather?zip=61820,us&appid=cf7f4e0a615b5f48f4601377a2c98a75 into the address bar in a new tab. The website shows text about the weather information in the area of zipcode 61820 (Champaign). Let's load this information through `urllib`.

In [3]:
appid = 'cf7f4e0a615b5f48f4601377a2c98a75'
zipcode = '61820'
url = 'http://api.openweathermap.org/data/2.5/weather?zip={},us&APPID={}'.format(zipcode, appid)
response = urlopen(url)
weather_html = response.read().decode('utf-8')

print(weather_html)

{"coord":{"lon":-88.24,"lat":40.12},"weather":[{"id":800,"main":"Clear","description":"clear sky","icon":"01d"}],"base":"stations","main":{"temp":296.31,"pressure":1020,"humidity":80,"temp_min":294.05,"temp_max":299.05},"visibility":16093,"wind":{"speed":1.46,"deg":108.513},"clouds":{"all":1},"dt":1538933700,"sys":{"type":1,"id":968,"message":0.0044,"country":"US","sunrise":1538913364,"sunset":1538954660},"id":420012386,"name":"Bloomington","cod":200}


The string that we have received is formatted in JSON, which is very similar to a Python dictionary. Let's parse this JSON data into a Python dictionary, and also pretty print it so that we can understand the structure of the data.

In [4]:
from json import JSONDecoder, dumps

decoder = JSONDecoder()
weather_data = decoder.decode(weather_html)
pretty_weather_data = dumps(weather_data, indent=2, separators=(',', ': '))

print(pretty_weather_data)

{
  "coord": {
    "lon": -88.24,
    "lat": 40.12
  },
  "weather": [
    {
      "id": 800,
      "main": "Clear",
      "description": "clear sky",
      "icon": "01d"
    }
  ],
  "base": "stations",
  "main": {
    "temp": 296.31,
    "pressure": 1020,
    "humidity": 80,
    "temp_min": 294.05,
    "temp_max": 299.05
  },
  "visibility": 16093,
  "wind": {
    "speed": 1.46,
    "deg": 108.513
  },
  "clouds": {
    "all": 1
  },
  "dt": 1538933700,
  "sys": {
    "type": 1,
    "id": 968,
    "message": 0.0044,
    "country": "US",
    "sunrise": 1538913364,
    "sunset": 1538954660
  },
  "id": 420012386,
  "name": "Bloomington",
  "cod": 200
}


The information we want to build our program is the name field and the temp field which is inside the main sub-dictionary.

In [5]:
city = weather_data['name']

temp_kelvin = weather_data['main']['temp']
temp_fah = 1.8 * (temp_kelvin - 273.15) + 32

print("We are in {0} and it is {1} degrees outside!".format(city, temp_fah))

We are in Bloomington and it is 73.68800000000005 degrees outside!


Let's put all of this into a nice and easy to use function.

In [0]:
def tell_me_weather(zipcode):
    # import urllib
    appid = 'cf7f4e0a615b5f48f4601377a2c98a75'
    url = 'http://api.openweathermap.org/data/2.5/weather?zip={0},us&APPID={1}'.format(zipcode, appid)
    response = urlopen(url)
    weather_html = response.read().decode('utf-8')

    decoder = JSONDecoder()
    weather_data = decoder.decode(weather_html)
    city = weather_data['name']

    temp_kelvin = weather_data['main']['temp']
    temp_fah = 1.8 * (temp_kelvin - 273.15) + 32

    print("You are in {0} and it is {1} degrees outside!".format(city, temp_fah))

Now let's use our new tell_me_weather function!

In [7]:
tell_me_weather(61801)
tell_me_weather(60601)
tell_me_weather(94102)

You are in Bloomington and it is 75.93800000000005 degrees outside!
You are in Chicago and it is 59.57600000000009 degrees outside!
You are in San Francisco and it is 72.46400000000003 degrees outside!


###Excercise 
Scrape all of the thread titles from the first page reddit posts. 
Hints: 
1) Use "inspect" function on the page to find the correct tag. Use this tag with the "find_all" function to get each of the titles. 

In [0]:
url = 'https://www.reddit.com/r/worldnews/hot/'


Now find the the number the comments that each of the reddit threads has received. 

Hints: There are some threads that are promotions. 

Now create a dataframe with your title and comment count. The columns should be named "title" and "comment_count"