# Importing Data from the Internet

**Lesson Plan**
* Import and locally save datasets from the web
* Load datasets into pandas DataFrames
* Make HTTP requests (GET requests)
* Scape web data such as HTML
* Parse HTML into useful data (Beautiful Soup)
    * Packages: ***urlib*** and ***requests***
    
## urllib package
* Provies interface for fetching data from the web
* urlopen() - accepts URLs instead of file names

In [1]:
from urllib.request import urlretrieve
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'

# Downloads and saves data into the sample_data folder, into a file named 'winequality-white.csv'
urlretrieve(url, 'sample_data/winequality-white.csv')

('sample_data/winequality-white.csv',
 <http.client.HTTPMessage at 0x2177d37c908>)

## Importing data from the web into pandas DataFrames

In [2]:
import pandas as pd
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
df = pd.read_csv(url, sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


## GET requests using urllib

In [3]:
from urllib.request import urlopen, Request
url = "https://www.google.com"
request = Request(url)
response = urlopen(request)
html = response.read()
response.close()

## GET request using requests (Different from 'Request')

In [4]:
import requests
url = "https://www.wikipedia.org/"
req = requests.get(url)
headers = req.headers
headers

{'Date': 'Wed, 14 Aug 2019 00:14:52 GMT', 'Content-Type': 'text/html', 'Content-Length': '18685', 'Connection': 'keep-alive', 'Server': 'mw1239.eqiad.wmnet', 'Cache-Control': 's-maxage=86400, must-revalidate, max-age=3600', 'ETag': 'W/"129d0-58f5c4cdc3919"', 'Last-Modified': 'Mon, 05 Aug 2019 10:37:52 GMT', 'Backend-Timing': 'D=232 t=1565507952102932', 'Content-Encoding': 'gzip', 'Vary': 'X-Seven, Accept-Encoding', 'X-Varnish': '922336708 909575905, 679934806 216828023', 'Age': '67630', 'X-Cache': 'cp1079 hit/6, cp1075 hit/644450', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Set-Cookie': 'WMF-Last-Access=14-Aug-2019;Path=/;HttpOnly;secure;Expires=Sun, 15 Sep 2019 00:00:00 GMT, WMF-Last-Access-Global=14-Aug-2019;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Sun, 15 Sep 2019 00:00:00 GMT, GeoIP=US:IN:Fishers:39.96:-86.02:v4; Path=/; secure; Domain=.wikipedia.org', 'X-Analyt

## Scraping the web in Python

**HTML**
* Mix of unstructured and structured data
* Structured data
    * Pre-defined data models or
    * Organized in a defined manner


 ### BeautifulSoup
 * Parse and extract structured data from HTML
 * Make tag soup beautiful and extract information

In [5]:
from bs4 import BeautifulSoup
import requests
url = "https://www.crummy.com/software/BeautifulSoup"
req = requests.get(url)
html_doc = req.text
soup = BeautifulSoup(html_doc)
soup

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup: We called him Tortoise because he taught us.</title>
<link href="mailto:leonardr@segfault.org" rev="made"/>
<link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
<meta content="Beautiful Soup: a library designed for screen-scraping HTML and XML." name="Description"/>
<meta content="Markov Approximation 1.4 (module: leonardr)" name="generator"/>
<meta content="Leonard Richardson" name="author"/>
</head>
<body alink="red" bgcolor="white" link="blue" text="black" vlink="660066">
<img align="right" src="10.1.jpg" width="250"/><br/>
<p>You didn't write that awful page. You're just trying to get some
data out of it. Beautiful Soup is here to help. Since 2004, it's been
saving programmers hours or days of work on quick-turnaround
screen scraping projects.</

In [6]:
# Get the Title of the webpage
title = soup.title
title

<title>Beautiful Soup: We called him Tortoise because he taught us.</title>

In [7]:
# Get the Text of the webpage
text = soup.text
text

'\n\n\nBeautiful Soup: We called him Tortoise because he taught us.\n\n\n\n\n\n\n\n\nYou didn\'t write that awful page. You\'re just trying to get some\ndata out of it. Beautiful Soup is here to help. Since 2004, it\'s been\nsaving programmers hours or days of work on quick-turnaround\nscreen scraping projects.\n\nBeautiful Soup\n"A tremendous boon." -- Python411 Podcast\n[ Download | Documentation | Hall of Fame | Source | Changelog | Discussion group  | Zine ]\nIf you use Beautiful Soup as part of your work, please consider a Tidelift subscription. This will support many of the free software projects your organization depends on, not just Beautiful Soup.\nIf Beautiful Soup is useful to you on a personal level, you might like to read Tool Safety, a short zine I wrote about what I learned about software development from working on Beautiful Soup. Thanks!\n\nIf you have questions, send them to the discussion\ngroup. If you find a bug, file it.\nBeautiful Soup is a Python library designe

In [8]:
# Find all 'a' tags
a_tags = soup.find_all('a')
for link in a_tags:
    print(link.get('href'))

bs4/download/
#Download
bs4/doc/
#HallOfFame
https://code.launchpad.net/beautifulsoup
https://bazaar.launchpad.net/%7Eleonardr/beautifulsoup/bs4/view/head:/CHANGELOG
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
zine/
https://tidelift.com/subscription/pkg/pypi-beautifulsoup4?utm_source=pypi-beautifulsoup4&utm_medium=referral&utm_campaign=website
zine/
https://groups.google.com/forum/?fromgroups#!forum/beautifulsoup
https://bugs.launchpad.net/beautifulsoup/
http://lxml.de/
http://code.google.com/p/html5lib/
bs4/doc/
None
bs4/download/
http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html
download/3.x/BeautifulSoup-3.2.1.tar.gz
None
http://www.nytimes.com/2007/10/25/arts/design/25vide.html
https://github.com/reddit/reddit/blob/85f9cff3e2ab9bb8f19b96acd8da4ebacc079f04/r2/r2/lib/media.py
http://www.harrowell.org.uk/viktormap.html
http://svn.python.org/view/tracker/importer/
http://www2.ljworld.com/
http://www.b-list.org/weblog/2010/nov/02/news-done-broke/
h