# Importing data from the Internet


We will be importing and saving data from the web. We will be making HTTP requests using [Requests](https://requests.readthedocs.io/en/master/) library, scraping web data (HTML), and parsing HTML into useful data using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Python has a [urllib](https://docs.python.org/3/library/urllib.html) library form working with URLS.

## Importing flat files

To import data sets from websites, we are using the urllib.request package.

    *urlopen() - accepts URLs 
    *urlretrieve(url, name) - retrieves data

In [2]:
from urllib.request import urlretrieve
import pandas as pd

# Assign url of file
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
# Save file locally
urlretrieve(url, 'winequality-white.csv')
# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-white.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Importing non-flat files from the web

In [3]:
# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xls
xls = pd.read_excel(url, sheet_name = None)

# Print the sheetnames to the shell
print(xls.keys())

odict_keys(['1700', '1900'])


### Importing HTML data using urllib.request library

URL: Universal Resource Locator. It references to web resources
HTTP: HyperText Transfer Protocol. It is the foundation of data communication for the web
HTML: HyperText Markup Language

In [4]:
from urllib.request import urlopen, Request

url = "http://quotes.toscrape.com/random"
request = Request(url) #This packages the request
response = urlopen(request) #Sends the request and catches the response (HTML data)
html = response.read() #Extract the response
response.close()
print(html)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe difference between genius and stupidity is: genius has its limits.\xe2\x80\x9d</span>\n        <span>by <small class="

In [8]:
# We can also use:

with urlopen("http://quotes.toscrape.com/random") as url:
    s = url.read()
    print(s)
    
#So that we don't have to close the response.

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cThe real lover is the man who can thrill you by kissing your forehead or smiling into your eyes or just staring into space

### Importing HTML data using Requests library

In [9]:
# Import package
import requests

url = "http://quotes.toscrape.com/random"
r = requests.get(url) #Packages the request, send the request and catch the response
text = r.text #Extract the response
print(text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“Anyone who has never made a mistake has never tried anything new.”</span>
        <span>by <small class="author" itemprop="author">Albert Einstein</small>
        <a h

### BeautifulSoup

The HTML data is complicated and difficult to read. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
Some objects:

    *A 'Tag' object corresponds to an XML or HTML tag in the original document.
    *A tag may have any number of attributes. You can access the dictionary of attributes directly as '.attrs'.
    *The BeautifulSoup object represents the parsed document as a whole.

In [24]:
from bs4 import BeautifulSoup
import requests

url = 'https://www.python.org/~guido/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc) #Create a BeautifulSoup object from the HTML
pretty_soup = soup.prettify() #Prettify the BeautifulSoup object
print(pretty_soup)

<html>
 <head>
  <title>
   Guido's Personal Home Page
  </title>
 </head>
 <body bgcolor="#FFFFFF" text="#000000">
  <h1>
   <a href="pics.html">
    <img border="0" src="images/IMG_2192.jpg"/>
   </a>
   Guido van Rossum - Personal Home Page
   <a href="pics.html">
    <img border="0" height="216" src="images/guido-headshot-2019.jpg" width="270"/>
   </a>
  </h1>
  <p>
   <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
    <i>
     "Gawky and proud of it."
    </i>
   </a>
  </p>
  <h3>
   <a href="images/df20000406.jpg">
    Who I Am
   </a>
  </h3>
  <p>
   Read
my
   <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
    "King's
Day Speech"
   </a>
   for some inspiration.
  </p>
  <p>
   I am the author of the
   <a href="http://www.python.org">
    Python
   </a>
   programming language.  See also my
   <a href="Resume.html">
    resume
   </a>
   and my
   <a href="Publications.html">
    publicati

In [27]:
#If you want the <head> tag
soup.head

<head>
<title>Guido's Personal Home Page</title>
</head>

In [26]:
#If you want the <title> tag
soup.title

<title>Guido's Personal Home Page</title>

In [20]:
#we can use the string attribute of the title tag
soup.title.string

"Guido's Personal Home Page"

In [60]:
#We can look at the body of the page
#print(soup.body)

In [29]:
# <a> tag. But it will give you only the first tag by that name
soup.a

<a href="pics.html"><img border="0" src="images/IMG_2192.jpg"/></a>

In [30]:
#If you need to get all the <a> tags, you’ll need to use find_all()
soup.find_all('a')

[<a href="pics.html"><img border="0" src="images/IMG_2192.jpg"/></a>,
 <a href="pics.html"><img border="0" height="216" src="images/guido-headshot-2019.jpg" width="270"/></a>,
 <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm"><i>"Gawky and proud of it."</i></a>,
 <a href="images/df20000406.jpg">Who I Am</a>,
 <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">"King's
 Day Speech"</a>,
 <a href="http://www.python.org">Python</a>,
 <a href="Resume.html">resume</a>,
 <a href="Publications.html">publications list</a>,
 <a href="bio.html">brief bio</a>,
 <a href="http://legacy.python.org/doc/essays/">writings</a>,
 <a href="http://legacy.python.org/doc/essays/ppt/">presentations</a>,
 <a href="interviews.html">interviews</a>,
 <a href="pics.html">pictures of me</a>,
 <a href="http://neopythonic.blogspot.com">my new blog</a>,
 <a href="http://www.artima.com/weblogs/index.jsp?blogger=12088">old
 blog</a>,
 <a href="

In [61]:
#If you need to find the hyperlink, we need to extract the 'href' attribute. <a> tags defines hyperlinks.
a_tags = soup.find_all('a') #Find all 'a' tags

for link in a_tags:
    print(link.get('href')) #hyperlink reference

pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
images/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
Resume.html
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


In [42]:
h3_tags = soup.find_all('h3')
h3_tags

[<h3><a href="images/df20000406.jpg">Who I Am</a></h3>,
 <h3>How to Reach Me</h3>,
 <h3>My Name</h3>,
 <h3>More Hyperlinks</h3>,
 <h3>The Audio File Formats FAQ</h3>]

In [44]:
for txt in h3_tags:
    print(link.text)

Who I Am
How to Reach Me
My Name
More Hyperlinks
The Audio File Formats FAQ


Tags have attributes that allow us to navigate through the structure of the document as well. We can navigate up and down a document's structure by looking at a tag's child and parent attributes.

In [62]:
p_tag = soup.find_all('p')
for tag in p_tag:
    print(tag.text)

"Gawky and proud of it."

Read
my "King's
Day Speech" for some inspiration.


I am the author of the Python
programming language.  See also my resume
and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
pictures of me,
my new blog, and
my old
blog on Artima.com.  I am
@gvanrossum on Twitter.


I am retired, working on personal projects (and maybe a book).
I have worked for Dropbox, Google, Elemental Security, Zope
Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
my resume.)  I created Python while at CWI.


You can send email for me to guido (at) python.org.
I read everything sent there, but I receive too much email to respond
to everything.


My name often poses difficulties for Americans.


Pronunciation: in Dutch, the "G" in Guido is a hard G,
pronounced roughly like the "ch" in Scottish "loch".  (Listen to the
sound clip.)  However, if you're
American, you may also pronounce it as the Italian "Guido".  I'm not
too worrie