# Importing data from the Internet


We will be importing and saving data from the web. We will be making HTTP requests using [Requests](https://requests.readthedocs.io/en/master/) library, scraping web data (HTML), and parsing HTML into useful data using [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. Python has a [urllib](https://docs.python.org/3/library/urllib.html) library form working with URLS.

## Importing flat files

To import data sets from websites, we are using the urllib.request package.

    *urlopen() - accepts URLs 
    *urlretrieve(url, name) - retrieves data

In [1]:
from urllib.request import urlretrieve
import pandas as pd

# Assign url of file
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
# Save file locally
urlretrieve(url, 'winequality-white.csv')
# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-white.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Importing non-flat files from the web

In [2]:
# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xls
xls = pd.read_excel(url, sheet_name = None)

# Print the sheetnames to the shell
print(xls.keys())

odict_keys(['1700', '1900'])


### Importing HTML data using urllib.request library

URL: Universal Resource Locator. It references to web resources
HTTP: HyperText Transfer Protocol. It is the foundation of data communication for the web
HTML: HyperText Markup Language

In [3]:
from urllib.request import urlopen, Request

url = "http://quotes.toscrape.com/random"
request = Request(url) #This packages the request
response = urlopen(request) #Sends the request and catches the response (HTML data)
html = response.read() #Extract the response
response.close()
print(html)

b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bootstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/login">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xe2\x80\x9cSome people never go crazy. What truly horrible lives they must lead.\xe2\x80\x9d</span>\n        <span>by <small class="a

### Importing HTML data using Requests library

In [4]:
# Import package
import requests

url = "http://quotes.toscrape.com/random"
r = requests.get(url) #Packages the request, send the request and catch the response
text = r.text #Extract the response
print(text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“That&#39;s the problem with drinking, I thought, as I poured myself a drink. If something bad happens you drink in an attempt to forget; if something good happens you 

### BeautifulSoup

The HTML data is complicated and difficult to read. Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
Some objects:

    *A 'Tag' object corresponds to an XML or HTML tag in the original document.
    *A tag may have any number of attributes. You can access the dictionary of attributes directly as '.attrs'.
    *The BeautifulSoup object represents the parsed document as a whole.

In [5]:
from bs4 import BeautifulSoup
import requests

url = 'https://www.python.org/~guido/'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc) #Create a BeautifulSoup object from the HTML
pretty_soup = soup.prettify() #Prettify the BeautifulSoup object
print(pretty_soup)

<html>
 <head>
  <title>
   Guido's Personal Home Page
  </title>
 </head>
 <body bgcolor="#FFFFFF" text="#000000">
  <h1>
   <a href="pics.html">
    <img border="0" src="images/IMG_2192.jpg"/>
   </a>
   Guido van Rossum - Personal Home Page
   <a href="pics.html">
    <img border="0" height="216" src="images/guido-headshot-2019.jpg" width="270"/>
   </a>
  </h1>
  <p>
   <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
    <i>
     "Gawky and proud of it."
    </i>
   </a>
  </p>
  <h3>
   <a href="images/df20000406.jpg">
    Who I Am
   </a>
  </h3>
  <p>
   Read
my
   <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
    "King's
Day Speech"
   </a>
   for some inspiration.
  </p>
  <p>
   I am the author of the
   <a href="http://www.python.org">
    Python
   </a>
   programming language.  See also my
   <a href="Resume.html">
    resume
   </a>
   and my
   <a href="Publications.html">
    publicati

In [6]:
# Navigating the tree

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

In [7]:
#If you want the <head> tag
soup.head

<head><title>The Dormouse's story</title></head>

In [8]:
#If you want the <title> tag
soup.title

<title>The Dormouse's story</title>

In [9]:
#for the <body>
soup.body

<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>

In [10]:
# <b> tag
soup.b

<b>The Dormouse's story</b>

In [11]:
# Using a tag name as an attribute will give you only the first tag by that name
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [12]:
#If you need to get all the <a> tags, you’ll need to use find_all()
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [13]:
#If you need to find the hyperlink, we need to extract the 'href' attribute. <a> tags defines hyperlinks.

a_tags = soup.find_all('a') #Find all 'a' tags

for link in a_tags:
    print(link.get('href')) #hyperlink reference

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
