# Importing Data in Python

In [1]:
import sys
import datetime as dt

In [2]:
# Notebook Info
nb_info = {'Author':'Simon Zahn', 'Last Updated':dt.datetime.now().strftime('%Y-%m-%d %H:%M'), 'Python Version':sys.version }

for k,v in nb_info.items():
    print((k + ':').ljust(18), str(v))

Author:            Simon Zahn
Last Updated:      2019-07-28 15:49
Python Version:    3.7.3 (default, Apr 24 2019, 15:29:51) [MSC v.1915 64 bit (AMD64)]


--------------------------
### Table of Contents
1. [Imports and Top Matter](#Imports-and-Top-Matter)
1. [Importing Data from the Web](#Importing-Data-from-the-Web)
1. [Introduction to APIs and JSONs](#Introduction-to-APIs-and-JSONs)
--------------------------

### Imports and Top Matter
[[back to top]](#Table-of-Contents)

In [32]:
# standard library
from urllib.request import urlretrieve, urlopen, Request

# general
import requests
from bs4 import BeautifulSoup

# IPython
from IPython.display import display, Image

# analysis
import numpy as np
import pandas as pd
from scipy import stats, special

# plotting
import matplotlib.pyplot as plt
import seaborn as sns

In [5]:
pal = sns.color_palette()

### Importing Data from the Web
[[back to top]](#Table-of-Contents)

In [10]:
# Assign url of file
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'data\winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('data\winequality-red.csv', sep=';')
display(df.head())

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


We just imported a file from the web, saved it locally and loaded it into a DataFrame. If we just wanted to load a file from the web into a DataFrame without first saving it locally, we can do that easily using pandas. In particular, we can use the function `pd.read_csv()` with the URL as the first argument and the separator sep as the second argument.

In [12]:
# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')
display(df.head())

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


This function is super cool because it has close relatives that allow us to load all types of files, not only flat ones. Now, lets use `pd.read_excel()` to import an Excel spreadsheet.

In [14]:
# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xl
xl = pd.read_excel(url, sheet_name=None)

# Print the sheetnames to the shell
print(xl.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xl['1700'].head())

odict_keys(['1700', '1900'])
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000


Now lets perform an HTML GET request. We will ping DataCamp servers to perform a GET request to extract information from our teach page, "http://www.datacamp.com/teach/documentation".

In [17]:
# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Print the datatype of response
print(type(response))

# Be polite and close the response!
response.close()

<class 'http.client.HTTPResponse'>


We just packaged and sent a GET request to "http://www.datacamp.com/teach/documentation" and then caught the response. We saw that such a response is a http.client.HTTPResponse object. The question remains: what can we do with this response?

Well, as it came from an HTML page, we can read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method. 

In [27]:
# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

# Be polite and close the response!
response.close()

In [28]:
# print first 500 characters of the HTML
print(html[:500])

b'<!doctype html>\n<html lang="en" data-direction="ltr">\n  <head>\n    <link href="https://fonts.intercomcdn.com" rel="preconnect" crossorigin>\n      <script src="https://www.googletagmanager.com/gtag/js?id=UA-39297847-9" async="async" nonce="YiZNnCUb11wKNi5AHU3oBZRaEw1DgVrSOA+6ooGhFI8="></script>\n      <script nonce="YiZNnCUb11wKNi5AHU3oBZRaEw1DgVrSOA+6ooGhFI8=">\n        window.dataLayer = window.dataLayer || [];\n        function gtag(){dataLayer.push(arguments);}\n        gtag(\'js\', new Date());\n  '


Now that we've got our head and hands around making HTTP requests using the urllib package, we're going to figure out how to do the same using the higher-level requests library. We'll once again be pinging DataCamp servers for their "http://www.datacamp.com/teach/documentation" page. Note that unlike in the previous exercises using urllib, we don't have to close the connection when using requests!

In [31]:
# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(type(text))
print(text[:500])

<class 'str'>
<!doctype html>
<html lang="en" data-direction="ltr">
  <head>
    <link href="https://fonts.intercomcdn.com" rel="preconnect" crossorigin>
      <script src="https://www.googletagmanager.com/gtag/js?id=UA-39297847-9" async="async" nonce="BjijvvUNAMUAhga9VJZ7LANkmbZww41onoJPjXJfRzo="></script>
      <script nonce="BjijvvUNAMUAhga9VJZ7LANkmbZww41onoJPjXJfRzo=">
        window.dataLayer = window.dataLayer || [];
        function gtag(){dataLayer.push(arguments);}
        gtag('js', new Date());
  


Now, lets learn how to use the BeautifulSoup package to parse, prettify and extract information from HTML. We'll scrape the data from the webpage of Guido van Rossum, Python's very own Benevolent Dictator for Life.

In [40]:
# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Print first 430 characters of Guido's text
print(soup.get_text()[:430])

<title>Guido's Personal Home Page</title>


Guido's Personal Home Page




Guido van Rossum - Personal Home Page


"Gawky and proud of it."
Who
I Am
Read
my "King's
Day Speech" for some inspiration.

I am the author of the Python
programming language.  See also my resume
and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
pictures of me,
my new blog, and
my old
blog on Artima.com.  I am
@gvanrossum on Twitter


In [44]:
url = 'https://www.python.org/~guido/'    # Specify url
r = requests.get(url)                     # Package the request, send the request and catch the response: r
html_doc = r.text                         # Extracts the response as html: html_doc
soup = BeautifulSoup(html_doc)            # create a BeautifulSoup object from the HTML: soup

print(soup.title, end='\n\n')             # Print the title of Guido's webpage

a_tags = soup.find_all('a')               # Find all 'a' tags (which define hyperlinks): a_tags

print('List of links:',''.center(30, '-'), sep='\n')
for link in a_tags:                       # Print the URLs to the shell
    print(link.get('href'))

<title>Guido's Personal Home Page</title>

List of links:
------------------------------
pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
http://www.dropbox.com
Resume.html
http://groups.google.com/groups?q=comp.lang.python
http://stackoverflow.com
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


### Introduction to APIs and JSONs
[[back to top]](#Table-of-Contents)

Lets move onto JSON files, and we'll start by loading one into our Python environment and exploring it. Here, we'll load the JSON 'a_movie.json' into the variable json_data, which will be a dictionary. We'll then explore the JSON contents by printing the key-value pairs of json_data to the shell.

We'll pull some movie data down from the Open Movie Database (OMDB) using their API. The movie we'll query the API about is The Social Network. Recall that to query the API about the movie Hackers, Hugo's query string was 'http://www.omdbapi.com/?t=hackers' and had a single argument t=hackers.

Note: recently, OMDB has changed their API: you now also have to specify an API key. This means you'll have to add another argument to the URL: apikey=72bc447a.

In [47]:
url = 'http://www.omdbapi.com?apikey=72bc447a&t=the+social+network'      # Assign URL to variable: url
r = requests.get(url)                                                    # Package the request, send the request and catch the response: r

print(r.text)                                                            # Package the request, send the request and catch the response: r

{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"Harvard student Mark Zuckerberg creates the social networking site. That would become known as Facebook but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 165 wins & 168 nominations.","Poster":"https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"95%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"571,335","imdbID":"tt

We've just queried our first API programmatically in Python and printed the text of the response to the shell. However, the response is actually a JSON, so we can do one step better and decode the JSON. We can then print the key-value pairs of the resulting dictionary. 

In [49]:
url = 'http://www.omdbapi.com?apikey=72bc447a&t=the+social+network'      # Assign URL to variable: url
r = requests.get(url)                                                    # Package the request, send the request and catch the response: r
json_data = r.json()                                                     # Decode the JSON data into a dictionary: json_data

for k in json_data.keys():                                               # Print each key-value pair in json_data
    print(k + ': ', json_data[k])

Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors:  Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot:  Harvard student Mark Zuckerberg creates the social networking site. That would become known as Facebook but is later sued by two brothers who claimed he stole their idea, and the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  USA
Awards:  Won 3 Oscars. Another 165 wins & 168 nominations.
Poster:  https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '95%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.7
imdbVotes:  571,335
imdbID:  tt1285016
Type:  movie
DVD:

We're going to throw one more API at you: the Wikipedia API (documented [here](https://www.mediawiki.org/wiki/API:Main_page)). We'll figure out how to find and extract information from the Wikipedia page for Pizza. What gets a bit wild here is that our query will return nested JSONs, that is, JSONs with JSONs, but Python can handle that because it will translate them into dictionaries within dictionaries.

In [51]:
# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'

# Package the request, send the request and catch the response, and decode JSON
r = requests.get(url)
json_data = r.json()

# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)

<p class="mw-empty-elt">
</p>

<p><b>Pizza</b> (<small>Italian: </small><span title="Representation in the International Phonetic Alphabet (IPA)">[ˈpittsa]</span>, <small>Neapolitan: </small><span title="Representation in the International Phonetic Alphabet (IPA)">[ˈpittsə]</span>) is a savory dish of Italian origin, consisting of a usually round, flattened base of leavened wheat-based dough topped with tomatoes, cheese, and various other ingredients (anchovies, olives, meat, etc.) baked at a high temperature, traditionally in a wood-fired oven. In formal settings, like a restaurant, pizza is eaten with knife and fork, but in casual settings it is cut into wedges to be eaten while held in the hand. Small pizzas are sometimes called pizzettas.
</p><p>The term <i>pizza</i> was first recorded in the 10th century in a Latin manuscript from the Southern Italian town of Gaeta in Lazio, on the border with Campania. Modern pizza was invented in Naples, and the dish and its variants have since 