# Importing Data in Python (Part 2)

## Chapter 1: Importing Data From the Internet

### Importing flat files from the web

#### The urllib package
* Provides interface for fetching data across the web
* `urlopen()` - accepts URLs instead of file names

#### How to automate file download in Python

In [1]:
import ssl

ssl._create_default_https_context = ssl._create_unverified_context

In [2]:
from urllib.request import urlretrieve

url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

urlretrieve(url, 'winequality-red.csv')

('winequality-red.csv', <http.client.HTTPMessage at 0x105827ba8>)

#### Read in file using pandas directly (without saving locally)

In [3]:
import pandas as pd

url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

df = pd.read_csv(url, ';')

df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### Read in excel files from the web

In [4]:
import pandas as pd

url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

xl = pd.read_excel(url, sheetname = None) # reads in all sheets

print(xl.keys()) # prints sheetnames

print(xl['1700'].head())

  return func(*args, **kwargs)


odict_keys(['1700', '1900'])
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000


### HTTP requests to import files from the web

#### HTTP
* HyperText Transfer Protocol
* Foundation of data communication for the web
* HTTPS - more secure form of HTTP
* Going to a website = sending HTTP request
    * GET request
* `urlretrieve()` performs a GET request
* HTML - HyperText Markup Language

#### GET requests using urllib

In [6]:
from urllib.request import urlopen, Request

url = "https://www.wikipedia.org/"

request = Request(url)

response = urlopen(request)

html = response.read()

print(html[:500])

response.close()

b'<!DOCTYPE html>\n<html lang="mul" class="no-js">\n<head>\n<meta charset="utf-8">\n<title>Wikipedia</title>\n<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">\n<![if gt IE 7]>\n<script>\ndocument.documentElement.className = document.documentElement.className.replace( /(^|\\s)no-js(\\s|$)/, "$1js-enabled$2" );\n</script>\n<![endif]>\n<!--[if lt IE 7]><meta http-equiv="imagetoolbar" content="no">'


#### Get requests using requests

In [7]:
import requests

url = "https://www.wikipedia.org/"

r = requests.get(url)

text = r.text

print(text[:500])

<!DOCTYPE html>
<html lang="mul" class="no-js">
<head>
<meta charset="utf-8">
<title>Wikipedia</title>
<meta name="description" content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.">
<![if gt IE 7]>
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<![endif]>
<!--[if lt IE 7]><meta http-equiv="imagetoolbar" content="no">


### Scraping the web in Python

#### HTML
* Mix of unstructured and structured data

#### BeautifulSoup
* Parse and extract structured data from HTML
* Make tag soup beautiful and extract information

In [9]:
from bs4 import BeautifulSoup

import requests

url = 'https://www.crummy.com/software/BeautifulSoup/'

r = requests.get(url)

html_doc = r.text

soup = BeautifulSoup(html_doc)

print(soup.prettify()[:500])

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/transitional.dtd">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <title>
   Beautiful Soup: We called him Tortoise because he taught us.
  </title>
  <link href="mailto:leonardr@segfault.org" rev="made"/>
  <link href="/nb/themes/Default/nb.css" rel="stylesheet" type="text/css"/>
  <meta content="Beautiful Soup: a library designed for screen-scraping HTML and X


In [10]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(guido_title)

# Get Guido's text: guido_text
guido_text = soup.get_text()

# Print Guido's text to the shell
print(guido_text[:500])

<title>Guido's Personal Home Page</title>


Guido's Personal Home Page




Guido van Rossum - Personal Home Page


"Gawky and proud of it."
Who
I Am
Read
my "King's
Day Speech" for some inspiration.

I am the author of the Python
programming language.  See also my resume
and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
pictures of me,
my new blog, and
my old
blog on Artima.com.  I am
@gvanrossum on Twitter.

I am retired, working on personal projects (and maybe a book).
I ha


In [11]:
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

<title>Guido's Personal Home Page</title>
pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
http://metalab.unc.edu/Dave/Dr-Fun/df200004/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088
https://twitter.com/gvanrossum
Resume.html
http://groups.google.com/groups?q=comp.lang.python
http://stackoverflow.com
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif


## Chapter 2: Interacting with APIs to import data from the web

### Introduction to APIs and JSONs

APIs
* Application Programming Interface
* Protocols and routines
    * Building and interacting with software applications
    
JSONs
* JavaScript Object Notation
* Real-time server-to-browser communication
* Douglas Crockford
* Human readable

Loading JSONs in Python

In [None]:
import json

with open('datasets/tweets.json', 'r') as json_file:
    json_data = json.load(json_file)
    
type(json_data)

In [None]:
for key, value in json_data.items():
    print(key + ':', value)

### APIs and interacting with the world wide web

#### What is an API?
* Set of protocols and routines
* Bunch of code
    * Allows two software programs to communicate with each other
    
#### Connecting to an API in Python


In [13]:
import requests

url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'

r = requests.get(url)

print(r.text)

{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin (screenplay), Ben Mezrich (book)","Actors":"Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"USA","Awards":"Won 3 Oscars. Another 165 wins & 168 nominations.","Poster":"https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"579,002","imdbID":"tt1285

In [14]:
json_data = r.json()

for key, value in json_data.items():
    print(key + ":", value)

Title: The Social Network
Year: 2010
Rated: PG-13
Released: 01 Oct 2010
Runtime: 120 min
Genre: Biography, Drama
Director: David Fincher
Writer: Aaron Sorkin (screenplay), Ben Mezrich (book)
Actors: Jesse Eisenberg, Rooney Mara, Bryan Barter, Dustin Fitzsimons
Plot: As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.
Language: English, French
Country: USA
Awards: Won 3 Oscars. Another 165 wins & 168 nominations.
Poster: https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore: 95
imdbRating: 7.7
imdbVotes: 579,002
imdbID: tt1285016
Type: movie
DVD: 11 Jan 2011
BoxOffice: 

#### What was that URL?

* http - making an HTTP request
* www.omdbapi.com - querying the OMDB API
* ?t=hackers
    * Qyer string
    * Return data for a move with the title (t) `Hackers`

## Chapter 3: Diving deep into the Twitter API

### The Twitter API and Authentication

#### You'll learn:
* How to stream data from the Twitter API
* How to filter incoming tweets for keywords
* About API Authentication and OAuth
* How to use the Tweepy Python package

#### Access the Twitter API
* Create an account

#### Twitter has a number of APIs
* REST APIs
    * Allows the user to read and write twitter data
* The Streaming APIs - The Public Stream

#### Using Tweepy: Authentication handler

In [None]:
import tweepy, json

access_token = "..."
access_token_secret = "..."
consumer_key = "..."
consumer_secret = "..."

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)