# Getting Data
This notebook showcases how to download data available on the Internet. We cover most formats the data is typically available in, and learn/practice via example Python code or utilities for getting data. 

TOPIC1: Getting data from a Web URL: text, HTML, XML, PDF.

TOPIC2: Crawling/Scraping data from the Web (entire websites).

TOPIC3: Getting data via APIs (JSON format).

## TOPIC1: Getting data from a Web URL: text, HTML, PDF.

In [1]:
#To check which Python version and virtual environment Juoyter Notebook uses
import sys
print(sys.executable)
print()
print(sys.version_info)
print()
print(sys.path)

#If you find that Jupyter Notebook does not point to the required virtual environment
#remove the venv and re-create using
#conda create --name comp47350py37 python=3.7 jupyter
#Use 'conda install' or 'pip install' to re-install required packages

C:\Users\User\Anaconda3\envs\comp47350py37\python.exe

sys.version_info(major=3, minor=7, micro=2, releaselevel='final', serial=0)

['C:\\Users\\User\\Anaconda3\\OOP_Jupiter\\Data Analytics', 'C:\\Users\\User\\Anaconda3\\envs\\comp47350py37\\python37.zip', 'C:\\Users\\User\\Anaconda3\\envs\\comp47350py37\\DLLs', 'C:\\Users\\User\\Anaconda3\\envs\\comp47350py37\\lib', 'C:\\Users\\User\\Anaconda3\\envs\\comp47350py37', '', 'C:\\Users\\User\\Anaconda3\\envs\\comp47350py37\\lib\\site-packages', 'C:\\Users\\User\\Anaconda3\\envs\\comp47350py37\\lib\\site-packages\\IPython\\extensions', 'C:\\Users\\User\\.ipython']


In [5]:
#Import all required packages

#Import package 'requests'for URL scrapping
import requests
# Import package for reading csv files 
import pandas as pd
#import package 'beautifulsoup' to extract the content of HTML fields 
from bs4 import BeautifulSoup
#import package 'feedparser'
#Feedparser is a library to parse RSS/XML feeds, these are files with a specific XML structure
#If you don't have it, install using conda or pip, e.g.,: conda install feedparser
import feedparser
#import package 'json' to parse json objects
import json
#Import the necessary methods from the "twitter" library
from twitter import Twitter, OAuth, TwitterHTTPError, TwitterStream

#Look at the package structure to understand how to use it
#print(dir(requests))

#Look at individual functions
help(requests.get)
#As an alternative can use '?', same as help() but opens a new window
#?requests.get

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the body of the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



In [6]:
#Get a text file.
#Get book "Alice's Adventures in Wonderland" from Project Gutenberg, in text format

#Give the URL for the file to be downloaded
url='http://www.gutenberg.org/cache/epub/11/pg11.txt'

#Look at the object returned by requests.get()
object = requests.get(url)
#print(object.encoding)
#print(object.links)

#Get the content from the downloaded text file
text_page = object.text
#print(text_page)

#Look at the first 500 characters of the book
print(text_page[:500])


﻿Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Alice's Adventures in Wonderland

Author: Lewis Carroll

Posting Date: June 25, 2008 [EBook #11]
Release Date: March, 1994
[Last updated: December 20, 2011


In [7]:
# Reading from a csv file, into a data frame
df = pd.read_csv('MotorInsuranceFraudClaimABTFull.csv')

# Show the data frame
#df

# Show first 10 rows of data frame
# The rows are indexed starting from 0
df.head(10)

# Check how many rows and columns this dataframe has
#df.shape


Unnamed: 0,ID,Insurance Type,Income of Policy Holder,Marital Status,Num Claimants,Injury Type,Overnight Hospital Stay,Claim Amount,Total Claimed,Num Claims,Num Soft Tissue,% Soft Tissue,Claim Amount Received,Fraud Flag
0,1,CI,0,,2,Soft Tissue,No,1625,3250,2,2.0,1.0,0,1
1,2,CI,0,,2,Back,Yes,15028,60112,1,0.0,0.0,15028,0
2,3,CI,54613,Married,1,Broken Limb,No,-99999,0,0,0.0,0.0,572,0
3,4,CI,0,,3,Serious,Yes,270200,0,0,0.0,0.0,270200,0
4,5,CI,0,,4,Soft Tissue,No,8869,0,0,0.0,0.0,0,1
5,6,CI,0,,1,Broken Limb,Yes,17480,0,0,0.0,0.0,17480,0
6,7,CI,52567,Single,3,Broken Limb,No,3017,18102,2,1.0,0.5,0,1
7,8,CI,0,,2,Back,Yes,7463,0,0,0.0,0.0,7463,0
8,9,CI,0,,1,Soft Tissue,No,2067,0,0,,0.0,2067,0
9,10,CI,42300,Married,4,Back,No,2260,0,0,0.0,0.0,2260,0


In [8]:
#Get an HTML file.
#Get news article from IrishTimes website.

#Give the URL for the file to be downloaded
url = 'http://www.irishtimes.com/news/science/evidence-of-new-planet-10-times-size-of-earth-in-solar-system-1.2505306'

#Get the content from the downloaded html file
html_page = requests.get(url).text

#Look at the format of the html file
print(html_page[:5000])




                                        
                                                                                                                                                                                                                                                            
        

                    
                
        
    
        
<!doctype html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.irishtimes.com/" version="HTML+RDFa 1.1"> <![endif]-->
<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.irishtimes.com/" version="HTML+RDFa 1.1"> <![endif]-->
<!--[if IE 8]>    <html class="no-js lt-ie9" lang="en" prefix="og:http://ogp.me/ns# fb:http://www.facebook.com/2008/fbml irishtimes:http://www.irishtimes.com/" version="HTML+RDFa 1.1"> <![endif]-->
<

In [9]:
#Use package 'beautifulsoup' to extract the content of HTML fields 
#Need to know the HTML structure and the tags containing the information we need
#To look at the HTML file open it in a text editor, look for the tags that contain headline, subheadline, article body 
#If you don't have beautifulsoup4 installed, run in shell: conda install beautifulsoup4

# Method to parse the structure of an html page using package beautifulsoup.
# The code looks for specific tags in the html structure and extracts the content
def getArticleDetailsByUrl(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text,"html.parser")
    #soup.prettify()
    
    headline = soup.title.string
    subheadline = soup.head.find("meta",attrs={"name":"description"}).get('content')

    doc_body = ''
    if "The Irish Times" in soup.text:
        for body_p_tag in soup.article.find_all("p", attrs={"class": "no_name"}):
            doc_body += body_p_tag.get_text() + '\n'

    source = "Other"
    try:
        if "irishtimes" in url:
            source = "IrishTimes"
            body_p_tag = soup.article.find("div", attrs={"class": "last_updated"}).find("p")
    except:
        pass

    first_sentence = doc_body.split(".")[0]

    return [headline, subheadline, first_sentence, doc_body, source]

# Main code that calls our parsing method getArticleDetailsByUrl(url) for specific html pages.
if __name__ == '__main__':
    article_url = 'http://www.irishtimes.com/news/science/evidence-of-new-planet-10-times-size-of-earth-in-solar-system-1.2505306'
    #print(getArticleDetailsByUrl(article_url))
    
    print("\nField by field:\n")
    [headline, subheadline, first_sentence, doc_body, source] = getArticleDetailsByUrl(article_url)
    print("Headline:\n", headline, "\n")
    print("Subheadline:\n", subheadline, "\n")
    print("First sentence:\n", first_sentence, "\n")
    print("Article body:\n", doc_body)


Field by field:

Headline:
 Evidence of new planet ‘10 times size of Earth’ in solar system 

Subheadline:
 Existence of ‘Planet Nine’ shown by examining gravitational influence on other bodies 

First sentence:
 Evidence of a hidden giant planet on the fringes of the solar system has been uncovered by scientists using computer simulations 

Article body:
 Evidence of a hidden giant planet on the fringes of the solar system has been uncovered by scientists using computer simulations.
The mysterious world, nicknamed Planet Nine, is about 10 times more massive than the Earth, thought to be gaseous, and similar to Uranus or Neptune.
Scientists believe it traces a highly elongated orbit and takes between 10,000 and 20,000 years to make just one journey around the sun.
Planet Nine is, on average, about 20 times further from the sun than Neptune, which orbits at a distance of about 4.5 billion kilometres.
Once there were thought to be nine planets in the solar system, but then the outermost

In [10]:
#Downloading and working with an XML file
#Get the whole RSS feed for the Irish Times news articles
#This is an XML file listing the URLs of individual news articles published online
#Need to know the structure of the XML to be able to extract text from specific tags

#Parse the XML file to retrieve the URLs for individual news articles.
#Parse each article's HTML page
def scrapRSSFeed(rss_feed):
    d = feedparser.parse(rss_feed)
    #print(d)
    #print(d['entries'], "\n")
        
    for item in d['entries']:
        #Extract an article URL
        article_url = item['link']
        [headline, subheadline, first_sentence, doc_body, source] = getArticleDetailsByUrl(article_url)
        print("Article:", headline, "\n")
        
if __name__ == '__main__':

    #The URL of the XML file
    url='http://www.irishtimes.com/cmlink/news-1.1319192'
    xml_page = requests.get(url).text
    
    #Look at the structure of the XML file
    #To have a proper look, open the XML file with a text editor
    print(xml_page[:600])

    # Call the method that parses a given XML file
    scrapRSSFeed(url)

<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title><![CDATA[The Irish Times - News]]></title>
    <link>/cmlink/the-irish-times-news-1.1319192</link>
    <description>
                    
          </description>
    <lastBuildDate>Thu, 31 Jan 2019 09:40:11 +0000</lastBuildDate>
    
                  <language></language>
              
          <item>
        <title><![CDATA[This could be the funniest piece of Brexit news so far this year]]></title>
        <link>https://www.irishtimes.com/news/offbeat/this-could-be-the-funniest-piece-of-brexit-news-so-far-this-year-1.3777305
Article: This could be the funniest piece of Brexit news so far this year 

Article: James Reilly calls for tax incentives to encourage over 55s to downsize 


Article: Funeral of four Donegal crash victims to take place today 

Article: Psychiatric nurses begin overtime ban over pay 

Article: From Democritus to Einstein, the long search for the tiny atom 

Article: Apes can be lazy and never ge

In [2]:
#Get a PDF file, save it to disk.

# Give url of the PDF file
url='http://www.greenteapress.com/thinkpython/thinkpython.pdf'
# Download the pdf file into request_object
request_object = requests.get(url)

#PDF is a binary format. Use request.content instead of request.text
#Write binary content on your machine's disk in a file named 'thinkpython.pdf'
with open("thinkpython.pdf", "wb") as pdffile:
    # Look at the conent of the file
    print(request_object.content[:500])
    
    #Print the content of the request_object to a file named "thinkpython.pdf"
    pdffile.write(request_object.content)

#Check that it downloaded the file to the current directory.
#%ls

NameError: name 'requests' is not defined

In [12]:
# Extract the text from the pdf file.
# Use the 'pdftotext' command line utility for your operating system, e.g., Unix, Max OS, Windows.
# Google "pdftotext install" and install for your operating system.

#Use 'pdftotext' to extract the text content from the pdf file to a text file with 
# the same name, but extension .txt
# Can call pdftotext in the command line or directly from Jupyter Notebook.
!pdftotext -enc UTF-8 thinkpython.pdf thinkpython.txt

#Read the text file with text extracted from the original PDF
with open('thinkpython.txt', 'rb') as f:
    text_content = f.read().decode('UTF-8')

print(text_content[:500]) 

Think Python
How to Think Like a Computer Scientist

Version 2.0.17

Think Python
How to Think Like a Computer Scientist

Version 2.0.17

Allen Downey

Green Tea Press
Needham, Massachusetts

Copyright © 2012 Allen Downey. Green Tea Press 9 Washburn Ave Needham MA 02492 Permission is granted to copy, distribute, and/or modify this document under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported License, which is available at http: //creativecommons.org/licenses/by-nc/3.


In [13]:
# An alternative way to parse PDF to TXT using pdfminer3k package.
# Example here: https://gist.github.com/vinovator/c78c2cb63d62fdd9fb67
# Also useful read: https://mikethecanuck.wordpress.com/2016/12/29/parsing-pdfs-using-python/

## Topic2: Crawling data from the Web.

As an alternative to using the Python package *requests*, you can use the command line *wget* utility to download an HTML page from a given URL or to download an entire website. If you don't have *wget* on your computer, first install it for your platform.

The *wget* tool is great for crawling entire or parts of websites. It recursively follows URLs up to given depth.
The example below downloads a part of the website locally, in a folder named *en.wikipedia.org*. The parameter -l tells wget to what depth it should follow URLs from the original URL. The parameter --no-parent tells wget to not download anything other than the given path. See http://linuxreviews.org/quicktips/wget/ for more details.

In [14]:
! wget https://en.wikipedia.org/wiki/Main_Page -r -l 1 --no-parent

--2019-01-31 10:06:32--  https://en.wikipedia.org/wiki/Main_Page
Resolving en.wikipedia.org... 91.198.174.192, 2620:0:862:ed1a::1
Connecting to en.wikipedia.org|91.198.174.192|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 74975 (73K) [text/html]
Saving to: ‘en.wikipedia.org/wiki/Main_Page’


2019-01-31 10:06:32 (1.56 MB/s) - ‘en.wikipedia.org/wiki/Main_Page’ saved [74975/74975]

Loading robots.txt; please ignore errors.
--2019-01-31 10:06:32--  https://en.wikipedia.org/robots.txt
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 27329 (27K) [text/plain]
Saving to: ‘en.wikipedia.org/robots.txt’


2019-01-31 10:06:32 (29.5 MB/s) - ‘en.wikipedia.org/robots.txt’ saved [27329/27329]

FINISHED --2019-01-31 10:06:32--
Total wall clock time: 0.2s
Downloaded: 2 files, 100K in 0.05s (2.09 MB/s)


In [15]:
#Need to stop crawling after a short while, otherwise it may fill the hard disk or we will get banned by the website
! wget http://www.irishtimes.com/ -r -l 1 --no-parent

--2019-01-31 10:06:32--  http://www.irishtimes.com/
Resolving www.irishtimes.com... 151.101.18.174
Connecting to www.irishtimes.com|151.101.18.174|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://www.irishtimes.com/ [following]
--2019-01-31 10:06:32--  https://www.irishtimes.com/
Connecting to www.irishtimes.com|151.101.18.174|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 314772 (307K) [text/html]
Saving to: ‘www.irishtimes.com/index.html’


2019-01-31 10:06:33 (3.35 MB/s) - ‘www.irishtimes.com/index.html’ saved [314772/314772]

Loading robots.txt; please ignore errors.
--2019-01-31 10:06:33--  https://www.irishtimes.com/robots.txt
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 666 [text/plain]
Saving to: ‘www.irishtimes.com/robots.txt’


2019-01-31 10:06:33 (19.8 MB/s) - ‘www.irishtimes.com/robots.txt’ saved [666/666]

Loading robots.txt; please i

HTTP request sent, awaiting response... 200 OK
Length: 10795 (11K) [text/xml]
Saving to: ‘www.irishtimes.com/rss/work-rss-1.2458565’


2019-01-31 10:06:33 (29.7 MB/s) - ‘www.irishtimes.com/rss/work-rss-1.2458565’ saved [10795/10795]

--2019-01-31 10:06:33--  https://www.irishtimes.com/rss/apple-rss-1.2384003
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 100281 (98K) [text/xml]
Saving to: ‘www.irishtimes.com/rss/apple-rss-1.2384003’


2019-01-31 10:06:33 (4.36 MB/s) - ‘www.irishtimes.com/rss/apple-rss-1.2384003’ saved [100281/100281]

--2019-01-31 10:06:33--  https://www.irishtimes.com/assets/css/framework.min.css?rev=20190130T102447
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 85562 (84K) [text/css]
Saving to: ‘www.irishtimes.com/assets/css/framework.min.css?rev=20190130T102447’


2019-01-31 10:06:33 (4.25 MB/s) - ‘www.irishtimes.com/assets/css/framework.min.

HTTP request sent, awaiting response... 200 OK
Length: 197495 (193K) [text/html]
Saving to: ‘www.irishtimes.com/news/world/uk.4’


2019-01-31 10:06:37 (5.12 MB/s) - ‘www.irishtimes.com/news/world/uk.4’ saved [197495/197495]

--2019-01-31 10:06:37--  https://www.irishtimes.com/news/world/europe
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 200712 (196K) [text/html]
Saving to: ‘www.irishtimes.com/news/world/europe’


2019-01-31 10:06:37 (9.09 MB/s) - ‘www.irishtimes.com/news/world/europe’ saved [200712/200712]

--2019-01-31 10:06:37--  https://www.irishtimes.com/news/world/us
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 197186 (193K) [text/html]
Saving to: ‘www.irishtimes.com/news/world/us.4’


2019-01-31 10:06:38 (4.70 MB/s) - ‘www.irishtimes.com/news/world/us.4’ saved [197186/197186]

--2019-01-31 10:06:38--  https://www.irishtimes.com/news/world/africa
Reusi


2019-01-31 10:06:45 (3.69 MB/s) - ‘www.irishtimes.com/news/science/citizen-science’ saved [115482/115482]

--2019-01-31 10:06:45--  https://www.irishtimes.com/news/consumer
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/news/consumer.3’

www.irishtimes.com/     [ <=>                ] 207.89K  --.-KB/s    in 0.1s    

2019-01-31 10:06:45 (1.39 MB/s) - ‘www.irishtimes.com/news/consumer.3’ saved [212878]

--2019-01-31 10:06:45--  https://www.irishtimes.com/news/offbeat
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/news/offbeat’

www.irishtimes.com/     [ <=>                ] 157.40K  --.-KB/s    in 0.05s   

2019-01-31 10:06:45 (3.39 MB/s) - ‘www.irishtimes.com/news/offbeat’ saved [161180]

--2019-01-31 10:06:45--  https://www.irishtimes.com/news/highligh


2019-01-31 10:06:49 (2.91 MB/s) - ‘www.irishtimes.com/news/world/uk/irritation-with-london-over-brexit-now-bordering-on-anger-1.3776545’ saved [207246/207246]

--2019-01-31 10:06:49--  https://www.irishtimes.com/news/politics/brexit-wary-stalemate-looms-as-dublin-brussels-rebuff-may-over-backstop-1.3777266
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/news/politics/brexit-wary-stalemate-looms-as-dublin-brussels-rebuff-may-over-backstop-1.3777266?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fnews%2Fpolitics%2Fbrexit-wary-stalemate-looms-as-dublin-brussels-rebuff-may-over-backstop-1.3777266 [following]
--2019-01-31 10:06:49--  https://www.irishtimes.com/news/politics/brexit-wary-stalemate-looms-as-dublin-brussels-rebuff-may-over-backstop-1.3777266?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fnews%2Fpolitics%2Fbrexit-wary-stalemate

HTTP request sent, awaiting response... 200 OK
Length: 116053 (113K) [text/html]
Saving to: ‘www.irishtimes.com/news/offbeat/this-could-be-the-funniest-piece-of-brexit-news-so-far-this-year-1.3777305’


2019-01-31 10:06:50 (3.40 MB/s) - ‘www.irishtimes.com/news/offbeat/this-could-be-the-funniest-piece-of-brexit-news-so-far-this-year-1.3777305’ saved [116053/116053]

--2019-01-31 10:06:50--  https://www.irishtimes.com/polopoly_fs/1.3777304.1548927388!/image/image.jpg_gen/derivatives/box_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 4299 (4.2K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3777304.1548927388!/image/image.jpg_gen/derivatives/box_140/image.jpg’


2019-01-31 10:06:50 (158 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3777304.1548927388!/image/image.jpg_gen/derivatives/box_140/image.jpg’ saved [4299/4299]

--2019-01-31 10:06:50--  https://www.irishtimes.com/opinion/ireland-cannot-afford-leo

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/news/health/woman-24-who-died-a-week-after-giving-birth-had-sepsis-says-coroner-1.3776483?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fnews%2Fhealth%2Fwoman-24-who-died-a-week-after-giving-birth-had-sepsis-says-coroner-1.3776483 [following]
--2019-01-31 10:06:51--  https://www.irishtimes.com/news/health/woman-24-who-died-a-week-after-giving-birth-had-sepsis-says-coroner-1.3776483?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fnews%2Fhealth%2Fwoman-24-who-died-a-week-after-giving-birth-had-sepsis-says-coroner-1.3776483
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/news/health/woman-24-who-died-a-week-after-giving-birth-had-sepsis-says-coroner-1.3776483’

www.irishtimes.com/     [ <=>                ] 111.26K  --.-KB/s    in

HTTP request sent, awaiting response... 200 OK
Length: 113195 (111K) [text/html]
Saving to: ‘www.irishtimes.com/business/energy-and-resources/dublin-based-firm-drawn-into-south-african-corruption-scandal-1.3776344’


2019-01-31 10:06:52 (3.31 MB/s) - ‘www.irishtimes.com/business/energy-and-resources/dublin-based-firm-drawn-into-south-african-corruption-scandal-1.3776344’ saved [113195/113195]

--2019-01-31 10:06:52--  https://www.irishtimes.com/polopoly_fs/1.3776343.1548883208!/image/image.jpg_gen/derivatives/box_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 2967 (2.9K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3776343.1548883208!/image/image.jpg_gen/derivatives/box_140/image.jpg’


2019-01-31 10:06:52 (20.7 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3776343.1548883208!/image/image.jpg_gen/derivatives/box_140/image.jpg’ saved [2967/2967]

--2019-01-31 10:06:52--  https://www.irishtimes.com/busi

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/business/financial-services/deutsche-bank-sees-merger-with-rival-commerzbank-report-1.3777314?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fbusiness%2Ffinancial-services%2Fdeutsche-bank-sees-merger-with-rival-commerzbank-report-1.3777314 [following]
--2019-01-31 10:06:53--  https://www.irishtimes.com/business/financial-services/deutsche-bank-sees-merger-with-rival-commerzbank-report-1.3777314?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fbusiness%2Ffinancial-services%2Fdeutsche-bank-sees-merger-with-rival-commerzbank-report-1.3777314
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 110097 (108K) [text/html]
Saving to: ‘www.irishtimes.com/business/financial-services/deutsche-bank-sees-merger-with-rival-commerzbank-report-1.3777314’


2019-01-31 10:06:55 (2.22 MB/s) - ‘www.irishtim

HTTP request sent, awaiting response... 200 OK
Length: 110196 (108K) [text/html]
Saving to: ‘www.irishtimes.com/business/technology/why-huawei-is-too-great-a-security-gamble-for-5g-networks-1.3776173’


2019-01-31 10:06:56 (3.17 MB/s) - ‘www.irishtimes.com/business/technology/why-huawei-is-too-great-a-security-gamble-for-5g-networks-1.3776173’ saved [110196/110196]

--2019-01-31 10:06:56--  https://www.irishtimes.com/polopoly_fs/1.3776172.1548862008!/image/image.jpg_gen/derivatives/box_140_140/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 4923 (4.8K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3776172.1548862008!/image/image.jpg_gen/derivatives/box_140_140/image.jpg’


2019-01-31 10:06:56 (63.4 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3776172.1548862008!/image/image.jpg_gen/derivatives/box_140_140/image.jpg’ saved [4923/4923]

--2019-01-31 10:06:56--  https://www.irishtimes.com/life-and-style/healt

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/sport/rugby/international/brian-o-driscoll-backs-ireland-to-beat-england-if-they-can-contain-vunipola-1.3776419?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fsport%2Frugby%2Finternational%2Fbrian-o-driscoll-backs-ireland-to-beat-england-if-they-can-contain-vunipola-1.3776419 [following]
--2019-01-31 10:06:57--  https://www.irishtimes.com/sport/rugby/international/brian-o-driscoll-backs-ireland-to-beat-england-if-they-can-contain-vunipola-1.3776419?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fsport%2Frugby%2Finternational%2Fbrian-o-driscoll-backs-ireland-to-beat-england-if-they-can-contain-vunipola-1.3776419
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 110638 (108K) [text/html]
Saving to: ‘www.irishtimes.com/sport/rugby/international/brian-o-driscoll-backs-ireland-to-beat-eng

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/culture/heritage/the-global-irish-brand-may-be-a-clich%C3%A9-now-but-it-s-not-a-fantasy-1.3768021?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fculture%2Fheritage%2Fthe-global-irish-brand-may-be-a-clich%25C3%25A9-now-but-it-s-not-a-fantasy-1.3768021 [following]
Incomplete or invalid multibyte sequence encountered
--2019-01-31 10:07:12--  https://www.irishtimes.com/culture/heritage/the-global-irish-brand-may-be-a-clich%C3%A9-now-but-it-s-not-a-fantasy-1.3768021?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fculture%2Fheritage%2Fthe-global-irish-brand-may-be-a-clich%25C3%25A9-now-but-it-s-not-a-fantasy-1.3768021
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 108945 (106K) [text/html]
Saving to: ‘www.irishtimes.com/culture/heritage/the-global-irish-brand-may-be-a-clich\303\251-now-

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/culture/film/how-to-train-your-dragon-the-hidden-world-a-franchise-we-re-going-to-miss-1.3774977’

www.irishtimes.com/     [ <=>                ] 102.49K  --.-KB/s    in 0.03s   

2019-01-31 10:07:13 (3.87 MB/s) - ‘www.irishtimes.com/culture/film/how-to-train-your-dragon-the-hidden-world-a-franchise-we-re-going-to-miss-1.3774977’ saved [104946]

--2019-01-31 10:07:13--  https://www.irishtimes.com/culture/heritage/century/war-and-peace
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 158395 (155K) [text/html]
Saving to: ‘www.irishtimes.com/culture/heritage/century/war-and-peace’


2019-01-31 10:07:14 (7.62 MB/s) - ‘www.irishtimes.com/culture/heritage/century/war-and-peace’ saved [158395/158395]

--2019-01-31 10:07:14--  https://www.irishtimes.com/culture/heritage/from-great-war-to-a-more-intimate-one-1.3760281
Reusing 


2019-01-31 10:07:15 (2.96 MB/s) - ‘www.irishtimes.com/life-and-style/health-family/a-man-in-your-50s-nutrition-exercise-and-overall-health-advice-you-should-know-1.3765258’ saved [113156/113156]

--2019-01-31 10:07:15--  https://www.irishtimes.com/polopoly_fs/1.3765251.1548072395!/image/image.jpg_gen/derivatives/box_300_160/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 13019 (13K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3765251.1548072395!/image/image.jpg_gen/derivatives/box_300_160/image.jpg’


2019-01-31 10:07:15 (23.2 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3765251.1548072395!/image/image.jpg_gen/derivatives/box_300_160/image.jpg’ saved [13019/13019]

--2019-01-31 10:07:15--  https://www.irishtimes.com/life-and-style/health-family
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 211703 (207K) [text/html]
Saving to: ‘www.irishtime


2019-01-31 10:07:15 (4.41 MB/s) - ‘www.irishtimes.com/news/world/brexit/can-ireland-s-economy-benefit-from-the-brexit-debacle-1.3763039’ saved [142673/142673]

--2019-01-31 10:07:15--  https://www.irishtimes.com/polopoly_fs/1.3763039.1547837286!/image/image.png_gen/derivatives/box_300/image.png
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 26031 (25K) [image/png]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3763039.1547837286!/image/image.png_gen/derivatives/box_300/image.png’


2019-01-31 10:07:15 (29.7 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3763039.1547837286!/image/image.png_gen/derivatives/box_300/image.png’ saved [26031/26031]

--2019-01-31 10:07:15--  https://www.irishtimes.com/news/politics/first-look-at-leinster-house-s-extensive-renovations-1.3758552
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.c

HTTP request sent, awaiting response... 200 OK
Length: 115074 (112K) [text/html]
Saving to: ‘www.irishtimes.com/life-and-style/abroad/i-ve-lived-in-london-for-a-decade-but-have-never-stopped-writing-about-ireland-1.3775006’


2019-01-31 10:07:17 (3.32 MB/s) - ‘www.irishtimes.com/life-and-style/abroad/i-ve-lived-in-london-for-a-decade-but-have-never-stopped-writing-about-ireland-1.3775006’ saved [115074/115074]

--2019-01-31 10:07:17--  https://www.irishtimes.com/polopoly_fs/1.3775005.1548783498!/image/image.jpg_gen/derivatives/box_140_75/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 3613 (3.5K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3775005.1548783498!/image/image.jpg_gen/derivatives/box_140_75/image.jpg’


2019-01-31 10:07:17 (17.8 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3775005.1548783498!/image/image.jpg_gen/derivatives/box_140_75/image.jpg’ saved [3613/3613]

--2019-01-31 10:07:17--  http

HTTP request sent, awaiting response... 302 Temporarily Moved
Location: https://www.irishtimes.com/advertising-feature/new-phase-of-millerstown-family-houses-in-kilcock-launching-now-1.3770420?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fadvertising-feature%2Fnew-phase-of-millerstown-family-houses-in-kilcock-launching-now-1.3770420 [following]
--2019-01-31 10:07:19--  https://www.irishtimes.com/advertising-feature/new-phase-of-millerstown-family-houses-in-kilcock-launching-now-1.3770420?mode=sample&auth-failed=1&pw-origin=https%3A%2F%2Fwww.irishtimes.com%2Fadvertising-feature%2Fnew-phase-of-millerstown-family-houses-in-kilcock-launching-now-1.3770420
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 106251 (104K) [text/html]
Saving to: ‘www.irishtimes.com/advertising-feature/new-phase-of-millerstown-family-houses-in-kilcock-launching-now-1.3770420’


2019-01-31 10:07:19 (3.27 MB/s) - ‘www.irishtimes.c

HTTP request sent, awaiting response... 200 OK
Length: 10699 (10K) [image/png]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3775956.1548850648!/image/image.png_gen/derivatives/box_140/image.png’


2019-01-31 10:07:20 (42.3 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3775956.1548850648!/image/image.png_gen/derivatives/box_140/image.png’ saved [10699/10699]

--2019-01-31 10:07:20--  https://www.irishtimes.com/news/world/historian-goes-viral-after-berating-billionaires-at-davos-1.3776268
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 142470 (139K) [text/html]
Saving to: ‘www.irishtimes.com/news/world/historian-goes-viral-after-berating-billionaires-at-davos-1.3776268’


2019-01-31 10:07:20 (3.53 MB/s) - ‘www.irishtimes.com/news/world/historian-goes-viral-after-berating-billionaires-at-davos-1.3776268’ saved [142470/142470]

--2019-01-31 10:07:20--  https://www.irishtimes.com/polopoly_fs/1.3776268.1548920975!/image/image.png_gen/

HTTP request sent, awaiting response... 301 Moved Permanently
Location: /weather [following]
--2019-01-31 10:07:22--  https://www.irishtimes.com/weather
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 148537 (145K) [text/html]
Saving to: ‘www.irishtimes.com/news/weather’


2019-01-31 10:07:22 (7.40 MB/s) - ‘www.irishtimes.com/news/weather’ saved [148537/148537]

--2019-01-31 10:07:22--  https://www.irishtimes.com/polopoly_fs/1.3071674!/image/image.jpg_gen/derivatives/landscape_620/image.jpg
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 34922 (34K) [image/jpeg]
Saving to: ‘www.irishtimes.com/polopoly_fs/1.3071674!/image/image.jpg_gen/derivatives/landscape_620/image.jpg’


2019-01-31 10:07:22 (136 MB/s) - ‘www.irishtimes.com/polopoly_fs/1.3071674!/image/image.jpg_gen/derivatives/landscape_620/image.jpg’ saved [34922/34922]

--2019-01-31 10:07:22--  https://www.iri

HTTP request sent, awaiting response... 200 OK
Length: 196958 (192K) [text/html]
Saving to: ‘www.irishtimes.com/policy-and-terms/terms-conditions’


2019-01-31 10:07:23 (8.51 MB/s) - ‘www.irishtimes.com/policy-and-terms/terms-conditions’ saved [196958/196958]

--2019-01-31 10:07:23--  https://www.irishtimes.com/policy-and-terms/privacy-policy
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 107756 (105K) [text/html]
Saving to: ‘www.irishtimes.com/policy-and-terms/privacy-policy’


2019-01-31 10:07:23 (73.1 MB/s) - ‘www.irishtimes.com/policy-and-terms/privacy-policy’ saved [107756/107756]

--2019-01-31 10:07:23--  https://www.irishtimes.com/policy-and-terms/community-standards
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 127742 (125K) [text/html]
Saving to: ‘www.irishtimes.com/policy-and-terms/community-standards’


2019-01-31 10:07:24 (7.67 MB/s) - ‘www.irishtim


2019-01-31 10:07:27 (3.50 MB/s) - ‘www.irishtimes.com/sport/other-sports’ saved [317717/317717]

--2019-01-31 10:07:27--  https://www.irishtimes.com/sport/women-in-sport
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 404007 (395K) [text/html]
Saving to: ‘www.irishtimes.com/sport/women-in-sport’


2019-01-31 10:07:28 (1.85 MB/s) - ‘www.irishtimes.com/sport/women-in-sport’ saved [404007/404007]

--2019-01-31 10:07:28--  https://www.irishtimes.com/sport/comment
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 168403 (164K) [text/html]
Saving to: ‘www.irishtimes.com/sport/comment’


2019-01-31 10:07:28 (7.22 MB/s) - ‘www.irishtimes.com/sport/comment’ saved [168403/168403]

--2019-01-31 10:07:28--  https://www.irishtimes.com/business/economy
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/

HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.irishtimes.com/life-and-style/motors.5’

www.irishtimes.com/     [ <=>                ] 189.55K  --.-KB/s    in 0.04s   

2019-01-31 10:07:35 (5.27 MB/s) - ‘www.irishtimes.com/life-and-style/motors.5’ saved [194102]

--2019-01-31 10:07:35--  https://www.irishtimes.com/life-and-style/fashion
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 160341 (157K) [text/html]
Saving to: ‘www.irishtimes.com/life-and-style/fashion.5’


2019-01-31 10:07:35 (3.39 MB/s) - ‘www.irishtimes.com/life-and-style/fashion.5’ saved [160341/160341]

--2019-01-31 10:07:35--  https://www.irishtimes.com/culture/books
Reusing existing connection to www.irishtimes.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 214121 (209K) [text/html]
Saving to: ‘www.irishtimes.com/culture/books’


2019-01-31 10:07:35 (3.63 MB/s) - ‘www.irishtimes.com/culture/boo

For a pure Python crawler we can use the Python *wget* package or the *scrapy* package (scrapy only works with Phyton2.7 though). 

## Topic3: Getting data via APIs.
### JSON format: 
JavaScript Object Notation - a text format used widely for web-based resource sharing. Many APIs return data in JSON.

Create a file named *example.json* using the Python code below to write a given string to a file.

In [16]:
json_string = """
{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}"""
with open("example.json", "w") as file:
    file.write(json_string)    

In [17]:
# Run shell command "cat" to look at the file
# The sign ! tells Jupyter Notebook that the following command is a shell command.
!cat example.json


{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}

In [18]:
json_data = json.load(open('example.json'))
#json_data looks like a nested Python dictionary
print(json_data)

{'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}}}}


In [19]:
#We can refer to different fields of the json object
print(json_data['glossary']['title'])
print(json_data['glossary']['GlossDiv']['title'])
print(json_data['glossary']['GlossDiv']['GlossList']['GlossEntry']['ID'])

example glossary
S
SGML


In the example below we use an URL called an API endpoint and the *requests* package to get a json file, as we have seen above in getting data from an URL.


In [20]:
url='https://data.colorado.gov/resource/4ykn-tg5h.json'
json_dataset = requests.get(url).text
print(len(json_dataset))
#Look at the first 500 characters of the json list
print(json_dataset[:500])

with open("data_colorado_gov.json", "w") as file:
    file.write(json_dataset)


668198
[{"agentorganizationname":"The Corporation Company","agentprincipaladdress1":"7700 E Arapahoe Rd Ste 220","agentprincipalcity":"Centennial","agentprincipalcountry":"US","agentprincipalstate":"CO","agentprincipalzipcode":"80112-1268","entityformdate":"1959-03-24T00:00:00.000","entityid":"19871028279","entityname":"DEUTSCHE BANK TRUST COMPANY AMERICAS","entitystatus":"Good Standing","entitytype":"Foreign Corporation","jurisdictonofformation":"NY","principaladdress1":"60 Wall Street","principalcity


## Twitter API

You must have a Twitter account and Twitter OAuth credentials available from https://apps.twitter.com/. 
For now you can use the credentials below, but Twitter may reject too many connections on the same credentials.
It is important to create and use your own authentification. The credentials below will be reset after this lab.
Create a new application (using your own Twitter credentials) and then generate access tokens. See this tutorial for more details:
http://socialmedia-class.org/twittertutorial.html

In [21]:
# Using Twitter Search API to get public tweets from the past
# Initiate the connection to Twitter API
# Twitter API returns data in JSON format

# Variables that contains the user credentials to access Twitter API 
# ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
# ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
# CONSUMER_KEY = 'YOUR API KEY'
# CONSUMER_SECRET = 'ENTER YOUR API SECRET'
ACCESS_TOKEN = '2839893905-pBXUzdrHCNXyjfPuBpSwxNbH1zyEpRaa2sXK0Jd'
ACCESS_SECRET = 'eNtB7YTAfsMhPIQtKji8aQT7zQFpFfDPR2lQ89WKfgI1U'
CONSUMER_KEY = 'ZqPrfLpc0znZlz3kW2a22VmUa'
CONSUMER_SECRET = 'BHD19T0DmUV2XVvEhUAgvpXMx0nGfxevAtr53NbCd9jQjPyTqn'

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)
twitter = Twitter(auth=oauth)
            
# Search for latest 100 tweets about "#analytics"
iterator = twitter.search.tweets(q='#wearenotwaiting', result_type='recent', lang='en', count=100)
#print(json.dumps(iterator, indent=4))

file = open("twitter_search_100tweets.json", "w") 
for tweet in iterator['statuses']:
    #print(json.dumps(tweet))
    file.write(json.dumps(tweet)+"\n")

Assuming previous tweets were saved in a file *twitter_search_100tweets.json*, read file and look at tweets. 
If you couldn't get the above code to work use sample given file.

In [22]:
# We use the file saved from last step as example
with open('twitter_search_100tweets.json', 'r') as f:
    tweets_file = f.readlines()
#print(tweets_file)

for line in tweets_file:
    #print(line)
    try:
        # Read in one line of the file, convert it into a json object 
        tweet = json.loads(line.strip())
        #print(tweet)
        if 'text' in tweet: # only messages contains 'text' field is a tweet
#             print(tweet['id']) # This is the tweet's id
#             print(tweet['created_at']) # when the tweet posted
#             print(tweet['text']) # content of the tweet
                        
#             print(tweet['user']['id']) # id of the user who posted the tweet
#             print(tweet['user']['name']) # name of the user, e.g. "Wei Xu"
#             print(tweet['user']['screen_name']) # name of the user account, e.g. "cocoweixu"

#             hashtags = []
#             for hashtag in tweet['entities']['hashtags']:
#             	hashtags.append(hashtag['text'])
#             print(hashtags)
            date = tweet['created_at']
            id = tweet['id']
            text = tweet['text']
            nfollowers = tweet['user']['followers_count']
            nfriends = tweet['user']['friends_count']
            hashtags = [hashtag['text'] for hashtag in tweet['entities']['hashtags']]
            users = [user_mention['screen_name'] for user_mention in tweet['entities']['user_mentions']]
            urls = [url['expanded_url'] for url in tweet['entities']['urls']]
    
            media_urls = []
            if 'media' in tweet['entities']:
                media_urls = [media['media_url'] for media in tweet['entities']['media']]	  
    
            print([date, id, text, hashtags, users, urls, media_urls, nfollowers, nfriends])
    except:
        # read in a line that is not in JSON format (sometimes error occured)
        print("JSON error!!!")
        continue

['Thu Jan 31 10:04:13 +0000 2019', 1090913635399020544, "@_viq @Shnoune Thank you! You're really very helpful. It's all V technical for a non tech person. I'm a bit stupid… https://t.co/JNmtUpenUc", [], ['_viq', 'Shnoune'], ['https://twitter.com/i/web/status/1090913635399020544'], [], 880, 443]
['Thu Jan 31 09:03:54 +0000 2019', 1090898457374662658, '@punchesbears @OpenAPS I used #openaps and worked really well! But a newer pump like I use now (DanaRS) is way quic… https://t.co/Vb3McVYKKo', ['openaps'], ['punchesbears', 'OpenAPS'], ['https://twitter.com/i/web/status/1090898457374662658'], [], 53, 118]
['Wed Jan 30 22:42:42 +0000 2019', 1090742126420344844, 'RT @SthSue: We’re looking forward to hearing from Dr Matthew Guy about remote monitoring of the Medtronic 600-series insulin pumps, openAPS…', [], ['SthSue'], [], [], 3226, 393]
['Wed Jan 30 22:42:06 +0000 2019', 1090741973105938433, 'RT @SthSue: We’re looking forward to hearing from Dr Matthew Guy about remote monitoring of the Med

In [23]:
#Using Twitter Streaming API to stream tweets in real-time
#Gather all tweets containing a given keyword
#You can also gather all tweets of given user, check Twitter Streaming API details.

# Variables that contains the user credentials to access Twitter API 
# ACCESS_TOKEN = 'YOUR ACCESS TOKEN"'
# ACCESS_SECRET = 'YOUR ACCESS TOKEN SECRET'
# CONSUMER_KEY = 'YOUR API KEY'
# CONSUMER_SECRET = 'ENTER YOUR API SECRET'
ACCESS_TOKEN = '2839893905-pBXUzdrHCNXyjfPuBpSwxNbH1zyEpRaa2sXK0Jd'
ACCESS_SECRET = 'eNtB7YTAfsMhPIQtKji8aQT7zQFpFfDPR2lQ89WKfgI1U'
CONSUMER_KEY = 'ZqPrfLpc0znZlz3kW2a22VmUa'
CONSUMER_SECRET = 'BHD19T0DmUV2XVvEhUAgvpXMx0nGfxevAtr53NbCd9jQjPyTqn'

oauth = OAuth(ACCESS_TOKEN, ACCESS_SECRET, CONSUMER_KEY, CONSUMER_SECRET)

# Initiate the connection to Twitter Streaming API
twitter_stream = TwitterStream(auth=oauth)

# Get a sample of the public data published on Twitter in real-time
#iterator = twitter_stream.statuses.sample()
# Get a sample of tweets in English, containin #analytics"
iterator = twitter_stream.statuses.filter(track="data analytics", language="en")

# Print each tweet in the stream to the screen 
# Here we set it to stop after getting 100 tweets. 
# You don't have to set it to stop, but can continue running 
# the Twitter API to collect data for days or even longer. 
tweet_count = 10
file = open("data_analytics_twitter_stream_10tweets.json", "w") 

for tweet in iterator:
    tweet_count -= 1
    # Twitter Python Tool wraps the data returned by Twitter 
    # as a TwitterDictResponse object.
    # We convert it back to the JSON format to print/score
    #print(json.dumps(tweet))  
    file.write(json.dumps(tweet)+"\n")

    # The command below will do pretty printing for JSON data, try it out
    print(json.dumps(tweet, indent=4))
       
    if tweet_count <= 0:
        break

{
    "created_at": "Thu Jan 31 10:07:55 +0000 2019",
    "id": 1090914567247613953,
    "id_str": "1090914567247613953",
    "text": "RT @BigData_Fr: via @RichardEudes - Integrating factor of Big data and Artificial intelligence in Business https://t.co/zcbV0r6jRd #ai, #an\u2026",
    "source": "<a href=\"http://discussionexpress.com\" rel=\"nofollow\">Information Critical</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 16967457,
        "id_str": "16967457",
        "name": "Don Robinson",
        "screen_name": "keepcliming",
        "location": "New Jersey",
        "url": null,
        "description": "The byproducts of innovative thinking are assembled into revolutionary products and services. #crosstraining #wellness\n #bigdata #poetry #environment",
        "translator_type": "none",
       

{
    "created_at": "Thu Jan 31 10:08:15 +0000 2019",
    "id": 1090914652182274049,
    "id_str": "1090914652182274049",
    "text": "RT @DataScienceFr: via @RichardEudes - Integrating factor of Big data and Artificial intelligence in Business https://t.co/EJFlsLpA8m #ai,\u2026",
    "source": "<a href=\"http://discussionexpress.com\" rel=\"nofollow\">Information Critical</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 16967457,
        "id_str": "16967457",
        "name": "Don Robinson",
        "screen_name": "keepcliming",
        "location": "New Jersey",
        "url": null,
        "description": "The byproducts of innovative thinking are assembled into revolutionary products and services. #crosstraining #wellness\n #bigdata #poetry #environment",
        "translator_type": "none",
        

{
    "created_at": "Thu Jan 31 10:08:36 +0000 2019",
    "id": 1090914738324860928,
    "id_str": "1090914738324860928",
    "text": "RT @DataScientistsF: via @RichardEudes - Integrating factor of Big data and Artificial intelligence in Business https://t.co/7G5gGwpxeL #ai\u2026",
    "source": "<a href=\"http://discussionexpress.com\" rel=\"nofollow\">Information Critical</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 16967457,
        "id_str": "16967457",
        "name": "Don Robinson",
        "screen_name": "keepcliming",
        "location": "New Jersey",
        "url": null,
        "description": "The byproducts of innovative thinking are assembled into revolutionary products and services. #crosstraining #wellness\n #bigdata #poetry #environment",
        "translator_type": "none",
       

{
    "created_at": "Thu Jan 31 10:08:56 +0000 2019",
    "id": 1090914822865199104,
    "id_str": "1090914822865199104",
    "text": "RT @AnalyticsFrance: via @RichardEudes - Integrating factor of Big data and Artificial intelligence in Business https://t.co/doc8ZUnpxY #ai\u2026",
    "source": "<a href=\"http://discussionexpress.com\" rel=\"nofollow\">Information Critical</a>",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 16967457,
        "id_str": "16967457",
        "name": "Don Robinson",
        "screen_name": "keepcliming",
        "location": "New Jersey",
        "url": null,
        "description": "The byproducts of innovative thinking are assembled into revolutionary products and services. #crosstraining #wellness\n #bigdata #poetry #environment",
        "translator_type": "none",
       

{
    "created_at": "Thu Jan 31 10:10:00 +0000 2019",
    "id": 1090915092688977920,
    "id_str": "1090915092688977920",
    "text": "\"Financial considerations likely played a role. Optum, a sister company to UnitedHealthcare, sells data analytics s\u2026 https://t.co/zXwQdLOsrB",
    "source": "<a href=\"https://buffer.com\" rel=\"nofollow\">Buffer</a>",
    "truncated": true,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 44143505,
        "id_str": "44143505",
        "name": "Rich Duszak, MD",
        "screen_name": "RichDuszak",
        "location": "Emory University",
        "url": "https://duszak.me/",
        "description": "Grateful. Radiologist, researcher, teacher, policy wonk, and patient advocate. Striving each day to pay it forward and put the care back in healthcare.",
        "translator_type": "none",
      

{
    "created_at": "Thu Jan 31 10:10:04 +0000 2019",
    "id": 1090915108497305601,
    "id_str": "1090915108497305601",
    "text": "The power of analytics for insurance fraud detection. Fraud detection has become a lengthy and exhausting process f\u2026 https://t.co/mzYxvcQcNC",
    "display_text_range": [
        0,
        140
    ],
    "source": "<a href=\"https://www.hootsuite.com\" rel=\"nofollow\">Hootsuite Inc.</a>",
    "truncated": true,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 765924384678744064,
        "id_str": "765924384678744064",
        "name": "LexisNexisInsureIN",
        "screen_name": "LexisInsureIN",
        "location": "Mumbai, India",
        "url": "http://www.lexisnexis.com/risk/in/insurance",
        "description": "@LexisNexisRisk's Insurance-related tweets for India: data, analytics, life

{
    "created_at": "Thu Jan 31 10:11:05 +0000 2019",
    "id": 1090915363812855810,
    "id_str": "1090915363812855810",
    "text": "Healthcare Data Analytics Market to Witness Huge Growth by 2025 | Allscripts, Cerner, Health Catalyst, IBM, Inovalo\u2026 https://t.co/Up0LqR2JSj",
    "display_text_range": [
        0,
        140
    ],
    "source": "<a href=\"https://dlvrit.com/\" rel=\"nofollow\">dlvr.it</a>",
    "truncated": true,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 804337567,
        "id_str": "804337567",
        "name": "Concord News Now",
        "screen_name": "concordnewsnow",
        "location": null,
        "url": "http://www.concordnewsnow.com/",
        "description": "Concord News Now is one of the trusted websites for scalable distribution of your products worldwide through press releases.",
    