# Getting Data
This notebook showcases how to download data available on the Internet. We cover most formats the data is typically available in, and learn/practice via example Python code or utilities for getting data. 

TOPIC1: Getting data from a Web URL: text, HTML, XML, PDF.

TOPIC2: Crawling/Scraping data from the Web (entire websites).

TOPIC3: Getting data via APIs (JSON format).

## TOPIC1: Getting data from a Web URL: text, HTML, PDF.

In [1]:
import sys

In [2]:
#To check which Python version and virtual environment this Jupyter Notebook uses
print(sys.executable)
print(sys.version_info)
#print(sys.path)

#If you find that Jupyter Notebook does not point to the required virtual environment
#remove the venv and re-create the virtual environment using
#conda create --name comp47350py37 python=3.7 jupyter
#Use 'pip install' to re-install required packages

/Users/georgianaifrim/miniconda3/envs/comp47350py311/bin/python
sys.version_info(major=3, minor=11, micro=7, releaselevel='final', serial=0)


In [29]:
#Import all required packages
#If you don't have these packages, install using: pip install <package-name>

#Import package 'requests'for URL scrapping
import requests
# Import package for reading csv files 
import pandas as pd
#import package 'beautifulsoup' to extract the content of HTML fields 
#pip install bs4
from bs4 import BeautifulSoup

#pip install newspaper3k
import newspaper

#import package 'feedparser'
#Feedparser is a library to parse RSS/XML feeds, these are files with a specific XML structure
import feedparser
#import package 'json' to parse json objects
import json

import time

#Look at the package structure to understand how to use it
#print(dir(requests))

#Look at individual functions
#help(requests.get)

#As an alternative can use '?', same as help() but opens a new window
#?requests.get

In [4]:
#Get a text file.
#Get book "Alice's Adventures in Wonderland" from Project Gutenberg, in text format

#Give the URL for the file to be downloaded
url='https://www.gutenberg.org/files/11/11-0.txt'
#Look at the object returned by requests.get()
requests_object = requests.get(url)

#print(requests_object.content)

#Get the content from the downloaded text file
text_page = requests_object.text
#print(text_page)

#Look at the first 500 characters of the book
print(text_page[:500])

ï»¿The Project Gutenberg eBook of Aliceâs Adventures in Wonderland, by Lewis Carroll

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you


In [5]:
# Reading from a csv file, into a data frame
df = pd.read_csv('MotorInsuranceFraudClaimABTFull.csv')

# Check how many rows and columns this dataframe has
print("number of rows and columns:", df.shape)

# Show first 10 rows of data frame
# The rows are indexed starting from 0
df.head(10)

# Show last 10 rows of data frame
# The rows are indexed starting from 0
#df.tail(10)

number of rows and columns: (500, 14)


Unnamed: 0,ID,Insurance Type,Income of Policy Holder,Marital Status,Num Claimants,Injury Type,Overnight Hospital Stay,Claim Amount,Total Claimed,Num Claims,Num Soft Tissue,% Soft Tissue,Claim Amount Received,Fraud Flag
0,1,CI,0,,2,Soft Tissue,No,1625,3250,2,2.0,1.0,0,1
1,2,CI,0,,2,Back,Yes,15028,60112,1,0.0,0.0,15028,0
2,3,CI,54613,Married,1,Broken Limb,No,-99999,0,0,0.0,0.0,572,0
3,4,CI,0,,3,Serious,Yes,270200,0,0,0.0,0.0,270200,0
4,5,CI,0,,4,Soft Tissue,No,8869,0,0,0.0,0.0,0,1
5,6,CI,0,,1,Broken Limb,Yes,17480,0,0,0.0,0.0,17480,0
6,7,CI,52567,Single,3,Broken Limb,No,3017,18102,2,1.0,0.5,0,1
7,8,CI,0,,2,Back,Yes,7463,0,0,0.0,0.0,7463,0
8,9,CI,0,,1,Soft Tissue,No,2067,0,0,,0.0,2067,0
9,10,CI,42300,Married,4,Back,No,2260,0,0,0.0,0.0,2260,0


In [30]:
#Get an HTML file.
#Get news article from IrishTimes website.

#Give the URL for the file to be downloaded
url = "https://www.irishtimes.com/ireland/2024/01/16/low-temperature-warning-issued-for-entire-country-by-met-eireann-with-snow-warning-for-several-counties/"

#Get the content from the downloaded html file
html_page = requests.get(url).text
#Look at the format of the html file
print(html_page[:500])

#write the content to a file
file = open("low-temperature-warning-issued-for-entire-country.html", "w") 
file.write(html_page)
file.close()

<!DOCTYPE html><html lang="en"><head><script data-integration="inlineScripts">
    (function() {
      var _sf_async_config = window._sf_async_config = (window._sf_async_config || {});
      _sf_async_config.uid = 31036;
      _sf_async_config.domain = "irishtimes.com";
      _sf_async_config.useCanonical = true;
      _sf_async_config.useCanonicalDomain = true;
      _sf_async_config.sections = "ireland";
      _sf_async_config.authors = "Jade Wilson";
      _sf_async_config.flickerControl = fa


In [31]:
# we can download a newsarticle and parse it using the newspaper3k library
# https://buildmedia.readthedocs.org/media/pdf/newspaper/latest/newspaper.pdf
# newspaper cannot parse all types of html files, for more complex file structure we still need 'beautifulsoup'
from newspaper import Article

url ="https://www.irishtimes.com/ireland/2024/01/16/low-temperature-warning-issued-for-entire-country-by-met-eireann-with-snow-warning-for-several-counties/"
article = Article(url)
article.download()

#print(article.html)

article.parse()
print("Authors:", article.authors)
print("Date:", article.publish_date)
print("Title:", article.title)
print("Text:", article.text)
print("\nURL:", article.url)


Authors: ['Jade Wilson', 'Tue Jan -']
Date: 2024-01-16 00:00:00
Text: Low clouds lie over the snow-covered Black Mountain behind the Harland & Wolff cranes on the east side of Belfast Lough. Photograph: Liam McBurney/PA Wire



The lowest overnight air temperature of -6.3 degrees was recorded at Moore Park, Co Cork at 5am on Monday.

The record lowest air temperature for January was recorded on this day back in 1881, at Markree, Co Sligo. It was -19.1 degrees.

READ MORE

Today, the provisional lowest air temperature is -7.4 degrees, recorded at Thomastown in Co Kilkenny, Met Éireann said.





The rest of the week is to remain bitterly cold with temperatures continuing to fall below -5 at night.


Wednesday, Thursday and Friday will be “bitterly cold days”, he anticipated, with temperatures barely above freezing. The wind chill factor will make it feel like minus 8 in the east of the country. There is also a chance of snow in some parts of the north and northwest.



In [33]:
#Downloading and working with an XML file
#Get the whole RSS feed for the Irish Times news articles
#This is an XML file listing the URLs of individual news articles published online
#Need to know the structure of the XML to be able to extract text from specific tags

#Parse the XML file to retrieve the URLs for individual news articles.
#Parse each article's HTML page

def getArticleDetailsByUrl(url):
    article = Article(url)
    article.download()

    #print(article.html)
    article.parse()
    authors = article.authors
    date = article.publish_date
    title = article.title
    text = article.text

    return [authors, date, title, text]

def scrapeRSSFeed(rss_feed):
    d = feedparser.parse(rss_feed)
    #print(d)
    #print(d['entries'], "\n")
        
    for item in d['entries']:
        #Extract an article URL
        article_url = item['link']
        print(article_url)
        try:
            [authors, date, title, text] = getArticleDetailsByUrl(article_url)
            print("\nArticle title:", title, "\n")
            print("\nArticle first paragraph:", text.split("\n")[0], "\n")
            #we introduce a delay after each article download to avoid overloading the IrishTimes server
            time.sleep(5)
        except requests.RequestException as e:
            print("[Error]: " + str(e))
            
#Here you have your very own RSS feed reader in a few lines of code.
if __name__ == '__main__':

    #The URL of the XML file
    url='https://www.irishtimes.com/rss/irish-times-top-10-stories-1.4019566'
    xml_page = requests.get(url).text
    
    #Look at the structure of the XML file
    #To have a proper look, open the XML file with a text editor
    print(xml_page[:1000])

    # Call the method that parses a given XML file
    scrapeRSSFeed(url)

<?xml version="1.0" encoding="UTF-8"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" version="2.0" xmlns:media="http://search.yahoo.com/mrss/"><channel><title><![CDATA[Irish Times Feeds]]></title><link>https://www.irishtimes.com</link><atom:link href="https://www.irishtimes.com/arc/outboundfeeds/digest-resolver/homepage-top/" rel="self" type="application/rss+xml"/><description><![CDATA[Irish Times Feeds News Feed]]></description><lastBuildDate>Wed, 17 Jan 2024 16:42:27 +0000</lastBuildDate><language>en</language><ttl>1</ttl><sy:updatePeriod>hourly</sy:updatePeriod><sy:updateFrequency>1</sy:updateFrequency><item><title><![CDATA[Irish beef exports to China to resume immediately]]></title><link>https://www.irishtimes.com/ireland/2024/01/17/irish-beef-exports-to-china-to-resume-immediately/</link><guid isPermaLink="true">https://www.iri

In [34]:
#Get a PDF file, save it to disk.

# Give url of the PDF file
url='http://www.greenteapress.com/thinkpython/thinkpython.pdf'
# Download the pdf file into request_object
request_object = requests.get(url)

#PDF is a binary format. Use request.content instead of request.text
#Write binary content on your machine's disk in a file named 'thinkpython.pdf'
with open("thinkpython.pdf", "wb") as pdffile:
    # Look at the conent of the file; it looks all gibberish since it is a binary format.
    # To make sense of the content, we need tools that can read pdf format and extract it to plain text.
    # See next cell for pdftotext tool.
    print(request_object.content[:500])
    
    #Print the content of the request_object to a file named "thinkpython.pdf"
    pdffile.write(request_object.content)

#Check that it downloaded the file to the current directory.
#%ls

b'%PDF-1.5\n%\xd0\xd4\xc5\xd8\n2 0 obj <<\n/Type /ObjStm\n/N 100\n/First 804\n/Length 1113      \n/Filter /FlateDecode\n>>\nstream\nx\xda\x9dV\xdbn\x9cH\x10}\x9f\xaf\xa8\xc7d\xb5\x8a\xe9\x0b\xdd\xb0\x8a\x12E\x9b8\xca\xc3*Vl%\xcf\x1d\xe8\x19\xa30\x80\x1a\xb0=\xfb\xf5{\x8a\x8b\xed\xec\xa5\x07\xed\x83M\x0f\xd49UuNu\x83\xa0\x84RR\x82\x0ciA9YA"\xa1\xdc\x90P$dFB\x930\x92\x84%\x91c\x99\x91\x14)\xfeHj\xbb\x93\x92\xa4\xc1\x1d\xe0\x93\x04KR\xe0\x919)\xa3p\x87T\x86\x8b\x02-\x9ek\xd2\xd2\x92\xb2\xa4S<\xcfH\xdb\x84\xf3\xa5\x89\xdciI)\xb2\xe8\x94R\x8d\x8b\xa14\xc3%\'#\xb0L\xc8(<Pd\x8c\xc5s26\xa7\xd4\xa2J`3\xb2\xa00\x82\xacU;\xd4h\xf3\x8cLJ\x19R\x1bC\x19\x12\x99\x9c\xb2\x1c\xcf\xd1\x10\xaa\xb3\x8a\xf2\tDy\x96\x00\x84F\x05P\xe81Q\x922n\xdc\xe8]\x86f\x13\x0b\x9a\x94\x84H2\xca \x85\x90\t\x88pE!y\x82\xab\x01\x07\xf4\x11\xb9%VE&9a)\xa4\xc2\x15|2e2\xc1\x02\xaa\x9dH\xc0\xc8\x02\x89\x04\x94J1?\xcbk\xf8\x0eHU.9\x13\x94\xe6_,\xbbF[\x82\x85\xd7)\x17\x01b\x9d\xe1\x1f\xc4\x17)\x07B~\x91j\xb9\x130@\xa4h\x19\t\xb1\

In [35]:
import pdftotext

# Make sure you have downloaded the "thinkpython.pdf" file in your current folder
# http://www.greenteapress.com/thinkpython/thinkpython.pdf

# Load your PDF
with open('thinkpython.pdf', 'rb') as f:
    pdf = pdftotext.PDF(f)
    
# What kind of object is this?
#print(type(pdf)

# What are the methods and variables of this object.
#print(dir(pdf))

# Get more detail about how to use this object
# print(help(pdf))

# How many pages?
print("pages:", len(pdf))

# Iterate over all the pages
#for page in pdf:
#    print("\n=====newpage:=====\n", page)

# Read some individual pages
print("Page 0:\n", pdf[0])
print("Page 1:\n", pdf[1])

# Read all the text into one string
string_pdf = "\n\n".join(pdf)

# Print the first 500 characters in the string
print("\n\nThe first 500 symbols in the string:\n", string_pdf[:500])

pages: 240
Page 0:
                    Think Python
How to Think Like a Computer Scientist



                             Version 2.0.17

Page 1:
 


The first 500 symbols in the string:
                    Think Python
How to Think Like a Computer Scientist



                             Version 2.0.17




                   Think Python
How to Think Like a Computer Scientist



                               Version 2.0.17




                        Allen Downey


                       Green Tea Press
                       Needham, Massachusetts


Copyright © 2012 Allen Downey.


Green Tea Press
9 Washburn Ave
Needham MA 02492
Permission is granted to copy, distribute, a


## Topic2: Crawling data from the Web.

As an alternative to using the Python package *requests*, you can use the command line *wget* utility to download an HTML page from a given URL or to download an entire website. If you don't have *wget* on your computer, first install it for your platform.

The *wget* tool is great for crawling entire or parts of websites. It recursively follows URLs up to given depth.
The example below downloads a part of the website locally, in a folder named *en.wikipedia.org*. The parameter -l tells wget to what depth it should follow URLs from the original URL. The parameter --no-parent tells wget to not download anything other than the given path. See http://linuxreviews.org/quicktips/wget/ for more details.

In [36]:
#Crawl the website to depth 1. To stop downloading interrupt the kernel from the menu above.
! wget -r -l 1 --no-parent https://en.wikipedia.org/wiki/Main_Page 

8374.54s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
--2024-01-17 16:44:58--  https://en.wikipedia.org/wiki/Main_Page
Resolving en.wikipedia.org (en.wikipedia.org)... 185.15.59.224
Connecting to en.wikipedia.org (en.wikipedia.org)|185.15.59.224|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 99789 (97K) [text/html]
Saving to: ‘en.wikipedia.org/wiki/Main_Page’


2024-01-17 16:44:59 (3.12 MB/s) - ‘en.wikipedia.org/wiki/Main_Page’ saved [99789/99789]

Loading robots.txt; please ignore errors.
--2024-01-17 16:44:59--  https://en.wikipedia.org/robots.txt
Reusing existing connection to en.wikipedia.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 27524 (27K) [text/plain]
Saving to: ‘en.wikipedia.org/robots.txt’


2024-01-17 16:44:59 (39.7 MB/s) - ‘en.wikipedia.org/robots.txt’ saved [27524/27524]

FINISHED --2024-01-17 16:44:59--
Total wall clock time: 0.2s
Downloaded: 2 files, 124K in 0.03s (3.90 MB/s)


In [37]:
#Need to stop crawling after a short while, otherwise it may fill your hard disk or you will get banned by the website
! wget -E -p -l 1 --no-parent http://www.kdnuggets.com/

8380.08s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
URL transformed to HTTPS due to an HSTS policy
--2024-01-17 16:45:04--  https://www.kdnuggets.com/
Resolving www.kdnuggets.com (www.kdnuggets.com)... 104.26.3.64, 172.67.68.178, 104.26.2.64
Connecting to www.kdnuggets.com (www.kdnuggets.com)|104.26.3.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.kdnuggets.com/index.html’

www.kdnuggets.com/i     [ <=>                ] 192.53K  --.-KB/s    in 0.04s   

2024-01-17 16:45:04 (5.06 MB/s) - ‘www.kdnuggets.com/index.html’ saved [197154]

Loading robots.txt; please ignore errors.
URL transformed to HTTPS due to an HSTS policy
--2024-01-17 16:45:04--  https://www.kdnuggets.com/robots.txt
Reusing existing connection to www.kdnuggets.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 270 [text/plain]
Saving to: ‘www.kdnuggets.com/robots.txt’


2024-01-17 16:45:04 (51.5 M

For a pure Python crawler we can use the Python *wget* package or the *scrapy* package (scrapy only works with Phyton2.7 though). 

## Topic3: Getting data via APIs.
### JSON format: 
JavaScript Object Notation - a text format used widely for web-based resource sharing. Many packages and APIs return data in JSON.

Create a file named *example.json* using the Python code below to write a given string to a file.

In [38]:
json_string = """
{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}"""
with open("example.json", "w") as file:
    file.write(json_string)    

In [39]:
# Run shell command "cat" to look at the file
# The sign ! tells Jupyter Notebook that the following command is a shell/terminal command.
!cat example.json

8391.32s - pydevd: Sending message related to process being replaced timed-out after 5 seconds

{
    "glossary": {
        "title": "example glossary",
		"GlossDiv": {
            "title": "S",
			"GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
					"SortAs": "SGML",
					"GlossTerm": "Standard Generalized Markup Language",
					"Acronym": "SGML",
					"Abbrev": "ISO 8879:1986",
					"GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
						"GlossSeeAlso": ["GML", "XML"]
                    },
					"GlossSee": "markup"
                }
            }
        }
    }
}

In [40]:
json_data = json.load(open('example.json'))
#json_data looks like a nested Python dictionary
print(json_data)

{'glossary': {'title': 'example glossary', 'GlossDiv': {'title': 'S', 'GlossList': {'GlossEntry': {'ID': 'SGML', 'SortAs': 'SGML', 'GlossTerm': 'Standard Generalized Markup Language', 'Acronym': 'SGML', 'Abbrev': 'ISO 8879:1986', 'GlossDef': {'para': 'A meta-markup language, used to create markup languages such as DocBook.', 'GlossSeeAlso': ['GML', 'XML']}, 'GlossSee': 'markup'}}}}}


In [41]:
#We can refer to different fields of the json object
print(json_data['glossary']['title'])
print(json_data['glossary']['GlossDiv']['title'])
print(json_data['glossary']['GlossDiv']['GlossList']['GlossEntry']['ID'])

example glossary
S
SGML


In the example below we use an URL called an API endpoint and the *requests* package to get a json file, as we have seen above in getting data from an URL.


In [42]:
url='https://data.colorado.gov/resource/4ykn-tg5h.json'
json_dataset = requests.get(url).text
print(len(json_dataset))
#Look at the first 500 characters of the json list
print(json_dataset[:500])

with open("data_colorado_gov.json", "w") as file:
    file.write(json_dataset)


626473
[{"entityid":"18861217679","entityname":"DENVER UNION CORPROATION, Dissolved January 17, 1983","principaladdress1":"1512 LARIMER STREET #760","principalcity":"Denver","principalstate":"CO","principalzipcode":"80202","entitystatus":"Voluntarily Dissolved","jurisdictonofformation":"CO","entitytype":"Corporation","agentfirstname":"JOHN","agentmiddlename":"F.","agentlastname":"O'DEA","agentprincipaladdress1":"1512 LARIMER STREET #760","agentprincipalcity":"DENVER","agentprincipalstate":"CO","agentpr
