# Working with Text Data

Often raw data comes from all kinds of text documents: structured documents (HyperText Markup Language/HTML, eXtensible Markup Language/XML, Comma Separated Values/CSV, and JavaScript Object Notation/JSON files) or unstructured documents (plain, human-readable text). 

As a matter of fact, unstructured text is perhaps the hardest data source to work with because the processing software has to infer the meaning of the data items.

#### Hint
Azure Notebooks intentionally restricts access to external URLs. This is most likely to prevent people from using the Notebooks service to perform denial of service attacks to other sites.


# Processing HTML Files

Module **BeautifulSoup** is used for parsing, accessing, and modifying HTML and XML documents. 
You can construct a BeautifulSoup object from a markup string, a markup file, or a URL of a markup document on the Web.



In [2]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Construct soup from a string
soup1 = BeautifulSoup("<HTML><HEAD>«headers»</HEAD>«body»</HTML>", "lxml")

# Construct soup from a local file
#soup2 = BeautifulSoup(open("myDoc.html"), "lxml")

# Construct soup from a Web document
# Remember that urlopen() does not add "http://"!
soup3 = BeautifulSoup(urlopen("http://www.networksciencelab.com/"), "lxml")


HTTPError: HTTP Error 403: Forbidden

**BeautifulSoup** comes with four pre-installed parsers:
* "html.parser" (default, very fast, not very lenient; used for “simple” HTML documents);
* "lxml" (very fast, lenient);
* "xml" (for XML files only);
* "html5lib" (very slow, extremely lenient; used for HTML documents with complicated structure, or for all HTML documents if the parsing speed is not an issue).

When the soup is ready, you can pretty print the original markup document with the function *soup.prettify()*.



In [7]:
soup1.prettify()

'<html>\n <head>\n </head>\n <body>\n  <p>\n   «headers»«body»\n  </p>\n </body>\n</html>'

Function soup.get_text() returns the text part of the markup document with all tags removed. Use this function to convert markup to plain text when it’s the plain text you are interested in.

In [2]:
htmlString = '''
    <HTML>
    <HEAD><TITLE>My document</TITLE></HEAD>
    <BODY>Main text.</BODY></HTML>
'''
soup = BeautifulSoup(htmlString, "lxml")
soup.get_text()


'\nMy document\nMain text.\n'

Often markup tags are used to locate certain file fragments. For example, you might be interested in the first row of the first table.

BeautifulSoup uses a consistent approach to all vertical and horizontal relations between tags. The relations are expressed as attributes of the tag objects and resemble a file system hierarchy.

+ The first cell in the first row of the first table is *soup.body.table.tr.td* .

+ Any tag t has a name t.name, a string value (t.string with the original content and a list of t.stripped_strings with removed white spaces), the parent t.parent, the next t.next and the previous t.prev tags, and zero or more children t.children (tags within tags).

+ BeautifulSoup developers implemented access to HTML tag attributes through a Python dictionary interface. If object t represents a hyperlink, then the string value of the destination of the hyperlink is t["href"].string.

+ Perhaps the most useful soup functions are *soup.find()* and *soup.find_all()*, which find the first instance or all instances of a certain tag. 

**Examples:**
+ All instances of the H2-tag:

    level2headers = soup.find_all("H2")
    
    
+ All bold or italic formats:

	formats = soup.find_all(["i", "b", "em", "strong"])


+ All tags that have a certain attribute (for example, id="link3"):

    soup.find(id="link3")
  
  
+ All hyperlinks and also the destination URL of the first link, using either the dictionary notation of the tag.get() function:

    links = soup.find_all("a")
    
    firstLink = links[0]["href"]
    
    /# or
    
    firstLink = links[0].get("href")
    
    Both expressions fail if the attribute is not present. You must use the *tag.has_attr()* function to check the presence of an attribute before you extract it.



The following expression combines BeautifulSoup and list comprehension to extract all links and their respective URLs and labels (useful for recursive Web crawling):

In [8]:
with urlopen("http://www.networksciencelab.com/") as doc:
    soup = BeautifulSoup(doc, "lxml")

In [10]:
links = []
for link in soup.find_all("a"):
    if link.has_attr("href"):
        links.append( (link.string, link["href"]) )
        
print(links[:10])

[('Networks of Music Groups as Success Predictors', 'http://www.slideshare.net/DmitryZinoviev/networks-of-music-groups-as-success-predictors'), ('Network Science Workshop', 'http://www.slideshare.net/DmitryZinoviev/workshop-20212296'), ('Resilience in Transaction-Oriented Networks', 'http://www.slideshare.net/DmitryZinoviev/resilience-in-transactional-networks'), ('Peer Ratings in Massive Online Social Networks', 'http://www.slideshare.net/DmitryZinoviev/peer-ratings-in-massive-online-social-networks'), ('Semantic Networks of Interests in Online NSSI Communities', 'http://www.slideshare.net/DmitryZinoviev/presentation-31680572'), ('Towards an Ideal Store', 'http://www.slideshare.net/DmitryZinoviev/10-monthsymposiumbeta'), ('D.Zinoviev, "Analyzing Cultural Domains with Python,"', 'https://media.pragprog.com/newsletters/2016-04-06.html'), ('D. Zinoviev, D. Stefanescu, G. Fireman, and L. Swenson, "Semantic networks of interests in online non-suicidal self-injury communities,"', 'http://dh

In [11]:
# shorter with list comprehension
links = [(link.string, link["href"])
    for link in soup.find_all("a")
    if link.has_attr("href")]

print(links[:10])

[('Networks of Music Groups as Success Predictors', 'http://www.slideshare.net/DmitryZinoviev/networks-of-music-groups-as-success-predictors'), ('Network Science Workshop', 'http://www.slideshare.net/DmitryZinoviev/workshop-20212296'), ('Resilience in Transaction-Oriented Networks', 'http://www.slideshare.net/DmitryZinoviev/resilience-in-transactional-networks'), ('Peer Ratings in Massive Online Social Networks', 'http://www.slideshare.net/DmitryZinoviev/peer-ratings-in-massive-online-social-networks'), ('Semantic Networks of Interests in Online NSSI Communities', 'http://www.slideshare.net/DmitryZinoviev/presentation-31680572'), ('Towards an Ideal Store', 'http://www.slideshare.net/DmitryZinoviev/10-monthsymposiumbeta'), ('D.Zinoviev, "Analyzing Cultural Domains with Python,"', 'https://media.pragprog.com/newsletters/2016-04-06.html'), ('D. Zinoviev, D. Stefanescu, G. Fireman, and L. Swenson, "Semantic networks of interests in online non-suicidal self-injury communities,"', 'http://dh

# Handling CSV Files

Comma separated values (CSV) is a structured text file format used to store and move tabular or nearly tabular data. It dates back to 1972 and is a format of choice of Microsoft Excel, OpenOffice Calc, and other spreadsheet software.

Data.gov,1 a U.S. government Web site that provides access to publicly available data, alone provides 11,783 data sets in the CSV format.

A CSV file consists of columns representing variables and rows representing records. (Data scientists often call them observations.)

The fields in a record are typically separated by commas, but other delimiters, such as tabs (tab separated values [TSV]), colons, semicolons, and vertical bars, are also common.

Keep in mind that sometimes what looks like a delimiter is not a delimiter at all. To allow delimiter-like characters within a field as a part of the variable value (as in ...,"Hello, world",...), enclose the fields in quote characters.

Python module csv provides a CSV reader and a CSV writer.

Both objects take a previously opened text file handle as the first parameter (in the example, the file is opened with the newline='' option to avoid the need to strip the lines). You may provide the delimiter and the quote character, if needed, through the optional parameters delimiter and quotechar. Other optional parameters control the escape character, the line terminator, and so on. 

    with open("somefile.csv", newline='') as infile:
	    reader = csv.reader(infile, delimiter=',', quotechar='"')

The first record of a CSV file often contains column headers and may be treated differently from the rest of the file. This is not a feature of the CSV format itself, but simply a common practice.

A CSV reader provides iterator interface for use in a for loop. The iterator returns the next record as a list of string fields. The reader does not convert the fields to any numeric data type (it’s still our job!) and does not strip them of the leading white spaces, unless instructed by passing the optional parameter *skipinitialspace=True*.


In the following example we’ll use the csv module to extract the “Answer.Age” column from a CSV file. We’ll assume that the index of the column is not known, but the that column definitely exists.

First, open the file and read the data:

## Reading JSON Files
JavaScript Object Notation (JSON) is a lightweight data interchange format.

Unlike pickle, JSON is language independent, but more restricted in terms of data representation.

Many popular Web sites, such as Twitter, Facebook, and Yahoo! Weather provide APIs that use JSON as the data interchange format.

JSON supports the following data types:
+ atomic data types—strings, numbers, true, false, null
+ arrays - an array corresponds to a Python list; it is enclosed in square brackets []; the items in an array do not have to be of the same data type:

    [1, 3.14, "a string", true, null]


+ Objects - an object corresponds to a Python dictionary; it is enclosed in curly braces {}; every item consists of a key and a value, separated by a colon:
     
     {"age" : 37, "gender" : "male", "married" : true}
     

+ any recursive combinations of arrays, objects, and atomic data types (arrays of objects, objects with arrays as item values, and so on)


Storing complex data into a JSON file is called serialization. 

The opposite operation is called deserialization. 

Python handles JSON serialization and deserialization via the functions in the module json.

Function *dump()* exports (“dumps”) a representable Python object to a previously opened text file. 

Function *dumps()* exports a representable Python object to a text string (for the purpose of pretty printing or interprocess communications).

Both functions are responsible for serialization.

Function *loads()* converts a valid JSON string into a Python object (it “loads” the object into Python). This conversion is always possible. 

In the same spirit, function *load()* converts the content of a previously opened text file into one Python object. It is an error to store more than one object in a JSON file, but if an existing file still contains more than one object, You can read it as text, convert the text into an array of objects (by adding square brackets around the text and comma separators between the individual objects), and use *loads()* to deserialize the text to a list of objects.



In [16]:
import json

object1 = {'ObjectInterpolator': 1629,  'PointInterpolator': 1675, 'RectangleInterpolator': 2042}
# Serialize an object to a string
json_string = json.dumps(object1)
print(json_string)

# Parse a string as JSON
object2 = json.loads(json_string)

# Tadaaam! Despite four painful conversions, object1 and object2 still have the same value.


{"ObjectInterpolator": 1629, "PointInterpolator": 1675, "RectangleInterpolator": 2042}


In [None]:
# Your Turn

# Broken Link Detector

Write a program that, given a URL of a Web page, reports the names and destinations of broken links in the page. 

For the purpose of this exercise, a link is broken if an attempt to open it with urllib.request.urlopen() fails.
