# IST 652 - Lecture 9A

-----

## Review:
- Semi-structured Data
- HTML
- XML

## Explore:
- JSON
- Unicode

---

## Review

### Semistructured Data
- What is it?
- What are the two data formats we covered last class?
- How are they structructed?
- What are some of its primary uses?
- Why isn't this data structured?

![image.png](attachment:image.png)

### HTML
- What is it?
- Where is it primarily used?
- How is the data structured?
- How did we fetch HTML data from the internet using Python?
- How did we convert HTML structure to parsable data objects in Python?
- How is HTML different from XML?

### XML
- What is it?
- WHere is primarily used?
- How is the data structured?
- How did we fetch XML data from the internet using Python?
- How did we convert XML structure to parsable data objects in Python?

---

## Semi-structured Data <i>(Continued)</i>

## JSON
![image.png](attachment:image.png)
- JavaScript Object Notation
- http://www.json.org

#### JSON is the <u>third</u> of the main data interchange formats that we will look at. 
- Has a <b>lightweight</b> format:
    - It makes use of representations of data structures that are both easy for humans to read and for parsers to translate into internal data structures.

<span style="color:darkorange"><b>The difference between JSON and XML:</b></span>
- http://json.org/example.html

#### The two main structures used are what JSON calls objects:
- which are collections of name/value pairs
    - Think python dictionaries and ordered lists

#### The format rules are as follows:

- An <b><u>object</b></u> is an unordered set of name/value pairs, consisting of outer curly braces “{“ and “}”, with members separated by commas, and each member takes the form string : value. <br><br>

- An <b><u>array</b></u>  is an ordered collection of values, consisting of a pair of outer square brackets “[“ and “]”, with values separated by commas.

A value (or data point) can be one of the following:
- An object
- An array
- A string
- A number
- true or false
- null.

A string can have any Unicode character, with the default encoding being UTF-8
- Can also be UTF-16 or UTF-32

### Example:

![image.png](attachment:image.png)

---
WHen creating JSON for a particular interchange purpose, you can:
- Give a JSON schema that outlines the data structures
- Give the names of the name/value pairs
- Type and description information for each value.

However, in practice, people often give an example.
- For example, in the Twitter documentation of their API, each call returns a JSON object
    - e.g. https://dev.twitter.com/rest/reference/get/statuses/user_timeline

In order to use JSON in our programs, we will use the <i>json</i> python package that converts json strings to Python internal data structures of lists and dictionaries, and can convert those structures back to strings.
    
- NOTE: Instead of always storing data in files, we’ll also look at storing data in NOSQL databases.


## Python <i>json</i>

<b>- json docs:</b> https://docs.python.org/3.4/library/json.html

![image.png](attachment:image.png)

In [2]:
import json

The Python json package has functions to handle json formatted data.
- The following function parses the json string and produces a python structure consisting of dictionaries and lists corresponding to the json structure.
- NOTE: There is no need for further functions to access the data structure. 

In [None]:
json.loads(jsonstring)

For a list of python data types corresponding to json entities:
- https://pythonspot.com/json-encoding-and-decoding-with-python/

The json package also has a function <i>dumps</i> to convert python data structures to a json string.
- This can be used to save json data in a file, but is primarily used to “pretty print” json structures, with optional parameters for sort and indent.

In [None]:
json.dumps(python_object, sort_keys = True, indent=4)

#### Now we’ll make a collection to hold some earthquake data from the USGS earthquake web site:
- http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php

This page shows the format of the json that can be downloaded from this web site.  Let’s use the “significant earthquakes” from the past 30 days.

In [4]:
import urllib.request
import json

In [33]:
earthquake_url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_month.geojson"

- This gets the result from the web site (which is in python bytes) and converts it to a string using the decode() function.
![image.png](attachment:image.png)

In [38]:
response = urllib.request.urlopen(earthquake_url)
type(response)

http.client.HTTPResponse

In [39]:
json_string = response.read().decode('utf-8')
json_string[:500]

'{"type":"FeatureCollection","metadata":{"generated":1553456424000,"url":"https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_month.geojson","title":"USGS Significant Earthquakes, Past Month","status":200,"api":"1.7.0","count":10},"features":[{"type":"Feature","properties":{"mag":6.1,"place":"7km NW of El Dovio, Colombia","time":1553368876610,"updated":1553455398055,"tz":-300,"url":"https://earthquake.usgs.gov/earthquakes/eventpage/us1000jkrq","detail":"https://earthquake.usgs.g'

- Now we use the json package to transform the string to Python data structures consisting of lists and dictionaries. <br><br>

- The outermost level is a dictionary and we can look at the keys, comparing them with the format displayed at the web site.


In [40]:
eq_parsed_json = json.loads(json_string)
type(eq_parsed_json)

dict

In [41]:
eq_parsed_json.keys()

dict_keys(['type', 'metadata', 'features', 'bbox'])

In [42]:
eq_parsed_json['type']

'FeatureCollection'

In [43]:
eq_parsed_json['metadata']

{'api': '1.7.0',
 'count': 10,
 'generated': 1553456424000,
 'status': 200,
 'title': 'USGS Significant Earthquakes, Past Month',
 'url': 'https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_month.geojson'}

We can even dive deeper into the nested dictionaries:

In [44]:
title = eq_parsed_json['metadata']['title']
title

'USGS Significant Earthquakes, Past Month'

Now the earthquakes themselves are in a list under <u>features</u>.
- Let’s get the first one and look at its structure, again comparing with the web site:

In [51]:
quakelist = eq_parsed_json['features']
quake1 = quakelist[0]

In [49]:
len(quake1)

10

In [53]:
type(quake1)

dict

In [79]:
quakelist[1].keys()

dict_keys(['type', 'properties', 'geometry', 'id'])

- We can continue to dive deeper into the structure of the data, but we can also improve the view of the format of by invoking the enhanced print function:

In [62]:
print(json.dumps(quake1, indent=2))

{
  "type": "Feature",
  "properties": {
    "mag": 6.1,
    "place": "7km NW of El Dovio, Colombia",
    "time": 1553368876610,
    "updated": 1553455398055,
    "tz": -300,
    "url": "https://earthquake.usgs.gov/earthquakes/eventpage/us1000jkrq",
    "detail": "https://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/us1000jkrq.geojson",
    "felt": 214,
    "cdi": 6,
    "mmi": 4.763,
    "alert": "green",
    "status": "reviewed",
    "tsunami": 0,
    "sig": 701,
    "net": "us",
    "code": "1000jkrq",
    "ids": ",us1000jkrq,",
    "sources": ",us,",
    "types": ",dyfi,geoserve,ground-failure,losspager,moment-tensor,origin,phase-data,shakemap,",
    "nst": null,
    "dmin": 1.972,
    "rms": 1.28,
    "gap": 27,
    "magType": "mww",
    "type": "earthquake",
    "title": "M 6.1 - 7km NW of El Dovio, Colombia"
  },
  "geometry": {
    "type": "Point",
    "coordinates": [
      -76.2801,
      4.5596,
      113.28
    ]
  },
  "id": "us1000jkrq"
}


### How can we convert semi-structured data to structured data?

- Let pandas datafraem by the structured data set
- Let's assemble this dataframe iteratively
    - Each row should be a single earth quake from the quakelist
    - The columns should be composed of the properties
    - AND also the flattened geometry coordinates: ['longitude', 'latitude', 'depth']

In [64]:
import pandas as pd

In [195]:
# initiatilize an empty dataframe
# we'll iteratively create new dataframes and append to the master df (called df)
# Only criteria is that the dataframe rows are the same
# let the "id" object be the dataframe index

df = pd.DataFrame()

# quakelist is a list of dictionaries
for quake in quakelist:
    # we are dupling (i.e. converting a dictionary to json) for demonstartion purposes
    df_json = pd.read_json(json.dumps(quake)).reset_index()
    # we want to unstack a tall dataframe to a wide database by setting the columns to each of the properties
    df_props = df_json.pivot(index='id', columns='index', values='properties',)

    # we want to convert the array of geometry values to unique columns in a dataframe
    geo_lables = ['id','longitude', 'latitude', 'depth']
    geo_series = df_json[df_json['index'] == 'coordinates'].iloc[0]
    geo_vals = [geo_series['id']] + geo_series['geometry']
    geo_df = pd.DataFrame([geo_vals], columns=geo_lables).set_index('id', drop=False)
    
    # pandas join operator combines data from two seperate df's, combines on the index value.
    # For combining df's not on the index, use the df.merge method instead.
    df_prop_geo = df_props.join(geo_df) 
    df = df.append(df_prop_geo)

In [212]:
df.describe()

Unnamed: 0,longitude,latitude,depth
count,10.0,10.0,10.0
mean,-75.46473,7.012377,92.69
std,93.786708,25.730986,123.489809
min,-177.8845,-32.0238,0.76
25%,-115.794,-15.3675,12.295
50%,-80.007883,11.2448,23.035
75%,-67.166525,30.251208,117.57
max,167.6506,38.280333,358.34


In [211]:
df.iloc[:, :-3].describe()

Unnamed: 0,alert,cdi,code,coordinates,detail,dmin,felt,gap,ids,mag,...,status,time,title,tsunami,type,types,tz,updated,url,id
count,8,10.0,10,0.0,10,9.0,10,10,10,10.0,...,10,10,10,10,10,10,10,10,10,10
unique,1,8.0,10,0.0,10,9.0,10,10,10,9.0,...,1,10,10,2,1,8,7,10,10,10
top,green,5.2,1000jb8d,,https://earthquake.usgs.gov/earthquakes/feed/v...,0.03542,127,37,",us1000jaz5,se60233907,",6.3,...,reviewed,1553095438690,"M 6.3 - 52km E of Luganville, Vanuatu",0,earthquake,",dyfi,geoserve,ground-failure,losspager,moment...",-300,1552956958607,https://earthquake.usgs.gov/earthquakes/eventp...,pr2019071006
freq,8,2.0,1,,1,1.0,1,1,1,2.0,...,10,1,1,8,10,3,3,1,1,1


---

## Unicode

#### How to handle Unicode for Python 3:
- https://docs.python.org/3/howto/unicode.html

![image.png](attachment:image.png)

Unicode is an industry standard for defining all of the characters that can be used for text data.  The standard is defined by the Unicode Consortium:
- http://unicode.org/consortium/consort.html

### A character is anything that is a smallest component of text.
- There are standard language characters, like    <i>A,   B,   C,   È,   Í,   Ω</i>
- There are also special purpose characters like symbols and pictographs (like emojis)

### Unicode defines each character as a code point.
- A code point is an integer value, usually written in base 16 (hexadecimal).
- The code point is assigned a standard name, describing it as an ideal.
- Code points have no implementation, fonts or anything to do with the representation of the character in an actual medium.
- Characters can be represented on the screen or on paper by a set of graphical elements called a glyph.

Here are some example code point values, example glyph, and character names, where U+ is used as a prefix for the hexadecimal notation for the code point, for example U+0061 in the hexadecimal number 0x0061, which is equivalent to the decimal number 97:

U+0061  |  ‘a’;   LATIN SMALL LETTER A <br><br>
U+0394  |  'Δ';    GREEK CAPITAL LETTER DELTA <br><br>
U+007B  | ‘{‘;    LEFT CURLY BRACKET <br><br>

### Full Emoji List 😀:
https://unicode.org/emoji/charts/full-emoji-list.html
![image.png](attachment:image.png)

<u>Unicode allows for the definition of over a million code points, from 0 to the largest hexadecimal number <b>10FFFF</b></u>
- These are defined in layers, with many unused.  The standard itself is huge;  here is a guide to reading it:
    - http://www.cs.tut.fi/~jkorpela/unicode/guide.html
<br><br>

- There are separate guides for Unicode like emojis:  http://emojipedia.org/unicode-8.0/

#### So how should we represent these code points as characters inside computers? 

- Suppose that we just represented them as 6 digit hexadecimal numbers so that we can represent all 1 million code points.
- This would mean that we would use 6 * 16 = 96 bits to represent every character
    - This be really wasteful of space, so instead we represent the code points by an encoding.

#### To represent characters in computers, a character encoding is used to map the code point to a sequence of binary numbers.
- Historically, early attempts at defining characters sets were local to particular languages.
- Two such encodings arose for English and western European languages:
    - ASCII – represents English characters and some symbols in 7 bits.
    - Latin-1 – represents additional characters needed for western European languages in 8 bits

#### The most widely used character encoding for Unicode is UTF-8.
- This character encoding can represent all of the Unicode characters as a sequence of 8 bit bytes.
    - It will efficiently used fewer bytes for the lower code points;  here are the rules:

- If the code point is < 128, it’s represented by the corresponding byte value.
- If the code point is >= 128, it’s turned into a sequence of two, three, or four bytes, where each byte of the sequence is between 128 and 255.

The Wikipedia article on UTF-8 gives more details about the planes of encoding:
- https://en.wikipedia.org/wiki/UTF-8
- <b>Note</b>: that not all hexadecimal numbers are validly formatted UTF-8 characters.
- The first 7 bits of encoding correspond to ASCII and the first 8 bits to Latin-1.

## Unicode in Python 3

One of the main motivations in defining Python 3 was to make Unicode be the basis of the string type, instead of ASCII as in Python 2.
- <b>So every string in Python 3 is Unicode, and the default encoding is <u>UTF-8</u>.</b>

#### So the problem with character encodings is I/O
- How to input characters to be stored in Unicode and how to output them using various pieces of software:
    - Python interpreter to terminal output
    - Python print function to terminal output
    - Files in whatever OS you’re using
    - Databases like Mongo
    - Microsoft word
    - Web browsers, etc.
- Exactly which UTF-8 characters can be input or output will depend on the devices and the operating system.

##### <span style="color:darkorange"><b> When we type characters from our keyboard, Python automatically converts them to Unicode. </span>
- But what about Unicode characters that are not on our keyboards?
- And some of you may have keyboard mappings for other languages.
- As you type these examples into the python interpreter, you may get different results as python interacts with your operating system to give terminal output.

<b>Digression: Python literals:</b>
- Note that normally when you type in a number, Python interprets it as a decimal number.
    - If you want a hexadecimal number, you prefix it with 0x.

In [213]:
15    # the decimal number 15

15

In [214]:
0xFF    # hexadecimal numbers

255

In [217]:
255  # the decimal number 255

255

In [218]:
0xFF == 255

True

- We can also use hexadecimal in strings as 2 digit hexadecimal bytes with the escape sequence \x.  where it will be used as Unicode:

In [219]:
'\xFFabc'

'ÿabc'

<b> / End Digression </b>

### Encoding in Unicode

We can also use the escape sequences:
- <b>\u</b> (with 4 hex digits)
- <b>\U</b> (with 8 hex digits)

To give Unicode characters by their code point (hexadecimal) number:

In [220]:
'\u0394'

'Δ'

In [221]:
"\U00000394"

'Δ'

- Or we can use the escape sequence <b>\N</b> with the official Unicode name:

In [222]:
"\N{GREEK CAPITAL LETTER DELTA}"

'Δ'

### Bytes

There is a type called <b>bytes</b>, where you can create a sequence of bytes either using keyboard characters or the hexadecimal notation.
- The bytes should be enclosed in quotes after the letter b:

In [223]:
b'\xFFab'

b'\xffab'

In [224]:
type(b'ab')

bytes

In [225]:
type(b'\xFFab')

bytes

In [226]:
len(b'\xFFab')

3

Python allows you some control over these transitions for text to and from various devices to strings in Unicode by giving decode() and encode() functions.

- <b>bytes.decode()</b>
    - Function converts from bytes to Unicode strings

- <b>bytes.encode()</b>
    - Function converts from strings to bytes, both using a requested encoding and giving some control over <u>UnicodeEncodeError</u> and <u>UnicodeDecodeError</u>.

These are the bytes.decode() examples from the HowTo illustrating the various error functions.
- Note that hexadecimal ‘\x80’ is not a valid UTF-8 character.

In [229]:
bytes1 = b'\x80abc'

In [230]:
bytes1.decode("utf-8", "strict")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

In [231]:
bytes1.decode("utf-8", "replace")

'�abc'

In [234]:
bytes1.decode("utf-8", "backslashreplace")

'\\x80abc'

In [235]:
bytes1.decode("utf-8", "ignore")

'abc'

The tutorial also notes functions to create one character Unicode strings and code point numbers:

In [236]:
chr(57344)

'\ue000'

In [238]:
ord('\ue000')

57344

- These are the str.encode() examples from the HowTo.
- Encoding a python string to UTF-8 should always succeed, but there may be errors if you try to encode to ascii or latin-1.

In [239]:
u = chr(40960) + 'abcd' + chr(1972)

In [240]:
type(u)

str

In [243]:
'ꀀabcd\u07b4'

'ꀀabcd\u07b4'

In [242]:
print(u)

ꀀabcd޴


##### These examples show the options to handle the UnicodeEncodeErrors

In [244]:
u.encode('ascii', 'ignore')

b'abcd'

In [246]:
u.encode('ascii', 'replace')

b'?abcd?'

In [247]:
u.encode('ascii', 'xmlcharrefreplace')

b'&#40960;abcd&#1972;'

In [248]:
u.encode('ascii', 'backslashreplace')

b'\\ua000abcd\\u07b4'

An interesting python package is the <i>unicodedata</i> package that can give you the code point ordinal number and the name of Unicode characters.

In [249]:
import unicodedata

In [250]:
u = chr(233) + chr(0x0bf2) + chr(3972) + chr(6000) + chr(13231)

In [251]:
u

'é௲྄ᝰ㎯'

In [252]:
for i, c in enumerate(u):
    print(i, '%04x' % ord(c), unicodedata.category(c), end=" ")
    print(unicodedata.name(c))


0 00e9 Ll LATIN SMALL LETTER E WITH ACUTE
1 0bf2 No TAMIL NUMBER ONE THOUSAND
2 0f84 Mn TIBETAN MARK HALANTA
3 1770 Lo TAGBANWA LETTER SA
4 33af So SQUARE RAD OVER S SQUARED


Python is able to read and write to files using UTF-8 by using the encoding parameter in the open command.
- <b>But note that this can give errors.</b>

#### Here is some advice about various file situations, encodings and error handlers:
- http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

But for us, the most practical use of all this is to use the str.encode() function in order to output text to devices that can’t handle it.
- Here is an example, using an error handler that will ignore printing any non-ascii characters:

In [255]:
text = 'ꀀabcd\u07b4'

In [259]:
print(text.encode('ascii','ignore'))

b'abcd'


In [260]:
print(text.encode('latin-1','ignore'))

b'abcd'


In [280]:
'😀'.encode("utf-8", "ignore")

b'\xf0\x9f\x98\x80'

In [282]:
'😀'.encode("utf-8", "ignore").decode()

'😀'

---