# Repurposing Data Sources

_All language is but a poor translation._

Franz Kafka 

Sometimes data lives in formats that take extra work to ingest.  For common and explicitly data-oriented formats, common libraries already have readers built into them.  Data frame libraries, for example, read a huge number of different file types.  At worst, slightly less common formats have their own more specialized libraries that provide a relatively straightforward path between the original format and the general purpose data processing library you wish to use.

A greater difficulty often arises because a given format is not *per se* a data format, but exists for a different purpose.  Nonetheless, often there is data somehow embedded or encoded in the format that we would like to utilize.  

For example, web pages are generally designed for human readers and often rendered by web browsers with "quirks modes" that deal with not-quite-HTML, as is often needed.  Portable Document Format (PDF) documents are similar in having intended human readers in mind, and yet also often containing tabular or other data that we would like to process as data scientists. 

In both cases, we would rather have the data itself in some separate, easily ingestible, format; but reality does not always live up to our hopes.  Image formats likewise are intended for presentation of pictures to humans; but we sometimes wish to characterize or analyze collections of images in some data science or machine learning manner.

Still other formats are indeed intended as data formats, but they are unusual enough that common readers for the formats will not be available.  Generally, custom text formats are manageable, especially if you have some documentation of what the rules of the format are.  Custom binary formats are usually more work, but possible to decode if the need is sufficiently pressing and other encodings do not exist.

In [None]:
from src.setup import *

## Web Scraping

* HTML tables
* Non-tabular data
* Command-line scraping

A great deal of interesting data lives on web pages, and often, unfortunately, we do not have access to the same data in more structured data formats.  In the best cases, the data we are interested in at least lives within HTML tables inside of a web page; however, even where tables are defined, often the content of the cells has more than only the numeric or categorical values of interest to us.  For example, a given cell might contain commentary on the data point or a footnote providing a source for the information.  At other times, of course, the data we are interested in is not in HTML tables at all, but structured in some other manner across a web page.

The examples I'll show will use the Python library **BeautifulSoup** to parse web pages.  Within Python, **Scrapy** is another popular library.  For the R language, **rvest** is often useful.  In Ruby, **Nokogiri** is similar to Python's BeautifulSoup. **Colly** is a Golang approximately equivalent library. **Scraper** is Rust's answer.  And so on for other programming languages.

BeautifulSoup is friendly and is remarkably well able to handle malformed HTML.  In the real world, what gets called "HTML" is often only loosely conformant to any actual format standards, and hence web browsers, for example, are quite sophisticated (and complicated) in providing reasonable rendering of only vaguely structured tag soups.

## HTML Tables

<img src="img/Flu2009-infobox.png" alt="2009 Flu Infobox" width="40%"/>

Let's look at some data from Wikipedia to illustrate web scraping.  While there are surely other sources for similar data that we could locate, we will collect our data from the Wikipedia article on the 2009 flu pandemic.  Some data genuinely only readily exists on web pages.

In [None]:
url_flu_2009 = "https://te.wikipedia.org/wiki/%E0%B0%AE%E0%B1%82%E0%B0%B8:2009_flu_pandemic_data"

## Retrieve the Web Page

In [None]:
import requests
resp = requests.get(url_flu_2009)
resp.status_code

Constructing a script for web scraping, in practice, inevitably involves a some trial-and-error.  We generally need to eyeball the filtered and indexed elements, and refine this selection through repetition.  

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.content)
tables = soup.findAll("table")
print(f"Found {len(tables)} tables")
for i, table in enumerate(tables):
    print(f"Table {i} classes: {' '.join(table["class"])}")

In this case, I have already looked at the HTML source of Wikipedia pages, and I know that the style of table used for the mortality data is a `vertical-navbox`.  We can select it for further processing.

In [None]:
mortality = tables[1]
for tr in mortality.find_all("tr"):
    td = tr.find_all('td')
    row = [tr.text.strip() for tr in tr.find_all("td")]
    print(row)        

We've made progress here.  But there also remain some problems.  

The first row and last few rows are citational and footnote information that we do not need for the current purpose.  We can also see that the string for "Other European countries and Central Asia" lacks an internal space because an HTML `<br/>` tag had occurred where we want a single string.  

We would also like to convert the string version of numbers to integers.

In [None]:
rows = []
for tr in mortality.find_all("tr")[2:13]:
    td = tr.find_all('td')
    region = td[0].text.strip().replace("sand", "s and")
    count = int(td[1].text.replace(",", ""))
    rows.append([region, count])

The shown code is brief, but I needed to experiment and eyeball the data a fair amount to arrive at it.  The list-of-lists we created can easily put into a DataFrame or other structure for analytic purposes.

In [None]:
pl.DataFrame(rows, orient="row", schema={
    "Region": pl.String, "Deaths": pl.Int32
}).style.fmt_integer(columns="Deaths")

Obviously this is a small example that could easily be typed in manually.  The general techniques shown might be applied to a much larger table.  More significantly, they might also be used to scrape a table on a web page that is updated frequently.  2009 is strictly historical, but other data is updated every day, or even every minute, and a few lines like the ones shown could pull down current data each time it needs to be processed.

### Non-Tabular Data

<img src="img/HTTP-status-codes.png" alt="HTTP status codes" width="75%"/>

For an example a non-tabular web page, we will again use Wikipedia as well.  In a slightly self-referential way, we will look at the article that lists HTTP status codes in a term/definition layout.  A portion of that page renders in my browser as shown.

Numerous other codes are listed in the articles that are not in the screenshot.  Moreover, there are section divisions and other descriptive elements or images throughout the page.  Fortunately, Wikipedia tends to be very regular and predictable in its use of markup.

In [None]:
url_http = ("https://en.wikipedia.org/w/index.php?"
            "title=List_of_HTTP_status_codes&oldid=947767948")

The first thing we need to do is actually retrieve the HTML content.

In [None]:
import requests
resp = requests.get(url_http)
print(resp.status_code)
pprint(resp.content[98100:98800], width=55)

The raw HTML we retrieved is not especially easy to work with.  Even apart from the fact it is compacted to remove extra whitespace, the general structure is a "tag soup" with various things nested in various places. Basic string methods or regular expressions do not help us very much in identifying the parts we are interested in.

In [None]:
class More: pass
more = More()
more.text = "..."

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(resp.content)

codes = soup.find_all('dt')
for code in codes[:5] + [more] + codes[-5:]:
    print(code.text)

This is handled much more easily with BeautifulSoup.  As we've seen already, doing so first creates a "soup" object from the raw HTML, then using methods of that soup to pick out the elements we care about for our data set. 

We notice that the status codes themselves are each contained within an HTML &lt;dt&gt; element.  The first and last few of the elements identified by this tag are shown.

Everything without the `codes` variable is a status code. However, I only know that from manual inspection of all of them, albeiteyeballing fewer than 100 items is not difficult; doing so with a million would be infeasible).  

If we look back at the original web page, we can see that the 530 status code isn't captured because the page formatting is inconsistent.

In [None]:
def find_dds_after(node):
    dds = []
    sib = node.next_sibling
    while True:     # Loop until a break
        # Last sibling within page section
        if sib is None:
            break
        # Text nodes have no element name
        elif not sib.name: 
            sib = sib.next_sibling
            continue
        # A definition node
        if sib.name == 'dd':
            dds.append(sib)
            sib = sib.next_sibling
        # Finished <dd> the definition nodes
        else:
            break
    return dds

In the BeautifulSoup API, the empty space between elements is a node of plain text that contains exactly the characters (including whitespace) inside that span.  It is tempting to use the API called `node.find_next_siblings()` in this task.  

That method will fetch too much, however, including all subsequent &lt;dt&gt; elements after the current one.  Instead, we can use the property `.next_sibling` to get each one, and stop when needed.

The custom function I wrote is straightforward, but special to this purpose.  Perhaps it is extensible to similar definition lists one finds in other HTML documents.  BeautifulSoup provides numerous useful APIs, but they are building blocks for constructing custom extractors rather than foreseeing every possible structure in an HTML document. To understand it, let us look at a couple of the status codes.

In [None]:
for code in codes[23:26]:
    print(code.text)
    for dd in find_dds_after(code):
        print("  ", dd.text[:40], "...")

The HTTP 401 response contains two separate definition blocks. Let's apply the function across all the HTTP code numbers.  What is returned is a list of definition blocks; for our purpose we will join the text of each of these with a newline.

In [None]:
data = []
for code in codes:
    # All codes are 3 character numbers
    number = code.text[:3]
    # parenthetical is not part of status
    text, note = code.text[4:], ""
    if " (" in text:
        text, note = text.split(" (")
        note = note.rstrip(")")
    # Compose description from list of strings
    description = "\n".join(t.text for t in find_dds_after(code))
    data.append([int(number), text, note, description])

From the Python list of lists, we can create a DataFrame for further work on the data set.

In [None]:
(pd.DataFrame(
    data, columns=["Code", "Text", "Note", "Description"])
    .set_index('Code')
    .sort_index()
    .head(6))

The two HTML examples we looked at are not general to all the web pages you may wish to scrape data from.  Organization into tables and into definition lists are certainly two common uses of HTML to represent data, but many other conventions might be used. 

Particular domain specific—or likely page specific—`class` and `id` attributes on elements is also a common way to mark the structural role of different data elements.  Be prepared to try many variations on your web scraping code before you get it right.  Generally, your iteration will be a narrowing process; each stage *needs to* include the information desired, it becomes a process of removing the parts you do not want through refinement.

### Command-Line Scraping

In [None]:
%%bash
base='https://en.wikipedia.org/w/index.php?title='
url="$base"'List_of_HTTP_status_codes&oldid=947767948'
lynx -dump $url | sed -n '399,404p'

Text is often much easier to parse than is structured and nested HTML. "Flat is better than nested" according to the _Zen of Python_. 

The web browsers `lynx` and `links` provide a quick way to move the *content* you care about into plain text, which is relatively easy to parse.  Often looking for patterns of intentation, vertical space, searching for particular keywords, or similar text processing, will get the data you need more quickly than the trial-and-error of parsing libraries like BeautifulSoup.

These text-mode web browsers both have a `-dump` switch that outputs non-interactive text to STDOUT.  Each of them have a variety of other switches that can tweak the rendering of the text.  

The output from these two tools is similar, but the rest of your scripting will need to pay attention to the minor differences.  Each of these browsers will do a very good job of dumping 90% of web pages as text that is easy to process.  Of the problem 10%, often one or the other tool will produce something reasonable to parse.  In certain cases, one of these browsers may produce useful results and the other will not.  Fortunately, it is easy simply to try both for a given task or site.

In [None]:
%%bash
base='https://en.wikipedia.org/w/index.php?title='
url="$base"'List_of_HTTP_status_codes&oldid=947767948'
links -dump $url | sed -n '367,372p'

Here is the `links` version of the rendering.

Obviously, I experimented, in both cases, to find the exact line ranges of output that correspond.  You can see that only incidental formatting differences exist in this friendly HTML page.

The only differences in the rendering is one space difference in indentation of the definition element and some difference in the formatting of footnote links in the text.

## How to Parse HTTP Status Code Descriptions

* Look for a line that starts with 3 spaces followed by a 3 digit number;
* Accumulate all non-blank lines following that, stop at blank line;
* Strip the footnote/link markers from the texts;
* Split the code number and text in the same manner as in the previous example.

Obviously, we *could* accomplish extraction of these status codes with BeautifulSoup or other libraries.  But tools already exist that do a lot of special casing and simplification for us.

The few steps mentioned in bullets are relatively general for any web page that utilizes definition lists.  You can work out the exact code yourself, but it should not take more than 10 lines in a high-level language such as Python.

## Portable Document Format

* Identifying tabular regions
* Extracting plain text

There are a great many commercial tools to extract data which has become hidden away in PDF files. Unfortunately, many organizations—government, corporate, and others—issue reports in PDF format but do not provide data formats more easily accessible to computer analysis and abstraction.  This is common enough to have provided impetus for a cottage industry of tools for semi-automatically extracting data back out of these reports. 

I recommend using open source tools for extraction of data from PDFs.  One of these it the command-line tool `pdftotext` which is part of the **Xpdf** and the derived **Poppler** software suites.  The second is a Java tool called **tabula-java**.  Tabula-java is in turn the underlying engine for the GUI tool **Tabula**, and also has language bindings for Ruby (**tabula-extractor**), Python (**tabula-py**), R (**tabulizer**), and Node.js (**tabula-js**). 

There are two main elements that are likely to interest us in a PDF file.  An obvious one is tables of data, and those are often embedded in PDFs.  Otherwise, a PDF can often simply be treated as a custom text format. Various kinds of lists, bullets, captions, or simply paragraph text, might have data of interest to us.

<img src="img/preface-1.png" alt="Preface page 5" width="50%"/>

__Page 5 of Preface__

I wrote a book called _Cleaning Data for Effective Data Science_ a few years ago, on which this course is based. During writing, I exported its preface to a PDF.

There are three tables, in particular, which we would like to capture.

On page 5 of a draft of my preface, a table is rendered by both Pandas and as an R tibble, with corresponding minor presentation differences.  

<img src="img/preface-2.png" alt="Preface page 7" width="50%"/>

__Page 7 of Preface__

On page 7 another table is included that looks somewhat different again.

Running tabula-java requires a rather long command line, so I have created a small bash script to wrap it on my personal system:

```bash
#!/bin/bash
# script: tabula
# Adjust for your personal system path
TPATH='/home/dmertz/bin'
JAR='tabula-1.0.5-jar-with-dependencies.jar'
java -jar "$TPATH/$JAR" $@
```

Extraction will sometimes automatically recognize tables per page with the `--guess` option, but you can get better control by specifying a portion of a page where tabula-java should look for a table.  We simply output to STDOUT in the following code cells, but outputting to a file is just another option switch.

In [None]:
%%bash
tabula -g -t -p5 data/Preface-snapshot.pdf

Tabula does a good, but not perfect, job. The Pandas style of setting the name of the index column below the other headers threw it off slightly.  There is also a spurious first column that is usually empty strings, but has a header as the output cell number.  However, these small defects are very easy to clean up, and we have a very nice CSV of the actual data in the table.

Remember from just above, however, that page 5 actually had *two tables* on it.  Tabula-java only captured the first one, which is not unreasonable, but is not all the data we might want.  Slightly more custom instructions (determined by moderate trial-and-error to determine the region of interest) can capture the second one.

In [None]:
%%bash
tabula -a'%72,13,90,100' -fTSV -p5 data/Preface-snapshot.pdf

To illustrate the output options, we chose tab-delimited rather than comma-separated for the output.  A JSON output is also available. Moreover, by adjusting the left margin (as percent, but as typographic points is also an option), we can eliminate the unecessary row numbers.  As before, the ingestion is good but not perfect.  The tibble formatting of data type markers is superfluous for us.  Discarding the two rows with unnecessary data is straightforward.

Finally for this example, let us capture the table on page 7 that does not have any of those data frame library extra markers.  This one is probably more typical of the tables you will encounter in real work.  For the example, we use points rather than page percentage to indicate the position of the table.

In [None]:
%%bash
tabula -p7 -a'120,0,220,500' data/Preface-snapshot.pdf 

The extraction here is perfect, although the table itself is less than ideal in that it it repeats the number/color pairs twice.  However, that is likewise easy enough to modify using data frame libraries.

The tool tabula-java, as the name suggests, is only really useful for identifying tables.  In contrast, pdftotext creates a *best-effort* purely text version of a PDF.  Most of the time this is quite good.  From that, standard text processing and extraction techniques usually work well, including those that parse tables.  However, since an entire document (or a part of it selected by pages) is output, that lets us work with other elements like bullet lists, raw prose, or other identifiable data elements of a document.

In [None]:
%%bash
# Start with page 7, tool writes to .txt file 
# Use layout mode to preserve horizontal position
pdftotext -f 7 -layout data/Preface-snapshot.pdf
# Remove 25 spaces from start of lines
# Wrap other lines that are too wide
sed -E 's/^ {,25}//' data/Preface-snapshot.txt |
    fmt -s | 
    head -16

The tabular part in the middle would be simple to read as a fixed width format.  The bullets at top or the paragraph at bottom might be useful for other data extraction purposes.  In any case, it is plain text at this point, which is easy to work with.

Let us turn now to analyzing images, mostly for their metadata and overall statistical characteristics.

## Image Formats

_As the Chinese say, 1001 words is worth more than a picture._

John McCarthy

The quote McCarthy plays off of is not, of course, of ancient Chinese origin.  Like much early 20th century American sinophilia—inevitably tinged with sinophobia—it originated with an advertising agency.  Henrik Ibsen had said "A thousand words leave not the same deep impression as does a single deed" prior to his 1906 death.  This was adapted in March 1911, by Arthur Brisbane speaking to the Syracuse Advertising Men's Club, as "Use a picture. It's worth a thousand words." Later repetitions added the alleged source as a "Chinese proverb," or even a false attribution to Confucius specifically, presumably to lend credence to the slogan.

**Concepts**:

* OCR and image recognition (outside scope)
* Color models
* Pixel statistics
* Channel preprocessing
* Image metadata

For certain purposes, raster images are themselves the data sets of interest to us.  "Raster" just means rectangular collections of pixel values. The field of machine learning around image recognition and image processing is far outside the scope of this book.  The few techniques in this section might be useful to get your data ready to the point of developing input to those tools, but no further than that.  Also not considered in this book are other kinds of recognition of the *content* of images at a high-level.  For example, optical character recognition (OCR) tools might recognize an image as containing various strings and numbers as rendered fonts, and those values might be the data we care about.

If you have the misfortune of having data that is only available in printed and scanned form, you most certainly have my deep sympathy.  Scanning the images using OCR is likely to produce noisy results with many misrecognitions.  Detecting those is addressed in chapter 4 (*Anomaly Detection*); essentially you will get either wrong strings or wrong numbers when these errors happen, ideally the errors will be identifiable.  However, the specifics of those technologies are not within the current scope.

For this section, we merely want to present tools to read in images as numeric arrays, and perform a few basic processing steps that might be used in your downstream data analysis or modeling.  Within Python, the libary **Pillow** is the go-to tool (backward compatible successor to **PIL**, which is deprecated).  Within R, the **imager** library seems to be most widely used for the general purpose tasks of this section.  As a first task, let us examine and describe the raster images used in the creation of this book.

In [None]:
from PIL import Image, ImageOps

for fname in glob('img/*'):
    with Image.open(fname) as im:
        print(fname, im.format, "%dx%d" % im.size, im.mode)

We see that mostly PNG images were used, with a smaller number of JPEGs.  Each has certain spatial dimensions, by width then height, and each is either RGB, or RGBA if it includes an alpha channel.  Other images might be HSV format.  Converting between color spaces is easy enough using tools like Pillow and imager, but it is important to be aware of which model a given image uses.  

<img src="img/Konfuzius-1770.jpg" />

Let us analyze the contours of the pixels.

### Pixel Statistics

We can work on getting a feel for the data, which at heart is simply an array of values, with some tools the library provides.  In the case of imager which is built on **CImg**, the internal representation is 4-dimensional.  Each plane is  an X by Y grid of pixels (left-to-right, top-to-bottom).  However, the format can represent a stack of images—for example, an animation—in the depth dimension.  The several color channels (if the image is not grayscale) are the final dimension of the array.  The Confucius example is a single image, so the third dimension is of length one.  Let us look at some summary data about the image.

In [None]:
%%R
grayscale(confucius) %>% 
    hist(main="Luminance values in Confucius drawing") 

In [None]:
%%R
# Save histogram to disk
png("img/(Ch03)Luminance values in Confucius drawing.png", width=1200)
grayscale(confucius) %>% 
    hist(main="Luminance values in Confucius drawing") 

Perhaps we would like to look at the distribution only of one color channel instead.

In [None]:
%%R
B(confucius) %>% 
    hist(main="Blue values in Confucius drawing")

In [None]:
%%R
# Save histogram to disk
png("img/(Ch03)Blue values in Confucius drawing.png", width=1200)
B(confucius) %>% 
    hist(main="Blue values in Confucius drawing")

The histograms above simply utilize the standard R histogram function.  There is nothing special about the fact that the data represents an image.  We could perform whatever statistical tests or summarizations we wanted on the data to make sure it *makes sense* for our purpose; a histogram is only a simple example to show the concept.  We can also easily transform the data into a tidy data frame.  As of this writing, there is an "impedance error" in converting directly to a tibble, so the below cell uses an intermediate data.frame format. Tibbles are *often* but not *always* drop in replacements when functions were written to work with data.frame objects.

In [None]:
%%R
data <- as.data.frame(confucius) %>%
    as_tibble %>%
    # channels 1, 2, 3 (RGB) as factor
    mutate(cc = as.factor(cc))
data

With Python and PIL/Pillow, working with image data is very similar.  As in R, the image is an array of pixel values with some metadata attached to it.  Just for fun, we use a variable name with Chinese characters to illustrate that such is supported in Python.

In [None]:
# Courtesy name: Zhòngní (仲尼)
# "Kǒng Fūzǐ" (孔夫子) was coined by 16th century Jesuits
仲尼 = Image.open('img/Konfuzius-1770.jpg')
data = np.array(仲尼)
print("Image shape:", data.shape)
print("Some values\n", data[:2, :, :])

In the Pillow format, images are stored as 8-bit unsigned integers rather than as floating-point numbers in [0.0, 1.0] range.  Converting between these is easy enough, of course, as is other normalization.  For example, for many neural network tasks, the prefered representation is values centered at zero with standard deviation of one.  The array used to hold Pillow images in 3-dimensional since it does not have provision for stacking multiple images in the same object.

Let us look at perhaps the most important aspect of images to data scientists.

### Metadata

Photographic images may contain metadata embedded inside them.  Specifically, the *Exchangeable Image File Format* (Exif) specifies how such metadata can be embedded in JPEG, TIFF, and WAV formats (the last is an audio format).  Digital cameras typically add this information to the images they create, often including details such as timestamp and latitude/longitude location.

Some of the data fields within an Exif mapping are textual, numeric, or tuples; others are binary data.  Moreover, the *keys* in the mapping are from ID numbers that are not meaningful to humans directly; this mapping is a published standard, but some equipment makers may introduce their own IDs as well.  The binary fields contain a variety of types of data, encoded in various ways.  For example, some cameras may attach small preview images as Exif metadata; but simpler fields are also encoded.

The below function will utilize Pillow to return two dictionaries, one for the textual data, the other for the binary data.  Tag IDs are expanded to human readable names, where available.  Pillow uses "camel case" for these names, but other tools have different variations on capitalization and punctuation within the tag names.  The casing by Pillow is what I like to call Bactrian case—as opposed to Dromedary case—both of which differ from Python's usual "snake case" (e.g. `BactrianCase` versus `dromedaryCase` versus `snake_case`).

In [None]:
from PIL.ExifTags import TAGS

def get_exif(img):
    txtdata, bindata = dict(), dict()
    for tag_id in (exifdata := img.getexif()):
        # Lookup tag name from tag_id if available
        tag = TAGS.get(tag_id, tag_id)
        data = exifdata.get(tag_id)
        if isinstance(data, bytes):
            bindata[tag] = data
        else:
            txtdata[tag] = data
    return txtdata, bindata

Let us check whether the Confucius image has any metadata attached.

In [None]:
get_exif(仲尼)  # Zhòngní, i.e. Confucius

We see that this image does not have any such metadata.  Let us look instead at a photograph taken of the author next to a Lenin statue in Minsk.

In [None]:
# Could continue using multi-lingual variable names by
# choosing `Ленин`, `Ульянов` or `Мінск`
dqm = Image.open('img/DQM-with-Lenin-Minsk.jpg')
ImageOps.scale(dqm, 0.1)

This image, taken with a digital camera, indeed has Exif metadata.  These generally concern photographic settings, which are perhaps valuable to analyze in comparing images.  This example also has a timestamp, although not in this case a latitude/longitude position (the camera used did not have a GPS sensor).  Location data, where available, can obviously be valuable for many purposes.

In [None]:
txtdata, bindata = get_exif(dqm)
txtdata

One detail we notice in the textual data is that the tag ID 34864 was not unaliased by Pillow.  I can locate external documentation indicating that the ID should indicate "Exif.Photo.SensitivityType" but Pillow is apparently unaware of that ID.  The bytes strings may contain data you wish to utilize, but the meaning given to each field is different and must be compared to reference definitions.  For example, the field `ExifVersion` is defined as ASCII bytes, but *not* as UTF-8 encoded bytes like regular text field values.  We can view that using:

In [None]:
bindata['ExifVersion'].decode('ascii')

In contrast, the tag named `ComponentsConfiguration` consists of four bytes, with each byte representing a color code.  The function `get_exif()` produces separate text and binary dictionaries (`txtdata` and `bindata`). Let us decode `bindata` with a new special function.

In [None]:
def components(cc):
    colors = {0: None,
              1: 'Y', 2: 'Cb', 3: 'Cr',
              4: 'R', 5: 'G', 6: 'B'}
    return [colors.get(c, 'reserved') for c in cc]

In [None]:
components(bindata['ComponentsConfiguration'])

Other binary fields are encoded in other ways.  The specifications are maintained by the Japan Electronic Industries Development Association (JEIDA).  This section intends only to give you a feel for working with this kind of metadata, and is by no means a complete reference.

Let us turn our attention now to the specialize binary data formats we sometimes need to work with.

## Binary Serialized Data Structures

> I usually solve problems by letting them devour me.<br/>–Franz Kafka 

**Concepts**:

* Prefer existing libraries
* Bytes and struct data types
* Offset layout of data

There are a great many binary formats that data might live in.  Everything very popular has grown good open source libraries, but you may encounter some legacy or in-house format for which this is not true.  Good general advice is that unless there is an ongoing and/or performance sensitive need for processing an unusual format, try to leverage existing parsers.  Custom formats can be tricky, and if one is uncommon, it is as likely as not also to be underdocumented.

If an existing tool is only available in a language you do not wish to use for your main data science work, nonetheless see if that can be easily leveraged to act only as as a means to export to a more easily accessed format.  A fire-and-forget tool might be all you need, even if it is one that runs recurringly, but asynchronously with the actual data processing you need to perform.

For this section, let as assume that the optimistic situation is not realized, and we have nothing beyond some bytes on disk, and some possibly flawed documentation to work with.  Writing the custom code is much more the job of a systems engineer than a data scientist; but we data scientists need to be polymaths, and we should not be daunted by writing a little bit of systems code.

For this relatively short section, we look at a simple and straightforward binary format.  Moreover, this is a real-world data format for which we do not actually need a custom parser.  Having an actual well-tested, performant, and bullet-proof parser to compare our toy code with is a good way to make sure we do the right thing.  Specifically, we will read data stored in the [NumPy NPY format](https://docs.scipy.org/doc/numpy/reference/generated/numpy.lib.format.html#module-numpy.lib.format), which is documented as follows (abridged):

* The first 6 bytes are a magic string: exactly `\x93NUMPY`.
* The next 1 byte is an unsigned byte: the major version number of the file format, e.g. `\x01`.
* The next 1 byte is an unsigned byte: the minor version number of the file format, e.g. `\x00`. 
* The next 2 bytes form a little-endian unsigned short int: the length of the header data HEADER_LEN.
* The next HEADER_LEN bytes are an ASCII string which contains a Python literal expression of a dictionary.
* Following the header comes the array data.

First, we read in some binary data using the standard reader, using Python and NumPy, to understand what type of object we are trying to reconstruct.  It turns out that the serialization was of a 3-dimensional array of 64-bit floating-point values.  A small size was chosen for this section, but of course real-world data will generally be much larger.

In [None]:
arr = np.load(open('data/binary-3d.npy', 'rb'))
print(arr, '\n', arr.shape, arr.dtype)

Visually examining the bytes is a good way to have a better feel for what is going on with the data.  NumPy is, of course, a clearly and correctly documented project; but for some hypothetical format, this is an opportunity to potentially identify problems with the documentation not matching the actual bytes.  More subtle issues may arise in the more detailed parsing; for example, the meaning of bytes in a particular location can be contingent on flags occurring elsewhere.  Data science is, in surprisingly large part, a matter of eyeballing data.

In [None]:
%%bash
hexdump -Cv data/binary-3d.npy

As a first step, let us make sure the file really does match the type we expect in having the correct "magic string."  Many kinds of files are identified by a characteristic and distinctive first few bytes.  In fact, the common utility on Unix-like systems, `file` uses exactly this knowledge via a database describing many file types.  For a hypothetical rare file type (i.e. not NumPy), this utility may not know about the format; nonetheless, the file might still have such a header.

In [None]:
%%bash
file data/binary-3d.npy

With that, let us open a file handle for the file, and proceed with trying to parse it according to its specification.  For this, in Python, we will simply open the file in bytes mode, so as not to convert to text, and read various segments of the file to verify or process portions.  For this format, we will be able to process it strictly sequentially, but in other cases it might be necessary to seek to particular byte positions within the file.  The Python `struct` module will allow us to parse basic numeric types from bytestrings.  The `ast` module will let us create Python data structures from raw strings without a security risk that `eval()` can encounter.

In [None]:
import struct, ast
binfile = open('data/binary-3d.npy', 'rb')

# Check that the magic header is correct
if binfile.read(6) == b'\x93NUMPY':
    vermajor = ord(binfile.read(1))
    verminor = ord(binfile.read(1))
    print(f"Data appears to be NPY format, "
          f"version {vermajor}.{verminor}")
else:
    print("Data in unsupported file format")
    print("*** ABORT PROCESSING ***")

Next we need to determine how long the header is, and then read it in.  The header is always ASCII in NPY version 1, but may be UTF-8 in version 3.  Since ASCII is a subset of UTF-8, decoding does no harm even if we do not check the  version.

In [None]:
# Little-endian short int (tuple 0 element)
header_len = struct.unpack('<H', binfile.read(2))[0]
# Read specified number of bytes
# Use safer ast.literal_eval()
header = binfile.read(header_len)
# Convert header bytes to a dictionary
header_dict = ast.literal_eval(header.decode('utf-8'))
print(f"Read {header_len} bytes "
      f"into dictionary: \n{header_dict}")

While this dictionary stored in the header gives a nice description of the dtype, value order, and the shape, the convention used by NumPy for value types is different from that used in the `struct` module.  We can define a (partial) mapping to obtain the correct spelling of the data type for the reader.  We only define this mapping for some data types encoded as *little-endian*, but the *big-endian* versions would simply have a greater-than sign instead.  The key for 'fortran_order' indicates whether the fastest or slowest varying dimension is contiguous in memory.  Most systems use "C order" instead.

We are not aiming for high-efficiency here, but in minimizing code.  Therefore, I will expediently read the actual data into a simple list of values first, then later convert that to a NumPy array.

In [None]:
# Define spelling of data types and find the struct code
dtype_map = {'<i2': '<i', '<i4': '<l', '<i8': '<q',
             '<f2': '<e', '<f4': '<f', '<f8': '<d'}
dtype = header_dict['descr']
fcode = dtype_map[dtype]
# Determine number of bytes from dtype spec
nbytes = int(dtype[2:])

# List to hold values
values = []

# Python 3.8+ "walrus operator"
while val_bytes := binfile.read(nbytes):
    values.append(struct.unpack(fcode, val_bytes)[0])
    
print("Values:", values)

Let us convert the raw values into an actual NumPy array of appropriate shape and dtype now.  We also will look for whether to use Fortran- or C-order in memory.

In [None]:
shape = header_dict['shape']
order = 'F' if header_dict['fortran_order'] else 'C'
newarr = np.array(values, dtype=dtype, order=order)
newarr = newarr.reshape(shape)
print(newarr, '\n', newarr.shape, newarr.dtype)
print("\nMatched standard parser:", (arr == newarr).all())

Just as binary data can be oddball, so can text.

## Denouement

> They invaded the hexagons, showed credentials which were not always false, 
> leafed through a volume with displeasure and condemned whole shelves: their 
> hygienic, ascetic furor caused the senseless perdition of millions of books.<br/>
> –Jorge Luis Borges (The Library of Babel)

**Topics**: Web Scraping; Portable Document Format; Image Formats; Binary Formats; Custom Text Formats

This chapter contemplated data sources that you may not, in your first thought, think of as *data* per se.  Within web pages and PDF document, the intention is usually to present human readable content that only contains analyzable data as a secondary concern.  In the ideal situation, whoever produced those less structured documents will also provide structured versions of the same data; however, that ideal situation is only occasionally realized.  A few nicely written Free Software libaries let us do a reasonable job of extracting meaningful data from these source, albeit always in a way that is somewhat specific to the particular document, or at least to the family or revisions, of a particular document.

Images are a very common interest in machine learning.  Drawing various conclusions about or characterizations of the content portrayed in images is a key application of deep neural networks, for example.  While those actual machine learning techniques are outside the scope of this particular book, this chapter introduced you to the basic APIs for acquiring an array/tensor representation of images, and performing some basic correction or normalization that will aid in those later machine learning models.

There are formats, as well, that while directly intended as means of recording and communicating data as such, are not widely used and tooling to read them directly may not be available to you.  The specific examples we present, for both binary and textual custom formats, are ones that library support exists for (less so for the text formats this chapter examines), but the general kinds of reasoning and approach to creating custom ingestion tools presented resemble those you will need to use when you encounter an antiquated, in-house, or merely idiosyncratic, format.

The next chapter begins the next saga of this book.  These early chapters paid special attention to data formats you need to work with.  The next two chapters look at problems characteristic of data elements per se, not only their representation.  We begin by looking for anomalies in data.