### ST445 Managing and Visualizing Data

# The Shape of Data

### Week 2 Lecture, MT 2017 - Kenneth Benoit

# Plan today

* Questions and administration
* Representing text data: Unicode
* Representing dates
* Representing sparse matrix formats
* Data and datasets
    * "tidy" data
    * reshaping data
    * normalization forms
* Lab preview

# How to represent text data: encoding

-   a “character set" is a list of character with associated numerical
    representations

-   ASCII: the original character set, uses just 7 bits ($2^7$) – see
    <http://ergoemacs.org/emacs/unicode_basics.html>

-   ASCII was later extended, e.g. ISO-8859
    <http://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html>,
    using 8 bits ($2^8$)

-   but this became a jungle, with no standards:
    <http://en.wikipedia.org/wiki/Character_encoding>

# Solution: Unicode

-   Unicode was developed to provide a unique number ( a “code point”)
    to every known character – even some that are “unknown”

-   more here

# Unicode details

- here
- here


# Unicode must still be _encoded_

-   problem: there are more far code points than fit into 8-bit
    encodings. Hence there are multiple ways to *encode* the Unicode
    code points

-   *variable-byte* encodings use multiple bytes as needed. Advantage is
    efficiency, since most ASCII and simple extended character sets can
    use just one byte, and these were set in the Unicode standard to
    their ASCII and ISO-8859 equivalents

-   two most common are and , using 8 and 16 bits respectively

# Text encoding: Caution

-   Input texts can be very different

-   Many text production software (e.g. MS Office-based products) still
    tend to use proprietary formats, such as Windows-1252

-   Windows tends to use UTF-16, while Mac and other Unix-based
    platforms use UTF-8

-   Your eyes can be deceiving: a client may display gibberish but the
    encoding might still be as intended

-   No easy method of detecting encodings (except in HTML meta-data)


# A note on "meta-data"

-   Data that provides infomation about other (primary) data  
-   Usually not meant to be analyzed as data itself  

-   Example: HTML

    ```html
    <!DOCTYPE html>
    <html class="client-nojs" lang="en" dir="ltr">
    <head>
    <meta charset="UTF-8"/>
    <title>Metadata - Wikipedia</title>
    ```

-   Example of a standard attempting to address this need:
    [Dublin Core Metadata Initiative](http://dublincore.org/documents/dc-text/)

### Representing dates: Different formats?

![data orders](figs/dateformatorder.png "Date format order")
    

# Representing dates

| Description	| Format	| Examples |
|:------------|:--------|:---------|
| American month and day	      | mm/dd	           | "5/12", "10/27" |
| American month, day and year	| mm/dd/y	         | "1/17/2006" |
| Four digit year, month and day with slashes	| YY/mm/dd | "2008/6/30", "1978/12/22" |
| Four digit year and month (GNU)	| YY-mm	         | "2008-6", "2008-06", "1978-12" |
| Year, month and day with dashes	| y-mm-dd	       | "2008-6-30", "78-12-22", "8-6-21" |
| Day, month and four digit year	| dd-mm-YY	     | "30-6-2008" |
| Day, month and two digit year   | dd.mm.yy	     | "30.6.08" |
| Day, textual month and year	    | dd-m y	       | "30-June 2008" |
| Textual month and four digit year | m YY	       | "June 2008",  "March 1879" |
| Four digit year and textual month | YY  m	       | "2008 June" |
| Textual month, day and year	    | m dd, y	       | "April 17, 1790" |
| Day and textual month	          | d m	           | "1 July" |
| Month abbreviation, day and year	| M-DD-y	     | "May-09-78", "Apr-17-1790" |
| Year, month abbreviation and day	| y-M-DD	     | "78-Dec-22", "1814-MAY-17" |
| Year (and just the year)        |	YY	| "1978", "2008" |
| Textual month (and just the month)	| m	| "March", "jun", "DEC" |

# ISO8601: Imposing common standards

* Purpose: to provide unambiguous and well-defined method of representing dates and times
* Goal: to avoid misinterpretation of numeric representations of dates and times, particularly when data are transferred between countries with different conventions for writing numeric dates and time
* First published in 1988
* Introduces a common notation, and a common order (most-to-least-significant order [YYYY]-[MM]-[DD])
    - matches lexicographical order with chronological order
* Uses codes for date and time elements, to represent dates (and times) in either a basic format (no separators) or in an extended format with added separators (to enhance human readability)


## ISO8601 formatting components

|  Symbol  |  Meaning       |  Example   |  Notes |
|:---------|:---------------|:----------:|:-------|
| YYYY     | 4-digit year   |  2017    | Avoids the "Y2K" problem |
| MM       | 2-digit day of the month | 10  | |
| DD       | 2-digit day of the month | 03 | |
| Www      | Week number    | 52 | |
| D        | Weekday number |  2 | Starts on Monday! |
| hh       | hour (0-24)    |  10 | 24 is only used to denote midnight at the end of a calendar day |
| mm       | minute (0-59)  |  05 | |
| ss       | second (0-60)  |  20 | 60 is only used to denote an added leap second |


## Coordinated Universal Time (UTC)

* World standard for time
* Does not include Daylight Savings Time
* Interchangeable with Greenwich Mean Time (GMT), but GMT is no longer precisely defined by the scientific community
* [Time zones around the world](https://en.wikipedia.org/wiki/List_of_UTC_time_offsets) are expressed using positive or negative offsets from UTC

* French v. English
    - English speakers originally proposed _CUT_ (for "coordinated universal time")
    - French speakers proposed _TUC_ (for "temps universel coordonné") 
    - Compromise: _UTC_

# POSIX Time

- a system for describing a point in time, defined as the number of seconds that have elapsed since 00:00:00 UTC, Thursday, 1 January 1970
- Also known as "Unix time", or "<a href="https://en.wikipedia.org/wiki/Epoch_(reference_date)#Computing">[epoch time](</a>)", because it represents elapsed time from a defined "epoch"
- Problem: How much elapsed time can you store?  

* The _Year 2038 Problem_
    -  Many Unix-like operating systems which keep time as seconds elapsed from the epoch date of January 1, 1970
    -  For signed 32-bit integers, this means that cannot encoding times after 03:14:07 UTC on 19 January 2038
    -  Times beyond that will wrap around and be stored internally as a negative number, which these systems will interpret as having occurred on 13 December 1901
    -  A solution: 64-bit signed integers allow a new wraparound date that is 20x greater than the estimated age of the universe: approximately 290 billion years in the future

# Compression

-   Seeks to economize on space by representing recurring items using
    patterns that represent the uncompressed data

-   Common in formats for graphics and video encoding

-   “Lossless” formats compress data without reducing information -
    examples are .zip and .gz compression

-   This (and avoiding errors) is also a principle in normalized
    relational data forms

-   Also very important for sparse matrix representations, where many of
    the cells are zero, but it would be very wasteful to record a double
    precision numeric zero for each of these non-informative cells

## Compression: Sparse matrix formats

* used because many forms of matrix are very sparse - for example, document-term matrixes, which are commonly 80-90% sparse

In [None]:
suppressPackageStartupMessages(library("quanteda"))
mydfm <- dfm(data_corpus_inaugural[1:5])
head(mydfm, nfeature = 10)
## Document-feature matrix of: 5 documents, 10 features (38% sparse).
## 5 x 10 sparse Matrix of class "dfmSparse"
##                  features
## docs              fellow-citizens  of the senate and house representatives : among vicissitudes
##   1789-Washington               1  71 116      1  48     2               2 1     1            1
##   1793-Washington               0  11  13      0   2     0               0 1     0            0
##   1797-Adams                    3 140 163      1 130     0               2 0     4            0
##   1801-Jefferson                2 104 130      0  81     0               0 1     1            0
##   1805-Jefferson                0 101 143      0  93     0               0 0     7            0

### "simple triplet" format

-   “simple triplet” format

    - $i$:   indexes row

    - $j$:   indexes column

    - $x$:   indicates value
    
(indexes will be from zero)

#### example:

This matrix:

```
     [,1] [,2] [,3] [,4]
[1,]    1    0    3   12
[2,]    0    0   10    1
[3,]    2    0    0    0    
```

Would be represented by:
- $i$: 0 2 0 1 0 1
- $j$: 0 2 0 1 0 1
- $x$: 0 0 2 2 3 3




###   “compressed sparse column” format

- More efficient than the STF

    - $i$:   indexes row

    - $p$:   indexes the first nonzero element in each column of the matrix

    - $x$:   indicates value
    
    
-   "compressed sparse row" format is also possible

#### example:

This matrix:

```
     [,1] [,2] [,3] [,4]
[1,]    1    0    3   12
[2,]    0    0   10    1
[3,]    2    0    0    0    
```

Would be represented by:
- $i$: 0 2 0 1 0 1
- $p$: 2 2 4 6
- $x$: 0 0 2 2 3 3




# Dataset manipulation

### What is a “Dataset”?

-   A dataset is a “rectangular" formatted table of data in which all
    the values of the same variable must be in a single column

-   Many of the datasets we use have been artificially reshaped in order
    to fulfill this criterion of rectangularity

### Revisting basic data concepts

-   The difference between tables and *datasets*

-   This is a (partial) :

        district    incumbf wonseatf
        1      Carlow Kilkenny Challenger     Lost
        2      Carlow Kilkenny Challenger     Lost
        5      Carlow Kilkenny  Incumbent      Won
        100 Donegal South West Challenger     Lost
        459            Wicklow  Incumbent      Won
        464            Wicklow Challenger     Lost

-   This is a :

                   Lost Won
        Challenger  266  60
        Incumbent    32 106

-   The key with a dataset is that

### Example: Comparative Manifesto Project dataset

Note: Available from <https://manifestoproject.wzb.eu/>

\

[****]{}

### Example: Comparative Manifesto Project dataset

This is “wide" format: ![image](figures/cmpdata.png){width="105.00000%"}

### Long v. wide formats

-   reshape

    -   the “old" R way to do this, using ‘base::reshape()‘

    -   problem: confusing and difficult to use

-   reshape2

    -   from Hadley Wickham’s `reshape2` package

    -   data is first ‘melt‘ed into long format

    -   then ‘cast‘ into desired format

# Upcoming

-------

* **Lab**: Working with Jupyter, working with Github, setting up a web page
* **Next week**: The shape of data