### ST445 Managing and Visualizing Data

# Introduction to Data

### Week 1 Lecture, MT 2017 - Kenneth Benoit

## Data is Fundamental


> _"You can have data without information, but you cannot have information without data."_ – Daniel Keys Moran, an American computer programmer and science fiction writer.



> _"In God we trust. All others must bring data."_ – W. Edwards Deming, statistician, professor, author, lecturer, and consultant.

# Structured data: Index cards

* Origins in the 19th century, with botanist [Carl Linnaeus](https://en.wikipedia.org/wiki/Carl_Linnaeus), who needed to record species that he was studying

* This was a form of _database_
    - each piece of information about a species formed a _field_
    - each species' entry in the system formed a _record_
    - the records were _indexed_ using some reference system

### Heyday: Use in libraries to catalog books

![card catalog](figs/cardcatalog.jpg "Card catalog") ![card catalog room](figs/cardcatalog3.jpg "Card catalog room")
    

### a record looked like this

![card record](figs/cardcatalog2.jpg "Card record")

## Dewey decimal system

* a proprietary library classification system first published in the United States by Melvil Dewey in 1876
* scheme is made up of ten classes, each divided into ten divisions, each having ten sections
* the system's notation uses Arabic numbers, with three whole numbers making up the main classes and sub-classes and decimals creating further divisions


* Example:

    ```
    500 Natural sciences and mathematics
        510 Mathematics
            516 Geometry
                516.3 Analytic geometries
                    516.37 Metric differential geometries
                        516.375 Finsler Geometry
    ```

# How to index?

* Problem: Could only sort the cards one way
* Re-referencing was literally a manual operation

* Contrast with the idea of electronic indexes, where assets are stored once, and many indexing and reference systems can be applied
    ![photos](figs/photos.png "Apple Photos")
* In most photographic software, edits are "non-destructive", and stored separately from the original images

## Modern database manager

* "Normalizes" data into relational tables, linked by "keys" (more on this in [Week 3](https://lse-st445.github.io/#week-3-creating-and-managing-databases))

    ![relational data](figs/relational_data.png "Relational data")


## Punch cards (and legacy systems)

* How we used to enter programs into the computer
   ![punch card](figs/punchard.jpg "Punchcard")
* Pre-computer origins: 18-th century use in textile looms
* Responsible for the 80-character legacy


### Who knows what this is?

![piano roll](figs/pianoroll2.jpg "Piano roll")

That is a [piano roll](https://en.wikipedia.org/wiki/Piano_roll), where music to be played by a [player piano](https://en.wikipedia.org/wiki/Player_piano) is encoded.

# Changes in the world of data

* volume of data in the modern world: 90% of the world's data [generated in the last _two years_](https://www.sciencedaily.com/releases/2013/05/130522085217.htm)


* and that was _in 2013_

* SKA: Square Kilometer Array
    - a southern hemisphere radio telescope with a total of 1 km$^2$ of data sensors
    - will generate 1 exabyte _daily_ = $1 \times 10^{18}$ bytes
    - = 1,000,000,000,000,000,000 bytes

* compare this with the Apollo Guidance Computer (1966), which guided the first humans to the moon
    - 16-bit wordlength, 2048 words RAM (magnetic core memory) = _4KB_
    - 36,864 words ROM (core rope memory) = _73KB_
    ![](figs/agc.jpg)


# Basic units of data

* Bits
   - smallest unit of storage, a 0 or 1
   - anything that can store two states - now "transistors", used to be vacuum tubes
   - with $n$ bits, can store $2^n$ patterns - so one byte can store 256 patterns


* Bytes
   - eight _bits_ = one _byte_
   - "eight bit encoding" - represented characters, such as `A` represented as 65
   
  ![ASCII](figs/ASCII.png)

### multi-byte units

| unit     | abbreviation | total bytes  | nearest decimal equivalent |
|:--------:|:------------:|-------------:|---------------------------:|
| kilobyte |     KB       | 1,024^1      |             1000^1         |
| megabyte |     MB       | 1,024^2      |             1000^2         |
| gigabyte |     GB       | 1,024^3      |             1000^3         |
| terabyte |     TB       | 1,024^4      |             1000^4         |
| petabyte |     PB       | 1,024^5      |             1000^5         |
| exabyte  |     EB       | 1,024^6      |             1000^6         |
| zettabyte|     ZB       | 1,024^7      |             1000^7         |
| yottabyte|     YB       | 1,024^8      |             1000^8         |

* this is why 1GB is greater than 1 billion bytes

![decimal v. binary](figs/decimalvbinary.png)

# Data types: Generically

* objects are _bound_ to an identifier, e.g.

In [2]:
temperature = 98.6
print(temperature)

98.6


* here, `temperature` is a variable name assigned to the literal floating-point object with the value of 98.6
* in Python, this is an instance of the **float** class
* identifiers in R and Python are _case-sensitive_
* some identifiers are typically reserved, e.g.
    ```Python
    False, True, None, or, and  # Python
    FALSE, TRUE, NA             # R
    ```

* All programming languages use comments, for humans to read
    - this is anything that follows the `#` character in both Python and R

> "Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."  -- Donald Knuth, _Literate Programming_ (1984)

### Instantiation

* objects have _classes_, meaning they represent a "type" of object
* _instantiation_ means creating a new instance of that class

* "immutable" objects cannot be subsequently changed
    
| **Python class** | **Immutable** | **Description**                   |  **R class** |
|:-----------------|:-------------:|:----------------------------------|:------------:|
| bool             |      Yes      | Boolean value                     |    logical   |
| int              |      Yes      | integer number                    |    integer   |
| float            |      Yes      | floating-point number             |    numeric   |
| list             |       No      | mutable sequence of objects       |     list     |
| tuple            |      Yes      | immutable sequence of objects     |       -      |
| str              |      Yes      | character string                  |   character  |
| set              |       No      | unordered set of distinct objects |       -      |
| frozenset        |      Yes      | immutable form of set class       |       -      |
| dict             |       No      | dictionary                        | (named) list |

# How to index data

* Endian debate
   - Big Endian Byte Order: The most significant byte (the "big end") of the data is placed at the byte with the lowest address
    - Little Endian Byte Order: The least significant byte (the "little end") of the data is placed at the byte with the lowest address
    - Comes from _Gulliver's Travels_

### (indexing data cont.)

* index from 0 or from 1?

   - where an index begins counting, when addressing elements of a data object
   - [most languages index from 0](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28array%29#Array_system_cross-reference_list)
   - human ages - do they index from 0?

In [6]:
string_example = "Hello World"
string_example[0:5]

'Hello'

* Python indexes from 0, R from 1.  See [R example here](R_indexing.ipynb)


# git

* `git`: a revision control system
* Allows for complete history of changes, branching, staging areas, and flexible and distributed workflows
* simplified workflow (from [Anita Cheng's excellent blog post](http://anitacheng.com/git-for-non-developers))

   ![](figs/git.jpg)

* Works through the command line, or through GUI clients, or through most IDEs and (good) editors

![command-line git](figs/commandgit.png "command-line git")

* or through editors (here, the excellent [Atom](https://atom.io) editor)

    ![atom git](figs/gitatom.png "atom git")

# GitHub

* a website and hosting platform for git repositories

* and so much more...
    * publishing websites: http://kenbenoit.net, whose source code is at https://github.com/kbenoit/kbenoit.github.io
    * "continous integration" hooks: https://github.com/kbenoit/spacyr (for instance - see the badges)
    * Issue tracking: https://github.com/kbenoit/spacyr/issues
    * Inspecting code: (e.g.) https://github.com/kbenoit/spacyr/blob/master/R/python-functions.R

* [GitHub classroom](https://classroom.github.com)
* Free stuff for students! https://education.github.com/pack


# More great resources for using git/GitHub

* [An easy git Cheatsheet](http://rogerdudler.github.io/git-guide/files/git_cheat_sheet.pdf), by Nina Jaeschke and Roger Dudler 
* [git - the simple guide](http://rogerdudler.github.io/git-guide/) by Roger Dudler

### Some people have entire, open-source, user-commented books online, such as: 

* [_R for Data Science_](http://r4ds.had.co.nz)
    - with [source code here](https://github.com/hadley/r4ds)
    - with [GitHub issues here](https://github.com/hadley/r4ds/issues)
    - and pull requests - [examples here](https://github.com/hadley/r4ds/pulls)

# git Example

### Fixing a broken R Jupyter notebook

This Jupyter notebook is broken and needs to be fixed:

https://github.com/lse-st445/lectures/blob/master/week01/R_example.ipynb


### How to fix it:
* clone the repository
* edit the file
* stage the changes
* commit the changes
* issue a "pull request"

# Markdown (and other markup languages)

* Idea of a "markup" language: HTML, XML, LaTeX
* "Markdown"
    - Created by John Gruber as a simple way for non-programming types to write in an easy-to-read format that could be converted directly into HTML
    - No opening or closing tags
    - Plain text, and can be read when not rendered
* Markdown has [many "flavours"](https://github.com/commonmark/CommonMark/wiki/Markdown-Flavors)

# Markdown example

This is a markdown example.
* bullet list 1
* bullet list 2

> "[I love deadlines. I like the whooshing sound they make as they fly by.](https://www.brainyquote.com/quotes/quotes/d/douglasada134151.html?src=t_funny)"  
-- _Douglas Adams_

----
```
# Markdown example

This is a markdown example
* bullet list 1
* bullet list 2

> "[I love deadlines. I like the whooshing sound they make as they fly by.](https://www.brainyquote.com/quotes/quotes/d/douglasada134151.html?src=t_funny)"  
-- _Douglas Adams_
```


A good reference for Markdown: https://ia.net/writer/support/general/markdown-guide/.

# Upcoming

-------

* **Lab**: Working with Jupyter, working with Github, setting up a web page
* **Next week**: The shape of data