### ST445 Managing and Visualizing Data

# Introduction to Data

### Week 1 Lecture, MT 2017 - Kenneth Benoit

## Data is Fundamental


> _"You can have data without information, but you cannot have information without data."_ – Daniel Keys Moran, an American computer programmer and science fiction writer.



> _"In God we trust. All others must bring data."_ – W. Edwards Deming, statistician, professor, author, lecturer, and consultant.

# Structured data: Index cards

* Origins in the 19th century, with botanist [Carl Linnaeus](https://en.wikipedia.org/wiki/Carl_Linnaeus), who needed to record species that he was studying

* This was a form of _database_
    - each piece of information about a species formed a _field_
    - each species' entry in the system formed a _record_
    - the records were _indexed_ using some reference system

* Heyday: Use in libraries to catalog books

    ![card catalog](figs/cardcatalog.jpg "Card catalog")

* a record looked like this

![card record](figs/cardcatalog2.jpg "Card record")

## Dewey decimal system

* a proprietary library classification system first published in the United States by Melvil Dewey in 1876
* scheme is made up of ten classes, each divided into ten divisions, each having ten sections
* the system's notation uses Arabic numbers, with three whole numbers making up the main classes and sub-classes and decimals creating further divisions


* Example:

    ```
    500 Natural sciences and mathematics
        510 Mathematics
            516 Geometry
                516.3 Analytic geometries
                    516.37 Metric differential geometries
                        516.375 Finsler Geometry
    ```

# How to index?

* Problem: Could only sort the cards one way
* Re-referencing was literally a manual operation

* Contrast with the idea of electronic indexes, where assets are stored once, and many indexing and reference systems can be applied
    ![photos](figs/photos.png "Apple Photos")
* In most photographic software, edits are "non-destructive", and stored separately from the original images

## Modern database manager

* "Normalizes" data into relational tables, linked by "keys" (more on this in [Week 3](https://lse-st445.github.io/#week-3-creating-and-managing-databases)

    ![relational data](figs/relational_data.png "Relational data")


## Punch cards (and legacy systems)

* How we used to enter programs into the computer
   ![punch card](figs/punchard.jpg "Punchcard")
* Pre-computer origins: 18-th century use in textile looms
* Responsible for the 80-character legacy


* Who knows what this is?
   ![piano roll](figs/pianoroll.png "Piano roll")

That is a [piano roll](https://en.wikipedia.org/wiki/Piano_roll), where music to be played by a [player piano](https://en.wikipedia.org/wiki/Player_piano) is encoded.

# Changes in the world of data

* volume of data in the modern world
* Apollo landing module (from inaugural lecture)

* and hugely varied and complex ways for computers to communicate data

# Basic units of data

https://web.stanford.edu/class/cs101/bits-bytes.html

* Bits
* Bytes
* Why we count in hexadecimal (for example)


# Data types: Generically



# Data types: Python



# Data types: R



# git

* `git`: a revision control system
* Allows for complete history of changes, branching, staging areas, and flexible and distributed workflows


* Works through the command line, or through GUI clients, or through most IDEs and (good) editors

![command-line git](figs/commandgit.png "command-line git")

* or through editors (here, the excellent [Atom](https://atom.io) editor)

    ![atom git](figs/gitatom.png "atom git")

# GitHub

* a website and hosting platform for git repositories

* and so much more...
    * publishing websites: http://kenbenoit.net, whose source code is at https://github.com/kbenoit/kbenoit.github.io
    * "continous integration" hooks: https://github.com/kbenoit/spacyr (for instance - see the badges)
    * Issue tracking: https://github.com/kbenoit/spacyr/issues
    * Inspecting code: (e.g.) https://github.com/kbenoit/spacyr/blob/master/R/python-functions.R

* [GitHub classroom](https://classroom.github.com)
* Free stuff for students! https://education.github.com/pack


# git Example

## Fixing a broken R Jupyter notebook



# Markdown (and other markup languages)

* Idea of a "markup" language
    - HTML
    - XML
    - LaTeX
* "Markdown" and why it exists


# Markdown example

This is a markdown example.
* bullet list 1
* bullet list 2

> "[I love deadlines. I like the whooshing sound they make as they fly by.](https://www.brainyquote.com/quotes/quotes/d/douglasada134151.html?src=t_funny)"  
-- _Douglas Adams_

Markdown source:

---

```
# Markdown example

This is a markdown example
* bullet list 1
* bullet list 2

> "[I love deadlines. I like the whooshing sound they make as they fly by.](https://www.brainyquote.com/quotes/quotes/d/douglasada134151.html?src=t_funny)"  
-- _Douglas Adams_
```
---

A good reference for Markdown: https://ia.net/writer/support/general/markdown-guide/


### Static Semantics in Programming Languages

* Rules for forming meaningful syntactically valid strings

In [None]:
'MY'/470

### Semantics in Programming Languages

* The meaning associated with a syntactically correct string that has no static semantic errors
* Programming languages have simple semantics — statements have only one meaning

* **But this may not be the meaning the programmer had in mind!**

## Types of Programming Languages

* Low-level vs. high-level
* Genral vs. application-targetted
* Interpreted vs. compiled

## Computer Program

* A sequence of definitions and commands
  * Commands (or "statements") instruct the computer to do something
* For interpreted languages:
  * Programs are executed by the language interpreter (or "shell")
  * They can be typed directly in the shell 
  * Or they can be stored in a file and run from the shell


## Objects, Data Types, and Expressions

* Programs manipulate objects
* Objects have types
  * Scalar — indivisible
  * Non-scalar — with internal structure
* Expressions combine objects and operators


In [None]:
# scalar objects
2
0.125
True

# non-scalar objects
'This is a string.'
[1, 2, 3, 'a', 'x']

# expressions
2 / 0.125
'MY' + '470'

## Variables

* Variables associate objects with a name

In [None]:
a = 3.14
b = 11.2
c = a*(b**2)

In [None]:
pi = 3.14
diameter = 11.2
area = pi*(diameter**2)

* **Variable names help humans read programs!**

* **Comments also improve readabilty!**

In [None]:
pi = 3.14
diameter = 11.2 # diameter of circle
area = pi*((diameter/2)**2) # estimate area of circle using diameter 

## Computer Bugs

![Computer Bug](figs/bug.jpg "Computer Bug")

The actual first computer bug. On September 9, 1947, Admiral Grace Hopper found this moth trapped on a relay of the Harvard Mark II computer.

## What Is Computation?

We use programming languages to write programs that instruct computers to perform algorithms, which calculate results or process data.


-------

* **Lab**: Installing Anaconda, working with Jupyter, and uploading assignments on Github
* **Next week**: Data types in Python