[Oregon Curriculum Network](http://4dsolutions.net/ocn/)

[Home](School_of_Tomorrow.ipynb)

# Data Visualization (Part Two)

<a data-flickr-embed="true"  href="https://www.flickr.com/photos/kirbyurner/26084029938/in/album-72157693427665102/" title="P1040200"><img src="https://live.staticflickr.com/4763/26084029938_70b453b3c2.jpg" width="500" height="333" alt="P1040200"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
<div style="text-align: center">
Political Data Layer (PDL)
</div>

## Snap Shots in Time

You might think of a photograph as a slice of time, like a slice of bread with some jam on it, where the jam is "whatever is happening" and where the slice has some thickness.  It takes time for photons to register on a chemical film, or on photosensitive resistors.  However a trend in photography has been to heighten the sensistivity of the recording material and to speed up the frame rate, meaning each individual slice of "bread" (time) is becoming less thick.  Slow motion movies take advantage of this fact.

Speaking of movies, as soon as you put the time slices back together, you start to see rates of change.  In delta calculus, invented by Newton, Leibniz and other greats, we look for minimum changes (differentiation) however, then we add up those changes (integration) to get more summary results.  Even short movies, in placing one time slice after another in quick progression, say 30 frames per second (fps), may tell us a lot more about what we're looking at.

## Time Series Data

The time series in question might be "one snap shot a day" or "one per second" or "one per 1/100th of a second".  Or a time series might feature snap shots taken once a year, as when chronicling the slow advance or retreat of a glacier. 

However, given the wide range of instrumentation out there, you may not be getting literal photographs.  Most measurements, including images, have a numeric format and when it comes to noticing or registering differences, we may use numeric techniques rather than our own eyes and brains, to judge what a movie is showing.

More likely yet, we use a combination of visualizations, intuitions, and numerical computations, in various feedback loops, to help build comprehension through apprehension (see Synergetics for more contextualizing passages).

A lot of time series data is financial in nature.  Analysts look for patterns and seek to anticipate trends.

Jake VanderPlas is [one of the top authorities](https://jakevdp.github.io/PythonDataScienceHandbook/03.11-working-with-time-series.html) on using ```pandas``` to do data science.  The link is to his copyleft book.

In [1]:
% matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests  # a new tool, for communicating using the HTTP protocol

In [2]:
glossary_df = pd.read_json("glossary2.json")
glossary_df["sort_column"] = glossary_df.index.str.upper()
glossary_df.sort_values(['sort_column'], axis=0, ascending=True, inplace=True)
del glossary_df["sort_column"]  # now that the df is sorted, delete the sorting column

## Globes and Maps as Data Vizualization Tools

<a data-flickr-embed="true"  href="https://www.flickr.com/photos/kirbyurner/28494828118/in/album-72157693427665102/" title="DGGS / Global Matrix"><img src="https://live.staticflickr.com/1741/28494828118_573fc6c57b_n.jpg" width="320" height="180" alt="DGGS / Global Matrix"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
<div style="text-align: center">
Data Visualization Tool
</div>

You likely agree that maps and globes are informative.  Many of us use maps and globes daily, especially maps, because we are trying to find routes to follow, to specific places we search for.  Places get stored with sufficient data, by means of latitude and longitude especially, to have them show up on maps and globes relative to other places.

Sometimes our globes and maps have little to do with Spaceship Earth in that we're in a gaming or simulation environment and working with randomly generated planets on which phenomena evolve over time.  Sometimes we're working with other existing astronomical bodies, such as the Moon or Mars.  We have detailed maps for these spherical bodies, mostly thanks to "spy satellite" technology that was originally developed for self (Earth) observation.

<a data-flickr-embed="true"  href="https://www.flickr.com/photos/kirbyurner/2908472176/in/album-72157607673506906/" title="Tailgate Tableau"><img src="https://live.staticflickr.com/3110/2908472176_3ca704ca1d_n.jpg" width="320" height="240" alt="Tailgate Tableau"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>
<div style="text-align: center">
Razz the Subaru with Map and Dog Food<br />
OCN HQS
</div>


### Empircal versus Non-empirical Simulations

Lets also remember that simulations and models kick in (become relevant) as soon as we move towards anticipating future developments, or filling in gaps for which we have only a little data.  

The latter process (filling in gaps) is sometimes referred to as "interpolation", wherein we're "filling holes" based on what we know of some surroundings.  All of this kind of thinking is guesswork in some ways, provided we think new data might arise that could confirm or cast doubt on our guesses and models.  

Studies subject to correction or confirmation or disconfirmation are considered "empirical" meaning continuing to gather data make sense.  A science fiction computer game that takes a planet into a simulated future may be non-empirical because it's a made-up reality.  

"What really happens" is not an issue when "reality" is not in a position to correct us. On the other hand, inside the novel or story, the fictional characters (such as they exist) may have their simulated empirical concerns.

### Philosophical Caveats

Lets investigate some philosophical concerns relating to whether a given study is empirical or not.  People come to reality with very different beliefs and assumptions relative to others.  We do not always see agreement on the core terms of a debate and so what counts as evidence for or against a particular model may itself be in dispute.

## Importing Data

### Case Study:  Periodic Table to DataFrame (over the web)

In the previous notebook, we looked at JSON is a viable format for both storing and streaming global data on a global basis, meaning to and from any IP:port address.  Internet Protocol (IP) is not the only protocol available for transmitting signals, so we should keep in mind that JSON, as a protocol, is higher up in the "transport layer" than the lower level TCP/IP that we find in today's internet.

The Oregon Curriculum Network is currently sponsoring [a simple Flask website](http://thekirbster.pythonanywhere.com/), hosted by PythonAnywhere.com, that features all the source code and data on Github, meaning you're free to clone this website and operate it locally, on localhost. 

In [3]:
http_request  = "http://thekirbster.pythonanywhere.com/api/elements?elem=all"
http_response = requests.get(http_request)
http_response.status_code # if not 200, the request failed

200

In [4]:
periodic_table = http_response.json()
periodic_table['H']

[1, 'H', 'Hydrogen', 1.008, 'diatomic nonmetal', 1498013115, 'KTU']

In [5]:
symbols = np.array(list(periodic_table.keys()))

In [6]:
symbols[:10]

array(['H', 'N', 'As', 'Re', 'Fr', 'Hs', 'Fe', 'Mn', 'Yb', 'Xe'],
      dtype='<U2')

In [7]:
values = np.array(list(periodic_table.values()))

In [8]:
values[0]

array(['1', 'H', 'Hydrogen', '1.008', 'diatomic nonmetal', '1498013115',
       'KTU'], dtype='<U39')

Notice even the numeric values are quoted here, meaning they'll be treated as Unicode objects when it comes time to sort.  Our wish is to have appropriate number types in these columns.  The appropriate time to perform the conversion is when we have access to the ```.astype``` method of the pandas ```Series```, as these comprise the columns of the various sorted pandas ```DataFrame``` objects we aim to produce.

In [9]:
values[:10, 1]

array(['H', 'N', 'As', 'Re', 'Fr', 'Hs', 'Fe', 'Mn', 'Yb', 'Xe'],
      dtype='<U39')

In [10]:
values.shape

(118, 7)

At this juncture, with values a two-dimensional numpy array, each row of seven columns, we might as well apply the ndarray's native ```astype``` method, used to produce a new array with the internals converted.  In this case, the fact that atomic number and atomic mass were quoted, came across as strings, is an issue.

We should study how that happened.  Was [the source website](http://thekirbster.pythonanywhere.com/api/elements?elem=Cs) somehow the culprit?  No, that looks good.  Actually ```periodic_table['H']```, direct from the HTTP response object (with JSON payload) looked fine...

In [11]:
periodic_table['Cs']

[55, 'Cs', 'Cesium', 132.905451966, 'alkali metal', 1493462392, 'KTU']

The issue was turning these lists, of numbers and strings mixed together, into ndarrays of these same elements. An ndarray needs all its elements to be of the same type.  Numbers get coerced into strings at this point.

No matter, as we have the ability to slice and coerce back, in the process of creating and concatenating the three ```Series``` objects, plus an index.  That's what we undertake to accomplish below, with one of the several ways to construct a DataFrame.

In this case, the leftmost argument is a dictionary, with keys being column names, and values being single columns from the ```values``` matrix.

Notice the handsome frame.  The ```.head``` method defaults to displaying the top five rows.  You're free to pass an integer argument to see more.  Do you think ```.tail``` works.  How about ```.head``` with negative numbers?

In [12]:
p_table = pd.DataFrame({"Name":   values[:, 2], 
                        "Number": values[:, 0].astype(np.int8), 
                        "Mass":   values[:, 3].astype(np.float64)},
                        index = symbols)
p_table.index.name = "Symbol"
p_table.head()

Unnamed: 0_level_0,Name,Number,Mass
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
H,Hydrogen,1,1.008
N,Nitrogen,7,14.007
As,Arsenic,33,74.921596
Re,Rhenium,75,186.2071
Fr,Francium,87,223.0


In [13]:
p_table.tail()

Unnamed: 0_level_0,Name,Number,Mass
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Pt,Platinum,78,195.0849
Ar,Argon,18,39.9481
He,Helium,2,4.002602
Cr,Chromium,24,51.99616
Cd,Cadmium,48,112.4144


In [14]:
p_table.head(-100)

Unnamed: 0_level_0,Name,Number,Mass
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
H,Hydrogen,1,1.008
N,Nitrogen,7,14.007
As,Arsenic,33,74.921596
Re,Rhenium,75,186.2071
Fr,Francium,87,223.0
Hs,Hassium,108,269.0
Fe,Iron,26,55.8452
Mn,Manganese,25,54.938044
Yb,Ytterbium,70,173.0451
Xe,Xenon,54,131.2936


In [15]:
p_table.dtypes  # mission accomplished, reconversion has occurred

Name       object
Number       int8
Mass      float64
dtype: object

Check out the pandas API documentation for [how to sort by an index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html).

In [16]:
p_table_by_symbol = p_table.sort_index(axis=0, ascending=True, inplace=False)
p_table_by_symbol.head()

Unnamed: 0_level_0,Name,Number,Mass
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ac,Actinium,89,227.0
Ag,Silver,47,107.86822
Al,Aluminium,13,26.981539
Am,Americium,95,243.0
Ar,Argon,18,39.9481


Next lets sort the original ```p_table``` DataFrame object by Number instead of by Symbol.

You might want to [check the API docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) again, as the arguments are a little bit different.

In [17]:
# inplace = False is the default, but it doesn't hurt to be explicit
p_table_by_number = p_table.sort_values(["Number"], ascending=True, inplace=False) 
p_table_by_number.head()

Unnamed: 0_level_0,Name,Number,Mass
Symbol,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
H,Hydrogen,1,1.008
He,Helium,2,4.002602
Li,Lithium,3,6.94
Be,Beryllium,4,9.012183
B,Boron,5,10.81


Notice that you don't need to import Python itself.  That's because Python is the Kernel behind the scenes running all these code cells.  One specifies the Kernel upon starting a new Jupyter Notebook.

In [18]:
# glossary is now a pandas DataFrame
glossary_df.loc["TCP/IP"] =  "at the basis of the ARPAnet, funded by DARPA, and later the generic internet"
glossary_df.loc["delta calculus"] = "differential and integral calculus, popularized by Newton and Leibniz"
glossary_df.loc["lambda calculus"] = "a logic of functions, invented Alonso Church et al, a basis of CS"
glossary_df.loc["Series"] = "the pandas object for a single column of data, free-standing or in a DataFrame"
glossary_df.loc["TCP"] = "Transmission Control Protocol"
glossary_df.loc["IP"] = "Internet Protocol"
glossary_df.loc["IP packet"] = "A routable chunk of data obeying the IP protocol (format), with payload"
glossary_df.loc["CS"] = "Computer Science"
glossary_df.loc["STEM"] = "Science, Technology, Engineering and Mathematics"
glossary_df.loc["PATH"] = "Philosophy, Anthropology, Theater and History"
glossary_df.loc["STEAM"] = "STEM with Anthropology added (Anthropology > Art)"

glossary_df["sort_column"] = glossary_df.index.str.upper()
glossary_df.sort_values(['sort_column'], axis=0, ascending=True, inplace=True)
del glossary_df["sort_column"]  # now that the df is sorted, delete the sorting column

In [19]:
pd.set_option('display.max_colwidth', -1)  # max width on columns please
glossary_df

Unnamed: 0,definition
API,"a set of functions that take variable arguments, providing programmed control of something"
Bayesian,inferential methods useable even in the absense of any prospect for controlled studies
cell,a Jupyter Notebook consists of mostly Code and Markdown cells
code cell,"where runnable code, interpreted by the Kernel, is displayed and color coded"
CS,Computer Science
CSV,"comma-separated values, one of the simplest data sharing formats"
DataFrame,"the star of the pandas package, providing ndarrays with a framing infrastructure"
delta calculus,"differential and integral calculus, popularized by Newton and Leibniz"
DOM,the Document Object Model is a tree of graph of a document in a web browser
HTML,"hypertext markup language, almost an XML, defines the DOM in tandem with CSS"


In [20]:
glossary_df.to_json("glossary3.json")