
Here is a quick import and export of the Chicago crime stats data set that several of us had used earlier, so that we can easily shrink it, as well as alter it as needed.

-----

First, some sense of where we're at:


In [10]:
%pwd

'C:\\Users\\Angie\\Documents\\ANLT_233_Dynamic\\ANLT233-Dynamic_Viz\\Lessons'


In other words, if you are also at the top level of the shared GitHub repository for this class **when you launch your iPython Notebook** (and you're using Python3), you can exactly mimic the actions I'm taking here.

-----

Below I will change directories, and show a listing of what is in the directory.  By the time you are doing the work, there will likely be more files, however the files used here will not change.


In [11]:
%cd ..
%cd Lessons

C:\Users\Angie\Documents\ANLT_233_Dynamic\ANLT233-Dynamic_Viz
C:\Users\Angie\Documents\ANLT_233_Dynamic\ANLT233-Dynamic_Viz\Lessons


In [12]:
%ls data

 Volume in drive C is Windows8_OS
 Volume Serial Number is B4C7-F861

 Directory of C:\Users\Angie\Documents\ANLT_233_Dynamic\ANLT233-Dynamic_Viz\Lessons\data

08/31/2016  08:23 PM    <DIR>          .
08/31/2016  08:23 PM    <DIR>          ..
08/27/2016  10:15 AM         2,076,928 Boundaries - Community Areas (current).geojson
08/27/2016  10:15 AM           239,976 Boundaries - Community Areas (current).topojson
08/27/2016  10:16 AM        75,580,015 Chicago_crime_dataset.csv
08/27/2016  10:17 AM       163,256,928 Chicago_crime_dataset.json
08/31/2016  12:43 PM            49,544 LAenergyusage.json
08/31/2016  07:50 PM           100,573 mini_Chicago_crime_columns.json
08/31/2016  07:50 PM           153,996 mini_Chicago_crime_records.json
08/27/2016  10:17 AM             1,641 README.md
08/27/2016  10:17 AM    <DIR>          Uganda_shapefiles
               8 File(s)    241,459,601 bytes
               3 Dir(s)  680,297,861,120 bytes free



The data sets that we've used in the path, and that Heather Steich generated, are the `Chicago_crime_dataset.csv` and `Chicago_crime_dataset.json`.

-----

Here we'll use the JSON format, **because** it is a bit more complex to manage, **and** it allows for more flexibility.

(JSON, remember, stands for **J**ava**S**cript **O**bject **N**otation.  Therefore, it is designed to work well with Javascript, which is the language we will primarily use here.  However, as we've discussed before, CSV files are **necessarily** 2-dimensional dataframe-like structures, whereas JSON can have an arbitrary - and inconsistent - depth within elements.)

I am purposely choosing the more complex format because I would rather the lesson focus upon harder things, not easier.


In [13]:
import pandas as pd

crime_df = pd.read_json('./data/Chicago_crime_dataset.json')

In [14]:
crime_df.shape

(297978, 35)

In [15]:
crime_df.head()

Unnamed: 0,Area Per Capita Income,Area Prop Age<18 or Age>64,Area Prop Age>16 Unemployed,Area Prop Age>25 w/o HS Diploma,Area Prop Households Below Poverty,Area Prop Housing Crowded,Arrest,CA Name,CST,Cloud Cover,...,Max Visibility Miles,Max Wind MPH,Min DewPtF,Min Humidity,Min Sea Lvl PresIn,Min TempF,Min Visibility Miles,PrecipIn,Primary Type,Public Transit Rides
0,13089,0.393,0.139,0.451,0.236,0.144,0,Brighton Park,1325376000000,7,...,10,33,16,56,29.52,29,2,0.04,BATTERY,227947
1,65526,0.135,0.057,0.031,0.147,0.015,1,Loop,1325376000000,7,...,10,33,16,56,29.52,29,2,0.04,BATTERY,227947
10,12961,0.432,0.196,0.213,0.424,0.082,1,East Garfield Park,1325376000000,7,...,10,33,16,56,29.52,29,2,0.04,BATTERY,227947
100,71551,0.215,0.051,0.036,0.123,0.008,0,Lincoln Park,1325376000000,7,...,10,33,16,56,29.52,29,2,0.04,CRIMINAL DAMAGE,227947
1000,15957,0.379,0.226,0.244,0.286,0.063,0,Austin,1325376000000,7,...,10,33,16,56,29.52,29,2,0.04,THEFT,227947


In [16]:
crime_df.dtypes

Area Per Capita Income                       float64
Area Prop Age<18 or Age>64                   float64
Area Prop Age>16 Unemployed                  float64
Area Prop Age>25 w/o HS Diploma              float64
Area Prop Households Below Poverty           float64
Area Prop Housing Crowded                    float64
Arrest                                         int64
CA Name                                       object
CST                                            int64
Cloud Cover                                    int64
Community Area                                 int64
Date                                  datetime64[ns]
Day of Week                                   object
Description                                   object
Domestic                                       int64
Events                                        object
Hardship Index                               float64
ID                                             int64
Latitude                                     f


Yep, it's that really big data set which Heather created for the earlier visualizations class.



-----

Let's make a smaller subset to work with for our dynamic visualizations, but not as tiny as what you'll find in the examples within either of the books ([*D3.js in Action*](https://www.manning.com/books/d3-js-in-action) or [*Interactive Data Visualization (for the web)*](http://chimera.labs.oreilly.com/books/1230000000345).  My biggest issue with both of these books is that the data sets they use are so tiny and simple that it does not lend itself easily to real world situations you will come across.

Here, let's keep the following columns:

 - CA Name
 - Primary Type
 - Arrest
 - Area Per Capita Income
 - Area Prop Age>16 Unemployed
 - Area Prop Households Below Poverty
 - Hardship Index

Why these columns?

 - To keep the data set simple enough to be able to see and understand all of the information on each observation easily, and because:  CA Name is the level of granularity (it is the neighborhood name), Primary Type is the type of crime, Arrest is whether or not an arrest was made, and the remaining columns capture a few salient characteristics of the neighborhood

-----

Next, let's keep only 10 observations (or records or rows) per neighborhood.  Why?

 - We want an even sampling between neighborhoods.
 - We want the data set to be small enough that once visualizations are moving and dynamic, we don't see severe lag and slow down.
     - There are ways to make visualizations run more quickly, such as using the [HTML Canvas Element](http://www.w3schools.com/html/html5_canvas.asp) that the *D3.js in Action* book recommends, but we will not worry about speed issues for the moment, and simply shrink our data set so that we will not have significant lag.




In [17]:
cols_oi = [7,33,6,0,2,4,16] # These are the column numbers of the columns mentioned above.
mini_df = crime_df.groupby('CA Name').head(10).iloc[:,cols_oi]
mini_df.head(4)

Unnamed: 0,CA Name,Primary Type,Arrest,Area Per Capita Income,Area Prop Age>16 Unemployed,Area Prop Households Below Poverty,Hardship Index
0,Brighton Park,BATTERY,0,13089,0.139,0.236,84
1,Loop,BATTERY,1,65526,0.057,0.147,3
10,East Garfield Park,BATTERY,1,12961,0.196,0.424,83
100,Lincoln Park,CRIMINAL DAMAGE,0,71551,0.051,0.123,2



Notice above that I simply took the `head(10)` of each neighborhood.  I am presuming that there is no natural ordering of the original data set, so that taking the top 10 of each neighborhood is sufficiently like taking a random sample of each neighborhood.

For example / tutorial purposes, this is fine.  For rigorous analysis, ensure random sampling - or even better, [sampling reflecting the population](https://en.wikipedia.org/wiki/Stratified_sampling) - but again, this course is not about statistics.

-----

Now let's simply "spit them out" as JSON files, as discussed above, so that we can start to use them in our D3 files, our beginnings of dynamic visualization.


In [18]:
mini_df.to_json('./data/mini_Chicago_crime_columns.json', orient='columns')
mini_df.to_json('./data/mini_Chicago_crime_records.json', orient='records')


Why am I spitting it out in 2 different data formats?  What's the difference between `orient='columns'` (the default), and `orient='records'`?

The [`pandas documentation`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html) discusses this, but somewhat crypically, saying:

 - columns : dict like {column -> {index -> value}}
 - records : list like [{column -> value}, ... , {column -> value}]

-----

Let's break down what this is saying.

Columns:

We'll have a higher level **dictionary**, where the keys will be the columns, and the values will be another dictionary where the keys are the indices and the values are, well, the values.

Put another way, each **observation** is identified by the `pandas` dataframe **row index** as the inner key, and each column represents the outer, higher level key.

E.g., here's a snippet from our Chicago data:

    {"CA Name":
      {"0.0":"Brighton Park",
       "1.0":"Loop",
       "10.0":"East Garfield Park",
       ...
      },
     "Primary Type":
      {"0.0":"BATTERY",
      "1.0":"BATTERY",
      "10.0":"BATTERY",
      ...
      },
      .
      .
      .
    }

-----

Records:

We'll have a higher level **list**, where each element in the list will consist of an observation.  The list element is a dictionary with the column as the key and its value, again, as the value.

E.g., here's a snippet from our Chicago crime data:

    [
      {"CA Name":"Brighton Park",
       "Primary Type":"BATTERY",
       "Arrest":0,
       ...
      },
      {"CA Name":"Loop",
       "Primary Type":"BATTERY",
       "Arrest":1,
       ...
      },
      .
      .
      .
    ]


-----

Why all this hassle with exporting the data and figuring out how to format it?  Since we're more savvy with R / Python, it will be easier to get data **mostly** in the right format in Python or R, and we'll next look on the D3 side of things to see how these 2 JSON format structures look differently within D3 / Javascript.


