## Intro

The purpose of today's class is to explore data using **interactive visualizations**. Interactivity is a key part of modern dataviz. It's a way to allow users of your visualizations get their own feel for the data ... to create richer visualization, where people who use your work can expose more of the data by exploring.


## Part 1: Video Lectures and Reading

Starting this week, we'll be playing around with *explanatory data visualization*. Roughly speaking this means using data visualization to communicate your results to others. Thus, there are new things to think about. 

Until today we have worked with static data visualization. However, exploratory data analysis means to be able to explore the multi-faceted nature of data and *interactive dataviz* is a handy tool to do it! It allows to play with the data: Toggle the view. Zoom. Drag. Show more details. All those things. Those are a key part of modern data visualization. 

The video below provides context about these points.

We start with the video and then read a bit from a scientific article about types of explanatory dataviz. (*The video is from an old version of the class that used D3, so just ignore those parts.*).

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/yHKYMGwefso/0.jpg)](https://www.youtube.com/watch?v=yHKYMGwefso)

> *Exercises*: Explanatory data visualization
> * What are the three key elements to keep in mind when you design an explanatory visualization?
> * In the video I talk about (1) *overview first*,  (2) *zoom and filter*,  (3) *details on demand*. 
>   - Go online and find a visualization that follows these principles (don't use one from the video). 
>   - Explain how your video achieves (1)-(3). It might be useful to use screenshots to illustrate your explanation.
> * Explain in your own words: How is explanatory data analysis different from exploratory data analysis?

## Part 2: Interactive visualizations with Bokeh



To really master interactive visualizations, you will need to work with JavaScript, especially [D3](https://d3js.org). Given the limited time available for this class, we can't squeeze that in. But luckily Python has some pretty good options for interactive visualizations. You can find a range of different options [here](https://mode.com/blog/python-interactive-plot-libraries/).

Today, we'll explore [`Bokeh`](https://docs.bokeh.org/en/latest/), which provides lots of nice interactive funtionalities to Python. To work with Bokeh, we first set up our system:

1. If you haven't installed it yet please do so. You can simply follow [these steps](https://docs.bokeh.org/en/latest/docs/first_steps/installation.html)

2. To include Bokeh in your notebooks you can follow the [Bokeh: Using with Jupyter](https://docs.bokeh.org/en/latest/docs/user_guide/output/jupyter.html#jupyter) guide. Come back to this one when you need it

3. We aim to give you a gentle start with Bokeh and I am going to include more example code than usual in the follwing.
   * **HINT 1**: If you're not an experienced Python user, I recommend going to the [official user's guide](https://docs.bokeh.org/en/latest/docs/user_guide.html#userguide) and working through it. Start by clicking "Introduction" in the linked page. That page has a glossary, a section on output methods, stuff on settings, and interfaces that you can scroll through. The next page *Basic Plotting* where the action is. Spend some time working through that.
   * **HINT 2**: And by "working through it", I mean copy, paste, and run the code in your own notebook. 

Ok. Let's get started. First a general announcement on the data.

> **Announcement**
> * During this entire lecture, as always, we are going to work with the SF Crime Data. 
> * We will use data for the **period 2010-2017***.


Now, to get you in the mood here's a little gif to illustrate what the goal of this exercise is:

In [1]:
import numpy as np 
import pandas as pd 

In [2]:
df=pd.read_csv('Police_Department_Incident_Reports__Historical_2003_to_May_2018.csv')

![Movie](https://github.com/suneman/socialdata2023/blob/main/files/week8_1.gif?raw=true)

If the gif isn't displaying on your system, you can download it [here](https://github.com/suneman/socialdata2023/blob/main/files/week8_1.gif) and display locally.


> ***Exercise***: Recreate the results from **Week 2** as an interactive visualisation (shown in the gif). To complete the exercise, follow the steps below to create your own version of the dataviz.

### Data prep

A key step is to set up the data right. So for this one, we'll be pretty strict about the steps. The workflow is

1. Take the data for the period of 2010-2017 and group it by hour-of-the-day.
2. We would like to be able to easily compare how the distribution of crimes differ from each other, not absolute numbers, so we will work on *normalized data*:
    * To normalise data for within a crime category you simply to devide the count for each hour by the total number of this crime type. (To give a concrete example in the `ASSAULT` category, take the number of assault-counts in 1st hour you should devide by the total number of assaults, then you devide number of assaults in 2nd hour by the total number of assaults and so on)
    *  Your life will be easiest if you organize your dataframe as shown in [this helpful screenshot](https://github.com/suneman/socialdata2023/blob/main/files/W6_Part2_data.png).

If you've followed these steps, your data should be ready! Take a moment to celebrate. We now follow the [Bokeh guide for categorical data](https://docs.bokeh.org/en/latest/docs/user_guide/basic/bars.html):



1. First, let's convert our **Pandas Dataframe** to **Bokeh ColumnDataSource**: 
  > ```python
  > source = ColumnDataSource(your_processed_dataframe)
  > ## it is a standard way to convert your df to bokeh
  > ```
2. We also need to create an empty figure (we will add our stuff here later on). Mini sub-exercise: Find the a guide how to define a figure in Bokeh online. Here is a little help:
  > ```python
  > p = figure(...., x_range = FactorRange(factors=hours), ...) 
  > #p is a standard way to call figures in Bokeh
  > #make sure to add x_range. In my case hours is a list on the form ['1', '2', '3' ... , '24']
  > #read up on the FactorRange in the guide
  > #do not forget to add other attributes to the figure, e.g. title, axis names and so on
  > ```
3. Now we are going to add the bars. In order to do so, we will use **vbar** (see the guide for help):
  > ```python
  > bar ={} # to store vbars
  > ### here we will do a for loop:
  > for indx,i in enumerate(focuscrimes):
  >     bar[i] = p.vbar(x='name_of_the_column_that_contain_hours',  top=i, source= src, 
  >                     ### we will create a vbar for each focuscrime
  >                     legend_label=i,  muted_alpha=..., muted = ....) 
  > #i stands for a column that we use, top=y; we are specifying that our numbers comes from column i
  > #read up what legend_label, muted and muted_alpha do... you can add more attributes (you HAVE TO)
  > ```
4. The last thing to do is to make your legend interactive and display the figure:
  > ```python
  > p.legend.click_policy="mute" ### assigns the click policy (you can try to use ''hide'
  > show(p) #displays your plot
  > ```
5. You will notice that the legend appears in the middle of the figure (and it occludes some of the data). In order to fix this look into [this guide](https://stackoverflow.com/questions/26254619/position-of-the-legend-in-a-bokeh-plot) as a start. Below are some code snippets that you can use to deal with this problem (but read the guide first):
  > ```python
  > items = [] ### for the custom legend // you need to figure out where to add it
  > items.append((i, [bar[i]])) ### figure where to add it
  > legend = Legend(items=..., location=.....) ## figure where to add it
  > p.add_layout(...., ...) ## figure where to add it
  > ### if you read the guide, it will make sense
  > ```

Now you should be able to recreate this amazing visualisation.


**EXTRA feature**: If you're interested in detailed instructions for more Bokeh visualizations for your final project, you can find more inspiration **[here](https://github.com/suneman/socialdata2021/blob/main/lectures/Week8_extra_bokeh.ipynb)**.

α) keep the data from 2010 to 2017

In [3]:
df['Date']=pd.to_datetime(df['Date']) 

In [4]:
# create the column year 
df = df.assign(year = pd.to_datetime(df.loc[:,"Date"], format = "%Y-%m-%d").dt.year)

In [5]:
# keep the years from 2010 to 2017
df = df[df["year"] < 2018]
df = df[df["year"] >= 2010]

In [6]:
# group the data by houroftheday
df["month"] = pd.to_datetime(df.loc[:,"Date"], format='%Y-%m-%d').dt.month  # create the column month
df["dayofWeek"] = pd.to_datetime(df.loc[:,"Date"], format='%Y-%m-%d').dt.dayofweek  # create the column dayofWeek
df['Time']=pd.to_datetime(df['Time'])                          # convert time column to datetime
df["hoursOfday"] = pd.to_datetime(df['Time'], format='%H-%M-%S').dt.hour  # create column hourOfday

In [7]:
# den ton xreiazomai auton ton frame 
df_hourOfday = df.groupby(["hoursOfday"])["PdId"].count().reset_index(name= "countby_hourodfay")

In [8]:
# groupby the 2 columns that I want 
df_cat_hourofday = df.groupby(["Category" , "hoursOfday"])["PdId"].count().reset_index(name= "count_cat_hourofday")

In [None]:
# NORMALIZE THE DATA 

In [9]:
df_cat_total = df_cat_hourofday.groupby(["Category"])['count_cat_hourofday'].sum().reset_index( name = "count_cat_total")

In [10]:
df_cat_total

Unnamed: 0,Category,count_cat_total
0,ARSON,2042
1,ASSAULT,87204
2,BAD CHECKS,304
3,BRIBERY,493
4,BURGLARY,45781
5,DISORDERLY CONDUCT,4667
6,DRIVING UNDER THE INFLUENCE,3274
7,DRUG/NARCOTIC,45802
8,DRUNKENNESS,4933
9,EMBEZZLEMENT,1349


In [11]:
df_normal = pd.merge(df_cat_hourofday, df_cat_total, on = "Category")   # merge the 2 above dataframes

In [12]:
df_normal

Unnamed: 0,Category,hoursOfday,count_cat_hourofday,count_cat_total
0,ARSON,0,139,2042
1,ARSON,1,128,2042
2,ARSON,2,134,2042
3,ARSON,3,138,2042
4,ARSON,4,124,2042
...,...,...,...,...
850,WEAPON LAWS,19,749,11523
851,WEAPON LAWS,20,627,11523
852,WEAPON LAWS,21,609,11523
853,WEAPON LAWS,22,725,11523


In [13]:
df_normal["share"] = df_normal['count_cat_hourofday'] / df_normal['count_cat_total'] 

In [14]:
df_normal

Unnamed: 0,Category,hoursOfday,count_cat_hourofday,count_cat_total,share
0,ARSON,0,139,2042,0.068071
1,ARSON,1,128,2042,0.062684
2,ARSON,2,134,2042,0.065622
3,ARSON,3,138,2042,0.067581
4,ARSON,4,124,2042,0.060725
...,...,...,...,...,...
850,WEAPON LAWS,19,749,11523,0.065000
851,WEAPON LAWS,20,627,11523,0.054413
852,WEAPON LAWS,21,609,11523,0.052851
853,WEAPON LAWS,22,725,11523,0.062918


In [51]:
# keep only the focuscrimes

In [15]:
focuscrimes = set(['WEAPON LAWS', 'PROSTITUTION', 'DRIVING UNDER THE INFLUENCE', 'ROBBERY', 'BURGLARY', 'ASSAULT', 'DRUNKENNESS', 'DRUG/NARCOTIC', 'TRESPASS', 'LARCENY/THEFT', 'VANDALISM', 'VEHICLE THEFT', 'STOLEN PROPERTY', 'DISORDERLY CONDUCT'])

In [19]:
df_filtered = df_normal[df_normal.loc[:,"Category"].isin(focuscrimes)]

In [None]:
# to idio an ekana 
# df_filtered = df_normal[df_normal['Category'].isin(focuscrimes)] 

In [22]:
df_filtered

Unnamed: 0,Category,hoursOfday,count_cat_hourofday,count_cat_total,share
24,ASSAULT,0,4837,87204,0.055468
25,ASSAULT,1,4338,87204,0.049745
26,ASSAULT,2,3910,87204,0.044837
27,ASSAULT,3,2029,87204,0.023267
28,ASSAULT,4,1223,87204,0.014025
...,...,...,...,...,...
850,WEAPON LAWS,19,749,11523,0.065000
851,WEAPON LAWS,20,627,11523,0.054413
852,WEAPON LAWS,21,609,11523,0.052851
853,WEAPON LAWS,22,725,11523,0.062918


In [32]:
df_filtered.drop(columns = ["count_cat_hourofday" , "count_cat_total"], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_filtered.drop(columns = ["count_cat_hourofday" , "count_cat_total"], inplace = True)


In [33]:
df_filtered

Unnamed: 0,Category,hoursOfday,share
24,ASSAULT,0,0.055468
25,ASSAULT,1,0.049745
26,ASSAULT,2,0.044837
27,ASSAULT,3,0.023267
28,ASSAULT,4,0.014025
...,...,...,...
850,WEAPON LAWS,19,0.065000
851,WEAPON LAWS,20,0.054413
852,WEAPON LAWS,21,0.052851
853,WEAPON LAWS,22,0.062918


In [58]:
df_filtered["share"][0:24]  # for the first crime 

24    0.055468
25    0.049745
26    0.044837
27    0.023267
28    0.014025
29    0.011857
30    0.015573
31    0.022212
32    0.033324
33    0.035732
34    0.040480
35    0.042750
36    0.053690
37    0.046454
38    0.046810
39    0.052371
40    0.052371
41    0.053232
42    0.051626
43    0.052177
44    0.051167
45    0.052589
46    0.049837
47    0.048404
Name: share, dtype: float64

In [46]:
df_filtered["Category"].unique()

array(['ASSAULT', 'BURGLARY', 'DISORDERLY CONDUCT',
       'DRIVING UNDER THE INFLUENCE', 'DRUG/NARCOTIC', 'DRUNKENNESS',
       'LARCENY/THEFT', 'PROSTITUTION', 'ROBBERY', 'STOLEN PROPERTY',
       'TRESPASS', 'VANDALISM', 'VEHICLE THEFT', 'WEAPON LAWS'],
      dtype=object)

In [43]:
# create an empty dataframe witht the column names and the number of rows that I want 
df_new = pd.DataFrame(columns=focuscrimes , index = range(0,24))

In [48]:
df_new = df_new.sort_index(axis=1)  # change the sequence of the columns based on alhabetical order to match with the above df

In [49]:
df_new

Unnamed: 0,ASSAULT,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,LARCENY/THEFT,PROSTITUTION,ROBBERY,STOLEN PROPERTY,TRESPASS,VANDALISM,VEHICLE THEFT,WEAPON LAWS
0,,,,,,,,,,,,,,
1,,,,,,,,,,,,,,
2,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,
5,,,,,,,,,,,,,,
6,,,,,,,,,,,,,,
7,,,,,,,,,,,,,,
8,,,,,,,,,,,,,,
9,,,,,,,,,,,,,,


In [141]:
df_new.columns[0] # gia na piano to kathe column

'ASSAULT'

In [143]:
df_new.columns[13]

'WEAPON LAWS'

In [None]:
df_filtered["share"][0:24] # ta values gia to proto crime

In [None]:
df_filtered["share"][24:48]  # ta values gia to deutero crime 

In [139]:
# FILL ALL THE COLUMNS OF THE NEW MATRIX (ONEBYONE) ACCORDING TO THE COLUMNS OF THE df_filtered_dataframe

j = 0
k = 24

for i in range(14):
    df_new[df_new.columns[i]] = df_filtered["share"][j:k].values
    j += 24
    k += 24

In [145]:
df_new # etsi einai o dataframe opos thelei na ton exoume gia tin askisi

Unnamed: 0,ASSAULT,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,LARCENY/THEFT,PROSTITUTION,ROBBERY,STOLEN PROPERTY,TRESPASS,VANDALISM,VEHICLE THEFT,WEAPON LAWS
0,0.055468,0.040191,0.052282,0.121869,0.035064,0.080276,0.039479,0.129656,0.056201,0.044247,0.027969,0.054945,0.035913,0.054413
1,0.049745,0.027653,0.038354,0.114539,0.020654,0.077235,0.025431,0.095748,0.060538,0.033786,0.021191,0.038576,0.024113,0.039486
2,0.044837,0.031432,0.032569,0.098656,0.016746,0.07014,0.015607,0.060436,0.061111,0.029686,0.025391,0.035994,0.018234,0.032891
3,0.023267,0.032765,0.018642,0.047954,0.012489,0.027367,0.009971,0.036367,0.037957,0.023325,0.021382,0.026022,0.011841,0.022737
4,0.014025,0.029379,0.014999,0.01741,0.009279,0.014393,0.006543,0.019501,0.023943,0.020356,0.015559,0.017797,0.010011,0.016662
5,0.011857,0.025644,0.06021,0.01069,0.005284,0.005068,0.006631,0.01019,0.019642,0.016398,0.038851,0.01467,0.009991,0.007377
6,0.015573,0.022892,0.119777,0.012523,0.010982,0.009122,0.010058,0.008433,0.017849,0.017246,0.075601,0.017059,0.015274,0.011542
7,0.022212,0.032284,0.101564,0.008552,0.024759,0.020069,0.015137,0.005095,0.016559,0.024314,0.07436,0.021645,0.022757,0.021609
8,0.033324,0.048688,0.074566,0.008247,0.0324,0.017231,0.025859,0.004919,0.019749,0.02799,0.064147,0.031377,0.032932,0.024212
9,0.035732,0.044997,0.053139,0.013134,0.038252,0.021488,0.031942,0.002811,0.02233,0.031807,0.05651,0.029629,0.033159,0.031849


In [146]:
# insert the column hour again because it needs it for the plot 
df_new.insert(loc=0, column='hour', value=range(0,24))

VISUALIZATION

In [148]:
import bokeh
from bokeh.models import ColumnDataSource, FactorRange
from bokeh.plotting import figure, show
from bokeh.models import Legend
from bokeh.palettes import Category20 as Palette

In [149]:
source = ColumnDataSource(df_new)

In [150]:
hours = ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','21','22','23','24']

In [151]:

#create the figure frame etc
p = figure(width=1000,y_range = (0,0.2), x_range = FactorRange(factors=hours), tools="" ,toolbar_location=None,title="Crimes per hour")

                                                # edo to hours to pairnei apo tin lista pano

bar ={} # to store vbars
items = []
colors = Palette[len(focuscrimes)]   # gia na exei diaforetiko xroma kathe category
### here we will do a for loop:

for indx,i in enumerate(focuscrimes):
    bar[i] = p.vbar(x='hour',  top=i, source= source, color = colors[indx],
                    ### we will create a vbar for each focuscrime
                     muted_alpha = 0., muted = True) 
    items.append((i, [bar[i]]))

    #edo to "hour" einai i stili apo to dataframe
    #source = source-> pairnei ta stoixeia apo to dtframe mou
    # muted_alpha = 0 einai to transparency (gia na fainetai keno to plot apo piso), na to allakso gia na do pos fainetai
    # muted -> pairnei true or false, gia to an to proto plot tha ta ksekinaei apo to 0 i apo ola, na to testaro me False
    
    
    
    
    
legend = Legend(items=items)
p.add_layout(legend, 'right')
p.legend.click_policy="mute" ### assigns the click policy (you can try to use ''hide'
show(p) #displays your plot
#i stands for a column that we use, top=y; we are specifying that our numbers comes from column i
#read up what legend_label, muted and muted_alpha do... you can add more attributes (you HAVE TO)

## Part 3: Narrative Dataviz

Let's finish up with some reading

*Reading*: [Narrative Visualization: Telling Stories with Data](http://vis.stanford.edu/files/2010-Narrative-InfoVis.pdf) by Edward Segel and Jeffrey Heer. We'll read section 1-3 today. (And the rest a bit later).

When you get to section 3 it's fun to open up the examples mentioned by the authors in a browser and explore them as you read the text. 

> *Exercise*: Answer a couple of questions about the paper.
> 
> * What is the *Oxford English Dictionary's* defintion of a narrative?
> * What is your favorite visualization among the examples in section 3? Explain why in a few words.