# Interactive Plotting w/ Bokeh

Today we'll be going over how to do some more advanced things with interactive plots. 

In [1]:
import pandas as pd
import chardet
from config import gmaps_key
from pathlib import Path
from bokeh.io import output_notebook, show
from bokeh.plotting import figure, gmap
from bokeh import events
from bokeh.models import CustomJS, Div, Button, GMapOptions, Dropdown, ColumnDataSource, HoverTool
from bokeh.layouts import column, row
output_notebook()

data_path = Path.cwd() / 'data/boston_crime.csv'

## Character Encodings
Depending on the application that was used to create a CSV and the location it was created in, we could see different encodings on a CSV. For example, older Excel versions used custom encodings like what Pandas calls 'cp1252', and other countries have different encodings based on their local language. This has largely gone by the wayside in modern work since everything should be UTF-8, but you will see non-standard encodings every so often. 

[Here's the encodings read_csv supports.](https://docs.python.org/3/library/codecs.html#standard-encodings)

The best way of telling what type of encoding you're dealing with is to make sure that the person giving you the data puts it in UTF-8. That's not always possible, but for those times there's chardet. Chardet is a Python library that uses machine learning to predict what kind of encoding the file has. It takes a long time for large files and isn't always accurate, but it's usually a good thing to try. 

I don't recommend that you run the code chunk below. It takes forever, because this dataset is enormous. The output does state that the encoding is 'latin-1', which is correct. 

In [None]:
def detect_encoding(data_path):
    """
    detect_encoding()
    Takes in a Path object and prints the predicted encoding and confidence.
    
    Gets: data_path, a Path object
    Returns: nothing
    """
    with open(data_path, 'rb') as read_file:
        print(chardet.detect(read_file.read()))
        
        
detect_encoding(data_path)
# results in 'latin-1' but takes forever

## Comments
The best way to write comments in Python is, in my opinion and in the opinion of PEP8, to use docstrings. The comments I've placed within these functions are examples of docstrings. 

For more on docstrings, [see here](https://www.python.org/dev/peps/pep-0257/).

There are of course other places you might want to comment, for instance it probably makes sense to write a one-line comment talking about a particularly complicated bit of logic you wrote if you think you won't understand it later.

In [7]:
def import_data(data_path):
    """
    import_data(data_path)
    Receives a Path object and uses that to read in a csv
    and return it. Currently hardcoding encoding because this
    will only be used for one csv.
    
    Gets: data_path, a Path object
    Retuns: a Pandas Dataframe
    """
    return pd.read_csv(data_path, encoding='latin-1', low_memory = False)
    
    
df = import_data(data_path)

In [8]:
print(df.shape)
print(df.head)

(327820, 17)
<bound method NDFrame.head of        INCIDENT_NUMBER  OFFENSE_CODE  OFFENSE_CODE_GROUP  \
0           I182080058          2403  Disorderly Conduct   
1           I182080053          3201       Property Lost   
2           I182080052          2647               Other   
3           I182080051           413  Aggravated Assault   
4           I182080050          3122            Aircraft   
...                ...           ...                 ...   
327815   I050310906-00          3125     Warrant Arrests   
327816   I030217815-08           111            Homicide   
327817   I030217815-08          3125     Warrant Arrests   
327818   I010370257-00          3125     Warrant Arrests   
327819       142052550          3125     Warrant Arrests   

                        OFFENSE_DESCRIPTION DISTRICT REPORTING_AREA SHOOTING  \
0                      DISTURBING THE PEACE      E18            495      NaN   
1                           PROPERTY - LOST      D14            795      NaN

We have an enormous amount of data. Let's pare it down to:
1. Rows where the latitude and longitude aren't null
2. Rows where the latitude and longitude make sense for Boston
3. A sample of 1/20 of the size of our original df

In [9]:
def subset(df):
    """
    subset(df)
    Subsets the dataset to only those rows where:
    1. Rows where the latitude and longitude aren't null
    2. Rows where the latitude and longitude make sense for Boston
    3. A sample of 1/20 of the size of our original df
    
    Gets: df, a Pandas dataframe
    Returns: df, a Pandas dataframe
    """
    df = df[(df['Lat'].notnull()) & (df['Long'].notnull())]
    df = df[(df['Lat'] > 41) & (df['Lat'] < 43)]
    df = df[(df['Long'] > -73) & (df['Long'] < -69)]
    df = df.sample(frac=.05, axis = 'index')
    return df

df = subset(df)
print(df.shape)

(15321, 17)


## Bokeh Plot

Finally, our Bokeh plot. 

We're first turning our dataframe into a ColumnDataSource, which allows us to do more things with Bokeh. We don't have to do that, but if we don't then we wouldn't be able to use tooltips, for example. 

We then set our Google Map options. The latitude and longitude are where the map is centered, then we have some options about the type of map we want (i.e. elevation, streets, etc.) and the amount of zoom.

We're using the gmap plot from bokeh, which implicitly uses Google's Javascript Maps API. To use it, you need a Google API key. They don't charge for a small amount of use. Look [here](https://docs.bokeh.org/en/latest/docs/user_guide/geo.html) for more on this. You might notice that one of these examples looks a lot like the plot below. That's not a coincidence. I'm importing my key from a config file that I have not uploaded to git, so running the code below on your machine will  not work for you. If you want to use this plot style, please talk to me about it beforehand, because it could be rather bad if you upload your API key to Github (though your repos are private). 

We also have a little selection tool. We could use this to subset our data, but that would require more Javascript than we want you to write for your assignment. For now, it just colors based off the selection. 

In [11]:
source = ColumnDataSource(df)

map_options = GMapOptions(lat=42.359955, lng=-71.059886, map_type="roadmap", zoom=11)
tooltips = [
    ("Date", "@OCCURRED_ON_DATE"),
    ("Offense Description", "@OFFENSE_DESCRIPTION"),
]


p = gmap(gmaps_key, title="Boston Crime", map_options=map_options, tools="box_select")
p.circle('Long', 'Lat', size=2, fill_alpha=0.6, line_color=None, source=source)
div = Div(width=400)
layout = column(row(p, div))
p.add_tools(HoverTool(tooltips=tooltips))

p.js_on_event(events.SelectionGeometry, CustomJS(args=dict(div=div), code="""
div.text = "Selection! <p> <p>" + JSON.stringify(cb_obj.geometry, undefined, 2);
"""))

show(layout)