# Data Preparation

Run this notebook to (1) parse CSV to a Pandas DataFrame, (2) pickle the DataFrame and store on disk, (3) create bounding boxes and store the pickled verison on disk.

The pickled objects will be used by the Web Application.

(Note: at some point we may want to consider to use Dask Dataframe all the way and bypass Pandas entirely. But for now let's keep things simple).

In [1]:
import pandas as pd
import pickle
from scripts import utils

In [2]:
df = pd.read_csv('../data/dftRoadSafety_Accidents_2016.csv', usecols=["Accident_Index", "Longitude", "Latitude"])

In [3]:
df.shape

(136621, 3)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 136621 entries, 0 to 136620
Data columns (total 3 columns):
Accident_Index    136621 non-null object
Longitude         136614 non-null float64
Latitude          136614 non-null float64
dtypes: float64(2), object(1)
memory usage: 3.1+ MB


In [5]:
df=df.dropna()

In [6]:
df.shape

(136614, 3)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136614 entries, 0 to 136620
Data columns (total 3 columns):
Accident_Index    136614 non-null object
Longitude         136614 non-null float64
Latitude          136614 non-null float64
dtypes: float64(2), object(1)
memory usage: 4.2+ MB


In [8]:
# double check - just to make sure we don't have duplicate `Accident_Index`
len(df["Accident_Index"].unique())

136614

In [9]:
# add the Web Mercator `webm_x` and `web_y` columns from our Logitude and Latitude values
df = utils.add_webm_xys(df)

In [10]:
df.head()

Unnamed: 0,Accident_Index,Longitude,Latitude,webm_x,webm_y
0,2016010000005,-0.279323,51.584754,-31094.094127,6725389.0
1,2016010000006,0.184928,51.449595,20586.090793,6701211.0
2,2016010000008,-0.473837,51.543563,-52747.293559,6718013.0
3,2016010000016,-0.164442,51.404958,-18305.599705,6693241.0
4,2016010000018,-0.40658,51.483139,-45260.278567,6707205.0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136614 entries, 0 to 136620
Data columns (total 5 columns):
Accident_Index    136614 non-null object
Longitude         136614 non-null float64
Latitude          136614 non-null float64
webm_x            136614 non-null float64
webm_y            136614 non-null float64
dtypes: float64(4), object(1)
memory usage: 6.3+ MB


In [12]:
# df.to_parquet(df, '/Users/johnny/repos/bokeh-app-uk-road-accidents-viz/data')
df.to_pickle("../data/dftRoadSafety_Accidents_2016_tiny.pkl")

In [13]:
# check that we can import dataframe
usecols = ["Accident_Index", "Longitude", "Latitude", "webm_x", "webm_y"]
df2 = pd.read_pickle('../data/dftRoadSafety_Accidents_2016_tiny.pkl')[usecols]

In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 136614 entries, 0 to 136620
Data columns (total 5 columns):
Accident_Index    136614 non-null object
Longitude         136614 non-null float64
Latitude          136614 non-null float64
webm_x            136614 non-null float64
webm_y            136614 non-null float64
dtypes: float64(4), object(1)
memory usage: 6.3+ MB


Looks good!

In [15]:
bboxes = {
  "gb": ((-15.381, 7.251), (48.749, 61.502)),
  "gb_mainland": ( (-12.129, 5.120), (49.710, 58.745)),
  "gb_long": ((-8.745, 2.241), (48.749, 61.502)),
  "gb_wide": ((-21.709, 15.293), (48.749, 61.502)),
  "london": ((-0.643, 0.434), (51.200, 51.761)),
  "london_2": ((-0.1696, 0.0130), (51.4546, 51.5519)),
  "london_3": ((-0.1330, -0.0235), (51.4741, 51.5322)),  
  "manchester": ((-3.049, -1.505), (52.975, 53.865))
}

In [16]:
with open('../data/bboxes.pkl', 'wb') as handle:
    pickle.dump(bboxes, handle, protocol=pickle.HIGHEST_PROTOCOL)

In [18]:
# check that we can import dict
bboxes2 = pickle.load(open( "../data/bboxes.pkl", "rb" ) )

In [19]:
bboxes2

{'gb': ((-15.381, 7.251), (48.749, 61.502)),
 'gb_long': ((-8.745, 2.241), (48.749, 61.502)),
 'gb_mainland': ((-12.129, 5.12), (49.71, 58.745)),
 'gb_wide': ((-21.709, 15.293), (48.749, 61.502)),
 'london': ((-0.643, 0.434), (51.2, 51.761)),
 'london_2': ((-0.1696, 0.013), (51.4546, 51.5519)),
 'london_3': ((-0.133, -0.0235), (51.4741, 51.5322)),
 'manchester': ((-3.049, -1.505), (52.975, 53.865))}

In [20]:
type(bboxes2)

dict