## Quantitative Methods 2:  Data Science and Visualisation

## Workshop 8: Working with Spatiotemporal Data
In this workshop, we will work with data that information about space and time, and show different ways of presenting this data, with the goal of producing fully-fledged maps.

### Aims:

- Plot and summarise spatial data
- Create simple point maps
- Understand the basics of projection

In [1]:
#install geopandas....

## Downloading the Data
Let's grab the data we will need this week from our course website and save it into our data folder. If you've not already created a data folder then do so using the following command. 

Don't worry if it generates an error, that means you've already got a data folder.

In [2]:
!mkdir data

mkdir: cannot create directory 'data': File exists


In [2]:
!mkdir data/wk8
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk8/tweet_data.csv -o ./data/wk8/tweet_data.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  156k  100  156k    0     0   998k      0 --:--:-- --:--:-- --:--:--  998k


`------------------------------`

In [None]:
import pandas
import matplotlib.pyplot as plt
import numpy as np
import pylab

%matplotlib inline
plt.style.use('ggplot')
pylab.rcParams['figure.figsize'] = (10, 8)

## Point and Areal Data

We're going to look at some *point data*, data which has a spatial location but not an extent - this can be contrasted with *areal data*, where data is reported or represented as covering, or relating to, a specific region geography. An example of this second category would be the released census geography - which, as we saw earlier in the term, is reported on a bespoke areal unit called the Output Area.

Today we will look at point data. For the purpose, we'll be looking at some data from twitter - data which has detailed spatial position as well as time and date.

## The birds sing a pretty song

Let's start by loading in the twitter data and running head() to take a look at what the dataset contains. The dataset has information about tweeters but not the content of the tweets:

In [None]:
data_path = "./data/wk8/tweet_data.csv"

tweets = pandas.DataFrame.from_csv(data_path, parse_dates=[1], infer_datetime_format=True)
tweets.head()

## Exercise
What does each data column represent?

## Working with datetime data
Datetime data is a little trickier to work with - it has structure which allows the extraction of hours, minutes, seconds, and so on. For example, we can just take the time part: 

In [None]:
tweets['dateT'].dt.time.head()

## Exercise
What temporal extent does the data cover? How do we need to structure our approach?

## Exercise
Create a new column in the dataframe which stores the "minute" component of the timestamp, and use it to create a histogram of the data over the course of an hour in five minute intervals. Make sure that your graph includes a title and labelled axes.

You can also execuate this code in one line if you don't mind seeing a lot of dots - I've not included the parameters to give this some polish -  we will leave that as an exercise.

In [None]:
tweets['dateT'].dt.minute.hist()

## My First Map
The beauty of this data is that the data points have x and y values, so plotting them as a scatter graph will give us out first approximation (with caveats) of a map of the data:

In [None]:
tweets.plot(
    kind='scatter',
    x='Lon',
    y='Lat',
    title="Location of Tweets")
plt.xlabel("Longitude [degrees]")
plt.ylabel("Latitude [degrees]")

Notice that we've laid out the code so it's easier to see the multiple arguments in plot() - this is just the same as:

In [None]:
tweets.plot(kind='scatter',x='Lon',y='Lat', title="Location of Tweets")
plt.xlabel("Longitude [degrees]")
plt.ylabel("Latitude [degrees]")

## Exercise
Change the style of the above map using the optional arguments:

- *alpha = *: to set the opacity - 1 being opaque and 0 being transparent. Set the transparecny so that you can see busy areas *and* individual points
- *color=*: to set the colo*u*r to 'red'
- *s=*: so set the point size to 50

So we have something that looks a bit like a heat map, and even looks a bit Gaussian. Let's see what a histogram of this data looks like:

In [None]:
tweets['Lat'].hist(bins = 50)
plt.xlabel("Latitude [degrees]")
plt.ylabel("Number of Tweets")

In [None]:
ax = tweets['Lon'].hist(bins=50)
plt.xlabel("Longitude [degrees]")
ax.set_ylabel("Number of Tweets")

## The Naivest Projection

The above tweet plot implicitly uses the *equirectangular* projection, which maps longitude onto the x axis, and latitude onto the y axis. What is the problem with this ? 

Projection is hugely complex and mathematically fiddly - luckily, we'll be working with packages which mostly do the heavy lifting for us. It's still worth thinking about projection a bit, as the process of taking points on a sphere and translating that to a flat surface is never a perfect one.

If  we look at the picture below, then clearly the distance between $p$ and $q$ is the length of the curve on the sphere rather than the straight line between them. The closer $p$ and $q$ are the more the distance is like a straight line, and we can use a *linear mapping* - i.e. the x coordinate is a linear function of lon, and the y axis is a linear function of latitude.

In [None]:
from IPython.display import Image
Image("https://s3.eu-west-2.amazonaws.com/qm2/wk8/great-circle-distance.png")

## Extension: Not-so-great circles

Calculating great circle distances is the "real" way of figuring out the distance between two points on a sphere is fairly complex. Thankfully, there's a small angle approximation. 

If the two points (1 and 2) have latitudes $\phi_1$ and $\phi_2$ and longitudes $\lambda_1$ and $\lambda_2$, then let $\Delta\phi = \phi_1 - \phi_2$ and $\Delta\lambda = \lambda_1 - \lambda_2$ , where $(\phi_1,\lambda_1) ,(\phi_2,\lambda_2)$ are two points given in (latitude, longitude).  

If $\Delta\phi$ and $\Delta\lambda$ are small enough, you can calculate the distance $D$ with : 

$$D = R \sqrt{(\Delta\phi)^2+(cos(\bar{\phi})\Delta\lambda)^2}$$


where $\bar{\phi}$ is the mean latitude of the two points, $\frac{1}{2}(\phi_1+\phi_2)$, and R is the radius of the earth. 

How small is small enough if we want to use this approximation? Well, it depends on how much error you want to incur. But generally if the angles are much less than one radian, you'll incur small errors. Radians, you say? Yes, everything in the above equations assumes angles are expressed in radians. 1 radian is about 57 degrees, but there are more precise definitions, and python has a utility function for converting between the two.

For reference, the errors accumulated over the size of London are tens of metres.

## British Values

The British National Grid provides projected values in metres, so we can get by without doing projections "on the fly" just yet. If we plot these values, it will look pretty similar, for the reasons outlined above - over the few km of London, most projection methods are quite close to the linear mapping we've done.

In [None]:
ax = tweets.plot(
    kind='scatter',
    x='OSGB_Lon',
    y='OSGB_Lat',
    title="Location of Tweets")
ax.set_xlabel("projected Longitude")
ax.set_ylabel("projected Latitude")

## Exercise: Describing the Data

1) Find the data centroid (lat, lon)

2) Calculate the x, y and total extent of the data, in km (or miles). (Use the projected [OSGB] data for that.)

Hint: use commands which capture the maximum, minimum and mean of the data - describe() is a useful one here.


## Exercise

How might we go about calculating the geographical extent which contained 95% of tweets? Assuming the distribution is Gaussian in both variables, estimate a) the latitude the limits which contain 95% of tweets, b) the longitude limits which contain 95% of tweets. Then c) and add lines showing these limits to the tweet graph and d) save the figure as an image using plt.savefig(*filename*).

What proportion of tweets is held within this box?

How do you think the following elements influence the above result?

- The 2D nature of the data

- Asymmetry of the Gaussian (i.e. if $\sigma_x \neq \sigma_y$)

- Whether the data is Gaussian!

- What other approaches could you take with this data?

In [None]:
Image("https://s3.eu-west-2.amazonaws.com/qm2/wk8/tweets.png")

## Working in 2D

It's clear that we can learn from 1D about how we can approach 2D data. But there are limitations, and treating 2D as two sets of 1D data doesn't work for everything. We need to find ways to carry out histograms and other aggregations in 2D - and the first of these is hexbinning.

## Hexbinning

We can also use a [hexbin clustering](http://pandas-docs.github.io/pandas-docs-travis/visualization.html#hexagonal-bin-plot) method, which is similar to binning in a histogram, the more points we have in hexbin the warmer the color. Here, we count the number data points in each hexagons, in the same way that we count the number of data in each bin for 1D data

Luckily, the code is very easy to execute, and requires only small changes:

In [None]:
ax = tweets.plot(
    kind='hexbin',
    x='OSGB_Lon', y='OSGB_Lat',
    gridsize=50,
    title="Tweet Density (Hex Bin)",
    cmap='coolwarm',
    )
plt.xlabel("projected Longitude")
plt.ylabel("projected Latitude")

Possible values are: Spectral, summer, coolwarm, Wistia_r, pink_r, Set1, Set2, Set3, brg_r, Dark2, prism, PuOr_r, afmhot_r, terrain_r, PuBuGn_r, RdPu, gist_ncar_r, gist_yarg_r, Dark2_r, YlGnBu, RdYlBu, hot_r, gist_rainbow_r, gist_stern, PuBu_r, cool_r, cool, gray, copper_r, Greens_r, GnBu, gist_ncar, spring_r, gist_rainbow, gist_heat_r, Wistia, OrRd_r, CMRmap, bone, gist_stern_r, RdYlGn, Pastel2_r, spring, terrain, YlOrRd_r, Set2_r, winter_r, PuBu, RdGy_r, spectral, rainbow, flag_r, jet_r, RdPu_r, gist_yarg, BuGn, Paired_r, hsv_r, bwr, cubehelix, Greens, PRGn, gist_heat, spectral_r, Paired, hsv, Oranges_r, prism_r, Pastel2, Pastel1_r, Pastel1, gray_r, jet, Spectral_r, gnuplot2_r, gist_earth, YlGnBu_r, copper, gist_earth_r, Set3_r, OrRd, gnuplot_r, ocean_r, brg, gnuplot2, PuRd_r, bone_r, BuPu, Oranges, RdYlGn_r, PiYG, CMRmap_r, YlGn, binary_r, gist_gray_r, Accent, BuPu_r, gist_gray, flag, bwr_r, RdBu_r, BrBG, Reds, Set1_r, summer_r, GnBu_r, BrBG_r, Reds_r, RdGy, PuRd, Accent_r, Blues, autumn_r, autumn, cubehelix_r, nipy_spectral_r, ocean, PRGn_r, Greys_r, pink, binary, winter, gnuplot, RdYlBu_r, hot, YlOrBr, coolwarm_r, rainbow_r, Purples_r, PiYG_r, YlGn_r, Blues_r, YlOrBr_r, seismic, Purples, seismic_r, RdBu, Greys, BuGn_r, YlOrRd, PuOr, PuBuGn, nipy_spectral, afmhot

## Extension: Splitting Mapping Data by Time

In the next section, we use both space and time to show different geographical distributions at different times. We'll select on index, splitting the dataset in two.

In [None]:
early = tweets[:750]
late = tweets[750:len(tweets)]

In [None]:
early.head()

In [None]:
late.head()

## Exercise
Plot both sets of tweets onto the same axes so they can be compared. Try and make your plot look like the image below.

In [None]:
Image("https://s3.eu-west-2.amazonaws.com/qm2/wk8/two_times.png")

We can visually inspect the spatial plots of the two time frames using hexbin plots; in this case there's not much to see...

In [None]:
ax = early.plot(
    kind='hexbin',
    x='OSGB_Lon', 
    y='OSGB_Lat',
    gridsize=50,
    title="Tweet Density (Hex Bin)",
    cmap='coolwarm',
    )
plt.xlabel("projected Longitude")
plt.ylabel("projected Latitude")

In [None]:
ax = late.plot(
    kind='hexbin',
    x='OSGB_Lon', 
    y='OSGB_Lat',
    gridsize=50,
    title="Tweet Density (Hex Bin)",
    cmap='coolwarm',
    )
plt.xlabel("projected Longitude")
plt.ylabel("projected Latitude")