# Geography
In the previous tutorial ([Running a query](02-running-a-query.ipynb)), you learned how to run a FlowKit query, and get the result as a pandas DataFrame.

In this tutorial you will learn:

- How to choose which set of locations query results are aggregated to, using the 'aggregation_unit' query parameter,  
- How to get a query result including the geographic boundaries of the spatial areas,  
- How to use geopandas to visualise query results on a map, or save as a shapefile.  

## Spatial aggregation in FlowKit

Most FlowKit queries aggregate results to a set of geographic locations. The choice of locations is set by the `aggregation_unit` parameter of a query. In [the previous tutorial](02-running-a-query.ipynb), you got results aggregated to districts by setting `aggregation_unit="admin2"`. The available aggregation units can vary between FlowKit deployments - in the Ghana FlowCloud deployment we're using in these tutorials, the available aggregation units are:

- `"admin0"`: country level (i.e. the whole of Ghana)
- `"admin1"`: region level
- `"admin2"`: district level
- `"lon-lat"`: this will return results aggregated to cell towers, with the position (longitude, latitude) of each cell tower

Note that when doing analysis using a FlowKit server with real CDR data, your access token may restrict which aggregation units you are allowed to use.

In this tutorial, you will learn how to associate query results with the geometries of the locations to which they are aggregated.

## Getting a query result with geography data

We start by importing flowclient, and also geopandas (which we'll use for working with geospatial data).

In [None]:
import flowclient as fc
import geopandas as gpd

Next, create a connection as we did in [tutorial 1](01-getting-started-with-flowclient.ipynb):

In [None]:
token = 

conn = fc.connect(
    url="https://api.flowcloud-ghana.flowminder.org",
    token=token,
)

Now define a query. We'll use the same 'unique subscriber counts' query that we used in [tutorial 2](02-running-a-query.ipynb).

In [None]:
subscriber_counts_query = fc.unique_subscriber_counts(
    connection=conn,
    start_date="2016-01-01",
    end_date="2016-01-02",
    aggregation_unit="admin2",
)

We can now get the result of this query. In the previous tutorial we got the result as a pandas DataFrame, which contained the P-code that identified each district but did not include the geographic boundaries of the districts. This time, we will specify `format="geojson"` to get the result as a GeoJSON dictionary instead of a pandas DataFrame. The GeoJSON result will contain the geographic boundaries of the admin2 districts. (GeoJSON is a data format that can represent geometric shapes, including points, lines and polygons).

In [None]:
subscriber_counts_query_result_geojson = subscriber_counts_query.get_result(format="geojson")

This time, the query result is a dictionary instead of a DataFrame.

In [None]:
type(subscriber_counts_query_result_geojson)

This dictionary format is less convenient than the pandas DataFrames that we used in the previous tutorial. It will be more convenient if we load our GeoJSON result into a geopandas GeoDataFrame (this is equivalent to a pandas DataFrame, except that it can contain geometry objects). We imported the geopandas library at the beginning of this tutorial (using the name `gpd`), so we can now use `gpd.GeoDataFrame.from_features(...)` to load our GeoJSON dictionary into a GeoDataFrame:

In [None]:
subscriber_counts_query_result_geopandas = gpd.GeoDataFrame.from_features(subscriber_counts_query_result_geojson)
subscriber_counts_query_result_geopandas

We can see that as well as the 'pcod' and 'value' columns that we saw in the previous tutorial, we now have columns 'geometry' and 'centroid'. The 'geometry' column contains the shape of each district. 'centroid' contains the coordinates of the point at the centre of each district - we won't use the 'centroid' data in this tutorial.

Geopandas makes it easy for us to plot the subscriber counts as coloured polygons on a map (i.e. a choropleth map), using the `plot` method. The parameter `column="value"` means that each polygon will be coloured according to the value in the 'value' column (which is the subscriber count), and `legend=True` means that a colour bar is included next to the map.

In [None]:
subscriber_counts_query_result_geopandas.plot(column="value", legend=True)

We can also use geopandas to easily save geographic data to a file. For example, to save as a shapefile we can use the `to_file` method, and provide a filename ending with ".shp":

In [None]:
subscriber_counts_query_result_geopandas.to_file("admin2_subscriber_counts_20160101.shp")

## Getting geography data separately, and joining to query results

In some cases, you may want to do some further processing of query results before joining these results to the geography data. In this situation it is sometimes easier to use the flowclient `get_geography` function to get the geography data by itself.

As an example, let's calculate the average number of events per subscriber in each region. We'll do the calculation using query results as pandas DataFrames (without geography data), and then join these results to the region boundaries at the end.

### 1. Define and run queries

The average number of events per subscriber is the event count divided by the subscriber count, so we need to run two queries: a `location_event_counts` query and a `unique_subscriber_counts` query. This time we want to calculate the result per region, instead of per district, so we set the `aggregation_unit` parameter to `"admin1"` instead of `"admin2"`. Again we'll run the query for one day: 1st January 2016.

In [None]:
event_counts_query = fc.location_event_counts(
    connection=conn,
    start_date="2016-01-01",
    end_date="2016-01-02",
    aggregation_unit="admin1",
    count_interval="day",
)

subscriber_counts_query = fc.unique_subscriber_counts(
    connection=conn,
    start_date="2016-01-01",
    end_date="2016-01-02",
    aggregation_unit="admin1",
)

Next we set both queries running:

In [None]:
event_counts_query.run()
subscriber_counts_query.run()

### 2. Get results
In [the previous tutorial](02-running-a-query.ipynb) you learned how to get a query result as a pandas DataFrame, using `get_result()` with no arguments. We'll use this to get the results of our two queries. First, the event counts:

In [None]:
event_counts_query_result_dataframe = event_counts_query.get_result()

In [None]:
event_counts_query_result_dataframe

This time, the 'pcod' column is the admin1 P-code which identifies each of the regions. The 'value' column is the event count - let's rename it to make this clearer (using the pandas `rename` method):

In [None]:
event_counts_query_result_dataframe = event_counts_query_result_dataframe.rename(columns={"value": "event_count"})

 And now, also get the result of the subscriber counts query (again, as a DataFrame):

In [None]:
subscriber_counts_query_result_dataframe = subscriber_counts_query.get_result()
subscriber_counts_query_result_dataframe

Again, let's rename the 'value' column so it's clear this is the subscriber count:

In [None]:
subscriber_counts_query_result_dataframe = subscriber_counts_query_result_dataframe.rename(columns={"value": "subscriber_count"})

### 3. Calculate average events per subscriber

Now that we have the results of both queries, we can merge the two result dataframes using the pandas `merge` method, joining on the 'pcod' column (which is the admin1 region ID).

In [None]:
joined_results = event_counts_query_result_dataframe.merge(subscriber_counts_query_result_dataframe, on="pcod")
joined_results

Next, we can calculate the average events per subscriber (which is event count divided by subscriber count), and add this as a new column:

In [None]:
joined_results["events_per_subscriber"] = joined_results["event_count"] / joined_results["subscriber_count"]
joined_results

### 4. Get the geography data

Earlier in this tutorial, we got the result of a query as a GeoJSON dictionary containing the geographic boundaries. This time, we already have our result (and we have done some further analysis to calculate a new result: events per subscriber), so we just need the geographic boundaries.

We can use the flowclient `get_geography` function to get the admin1 region boundaries as a GeoJSON dictionary. `get_geography` requires two parameters:

- `connection`: this is the FlowKit connection `conn` that we created at the beginning of the tutorial, and used in the query definitions,  
- `aggregation_unit`: we used the "admin1" aggregation unit in the queries, so we use the same here to get the corresponding geographies.  

As before, we can use `gpd.GeoDataFrame.from_features` to load the GeoJSON dictionary into a geopandas GeoDataFrame.

In [None]:
admin1_geojson = fc.get_geography(connection=conn, aggregation_unit="admin1")
admin1_geopandas = gpd.GeoDataFrame.from_features(admin1_geojson)
admin1_geopandas

This time, the GeoDataFrame has 'pcod', 'geometry' and 'centroid' columns, but not 'value' because it does not include any query results.

Let's plot the admin1 polygons, to see that they correspond to Ghana's regions:

In [None]:
admin1_geopandas.plot()

### 5. Join the geographies to the 'events per subscriber' data

Now that we have the 'events per subscriber' data and the geographic boundaries, we can merge them using the 'pcod' region ID column.

In [None]:
joined_results_with_geography = admin1_geopandas.merge(joined_results, on="pcod")

**Note:** If we had done `joined_results.merge(admin1_geopandas, on="pcod")`, the result would be an ordinary pandas DataFrame, not a GeoDataFrame, so it is best to do `admin1_geopandas.merge(joined_results, on="pcod")`.

We can now plot the events per subscriber on a map, by using the value in the 'events_per_subscriber' column to colour each polygon:

In [None]:
joined_results_with_geography.plot(column="events_per_subscriber", legend=True)

## Summary

In this tutorial you learned:

- how to choose which set of set of locations (termed "aggregation unit") query results will be aggregated to, and what aggregation units are available,  
- how to get a query result with geography data, as a GeoJSON dictionary (using `query.get_result(format="geojson")`),  
- how to load a GeoJSON dictionary into a geopandas GeoDataFrame, and use geopandas to plot or save the data,  
- how to get geography data separately (using `get_geography`), and join this to data in a pandas DataFrame.  

If you're using Binder to run this tutorial, and you would like to keep a copy of this notebook with any changes you have made, don't forget to download a copy (`File`→`Download`).

Now that you have completed these three tutorials, you know how to use FlowClient to connect to a FlowKit server; how to define, run, check the status and get the result of a query; and how to associate query results with the geographic boundaries of the locations to which they are aggregated.