## Plotting geospatial data on a map

In this first activity for geoplotlib, you'll combine methodologies learned in the previous exercise and use theoretical knowledge from previous lessons.   
Besides from wrangling data you need to find the area with given attributes.   

Before we can start, however, we need to import our dataset.   
For this activity, we'll work with geo-spatial data that contains all cities with their coordinates and their population.

**Note:**   
This time the dataset is not yet added into the data folder. You have to download it from here:   
https://www.kaggle.com/max-mind/world-cities-database#worldcitiespop.csv

#### Loading the dataset

In [4]:
# importing the necessary dependencies
import numpy as np
import pandas as pd
import geoplotlib

In [1]:
# loading the Dataset (make sure to have the dataset downloaded)


**Note:**   
If we import our dataset without defining the dtype of column *Region* as String, we will get a warning telling out the it has a mixed datatype.   
We can get rid of this warning by explicitly defining the type of the values in this column by using the `dtype` parameter.   
`dtype={'Region': np.str}`

In [2]:
# looking at the data types of each column


**Note:**   
Here we can see the dtypes of each column.   
Since the String type is no primitive datatype, it's displayed as `object` here.

In [3]:
# showing the first 5 entries of the dataset


---

#### Mapping `Latitude` and `Longitude` to `lat` and `lon`

Most datasets won't be in the format that you want to have. Some of them might have their latitude and longitude values hidden in a different column.   
This is where the data wrangling skills of lesson 1 are needed.   

For the given dataset, the transformations are easy, we simply need to map the `Latitude` and `Longitude` columns into `lat` and `lon` columns which are used by geoplotlib.

In [4]:
# mapping Latitude to lat and Longitude to lon


**Note:**   
Geoplotlibs methods expect dataset columns `lat` and `lon` for plotting. This means your dataframe has to be tranfsormed to resemble this structure.   

---

#### Understanding our data

It's your first day at work, your boss hands you this dataset and wants you to dig into it and find the areas with the most adjacent cities that have a population of more than 100k.   
He needs this information to figure out where to expand next.   

To get a feeling for how many datapoints the dataset contains, we'll plot the whole dataset using dots.

In [5]:
# plotting the whole dataset with dots


Other than seeing the density of our datapoints, we also need to get some information about how the data is distributed.

In [6]:
# amount of countries and cities


In [7]:
# amount of cities per country (first 20 entries)


In [8]:
# average num of cities per country


Since we are only interested in areas with densely placed cities and high population, we can filter out cities without a population. 

#### Reducing our data

Our dataset has more than 3Mio cities listed. Many of them are really small and can be ignored, given our objective for this activity.   
We only want to look at those cities that have a value given for their population density.

**Note:**   
If you're having trouble filtering your dataset, you can always check back with the activities in lesson1.

In [11]:
# filter for countries with a population entry (Population > 0)


In [12]:
# displaying the first 5 items from dataset_with_pop


In [13]:
# showing all cities with a defined population with a dot density plot


**Note:**   
Not only the execution time of the visualization has been decreased but we already can see where the areas with more cities are.   

Following the request from our boss, we shall only consider areas that have a high density of adjacent cities with a population of more than 100k.

In [14]:
# dataset with cities with population of >= 100k


In [15]:
# displaying all cities >= 100k population with a fixed bounding box (WORLD) in a dot density plot
from geoplotlib.utils import BoundingBox


**Note:**   
In order to get the same view on our map every time, we can set the bounding box to the constant viewport declared in the geoplotlib library.   
We can also instantiate the BoundingBox class with values for north, west, south, and east.

---

#### Finding the best area

After reducing our data, we can now use more complex plots to filter down our data even more.   
Thinking back to the first exercise, we've seen that histograms and voronoi plots can give us a quick visual representation of the density of data.

**Note:**   
Try playing around with the different color maps of the plotting methods, sometimes using other colors does not only improve the visuals but also the amount of information you can take from the visualization.

In [16]:
# using filled voronoi to find dense areas


In the voronoi plot we can see tendencies.   
Germany, Great Britain, Nigeria, India, Japan, Java, the East Coast of the USA, and Brasil stick out.   
We can now again filter our data and only look at those countries to find the best suited.   

---

#### Final call

After meeting with your boss, he tells you that we want to stick to Europe when it comes to expanding.   
Filter your data for Germany and Great Britain only and decide which area is your final proposal.

In [17]:
# filter 100k dataset for cities in Germany and GB


In [18]:
# using Delaunay triangulation to find the most dense aree


Looking at our delaunay visualization, we can quickly see that area around Cologne and Düsseldorf stick out.   
With those insights, we can now get back to our boss and talk about what we found out.

**Note:**   
As mentioned before, it's important to know which visualization type helps you achieve the best insights.   
We e.g. could've simply used a dot density map in the final call which would have also given us an idea about where there are most cities.   
However delaunay triangulation is a good approach here that makes details pop nearly instantly.