In [None]:
# DELETE BEFORE RELEASE
import os
import sys
nb_dir = os.path.split(os.getcwd())[0] + "/src"
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

# `stedsans`

This is a notebook showing the current and most prominent capabilities of `stedsans`. 
It is heavily recommended to run the notebook by using Google Colab:
<br>
<br>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MalteHB/stedsans/blob/main/notebooks/stedsans_demo.ipynb)


If running the notebook on your local machine consider installing [Anaconda](https://docs.anaconda.com/anaconda/install/) and then install the package `geopandas` to get the pre-built binaries, by using the `conda` package manager from an Anaconda integraged terminal:

```bash
conda install geopandas
```

## Setup

We will start off by installing `stedsans` using the `pip` package manager.

In [None]:
#!pip -q install stedsans

If you are using either Google Colab, Linux or MacOS also feel free to install `geopandas` using `pip`, however, if you are using Windows OS install `geopandas` to by using the `conda` package manager.

In [None]:
# For Google Colab, Linux or MacOS:
#!pip -q install geopandas

# For Windows:
# !conda install geopandas

__Importing packages__

We start off by importing the main module of `stedsans`.

In [None]:
from stedsans import stedsans

## Language capabilities of `stedsans`

`stedsans` is capable of taking a either a Danish or an English sentence, and extracting the entities by using either [Ælæctra](https://huggingface.co/Maltehb/-l-ctra-danish-electra-small-cased-ner-dane) or [BERT](https://huggingface.co/dslim/bert-base-NER), respectively.

The intended use of `stedsans` is to initialize a stedsans object with a sentence.

In [None]:
# Define the sentence
danish_sentence = "Malte er mit navn, og jeg bor på Testvej 13, Aarhus C"

# By default stedsans assumes the language is Danish
default_stedsans = stedsans(sentence = danish_sentence)

After a `stedsans` instance with a sentence has been initialized one can simply call the `extract_entities()` function and print the entities.

In [None]:
default_entities = default_stedsans.extract_entities()

print(default_entities)

### Multilinguistic stedsans 
#### (duolinguistic for now...)
By default `stedsans` assumes the language is Danish, but we can also be specified using the 'language' argument. `stedsans` is currently only capable of predicting Danish and English sentences, but future enhancements will include increased language variety.

In [None]:
danish_stedsans = stedsans(danish_sentence, language="danish")

danish_entities = danish_stedsans.extract_entities()

print(danish_entities)

In [None]:
english_sentence = "Hello my name is Malte and i live in Aarhus C"

english_stedsans = stedsans(english_sentence, language="english")

english_entities = english_stedsans.extract_entities()

print(english_entities)

A `stedsance` instance has been initialized we also use it for predicting other sentences.

In [None]:
new_danish_sentence = "Jakob er min gode samarbejdspartners navn, og han bor også i Aarhus C"

danish_sentence_entities = default_stedsans.extract_entities(new_danish_sentence)

print(danish_sentence_entities)

In [None]:
new_english_sentence = "Jakob is the name my good cooperator, and he also lives in Aarhus C"

english_sentence_entities = english_stedsans.extract_entities(new_english_sentence)

print(english_sentence_entities)

Notice here how the different models have different predictive capabilities. The Danish Ælæctra notices 'C' as part of the location whereas BERT does not.

## Geographic capabilities of `stedsans`

To show the basic geographical functionalities of stedsans we will start of by initializing a stedsans instance, `geo_demo`, with an English text string, and printing the found location and organization entities.

In [None]:
txt = "Stedsans was developed by two knights who both live in Aarhus C. \
        They are both fans of FC Midtjylland which is a football team residing at MCH Arena. \
        One comes from Randers, which is home to one of the best beers in the world. \
        The other comes from a small city, not too far from LEGOLAND. \
        One of their favourite locations is Knebel which is located on Mols Djursland. \
        They are not hateful people, but they are not too fond of AGF. What is there really to like about AGF? \
        If you ever come close to Aarhus, feel free to pay them a visit. \
        They will gladly take you on a tourist tour to both see Bruuns Galleri and Dokk1. \
        And if you like beer they will gladly beknight you at Guldhornene Aarhus."

geo_demo = stedsans(sentence = txt, language = 'english')

geo_demo.print_entities()

We can then use the `get_coordinates()` function, to obtain both the a list of coordinates, a `pandas`dataframe and a `geopandas`dataframe.

In [None]:
coords, df, gdf = geo_demo.get_coordinates()

print("List of coordinates:\n", coords)

In [None]:
print("Overview of the pandas dataframe:\n",df)

In [None]:
print("Overview of the geopandas dataframe:\n",gdf)

As we see the two dataframes are different in that the `geopandas` dataframe also contains the `geometry` column, which enables a user of `stedsans` to do additional geoanalytical analyses. Also note how the geoparsing fails as it has returned a place in France, Aérodrome d'Agen-La Garenne, for the entity 'AGF' which actually corresponds to a substandard football club located in Aarhus.

Because of this ambiguity `get_coordinates()` also takes the parameters `limit` and `limit_area` which are very convenient when only wanting locations inside a specific area.

In [None]:
coords, df, gdf = geo_demo.get_coordinates(limit="country", limit_area="Denmark", output_language="en")

print("List of coordinates:\n", coords)

In [None]:
print("Overview of the pandas dataframe:\n",df)

In [None]:
print("Overview of the geopandas dataframe:\n",gdf)

Now the entity 'AGF' has been removed and we are only getting the geotagged locations inside of Denmark.

### Basic Visualization: Plotting points onto a map

To visualize the points we have extracted from the text we can plot them on an interactive `Folium` map by using the `plot_locations()` function. We will continue using our already defined instance of `stedsans`, `geo_demo`.

In [None]:
geo_demo.plot_locations(limit="country", limit_area="Denmark")

We can also pass a mapping layer into a `stedsans` instance. Inside of `stedsans`, a `geopandas` dataframe, created from a shapefile of Denmark, divided into municipalities, is provided by the GeoData() class and can be retrieved by calling `GeoData.municipalities()`. The original shapefile was in Danish, and the column names are therefore still in Danish.

In [None]:
from stedsans.data.load_data import GeoData
denmark = GeoData.municipalities()
print("First five rows of the denmark dataframe:\n", denmark.head())

Here it is there are several columns with Danish names, including REGIONNAVN, which translates into REGIONNAME, but also a geometry column for the different polygons.

By specifiying the 'layer' argument of `plot_locations()` to be the dataframe, we can use it as a base layer. This gives os a non-interactive map created by using `matplotlib`. 

In [None]:
geo_demo.plot_locations(limit="country", limit_area="Denmark", output_language="en", layer=denmark)

### Statistical Tools: Point Patterns

Since `stedsans` is intended to be for more than a geographical visualization tool, it also comes with the ability to do Q-statistics, enabling a quick statistical analysis of distribution of the points by checking for complete spatial randomness.

We will continue to use the `geo_demo` instance of `stedsans` from the previous section.

To get the Q-statistics `stedsans` provides the function `get_quad_stats()`.

In [None]:
geo_demo.get_quad_stats()

`get_quad_stats()` provides us with a 𝜒2-value and a p-value to determine whether out points are truly completely random. In this instance, since we have a 𝜒2 = 140 with a p-value = 0.001, it does not seem that our points are complete random.

We are also able to plot the points into quadrants using the `plot_quad_count()` function. 

In this example we specify the number of `squares` per axis to be 4.

In [None]:
geo_demo.plot_quad_count(squares = 4)

### Advanced Visualizations 1: Plotting cloropleth plots

`stedsans` also provides the ability to plot cloropleth plots. This can be done by using the function `plot_cloropleth()`. 

By default `plot_cloropleth()` uses the entire world as a default base layer.

In [None]:
 geo_demo.plot_choropleth()

When only having locations in Denmark (and France) such a plot might be deemed too informative. Luckily, similarly to the `plot_locations()` function, `plot_cloropleth()` also gives us the opportunity to pass a base layer into it with the `layer` argument. We will use the already initialized `denmark` layer.

In [None]:
geo_demo.plot_choropleth(layer=denmark)

By specifiying the layer we get a much more informative view of the distribution of tagged locations.

`plot_cloropleth()` also has a `group_by` functionality, where you can specify the filling of the cloropleth plot to be grouped by a variable in the input layer.

In [None]:
geo_demo.plot_choropleth(layer=denmark, group_by='REGIONNAVN')

If we only want to get a view of the location distribution of a specific region in Denmark, we can create a subset layer, `region_m`. Here we choose the region name column `REGIONNAVN` and subsets 'Region Midtjylland' which translates to the Central Jutland Region.

In [None]:
region_m = denmark[denmark["REGIONNAVN"] == "Region Midtjylland"] 

print("First five rows of the region_m dataframe:\n", region_m.head())

print("\nUnique regions in region_m:", region_m["REGIONNAVN"].unique())

Now we have a layer only containing the `REGIONNAVN` called 'Region Midtjylland'.

We can use this to create a cloropleth plot of only the locations located in Central Jutland Region in Denmark divided into the different municipalities. We will also make use of the `title` argument to make a nice title for the plot.

In [None]:
geo_demo.plot_choropleth(layer=region_m, title = 'Central Jutland Region')

If it was not clear before, it is now evidant that the most represented municipality in the text example, according to the language models, is the municipality of Aarhus. 

### Advanced Visualizations 2: Heatmaps

# Den Store Danske - Jylland

## Reading in the article

In [None]:
jylland_article = Articles.jylland()

## Initialising stedsans object

In [None]:
geo_demo = stedsans(file = jylland_article, language = 'danish')

## Plotting locations

### Plotting on interactive leaflet map

In [None]:
geo_demo.plot_locations()

## Plotting on shapefile layer

In [None]:
geo_demo.plot_locations(layer=danmark, on_map=True)

In [None]:
geo_demo.plot_locations(layer=region_m, on_map=True)

## Plotting heatmaps

### No restrictions

In [None]:
geo_demo.plot_heatmap()

### Plotting only locations in denmark

In [None]:
geo_demo.plot_heatmap(limit = 'country', limit_area = 'Danmark')

### Boudning search area to Region Midtjylland

In [None]:
geo_demo.plot_heatmap(bounding_box=((55.9,7.6),(56.6,10.9)), bounded=True)

## Choropleth maps

### Plotting choropleth map of points bounded to Region Midtjylland on map of Denmark

In [None]:
geo_demo.plot_choropleth()

In [None]:
geo_demo.plot_choropleth(layer=danmark, title='Jylland - Den Store Danske \n Bounded to Region Midtjylland', group_by='DAGI_ID', bounding_box=((55.9,7.6),(56.6, 10.9)), bounded=True)

### Plotting choropleth map grouped by region

In [None]:
geo_demo.plot_choropleth(layer=danmark, title='Jylland - Den Store Danske \n Grouped by Region', group_by='REGIONNAVN', bounding_box=((54.6,7.8),(57.8, 15.2)), bounded=False)

### Plotting choropleth map grouped by municipalites

In [None]:
geo_demo.plot_choropleth(layer=danmark, title='Jylland - Den Store Danske \n Unbounded', group_by='DAGI_ID')

## Quadrat Statistics

In [None]:
geo_demo.get_quad_stats(limit = 'country', limit_area = 'Danmark')

In [None]:
geo_demo.plot_quad_count(limit = 'country', limit_area = 'Danmark')