# `stedsans`

This is a notebook showing the current and most prominent capabilities of `stedsans`. 
It is heavily recommended to run the notebook by using Google Colab:
<br>
<br>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/MalteHB/stedsans/blob/main/notebooks/stedsans_demo.ipynb)


If running the notebook on your local machine consider installing [Anaconda](https://docs.anaconda.com/anaconda/install/) and then install the package `geopandas` to get the pre-built binaries, by using the `conda` package manager from an Anaconda integraged terminal:

```bash
conda install geopandas
```

## Setup

We will start off by installing `stedsans` using the `pip` package manager.

In [None]:
!pip install -q stedsans==0.0.13a0

If you are using either Google Colab, Linux or MacOS also feel free to install `geopandas` using `pip`, however, if you are using Windows OS install `geopandas` to by using the `conda` package manager.

In [None]:
# For Google Colab, Linux or MacOS:
!pip -q install geopandas==0.9.0

__Importing packages__

We start off by importing the main module of `stedsans`.

In [None]:
from stedsans import stedsans

## Language capabilities of `stedsans`

`stedsans` is capable of taking a either a Danish or an English sentence, and extracting the entities by using either [Ælæctra](https://huggingface.co/Maltehb/-l-ctra-danish-electra-small-cased-ner-dane) or [BERT](https://huggingface.co/dslim/bert-base-NER), respectively.

The intended use of `stedsans` is to initialize a stedsans object with a text input.

In [None]:
# Define the sentence
danish_sentence = "Malte er mit navn, og jeg bor på Testvej 13, Aarhus C"

# By default stedsans assumes the language is Danish
default_stedsans = stedsans(sentence = danish_sentence)

After a `stedsans` instance with a sentence has been initialized one can simply call the `extract_entities()` function and print the entities.

In [None]:
default_entities = default_stedsans.extract_entities()

print(default_entities)

### Multilinguistic stedsans 
#### (duolinguistic for now...)
By default `stedsans` assumes the language is Danish, but we can also be specified using the 'language' argument. `stedsans` is currently only capable of predicting Danish and English sentences, but future enhancements will include increased language variety.

In [None]:
danish_stedsans = stedsans(danish_sentence, 
                           language="danish")

danish_entities = danish_stedsans.extract_entities()

print(danish_entities)

In [None]:
english_sentence = "Hello my name is Malte and i live in Aarhus C"

english_stedsans = stedsans(english_sentence, 
                            language="english")

english_entities = english_stedsans.extract_entities()

print(english_entities)

A `stedsance` instance has been initialized we also use it for predicting other sentences.

In [None]:
new_danish_sentence = "Jakob er min gode samarbejdspartners navn, og han bor også i Aarhus C"

danish_sentence_entities = default_stedsans.extract_entities(new_danish_sentence)

print(danish_sentence_entities)

In [None]:
new_english_sentence = "Jakob is the name my good cooperator, and he also lives in Aarhus C"

english_sentence_entities = english_stedsans.extract_entities(new_english_sentence)

print(english_sentence_entities)

Notice here how the different models have different predictive capabilities. The Danish Ælæctra notices *'C'* as part of the location whereas BERT does not.

## Geographic capabilities of `stedsans`

To show the basic geospatial functionalities of stedsans we will start of by initializing a stedsans instance, `geo_demo`, with an English text string, and printing the found location and organization entities.

In [None]:
txt = "Stedsans was developed by two knights who both live in Aarhus C. \
        They are both fans of FC Midtjylland which is a football team residing at MCH Arena. \
        One comes from Randers, which is home to one of the best beers in the world. \
        The other comes from a small city, not too far from LEGOLAND. \
        One of their favourite locations is Knebel which is located on Mols Djursland. \
        They are not hateful people, but they are not too fond of AGF. What is there really to like about AGF? \
        If you ever come close to Aarhus, feel free to pay them a visit. \
        They will gladly take you on a tourist tour to both see Bruuns Galleri and Dokk1. \
        And if you like beer they will gladly beknight you at Guldhornene Aarhus."

geo_demo = stedsans(sentence = txt, 
                    language = 'english')

geo_demo.print_entities()

We can then use the `get_coordinates()` function, to obtain both the a list of coordinates, a `pandas` dataframe and a `geopandas` geodataframe.

In [None]:
coords, df, gdf = geo_demo.get_coordinates()

print("List of coordinates:\n", 
      coords)

In [None]:
print("Overview of the pandas dataframe:\n",
      df)

In [None]:
print("Overview of the geopandas dataframe:\n",
      gdf)

As we see the two dataframes are different in that the `geopandas` dataframe also contains the `geometry` column, which enables a user of `stedsans` to do additional geoanalytical analyses. Also note how the geoparsing fails at geocoding one of the the entities. It has returned a place in France, Aérodrome d'Agen-La Garenne, for the entity *'AGF'* which actually corresponds to a substandard football club located in Aarhus.

Because of this ambiguity `get_coordinates()` also takes the parameters `limit` and `limit_area` which are very convenient when only wanting locations inside a specific area.

In [None]:
coords, df, gdf = geo_demo.get_coordinates(limit="country", 
                                           limit_area="Denmark")

print("List of coordinates:\n", 
      coords)

In [None]:
print("Overview of the pandas dataframe:\n",
      df)

In [None]:
print("Overview of the geopandas dataframe:\n",
      gdf)

Now the entity *'AGF'* has been removed and we are only getting the geotagged locations inside of Denmark.

### Basic Visualization: Plotting points onto a map

`stedsans` comes with some basic example text files, datasets and shapefiles that can be used to explore the package. They also ideal for this demonstration of the capabilities of `stedsans`.

Right now, we will load a Danish article (sorry, non-Danish-speakers) about Jutland, or Jylland in Danish, by loading it using the `Articles` class. The article can be read below.

In [None]:
from stedsans.data.load_data import Articles

jylland_article = Articles.jylland()

print(jylland_article)

We will now use the article to initialize a new `stedsans` instance, called `jutland_demo` and print the extracted entities to get a brief overview of what we are dealing with.

In [None]:
jutland_demo = stedsans(file = jylland_article, 
                        language = 'danish')

jutland_demo.print_entities()

To visualize the points we have extracted from the article we can plot them on an interactive `Folium` and `Leaflet.js` map by using the `plot_locations()` function.

In [None]:
jutland_demo.plot_locations()

By default `plot_locations()` uses a 'cartodbpositron' tileset, however, you can also specify it to use 'OpenStreetMap' or any other  `Foliumn` supported map layer, by using the `tiles` keyword. For additional configurations on the `tiles` argument see the [Folium documentation](https://python-visualization.github.io/folium/modules.html).

In [None]:
jutland_demo.plot_locations(tiles="OpenStreetMap")

Here we see that we have places all around the world. However, we know that the article concerns Jutland, and it seems a bit dubious that an article regarding Jutland mentions locations across three continents. 
In these situations where it known beforehand, that all or most locations should be constrained to a specific area, all `stedsans` functions takes two powerful parameters that can be used to specify a bounding box. The `bounding_box` argument lets the user define two coordinate-pairs that represent the corners of the bounding box. The Boolean `bounded` then determines how the geocoder should handle the bounding box. By default, ‘bounded’ is set to 'False’. In this setting, the specified bounding box only serves as an extra heuristic for the importance score ranking in the geocoder; results that lie within the confined area are given a higher importance score. If ‘bounded’ is set to ‘True’, the bounding box categorically restricts the geocoder to only search for locations within the borders of the box. We can try bounding to the bounding box around Region Midtjylland to see if more locations are found in this area than before:

In [None]:
jutland_demo.plot_locations(bounding_box=((55.9,7.6),(56.6, 10.9)), 
                             bounded=True)

We can see that by bounding the search, the geocoder often locates more places in the specified area.

If we wanted the geocoder to extract locations without restrcitions, but at a later stage wanted to subset and visualise only the points located in e.g. Denmark, it would be best to use the `limit` and `limit_area` arguments and set them to 'country' and 'Denmark' respectively.

We can also pass a mapping layer into a `stedsans` instance. One of the datasets provided with `stedsans` is a `geopandas` dataframe, created from a shapefile of Denmark with municipality division. This dataset is provided by the `GeoData` class and can be retrieved by calling `GeoData.municipalities()`. The original shapefile was in Danish, and the column names are therefore still in Danish.

In [None]:
from stedsans.data.load_data import GeoData

denmark = GeoData.municipalities()

print("First five rows of the denmark dataframe:\n", 
      denmark.head())

Each row in this GeoDataFrame represents a municipality and has a polygon defined in the geometry column. The other columns hold various information on the municipalities, e.g., individual IDs (`DAGI_ID`) and the name of the region in which they are located (`REGIONNAVN`). These variables can be used for grouping the data in some of the other functions.

By specifiying the 'layer' argument of `plot_locations()` to be the loaded dataframe, we can use it as a base layer. This gives os a non-interactive map created using `matplotlib`. Setting the Boolean `on_map` argument to `True` entails that only points loocated within one of the polygons are kept.

In [None]:
jutland_demo.plot_locations(layer=denmark,
                            on_map=True)

### Statistical Tools: Point Patterns

Since `stedsans` is intended for more than geographical visualizations, it also comes with the ability to do Q-statistics, enabling a quick statistical analysis of the distribution of the points by checking for complete spatial randomness.

We will continue to use the `jutland_demo` instance of `stedsans` from the previous section.

To get the Q-statistics `stedsans` provides the function `get_quad_stats()`.

In [None]:
jutland_demo.get_quad_stats()

`get_quad_stats()` provides us with a 𝜒2-value and a p-value to determine whether out points are truly completely random. In this instance, since we have a 𝜒2 = 858.333 with a p-value = 0.001, we can reject the null and conclude that the points to not appear to be distributed randomly.

We are also able to plot the points into quadrants using the `plot_quad_count()` function. 

In this example we specify the number of `squares` per axis to be 4.

In [None]:
jutland_demo.plot_quad_count(squares = 4)

Again we can specify the `limit` and the `limit_area`.

In [None]:
jutland_demo.get_quad_stats(limit="country", 
                            limit_area="Denmark")

In [None]:
jutland_demo.plot_quad_count(squares = 4,
                             limit="country", 
                             limit_area="Denmark"
                             )

### Advanced Visualizations 1: Plotting cloropleth plots

`stedsans` also provides the ability to plot cloropleth plots. This can be done by using the function `plot_cloropleth()`. 

By default `plot_cloropleth()` uses the entire world as a default base layer.

In [None]:
 jutland_demo.plot_choropleth()

When only having locations in Denmark such a plot might be deemed too informative. Luckily, similarly to the `plot_locations()` function, `plot_cloropleth()` also gives us the opportunity to pass a base layer into it with the `layer` argument. We will use the already initialized `denmark` layer.

In [None]:
jutland_demo.plot_choropleth(layer=denmark)

By specifiying the layer we get a much more informative view of the distribution of tagged locations.

`plot_cloropleth()` also has a `group_by` functionality, where you can specify the filling of the cloropleth plot to be grouped by a variable in the input layer.
Here we pass the region name variable called `REGIONNAVN`.

In [None]:
jutland_demo.plot_choropleth(layer=denmark, 
                             group_by='REGIONNAVN')

If we only want to get a view of the location distribution of a specific region in Denmark, we can create a subset layer, `region_m`. Here we choose the region name column `REGIONNAVN` and subsets only *'Region Midtjylland'* which translates to the Central Jutland Region.

In [None]:
region_m = denmark[denmark["REGIONNAVN"] == "Region Midtjylland"] 

print("First five rows of the region_m dataframe:\n", 
      region_m.head())

print("\nUnique regions in region_m:", 
      region_m["REGIONNAVN"].unique())

Now we have a layer only containing the `REGIONNAVN` called 'Region Midtjylland'.

We can use this to create a cloropleth plot of only the locations located in Central Jutland Region in Denmark divided into the different municipalities. We will also make use of the `title` argument to make a nice title for the plot.

In [None]:
jutland_demo.plot_choropleth(layer=region_m, 
                             title = 'Jylland - Den Store Danske \n Central Jutland Region')

If you wanted to visualize the entirety of the kingdom of Denmark, but only to get the distribution of places in the Central Jutland Region you can specify a `bounding_box` and set `bounded=True`. Note how this also slightly changes (improves) the results for which locations are retrieved in the region.

In [None]:
# Creating a bounding box of the Central Jutland Region

bounding_box = bounding_box=((55.9,7.6),
                             (56.6, 10.9))

jutland_demo.plot_choropleth(layer=denmark, 
                             title='Jylland - Den Store Danske \n Bounded to Region Midtjylland', 
                             bounding_box=((55.9,7.6),(56.6, 10.9)), 
                             bounded=True)

In [None]:
jutland_demo.plot_choropleth(layer=denmark, title='Jylland - Den Store Danske \n Grouped by Region', 
                             group_by='REGIONNAVN', 
                             bounding_box=((54.6,7.8),(57.8, 15.2)), 
                             bounded=False)

### Advanced Visualizations 2: Heatmaps

Lastly, `stedsans` also provides the ability to create heatmaps by using the function `plot_heatmap()`.

In [None]:
jutland_demo.plot_heatmap()

Again, we can specify the `limit` and the `limit_area`.

In [None]:
jutland_demo.plot_heatmap(limit = 'country', 
                          limit_area = 'Denmark')

And we can set a bounding box.

In [None]:
jutland_demo.plot_heatmap(bounding_box=((55.9,7.6),(56.6,10.9)), 
                          bounded=True)

Making the argument names generalizable and work across multiple functions should make the usage `stedsans` more accessible for the users. 

There are lots of features to come and the developers are looking forward to continuously enhancing the features, and providing additional geospatial analytical tools.

We hope you enjoyed the demonstration of the current capabilities of `stedsans`. 

Thank you!

### Contact

For help or further information feel free to connect with either of the main developers:

**Malte Højmark-Bertelsen**
<br />
[hjb@kmd.dk](mailto:hjb@kmd.dk?subject=[GitHub]%20stedsans)


[<img align="left" alt="MalteHB | Twitter" width="30px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/twitter.svg" />][twitter]
[<img align="left" alt="MalteHB | LinkedIn" width="30px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/linkedin.svg" />][linkedin]

<br />

</details>

[twitter]: https://twitter.com/malteH_B
[linkedin]: https://www.linkedin.com/in/maltehb

**Jakob Grøhn Damgaard** 
<br />
[bokajgd@gmail.com](mailto:bokajgd@gmail.com?subject=[GitHub]%20stedsans)


[<img align="left" alt="Jakob Grøhn Damgaard | Twitter" width="30px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/twitter.svg" />][twitter]
[<img align="left" alt="Jakob Grøhn Damgaard | LinkedIn" width="30px" src="https://cdn.jsdelivr.net/npm/simple-icons@v3/icons/linkedin.svg" />][linkedin]

<br />

</details>

[twitter]: https://twitter.com/JakobGroehn
[linkedin]: https://www.linkedin.com/in/jakob-gr%C3%B8hn-damgaard-04ba51144/