# Make Location names shorter
> Some location names are really long - over 150 characters. How can they be shortened?

## Description

The names of locations, particularly private one, can be long. Sometimes very long.
They contain the coordinates, county or state names, and country codes, which adds
a lot of clutter, and makes them hard to read when displayed in a list, e.g. on a
web page. This notebook looks at the length of the names and what can be done to
shorten them.

This notebook looks at the data for Portugal, since the code presented here was used
to shorten the location names for the https://www.ebirders.pt/ site. However the
lessons should be applicable to any region.

Interestingly, the format of the location names varies, presumably depending on whether
the website or the app is used to record the observations. It almost certainly varies
between versions of the mobile app. However there's no direct evidence, but you
should see it in the location names listed.

In [10]:
import re
import datetime as dt

from django.conf import settings
from django.db.models.functions import Length

from IPython.display import display, HTML

import tabulate

from ebird.api.data.loaders import APILoader
from ebird.api.data.models import Location

## Load Data
Load data for the past 2 days - this will take a while, particularly
with and empty database since it has to load all the species and observers.
However it should not take too long, and you can see the progress printed
in the output. Adding checklists might also time out, or handshake error,
if the eBird servers are slow to respond. This happens occasionally.
If it does, simply re-run the cell.

In [3]:
loader = APILoader(settings.EBIRD_API_KEY, locales=settings.EBIRD_LOCALES)

country_code = "PT"

today = dt.date.today()
number_of_days = 2
dates = [today - dt.timedelta(days=n) for n in range(number_of_days)]

for date in dates:
    loader.add_checklists(country_code, date)

loader.run_filters()
loader.publish()

2025-07-24 15:41 [INFO] Adding checklists: PT, 2025-07-24
2025-07-24 15:41 [INFO] Visits made: 76 
2025-07-24 15:41 [INFO] Adding checklist: S262360471
2025-07-24 15:41 [INFO] Scraping checklist: S262360471
2025-07-24 15:41 [INFO] Added observer: João Rodrigues
2025-07-24 15:41 [INFO] Added species: mallar3, Mallard
2025-07-24 15:41 [INFO] Adding checklist: S262363431
2025-07-24 15:41 [INFO] Scraping checklist: S262363431
2025-07-24 15:41 [INFO] Added observer: Tom Bedford
2025-07-24 15:41 [INFO] Adding checklist: S262362232
2025-07-24 15:41 [INFO] Scraping checklist: S262362232
2025-07-24 15:41 [INFO] Added observer: Francisco Pires
2025-07-24 15:41 [INFO] Added species: norlap, Northern Lapwing
2025-07-24 15:41 [INFO] Added species: grnsan, Green Sandpiper
2025-07-24 15:41 [INFO] Added species: yelgul1, Yellow-legged Gull
2025-07-24 15:41 [INFO] Added species: gubter2, Gull-billed Tern
2025-07-24 15:41 [INFO] Added species: whisto1, White Stork
2025-07-24 15:41 [INFO] Added species: 

## The Names
Sorted by length, longest first:

In [4]:
locations = Location.objects.annotate(length=Length('original')).order_by("-length")
for location in locations:
    print(location.original)

EM508, União das freguesias de Tavira (Santa Maria e Santiago), Faro, PT (37,285, -7,629)
329 Avenida Comendador Ferreira de Matos, Matosinhos, Porto, PT (41,179, -8,683)
24 Rua José de Sousa Monteiro, Porto Salvo, Lisboa, PT (38.73, -9.304) Portugal
CM1142, União das freguesias de Cerva e Limões, Vila Real, PT (41,417, -7,819)
15 Avenida da República, Vila Real de Santo António, Faro, PT (37.192, -7.413)
PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)
01. Rua São Bartolomeu de La Torre area, Tavira, Faro, PT (37.122, -7.657)
Lezíria Grande de Vila Franca de Xira--área geral (acesso condicionado)
12 Avenida 12 de Novembro, Alcains, Castelo Branco, PT (39,914, -7,47)
Avenida da Barrinha 15–16, Praia de Mira PT-Centro 40.45503, -8.80143
28 Rua Frederico Garcia Selades, Cadima, Coimbra, PT (40,326, -8,643)
PP Arriba Fóssil da Costa da Caparica--Mata e Ribeira da Foz do Rego
Estação de Metro das Sete Bicas, Senhora da Hora (41,183, -8,652)
Avenida da Barrinha, Pr

Let's deal with the elephants in the list, first. Roadside locations, for example:
```
Autoestrada do Sul, União das freguesias de Alcácer do Sal (Santa Maria do Castelo e Santiago) e Santa Susana, Setúbal, PT (38,509, -8,58)
```
where the name attempts to describe the location using the nearest municipalities 
are an extravagant waste of characters, particularly since you need the 
coordinates to have any idea of the location. To compound the problem these 
locations are typically one-time only submissions, so their value is somewhat 
limited.

If we examine the remaining names, then it is easier to see some general patterns:
```
<site>, <town/city>, <state>, <country code> <coordinates>
<site>, <town/city> <country code>-<region> <coordinates>
<site>, <town/city>, <county>, <region>, <country code> <coordinates>
```
though there is quite a lot of variation, as the observer can edit the name in any
way they want.

Portugal is interesting as the district (state) and county are often (primarily) named 
after the largest city, which is the administrative centre, so you end up with names
such as "Av. de Berlim 46, Lisboa, Lisboa, PT (38.768,-9.115)". Since only two of three
administrative layers are present it's impossible to determine whether it is city + county, 
or city + state. It's probably city + state, but there's no real way of knowing.

The best option for reducing the length of the names is to display the name + town/city,
and remove the rest, since county, district and country are available separately and 
can be displayed as needed. That would give the maximum flexibility for displaying 
name that removed duplication/repetition and were easy to read.

If you have been paying attention the obvious and simplest answer is to split the 
name on commas, keep the first two elements of the list, and discard everything else.
That works, but not always, since sometimes commas are used to separate the elements
in the name and sometimes they are not. That means that the names have to be checked
and corrected when necessary. For example:

In [6]:
for location in locations[:20]:
    print(location.original)
    print(",".join(location.original.split(",")[:2]))
    print()

EM508, União das freguesias de Tavira (Santa Maria e Santiago), Faro, PT (37,285, -7,629)
EM508, União das freguesias de Tavira (Santa Maria e Santiago)

329 Avenida Comendador Ferreira de Matos, Matosinhos, Porto, PT (41,179, -8,683)
329 Avenida Comendador Ferreira de Matos, Matosinhos

24 Rua José de Sousa Monteiro, Porto Salvo, Lisboa, PT (38.73, -9.304) Portugal
24 Rua José de Sousa Monteiro, Porto Salvo

CM1142, União das freguesias de Cerva e Limões, Vila Real, PT (41,417, -7,819)
CM1142, União das freguesias de Cerva e Limões

15 Avenida da República, Vila Real de Santo António, Faro, PT (37.192, -7.413)
15 Avenida da República, Vila Real de Santo António

PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)
PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)

01. Rua São Bartolomeu de La Torre area, Tavira, Faro, PT (37.122, -7.657)
01. Rua São Bartolomeu de La Torre area, Tavira

Lezíria Grande de Vila Franca de Xira--área geral (ac

The practical solution is to start at the end of the name and start chopping things off 
the following order:

1. Coordinates (and any trailing comments)
2. Country code, or country code and region
3. State (district)

Note that #3 mentions "region". There is another administrative layer where districts 
(states) are grouped into regions. These are either NUTS 1 in the case of Açores or 
Madeira, or NUTS 2 where the districts are on the mainland. These only exist in the 
name and are not present anywhere else. eBird locations only support the levels, 
subnational1 (district/state) or subnational2 (county).

In [None]:
## Step 1 - Removing Coordinates

In [12]:
latitude_regex = r"[-\u2212]?\d{1,2}[.,]\d{1,7}"
longitude_regex = r"[-\u2212]?\d{1,3}[.,]\d{1,7}"
coordinates_regex = r"^(.*)\b,? (:?\()?%s[,x] ?%s(:?\))?.*$" % (
    latitude_regex,
    longitude_regex,
)

def remove_coordinates(name):
    if re.match(coordinates_regex, location.original):
        name = re.sub(coordinates_regex, r"\1", location.original)
    return name
    
for location in locations[:20]:
    name = remove_coordinates(location.original)
    print(location.original)
    print(name)
    print()


EM508, União das freguesias de Tavira (Santa Maria e Santiago), Faro, PT (37,285, -7,629)
EM508, União das freguesias de Tavira (Santa Maria e Santiago), Faro, PT

329 Avenida Comendador Ferreira de Matos, Matosinhos, Porto, PT (41,179, -8,683)
329 Avenida Comendador Ferreira de Matos, Matosinhos, Porto, PT

24 Rua José de Sousa Monteiro, Porto Salvo, Lisboa, PT (38.73, -9.304) Portugal
24 Rua José de Sousa Monteiro, Porto Salvo, Lisboa, PT

CM1142, União das freguesias de Cerva e Limões, Vila Real, PT (41,417, -7,819)
CM1142, União das freguesias de Cerva e Limões, Vila Real, PT

15 Avenida da República, Vila Real de Santo António, Faro, PT (37.192, -7.413)
15 Avenida da República, Vila Real de Santo António, Faro, PT

PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)
PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)

01. Rua São Bartolomeu de La Torre area, Tavira, Faro, PT (37.122, -7.657)
01. Rua São Bartolomeu de La Torre area, Tav

## Step 2 - Remove Country Code/Region

In [15]:
country_regex = r"^(.*), PT$"
region_regex = r"^(.*) PT-[\w\s]+$"

def remove_country(name):
    if re.match(country_regex, name):
        name = re.sub(country_regex, r"\1", name)
    return name

def remove_region(name: str) -> str:
    if re.match(region_regex, name):
        name = re.sub(region_regex, r"\1", name)
    return name


for location in locations[:20]:
    name = remove_coordinates(location.original)
    name = remove_country(name)
    name = remove_region(name)
    print(location.original)
    print(name)
    print()


EM508, União das freguesias de Tavira (Santa Maria e Santiago), Faro, PT (37,285, -7,629)
EM508, União das freguesias de Tavira (Santa Maria e Santiago), Faro

329 Avenida Comendador Ferreira de Matos, Matosinhos, Porto, PT (41,179, -8,683)
329 Avenida Comendador Ferreira de Matos, Matosinhos, Porto

24 Rua José de Sousa Monteiro, Porto Salvo, Lisboa, PT (38.73, -9.304) Portugal
24 Rua José de Sousa Monteiro, Porto Salvo, Lisboa

CM1142, União das freguesias de Cerva e Limões, Vila Real, PT (41,417, -7,819)
CM1142, União das freguesias de Cerva e Limões, Vila Real

15 Avenida da República, Vila Real de Santo António, Faro, PT (37.192, -7.413)
15 Avenida da República, Vila Real de Santo António, Faro

PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)
PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)

01. Rua São Bartolomeu de La Torre area, Tavira, Faro, PT (37.122, -7.657)
01. Rua São Bartolomeu de La Torre area, Tavira, Faro

Lezíria G

## Step 3 - Remove State

def remove_state(name, location):
    state_regex = r"^(.*), %s$" % location.state.name
    if re.match(state_regex, name):
        name = re.sub(state_regex, r"\1", name)
    return name
    
for location in locations[:20]:
    name = remove_coordinates(location.original)
    name = remove_country(name)
    name = remove_region(name)
    name = remove_state(name, location)
    print(location.original)
    print(name)
    print()

In [None]:
## Putting it all together

In [18]:
for location in locations:
    name = remove_coordinates(location.original)
    name = remove_country(name)
    name = remove_region(name)
    name = remove_state(name, location)
    print(name)
    print("%s, %s, %s" % (location.county.name, location.state.name, location.country.name))
    print()

EM508, União das freguesias de Tavira (Santa Maria e Santiago)
Tavira, Faro, Portugal

329 Avenida Comendador Ferreira de Matos, Matosinhos
Matosinhos, Porto, Portugal

24 Rua José de Sousa Monteiro, Porto Salvo
Oeiras, Lisboa, Portugal

CM1142, União das freguesias de Cerva e Limões
Ribeira de Pena, Vila Real, Portugal

15 Avenida da República, Vila Real de Santo António
Vila Real de Santo António, Faro, Portugal

PN Sintra-Cascais--Parque e Palácio Nacional da Pena (acesso condicionado)
Sintra, Lisboa, Portugal

01. Rua São Bartolomeu de La Torre area, Tavira
Tavira, Faro, Portugal

Lezíria Grande de Vila Franca de Xira--área geral (acesso condicionado)
Vila Franca de Xira, Lisboa, Portugal

12 Avenida 12 de Novembro, Alcains
Castelo Branco, Castelo Branco, Portugal

Avenida da Barrinha 15–16, Praia de Mira
Mira, Coimbra, Portugal

28 Rua Frederico Garcia Selades, Cadima
Cantanhede, Coimbra, Portugal

PP Arriba Fóssil da Costa da Caparica--Mata e Ribeira da Foz do Rego
Almada, Setúba

In [None]:
## Conclusion
Reducing the length of location names by trimming the coordinates, country code,
region, and state is easily doable. Chaining together small functions which return
either the cleaned name or the original makes it easy to create a small library 
where it should be possible to remove all the redundancy in a name, when displayed
alongside the name of the country and state.