# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [None]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [None]:
import stanza
import requests
import time
import pandas as pd

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [None]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


# Geocoding

Geocoding is the process of finding coordinates for a place.

The process uses APIs, Application Programming Interfaces,
which are internet services that are designed not for human reading
but for being called by applications.

There are many APIs that provide geocoding services. They typically have a database of place names and their coordinates. If you send a geocoding API a place name, it will return its coordinates (and perhaps some other data). Many of them are not free. In our case, we'll use the free GeoNames API to find our place names.

First, try it out by pasting the following URL in your browser (make sure to replace `<your_user_name>` with your geonames user name:

`http://api.geonames.org/searchJSON?q=Gaza&maxRows=5&username=<your_user_name>`

Paste the response here:

{
  "totalResultsCount": 5276,
  "geonames": [
    {
      "adminCode1": "GZ",
      "lng": "34.46672",
      "geonameId": 281133,
      "toponymName": "Gaza",
      "countryId": "6254930",
      "fcl": "P",
      "population": 410000,
      "countryCode": "PS",
      "name": "Gaza",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a first-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.50161",
      "fcode": "PPLA"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.48347",
      "geonameId": 281129,
      "toponymName": "Jabālyā",
      "countryId": "6254930",
      "fcl": "P",
      "population": 168568,
      "countryCode": "PS",
      "name": "Jabalia",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "populated place",
      "adminName1": "Gaza Strip",
      "lat": "31.5272",
      "fcode": "PPL"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.30627",
      "geonameId": 281124,
      "toponymName": "Khān Yūnis",
      "countryId": "6254930",
      "fcl": "P",
      "population": 173183,
      "countryCode": "PS",
      "name": "Khan Yunis",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a second-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.34018",
      "fcode": "PPLA2"
    },
    {
      "adminCode1": "02",
      "lng": "33",
      "geonameId": 1046058,
      "toponymName": "Gaza Province",
      "countryId": "1036973",
      "fcl": "A",
      "population": 1422460,
      "countryCode": "MZ",
      "name": "Gaza Province",
      "fclName": "country, state, region,...",
      "adminCodes1": {
        "ISO3166_2": "G"
      },
      "countryName": "Mozambique",
      "fcodeName": "first-order administrative division",
      "adminName1": "Gaza Province",
      "lat": "-23.5",
      "fcode": "ADM1"
    },
    {
      "adminCode1": "GZ",
      "lng": "34.24357",
      "geonameId": 281102,
      "toponymName": "Rafaḩ",
      "countryId": "6254930",
      "fcl": "P",
      "population": 126305,
      "countryCode": "PS",
      "name": "Rafah",
      "fclName": "city, village,...",
      "adminCodes1": {

      },
      "countryName": "Palestine",
      "fcodeName": "seat of a second-order administrative division",
      "adminName1": "Gaza Strip",
      "lat": "31.29722",
      "fcode": "PPLA2"
    }
  ]
}

I have created a function, `get_coordinates` that will take your a place name and your Geonames user name as an argument and return the coordinates. Please fill in your user name and run the code cell to make the function available:

In [None]:
# import necessary libraries
import requests
import time
import pandas as pd

# my geonames username
geonames_username = "shabirkarim"

# write a code to define a function to get coordinates for a given place

def get_coordinates(place, username=geonames_username, fuzzy=0, timeout=1):
  # wait a short while, so that we don't overload the server:
    time.sleep(timeout)
    #This function calls the GeoNames API to get coordinates for a given place
    url = "http://api.geonames.org/searchJSON?"
    params = {
        "q": place,
        "username": username,
        "fuzzy": fuzzy,
        "maxRows": 1,          # this chunk of code only return the top result
        "isNameRequired": True # this only return if a name is found
    }
    # write a code to make the request to Geonames API
    response = requests.get(url, params=params)
    results = response.json()

    # write a code that will try to extract the first result's coordinates
    try:
        result = results["geonames"][0]
        return {"latitude": result["lat"], "longitude": result["lng"]}
    except (IndexError, KeyError):  # If no result is found
        return {"latitude": "NA", "longitude": "NA"}


Now, reuse the code above to get the coordinates for the place names from the places we stored in the `ner_counts.tsv` file.

Write a new tsv file, `ner_gazetteer.tsv`, which contains three columns: name, latitude, longitude.

In [None]:
#get the place names from the tsv file
df = pd.read_csv("/content/ner_counts.tsv", sep="\t")

# Check CHATGPT Solution 3.1
# write a code to extract only unique places names to avoid unnecessary geocoding
unique_places = df['Place'].unique()
# print how many unique places we need to geocode
print(f"We need to geocode {len(unique_places)} places")

# the following chunks of code will get the coordinates for each place
# create an empty list to store place names and their coordinates
# Check CHATGPT Solution 3.2
gazetteer = []

# write a code to loop through each unique place name and get its coordinates
for place in unique_places:
  coords = get_coordinates(place)
  # add append a dictionary with the place name and its coordinates to the gazetteer list
  gazetteer.append({
        "placename": place,
        "latitude": coords["latitude"],
        "longitude": coords["longitude"]
    })
# convert the list of dictionaries to a pandas DataFrame
gazetteer_df = pd.DataFrame(gazetteer)

#write coordinates to tsv file
gazetteer_df.to_csv("/content/NER_gazetteer.tsv", sep="\t", index=False)
print("Saved to NER_gazetteer.tsv")


We need to geocode 557 places
Saved to NER_gazetteer.tsv
