# 2B. Use stanza to extract all place names from (part of) the corpus

## Installation

Run the code cell below to install stanza:

In [None]:
# Installing Stanza: an NLP library, used for tasks such as Named Entity Recognition (NER).
!pip install stanza



## Import library and download language model

After installing it, we import stanza into our notebook.

In [None]:
# now we will importing Stanza to be able to perform NER in this notebook
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [None]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


### multiple files
Since we can do this in one file, we can also do this for a large number of files!

Let's download our FASDH25 git repository here. Because we don't use Python to clone a git repository, we add an exclamation mark before the command git in Colab (as we did with pip). Complete the command below and run it:

In [None]:
# clone our FASDH25 folder here:
!git clone https://github.com/Faizan6661/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


We can now loop through the articles in the folder as we did when we were using regex to find filenames:

In [None]:
# bring in the module needed to interact with the file system
import os

# create an empty dictionary to record places and how often they appear
places = {}
# set the location of the folder containing the articles
folder = "/content/FASDH25-portfolio2/articles"
# start a counter to track how many articles are from January 2024
jan_2024_article_count = 0  # Counter for January 2024 articles

# go through every file in the folder
for filename in os.listdir(folder):
  # look for filenames that include '2024-01'
    if "2024-01" in filename:
      # increment the article counter
        jan_2024_article_count += 1
        # build the full path to the file
        path = os.path.join(folder, filename)
        # open the file for reading
        with open(path, encoding="utf-8") as file:
          # load the entire file content into a string
            text = file.read()
            # analyze the text using a previously loaded NLP model
            doc = nlp(text)
            # go through each identified named entity in the text
            for e in doc.entities:
              # check if the entity is a location or geopolitical name
                if e.type in ["GPE", "LOC"]:
                  # clean up the text of the entity
                    place = e.text.strip()
                    # increase the count for this place
                    places[place] = places.get(place, 0) + 1

# display the total number of January 2024 articles
print("Number of articles from January 2024:", jan_2024_article_count)
# show the dictionary of places and their frequencies
print(places)


### clean up the named entity names
We’ll now examine the extracted data to identify any duplicate place names. After that, we’ll standardize the names to ensure consistency, and finally, combine the counts of repeated entries under a single, unified version of each place name.

In [None]:
import re

normalized_places = {}

for place, count in places.items():
    # Remove possessive endings like 's
    place = re.sub(r"[’'`]s\b", "", place)

    # Strip out punctuation marks
    place = re.sub(r"[^\w\s]", "", place)

    # Eliminate the word 'the' at the beginning of the place name (e.g., "The Netherlands" → "Netherlands") (see conversation 1 in AI documentation)
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)  # case-insensitive match for "the"

    # Combine counts for duplicate entries after normalization
    if place in normalized_places:
        normalized_places[place] += count
    else:
        normalized_places[place] = count

# Display the final cleaned place names along with their total frequencies
print(normalized_places)


{}


### Storing data in a tsv file

We can now store the counts in a tsv file, so we can reuse it in a different script.

Let's create a tsv file with two columns: "name" and "frequency".
We'll create the tsv file in two steps:

1. we create the header: that is, the column names, separated by tabs
2. we loop through all the place names, and we create a new row in the table for each place. Each row will contain the place name and its frequency, separated by a tab. Each row will have to start on a new line, so we'll also have to add a newline character \n to the row; should we add it at the beginning or end of the line, or both?



In [None]:
filename = "ner_counts.tsv"
# open the file in writing mode and with unicode UTF-8 encoding:
with open(filename, mode="w", encoding="utf-8") as file:
  # create a header of the tsv files, which consists of the column names separated by a tab:
  header = "name\tfrequency\n"
  # write the header to the file:
  file.write(header)
  # Now, loop through the normalized places dictionary and create a new row for each item in the dictionary
  for entity, count in normalized_places.items():  # We use normalized_places here
        row = f"{entity}\t{count}\n"
        # finally, write the row to the file:
        file.write(row)

The file will now be stored in our colab's session environment. You can see it by clicking the folder icon in the left-hand tool bar in colab. Double-click it to view it in colab. Right-click it and choose "Download" to download the file.

To access it in your script, use the path `/content/ner_counts.tsv`

In [None]:
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())

name	frequency
Morocco	14
Israel	392
Gaza	414
Rabat	3
United States	43
United Arab Emirates	4
UAE	3
Bahrain	4
Sudan	1
US	202
Western Sahara	4
Washington	21
Tel Aviv	12
Algeria	3
Marrakesh	1
Maghreb	1
Ukraine	7
Saudi Arabia	4
California	1
West Bank	58
Dena	1
Oakland	1
South Africa	47
Jordan	9
Jerusalem	5
East Jerusalem	5
Egypt	13
Qatar	20
Kuala Lumpur	2
Malaysia	3
Palestine	27
Indonesia	1
Jakarta	1
Johannesburg	1
London	6
Paris	5
Vienna	1
Berlin	2
Amman	3
Washington DC	5
UK	13
Manchester	1
Yemen	54
India	5
Hyderabad	1
Colombo Kollupitiya	1
Namibia	9
Germany	13
Palestinian Territories	1
Sweden	2
Iran	65
Kerman	4
Lebanon	50
Bethlehem	3
Nairoukh	1
China	12
Italy	4
Spain	3
Turkey	14
Shawawra	1
Hague	6
Gaza Strip	40
Khan Younis	7
Syria	14
Mazzeh	1
Damascus	3
Houthis	1
Red Sea	68
BabelMandeb Strait	1
Gulf of Aden	5
Sanaa	7
United Kingdom	9
Hodeidah	2
Taiz	2
Dhamar	1
alBayda	1
Saada	2
Arabian Sea	2
Bab alMandeb Strait	1
Asia	5
Europe	6
Kuwait	1
Middle East	22
Ankara	7
West	4
Tehran	4
South Car