# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [1]:
!pip install stanza

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.3.0->stanza)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.3.0->stanza)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.3.0->stanza)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata 

## Import library and download language model

After installing it, we import stanza into our notebook.

In [3]:
#Importing stanza
import stanza

## Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [4]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline(lang="en", processors='tokenize,mwt,ner')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cuda
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


In [5]:
# Clone the GitHub repository containing your article corpus
!git clone https://github.com/Aqsa-2004/FASDH25-portfolio2.git

Cloning into 'FASDH25-portfolio2'...
remote: Enumerating objects: 4437, done.[K
remote: Counting objects: 100% (38/38), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 4437 (delta 23), reused 15 (delta 13), pack-reused 4399 (from 2)[K
Receiving objects: 100% (4437/4437), 19.25 MiB | 19.03 MiB/s, done.
Resolving deltas: 100% (41/41), done.


In [6]:
import os

# Create a dictionary to store place name counts
places = {}

# Folder path to your cloned repository's articles
folder = "/content/FASDH25-portfolio2/articles"

january_articles = 0  # Counter for January 2024 articles

In [13]:
# Loop through files that begin with "2024-01-"
for filename in os.listdir(folder):
    # loop through files from Jan 2024 make sure that the articles end with .txt
    if "2024-01" in filename:
      january_articles += 1 #counting the articles
      path = os.path.join(folder, filename) # create a path to the file:

  # open and read the file:
      with open(path, encoding="utf-8") as file:
          text = file.read()
          # use the nlp pipeline to analyse the text:
          doc = nlp(text)
          # select only the entities that are place names:
          for e in doc.entities:
            if e.type in ["GPE", "LOC"]:
              place = e.text.strip()
              # I am Updating dictionary with this place
              if place in place:
                places[place] += 1
              else:
                places[place] = 1

# At the end we will print how many articles we processed and the frequency of each place
print("Number of articles from January 2024:", january_articles)
print(places)



Number of articles from January 2024: 1127
{'Israel': 5456, 'Gaza': 5503, 'Palestine': 427, 'the United States': 335, 'Welch’s': 4, 'US': 2441, 'Iraq': 212, 'United States': 142, 'West': 87, 'the Global South': 8, 'Qatar': 232, 'Gulf': 37, 'Egypt': 145, 'East Jerusalem': 85, 'Netanyahu’s': 23, 'Gaza Strip': 106, 'the Gaza Strip': 427, 'South Africa': 689, 'Russia': 153, 'Ukraine': 170, 'China': 89, 'South Africa’s': 30, 'Malaysia': 26, 'Turkey': 92, 'Jordan': 147, 'Bolivia': 14, 'Maldives': 4, 'Namibia': 39, 'Pakistan': 74, 'Columbia': 10, 'Khan Younis': 76, 'Middle East': 88, 'The Hague': 114, 'Bangladesh': 7, 'Comoros': 7, 'Djibouti': 14, 'Netherlands': 51, 'The United States': 73, 'The United Kingdom': 10, 'Myanmar': 22, 'Beirut': 282, 'Dahiyeh': 22, 'Lebanon': 586, 'Iran': 687, 'Yemen': 624, 'Beirut’s Shatila': 4, 'Red Sea': 171, 'Africa': 98, 'the Red Sea': 662, 'Gulf of Aden': 14, 'the Cape of Good Hope': 43, 'Singapore': 8, 'the Gulf of Aden': 83, 'The Red Sea': 18, 'Mediterrane

In [14]:
import re

normalized_places = {}

for place, count in places.items():
    # Step 1: Remove possessives like 's
    place = re.sub(r"[’'`]s\b", "", place)

    # Step 2: Remove punctuation
    place = re.sub(r"[^\w\s]", "", place)
    place = re.sub(r"^the\s+", "", place, flags=re.IGNORECASE)  # case-insensitive removal of "The"

    # Step 4: Merge counts for normalized places
    if place in normalized_places:
        normalized_places[place] += count
    else:
        normalized_places[place] = count

# Printing the clean place names
print(normalized_places)

{'Israel': 5567, 'Gaza': 5565, 'Palestine': 427, 'United States': 557, 'Welch': 4, 'US': 2483, 'Iraq': 218, 'West': 87, 'Global South': 8, 'Qatar': 235, 'Gulf': 37, 'Egypt': 149, 'East Jerusalem': 85, 'Netanyahu': 23, 'Gaza Strip': 550, 'South Africa': 719, 'Russia': 153, 'Ukraine': 170, 'China': 95, 'Malaysia': 26, 'Turkey': 92, 'Jordan': 151, 'Bolivia': 14, 'Maldives': 4, 'Namibia': 39, 'Pakistan': 74, 'Columbia': 10, 'Khan Younis': 76, 'Middle East': 354, 'Hague': 133, 'Bangladesh': 7, 'Comoros': 7, 'Djibouti': 14, 'Netherlands': 51, 'United Kingdom': 153, 'Myanmar': 22, 'Beirut': 293, 'Dahiyeh': 22, 'Lebanon': 597, 'Iran': 696, 'Yemen': 645, 'Beirut Shatila': 4, 'Red Sea': 855, 'Africa': 98, 'Gulf of Aden': 97, 'Cape of Good Hope': 43, 'Singapore': 8, 'Mediterranean': 38, 'Indian Ocean': 8, 'Europe': 103, 'Asia': 61, 'Spain': 25, 'Canada': 152, 'Australia': 49, 'Britain': 48, 'Germany': 109, 'Italy': 37, 'Switzerland': 31, 'Finland': 11, 'Estonia': 4, 'Japan': 33, 'Austria': 12, 'R

In [20]:
filename = "ner_counts.tsv"
# open file
with open("ner_counts.tsv", mode= "w", encoding= "utf-8") as file:
  # create a header of the tsv files:
  header = "place\tcount\n"
  file.write(header)
  # loop through the places dictionary, creating a row for all items in the dictionary
  for place, count in normalized_places.items():
    row = f"{place}\t{count}\n"
    file.write(row)

#open file and print normalised results
with open("/content/ner_counts.tsv", encoding="utf-8") as file:
  print(file.read())


place	count
Israel	5567
Gaza	5565
Palestine	427
United States	557
Welch	4
US	2483
Iraq	218
West	87
Global South	8
Qatar	235
Gulf	37
Egypt	149
East Jerusalem	85
Netanyahu	23
Gaza Strip	550
South Africa	719
Russia	153
Ukraine	170
China	95
Malaysia	26
Turkey	92
Jordan	151
Bolivia	14
Maldives	4
Namibia	39
Pakistan	74
Columbia	10
Khan Younis	76
Middle East	354
Hague	133
Bangladesh	7
Comoros	7
Djibouti	14
Netherlands	51
United Kingdom	153
Myanmar	22
Beirut	293
Dahiyeh	22
Lebanon	597
Iran	696
Yemen	645
Beirut Shatila	4
Red Sea	855
Africa	98
Gulf of Aden	97
Cape of Good Hope	43
Singapore	8
Mediterranean	38
Indian Ocean	8
Europe	103
Asia	61
Spain	25
Canada	152
Australia	49
Britain	48
Germany	109
Italy	37
Switzerland	31
Finland	11
Estonia	4
Japan	33
Austria	12
Romania	15
West Bank	576
Syria	289
October7	7
Jerusalem	92
Dearborn	47
Michigan	46
Mackinac Island	4
Great Lakes	4
Lake Michigan	4
Afghanistan	24
Texas	12
Beit Nabala	4
Idlib	10
Hamas	21
Tel Aviv	171
Washington	221
Cairo	21
Doha	69
Nuseira