# Using stanza for Named Entity Recognition (continued)

## Installation

Run the code cell below to install stanza:

In [12]:
!pip install stanza  # for stanza installation



## Import library and download language model

After installing it, we import stanza into our notebook.

In [13]:
import stanza
import os
import requests
import time

Creating the pipeline

Download the English language model and build the pipeline (we specify that it should only tokenize the text, separate multiword tokens and perform Named Entity Recognition):


In [14]:
# Download the language model:
stanza.download("en")

# Create the pipeline, specifying the language:
nlp = stanza.Pipeline("en", processors="tokenize,ner")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:File exists: /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Loading these models for language: en (English):
| Processor | Package                   |
-----------------------------------------
| tokenize  | combined                  |
| mwt       | combined                  |
| ner       | ontonotes-ww-multi_charlm |

INFO:stanza:Using device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: mwt
INFO:stanza:Loading: ner
INFO:stanza:Done loading processors!


 # Cloning the repository

In [15]:
!git clone https://github.com/AkramHussain123/FASDH25-portfolio2.git

fatal: destination path 'FASDH25-portfolio2' already exists and is not an empty directory.


# To count the articles

In [17]:
import os

# set the path where we saved the articles
folder = "/content/FASDH25-portfolio2/articles"

# def variable to store counting
places = {}

# start from 0 sa that it will keeping adding when new found
january_files = 0

# To go through each file in the folder
for filename in os.listdir(folder):

    # to check the date of the articles
    if "2024-01" in filename:

        # this will keep adding new files if found
        january_files += 1

        # Create the full path to the current file
        path = os.path.join(folder, filename)

        # Open the file and read its contents as a string
        with open(path, encoding="utf-8") as file:
            text = file.read()

        # to process the article text, use NLP
        doc = nlp(text)

        # this will check each sentence in the article
        for sent in doc.sentences:
            # check each named enitity in the sentence
            for ent in sent.ents:
                # if it's a place(like a country, city, or region)
                if ent.type in ["GPE", "LOC"]:
                    # for each new place, it will increment by 1
                    places[ent.text] = places.get(ent.text, 0) + 1

# count the total number of articles found for january 2024
print(f"Total number of articles in January 2024: {january_files}")

# Print the dictionary of places and their occurrence counts
print(places)

Total number of articles in January 2024: 326
{'Morocco': 13, 'Israel': 1593, 'Gaza': 1605, 'Rabat': 3, 'United States': 40, 'the United Arab Emirates': 13, 'UAE': 7, 'Bahrain': 11, 'Sudan': 3, 'US': 706, 'Western Sahara': 3, 'Washington': 60, 'Tel Aviv': 49, 'Algeria': 7, 'Marrakesh': 1, 'the Western Sahara': 1, 'Morocco’s': 1, 'Maghreb': 1, 'Ukraine': 47, 'Saudi Arabia': 39, 'California': 3, 'West Bank': 120, 'Dena': 1, 'Israel’s': 31, 'Oakland': 1, 'the United States': 97, 'South Africa': 200, 'Jordan': 42, 'Jerusalem': 26, 'East Jerusalem': 23, 'Egypt': 43, 'Qatar': 64, 'Kuala Lumpur': 4, 'Malaysia': 8, 'Palestine': 124, 'Indonesia’s': 1, 'Jakarta': 2, 'Johannesburg': 4, 'London': 17, 'Paris': 8, 'Vienna': 1, 'Berlin': 5, 'Amman': 6, 'Washington DC': 3, 'UK': 95, 'Manchester': 1, 'Yemen': 182, 'Washington, DC': 4, 'India': 50, 'Hyderabad': 1, 'Colombo’s Kollupitiya': 1, 'Namibia': 10, 'Germany': 31, 'Palestinian Territories': 1, 'Sweden': 2, 'Iran': 206, 'Kerman': 6, 'Lebanon': 175

# Clean up the named entity names
Check if the data contains duplicates and merge the duplicates, using conditions: e.g., add the count for “Gaza’s” to “Gaza” and remove “Gaza’s” from the dictionary.

In [None]:
def clean_up(place):
# avoid dublicate names
  if place.endswith("'s"):
    return place[:-2]
  return place


# Storing data in tsv file

Write the results to a tsv file called “ner_counts.tsv”, which contains two columns: placename and count

In [18]:
filename = "ner_counts.tsv"

# open the file for writing, making sure it handles all text types with UTF-8 encoding
with open(filename, mode="w" , encoding="utf-8") as file:
# Add the column names at the top of the TSV file, using tab spaces between them
  header = "name \t frequency \n"
# save the header line into the file
  file.write(header)

# Go through each place in the dictionary and write a row for it
  for name, frequency in places.items():
    row = f"{name}\t{frequency}\n"
# Write this row with place name and its count
    file.write(row)


In [19]:
# open the file in read mode and print everything inside

with open("/content/ner_counts.tsv", "r", encoding="utf-8") as file:
  print(file.read())

name 	 frequency 
Morocco	13
Israel	1593
Gaza	1605
Rabat	3
United States	40
the United Arab Emirates	13
UAE	7
Bahrain	11
Sudan	3
US	706
Western Sahara	3
Washington	60
Tel Aviv	49
Algeria	7
Marrakesh	1
the Western Sahara	1
Morocco’s	1
Maghreb	1
Ukraine	47
Saudi Arabia	39
California	3
West Bank	120
Dena	1
Israel’s	31
Oakland	1
the United States	97
South Africa	200
Jordan	42
Jerusalem	26
East Jerusalem	23
Egypt	43
Qatar	64
Kuala Lumpur	4
Malaysia	8
Palestine	124
Indonesia’s	1
Jakarta	2
Johannesburg	4
London	17
Paris	8
Vienna	1
Berlin	5
Amman	6
Washington DC	3
UK	95
Manchester	1
Yemen	182
Washington, DC	4
India	50
Hyderabad	1
Colombo’s Kollupitiya	1
Namibia	10
Germany	31
Palestinian Territories	1
Sweden	2
Iran	206
Kerman	6
Lebanon	175
Bethlehem	4
Nairoukh	1
China	28
Italy	10
Spain	7
Turkey	25
Shawawra	1
The Hague	33
South Africa’s	8
the Gaza Strip	123
Khan Younis	23
Syria	83
Mazzeh	2
Damascus	17
U.S.	11
Houthis’	3
the Red Sea	194
the Bab-el-Mandeb Strait	1
the Gulf of Aden	23
Sanaa	15
the 