This notebook uses [Stanford Named Entity Recognizer](https://nlp.stanford.edu/software/CRF-NER.shtml) to label sequences of words in a text which are the names people.

It takes a single file as an input and produced a single file as an output with each name marked as `NAME|PERSON` (e.g. `James|PERSON BAKER|PERSON`).

#### Setup NLTK and NER

In [1]:
from nltk.tag.stanford import StanfordNERTagger
from nltk.tokenize import word_tokenize
import nltk
import re

!wget 'https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip'
!unzip stanford-ner-2018-10-16.zip

nltk.download('punkt')

--2023-02-23 13:59:14--  https://nlp.stanford.edu/software/stanford-ner-2018-10-16.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 302 FOUND
Location: https://downloads.cs.stanford.edu/nlp/software/stanford-ner-2018-10-16.zip [following]
--2023-02-23 13:59:15--  https://downloads.cs.stanford.edu/nlp/software/stanford-ner-2018-10-16.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180358328 (172M) [application/zip]
Saving to: ‘stanford-ner-2018-10-16.zip’


2023-02-23 13:59:46 (5.77 MB/s) - ‘stanford-ner-2018-10-16.zip’ saved [180358328/180358328]

Archive:  stanford-ner-2018-10-16.zip
   creating: stanford-ner-2018-10-16/
  inflating: stanford-ner-2018-10-16/R

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
st = StanfordNERTagger('/content/stanford-ner-2018-10-16/classifiers/english.all.3class.distsim.crf.ser.gz',
                       '/content/stanford-ner-2018-10-16/stanford-ner.jar',
                       encoding='utf-8')

#### Import File

In [3]:
# Download test file

!wget http://www.gutenberg.org/files/2600/2600-0.txt

--2023-02-23 14:01:58--  http://www.gutenberg.org/files/2600/2600-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://www.gutenberg.org/files/2600/2600-0.txt [following]
--2023-02-23 14:01:58--  https://www.gutenberg.org/files/2600/2600-0.txt
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3359405 (3.2M) [text/plain]
Saving to: ‘2600-0.txt’


2023-02-23 14:01:59 (18.6 MB/s) - ‘2600-0.txt’ saved [3359405/3359405]



In [None]:
# Upload a file of your choice

from google.colab import files

uploaded = files.upload()

Saving part_aa.txt to part_aa.txt


In [4]:
# To analyse a file of your choice, replace '2600-0.txt' with the name of your uploaded file.

inputtext = open('2600-0.txt', encoding="utf-8")
text = inputtext.read()

#### Run classification

In [5]:
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)

In [6]:
print(classified_text)

# Don't worry is the output is 'data rate exceeded', that means output is too large for the output window.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



#### Reformat the classifier tags (only run one option!)

In [9]:
# Option 1 (with `NAME|PERSON` output):

mylist = []

# Loop through, add first element to list (and 'PERSON' if applicable)
for i in classified_text: 
    if i[1] == "PERSON":    
        mylist.append(i[0])
        mylist.append("|"+i[1])     
    else:
        mylist.append(i[0])

# Convert list to string    
mystring = " ".join(mylist)

# Remove whitespace before punctuation characters (,./)
tidied = re.sub("\s(?=[\.,/|])", "", mystring)
print(tidied)

# Don't worry is the output is 'data rate exceeded', that means output is too large for the output window.

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [None]:
# Option 2 (with `NAME/PERSON` output):

mylist = []
# Loop add first element to list (and 'PERSON' if applicable)
[mylist.extend([i[0], "/", i[1]]) if i[1] == "PERSON" else mylist.append(i[0]) for i in classified_text]

# Convert list to string  
mystring = " ".join(mylist)

# Remove whitespace before .,/; remove whitespace after /
tidied = re.sub("\s(?=[\.,/])", "", mystring)
tidied = re.sub("/\s", "/", tidied)
print(tidied)

# Don't worry is the output is 'data rate exceeded', that means output is too large for the output window.

While in France, Christine/PERSON Lagarde/PERSON discussed short-term stimulus efforts in a recent interview with the Wall Street Journal.


#### Print output to file

In [10]:
#print to file (rename the file output as appropriate)

f = open("output.txt", "w") 
print(tidied, file=f)
f.close()