In [31]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
import os
import requests

os.makedirs("data", exist_ok=True)
with open("data/wiki_us.txt", "wb") as f:
  request = requests.get("https://raw.githubusercontent.com/wjbmattingly/freecodecamp_spacy/main/data/wiki_us.txt")
  f.write(request.content)

___

# **1. Linguistic Annotations**

In [None]:
with open("data/wiki_us.txt", "r") as f:
  text = f.read()

## **1.1. Creating a Doc Container**

In [None]:
doc = nlp(text)

In [None]:
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [None]:
len(doc), len(text)

(654, 3521)

In [None]:
for token in text[:10]:
  print(token)

T
h
e
 
U
n
i
t
e
d


In [None]:
for token in doc[:10]:
  print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


Doc object counts the tokens, but text counts every punctuation, letter, white space, etc.

In [None]:
for token in text.split()[:10]:
  print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


The difference of the doc object is that it seperates the token from the punctuation, if it isn't relevant to the token, whereas split just splits the sentences from white spaces without considering this.

___

## **1.2. Sentence Boundary Detection**

When we try to use split('.'), it not justs splits the sentence but floating point numbers, abbreviations with dots, etc., which is why spacy is very helpful.

In [None]:
for sent in doc.sents:
  print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [None]:
sentence1 = list(doc.sents)[0]
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

Since the doc object is a generator, it isn't subscriptable, so we need to turn it into a list.

___

## **1.3. Token Attributes**

let's get two tokens: noun, verb

In [None]:
token1 = sentence1[12]
token2 = sentence1[2]
print(token1)
print(token2)

known
States


to reach the text of the token

In [None]:
print(token1.text)
print(token2.text)

known
States


the primary verb that governs the noun subject 'States'

In [None]:
token2.head

is

If it is a multi-word token, which means it is meaningful as a part of a sequence of tokens, like in this case 'States' is a part of **'The United States of America'**, ``left_edge`` gives the beginning of this multi-word token **'The'** and ``right_edge`` gives the end of this multi-word token **'America'**.


In [None]:
token2.left_edge

The

In [None]:
token2.right_edge

America

let's learn the entity type,
``ent_type`` will give the integer corresponding to that entity type, whereas ``ent_type_`` return teh string equivalent of this entity type.

In [None]:
token2.ent_type

384

In [None]:
token2.ent_type_

'GPE'

``GPE:`` **geopolitical entity**

Countries, cities, states

``ent_iob_`` returns

**'I'**, if the token is inside an entity

**'B'**, if the token begins an entitiy

**'O'**, if the token is outside an entity

In this case, 'States' is inside the entity 'The United States of America'.

In [None]:
token2.ent_iob_

'I'

``lemma_`` returns the base form of the token

**known** -- **know**

In [None]:
token1.lemma_

'know'

``morph`` returns the morphological analysis of the word.

**known** -- Perfect Past Participle

In [None]:
token1.morph

Aspect=Perf|Tense=Past|VerbForm=Part

``pos_`` means **part of speech** and returns in which task the token in the sentence has.

*known -- verb*

*States -- proper noun*

In [None]:
token1.pos_

'VERB'

In [None]:
token2.pos_

'PROPN'

``dep_`` returns the syntactic dependency.

*States -- nominal subject*

*known -- Clausal Modifier of Noun (e.g. relative clauses)*

In [None]:
token1.dep_

'acl'

In [None]:
token2.dep_

'nsubj'

``lang_`` returns the language code of the token

In [None]:
token1.lang_

'en'

In [None]:
token2.lang_

'en'

___

## **1.4. Part of Speech Tagging (POS)**

In [None]:
sentence2 = "Serra has been accepted from MIT."
doc2 = nlp(sentence2)

In [None]:
for token in doc2:
  print(token.text, token.pos_, token.dep_)

Serra NOUN nsubjpass
has AUX aux
been AUX auxpass
accepted VERB ROOT
from ADP prep
MIT PROPN pobj
. PUNCT punct


In [None]:
from spacy import displacy
displacy.render(doc2, style="dep")

___

## **1.5. Named Entity Recognition**

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Soviet Union

In [None]:
displacy.render(doc, style="ent")

___

# **2. Word Vectors and spaCy**

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_md")

In [None]:
with open("data/wiki_us.txt", "r") as f:
  text = f.read()

In [None]:
doc = nlp(text)

In [None]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## **2.1. Similarity**

In [None]:
import numpy as np

word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[word]]]), n=10)

words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
words

['country—0,467',
 'nationâ\x80\x99s',
 'countries-',
 'continente',
 'Carnations',
 'pastille',
 'бесплатно',
 'Argents',
 'Tywysogion',
 'Teeters']

In [None]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

In [None]:
print(doc1, "«-»", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. «-» Fast food tastes very good. 0.691649353055761


In [None]:
doc3 = nlp("MIT is the best university in the world.")

In [None]:
print(doc1, "«-»", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. «-» MIT is the best university in the world. 0.42171362720334976


In [None]:
doc4 = nlp("I enjoy apples.")
doc5 = nlp("I enjoy oranges.")

In [None]:
print(doc4, "«-»", doc5, doc4.similarity(doc5))

I enjoy apples. «-» I enjoy oranges. 0.9775702131220241


In [None]:
doc6 = nlp("I enjoy hamburgers.")

In [None]:
print(doc5, "«-»", doc6, doc5.similarity(doc6))

I enjoy oranges. «-» I enjoy hamburgers. 0.9628306772893752


In [None]:
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "«-»", burgers, french_fries.similarity(burgers))

salty fries «-» hamburgers 0.6938489675521851


___

# **3. spaCy's Pipelines**

If we want to do just one task like extracting the sentences, we can create a blank model and give the language the text will be in to the model and add that specific attribute ruler as a pipeline.

It takes really less time, then using a model, which has multiple pipelines but the accuracy will be less, because it just focuses on one task, when for example extracting.

In [None]:
nlp = spacy.blank("en")

In [None]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x79c8ea9b2bc0>

In [None]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}

In [None]:
nlp2 = spacy.load("en_core_web_sm")

In [None]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

___

# **4. Rules Based spaCy**

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")

## **4.1. spaCy Entity Ruler**

In [None]:
text = "West Chestertenfieldville was referenced in Mr. Deeds."
doc = nlp(text)

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [None]:
ruler = nlp.add_pipe("entity_ruler")

In [None]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [None]:
patterns = [
    {"label" : "FILM",
     "pattern" : "Mr. Deeds"}
]

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc2 = nlp(text)
for ent in doc2.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


Although we added the pattern, it doesn't work.

The reason is that the entity ruler comes after the Name Entity Recognition in the pipeline.

In [None]:
nlp2 = spacy.load("en_core_web_sm")

In [None]:
ruler = nlp2.add_pipe("entity_ruler", before="ner")

In [None]:
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc3 = nlp2(text)

for ent in doc3.ents:
  print(ent.text, ent.label_)

West Chestertenfieldville GPE
Mr. Deeds FILM


___

## **4.2. spaCy Matcher**

In [None]:
import spacy
from spacy.matcher import Matcher

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])

In [None]:
doc = nlp("My email address is serurays@gmail.com")
matches = matcher(doc)

In [None]:
matches

[(16571425990740197027, 4, 5)]

The tuple's first index is the ``Lexeme``. It is the email's, which is a Lexeme object, index in the nlp.vocab.

``The second one is the starting index and the last the ending index``, since the email address is the 4th indexed element of the sentence.

In [None]:
nlp.vocab[matches[0][0]].text

'EMAIL_ADDRESS'

In [None]:
import os
os.makedirs("data", exist_ok=True)

In [None]:
with open("data/wiki_mlk.txt", "w") as f:
  f.write(
  """Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Christian minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.

King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King was one of the leaders of the 1963 March on Washington, where he delivered his "I Have a Dream" speech on the steps of the Lincoln Memorial, and helped organize two of the three Selma to Montgomery marches during the 1965 Selma voting rights movement. The civil rights movement achieved pivotal legislative gains in the Civil Rights Act of 1964, Voting Rights Act of 1965, and the Fair Housing Act of 1968.

The SCLC put into practice the tactics of nonviolent protest with some success by strategically choosing the methods and places in which protests were carried out. There were several dramatic standoffs with segregationist authorities, who frequently responded violently.[2] King was jailed several times. Federal Bureau of Investigation (FBI) director J. Edgar Hoover considered King a radical and made him an object of the FBI's COINTELPRO from 1963 forward. FBI agents investigated him for possible communist ties, spied on his personal life, and secretly recorded him. In 1964, the FBI mailed King a threatening anonymous letter, which he interpreted as an attempt to make him commit suicide.[3]

On October 14, 1964, King won the Nobel Peace Prize for combating racial inequality through nonviolent resistance. In his final years, he expanded his focus to include opposition towards poverty and the Vietnam War. In 1968, King was planning a national occupation of Washington, D.C., to be called the Poor People's Campaign, when he was assassinated on April 4 in Memphis, Tennessee. James Earl Ray, a fugitive from the Missouri State Penitentiary, was convicted of the assassination, though the King family believes he was a scapegoat; the assassination remains the subject of conspiracy theories. King's death was followed by national mourning, as well as anger leading to riots in many U.S. cities. King was posthumously awarded the Presidential Medal of Freedom in 1977 and the Congressional Gold Medal in 2003. Martin Luther King Jr. Day was established as a holiday in cities and states throughout the United States beginning in 1971; the federal holiday was first observed in 1986. Hundreds of streets in the U.S. have been renamed in his honor, and King County in Washington was rededicated for him. The Martin Luther King Jr. Memorial on the National Mall in Washington, D.C., was dedicated in 2011."""
  )

In [None]:
with open("data/wiki_mlk.txt", "r") as f:
  text = f.read()

In [None]:
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Christian minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.

King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. Kin

In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS" : "PROPN"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)

In [None]:
len(matches)

115

In [None]:
for mt in matches[:10]:
  print(mt, doc[mt[1]:mt[2]])

(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 66, 67) Martin


In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS" : "PROPN", "OP" : "+"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)

In [None]:
len(matches)

195

In [None]:
for mt in matches[:10]:
  print(mt, doc[mt[1]:mt[2]])

(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


In [None]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS" : "PROPN", "OP" : "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)

In [None]:
len(matches)

68

In [None]:
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

(451313080118390996, 66, 71) Martin Luther King Sr.
(451313080118390996, 536, 541) Martin Luther King Jr. Day
(451313080118390996, 591, 596) Martin Luther King Jr. Memorial
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 143, 147) Southern Christian Leadership Conference
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 243, 246) Civil Rights Act
(451313080118390996, 249, 252) Voting Rights Act
(451313080118390996, 257, 260) Fair Housing Act
(451313080118390996, 319, 322) J. Edgar Hoover


In [None]:
matches.sort(key = lambda x: x[1])

In [None]:
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 66, 71) Martin Luther King Sr.
(451313080118390996, 72, 73) King
(451313080118390996, 82, 84) United States
(451313080118390996, 95, 97) Jim Crow
(451313080118390996, 106, 107) King
(451313080118390996, 132, 133) Montgomery


In [None]:
matcher = Matcher(nlp.vocab)
# when a proper noun is proceeded by a verb.
patterns = [{"POS" : "PROPN", "OP" : "+"},
            {"POS" : "VERB"}]
matcher.add("PROPER_NOUNS", [patterns], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])

In [None]:
len(matches)

6

In [None]:
for match in matches[:10]:
  print(match, doc[match[1]:match[2]])

(3232560085755078826, 72, 74) King advanced
(3232560085755078826, 106, 108) King participated
(3232560085755078826, 319, 323) J. Edgar Hoover considered
(3232560085755078826, 364, 366) FBI mailed
(3232560085755078826, 391, 393) King won
(3232560085755078826, 552, 555) United States beginning


In [None]:
# Alice in Wonderland
import requests

with open("data/alice_in_wonderland.txt", "wb") as f:
  request = requests.get("https://gist.githubusercontent.com/phillipj/4944029/raw/75ba2243dd5ec2875f629bf5d79f6c1e4b5a8b46/alice_in_wonderland.txt")
  f.write(request.content)

In [None]:
with open("data/alice_in_wonderland.txt", "r") as f:
  data = f.read()

In [None]:
text = data[2000:5000]
text = text.replace("\n", " ")
text = text.replace("`", "'")
text

" she had plenty of time as she went down to look about her and to wonder what was going to happen next.  First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs.  She took down a jar from one of the shelves as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was empty:  she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it.    'Well!' thought Alice to herself, 'after such a fall as this, I shall think nothing of tumbling down stairs!  How brave they'll all think me at home!  Why, I wouldn't say anything about it, even if I fell off the top of the house!' (Which was very likely true.)    Down, down, down.  Would the fall NEVER come to an end!  'I wonder how many miles 

In [None]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
patterns = [{"ORTH" : "'"},
            {"IS_ALPHA" : True, "OP" : "+"},
            {"IS_PUNCT" : True, "OP" : "*"},
            {"ORTH" : "'"},
            {"POS" : "VERB", "LEMMA" : {"IN" : speak_lemmas}},
            {"POS" : "PROPN"}]
matcher.add("PROPER_NOUN", [patterns], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)

In [None]:
len(matches)

1

In [None]:
matches, doc[matches[0][1] : matches[0][2]]

([(451313080118390996, 149, 155)], 'Well!' thought Alice)

In [None]:
matcher = Matcher(nlp.vocab)
patterns = [
            {"IS_ALPHA" : True, "OP" : "+"},
            {"IS_PUNCT" : True, "OP" : "*"},
            {"ORTH" : "'"}
            ]
matcher.add("PROPER_NOUNS", [patterns], greedy = "LONGEST")
doc = nlp(text)
matches = matcher(doc)

In [None]:
len(matches)

11

In [None]:
matches

[(3232560085755078826, 197, 209),
 (3232560085755078826, 545, 555),
 (3232560085755078826, 481, 489),
 (3232560085755078826, 243, 249),
 (3232560085755078826, 704, 710),
 (3232560085755078826, 714, 720),
 (3232560085755078826, 588, 593),
 (3232560085755078826, 359, 363),
 (3232560085755078826, 669, 673),
 (3232560085755078826, 104, 107),
 (3232560085755078826, 150, 153)]

In [None]:
for match in matches[:10]:
  print(match, doc[match[1] : match[2]])

(3232560085755078826, 197, 209) even if I fell off the top of the house!'
(3232560085755078826, 545, 555) perhaps I shall see it written up somewhere.'
(3232560085755078826, 481, 489) is this New Zealand or Australia?'
(3232560085755078826, 243, 249) fallen by this time?'
(3232560085755078826, 704, 710) Do cats eat bats?'
(3232560085755078826, 714, 720) Do bats eat cats?'
(3232560085755078826, 588, 593) I should think!'
(3232560085755078826, 359, 363) got to?'
(3232560085755078826, 669, 673) I wonder?'
(3232560085755078826, 104, 107) ORANGE MARMALADE'


In [None]:
matcher = Matcher(nlp.vocab)
patterns = [
            {"IS_TITLE" : True},
            {"IS_ALPHA" : True, "OP" : "+"},
            {"IS_PUNCT" : True, "OP" : "*"}
            ]
matcher.add("PROPER_NOUNS", [patterns], greedy = "LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])

In [None]:
len(matches)

38

In [None]:
for match in matches[:10]:
  print(match, doc[match[1] : match[2]])

(3232560085755078826, 86, 100) She took down a jar from one of the shelves as she passed;
(3232560085755078826, 154, 158) Alice to herself,
(3232560085755078826, 166, 175) I shall think nothing of tumbling down stairs!
(3232560085755078826, 176, 179) How brave they
(3232560085755078826, 189, 191) I would
(3232560085755078826, 199, 210) I fell off the top of the house!' (
(3232560085755078826, 210, 217) Which was very likely true.)
(3232560085755078826, 225, 234) Would the fall NEVER come to an end!
(3232560085755078826, 236, 242) I wonder how many miles I
(3232560085755078826, 254, 266) I must be getting somewhere near the centre of the earth.


___

## **4.3. Custom Components in spaCy**

In [None]:
import spacy

In [None]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Britain is a place. Mary is a doctor.")

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Britain GPE
Mary PERSON


In [None]:
from spacy.language import Language

In [None]:
@Language.component("remove_gpe")
def remove_gpe(doc):
  original_ents = list(doc.ents)
  for ent in doc.ents:
    if ent.label_ == "GPE":
      original_ents.remove(ent)
  doc.ents = original_ents
  return doc

In [None]:
nlp.add_pipe("remove_gpe")

In [None]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [None]:
doc = nlp("Britain is a place. Mary is a doctor.")

for ent in doc.ents:
  print(ent.text, ent.label_)

Mary PERSON


___

# **5. Working with Multi-Word Token Entities and RegEx in spaCy**

In [None]:
import spacy

In [None]:
text = "This is a sample number (555) 555-5555."

In [None]:
nlp = spacy.blank("en")

In [None]:
ruler = nlp.add_pipe("entity_ruler")

In [None]:
patterns = [
    {
        "label" : "PHONE_NUMBER",
        "pattern" : [{"TEXT" : {"REGEX" : "((\d){3}-(\d){4})"} }]
    }
]

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc = nlp(text)

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)

Regex works with single tokens, so the reason why we don't get the entity is because of the dash.

In [None]:
text = "This is a sample number 5555555."

In [None]:
patterns = [
    {
        "label" : "PHONE_NUMBER",
        "pattern" : [{"TEXT" : {"REGEX" : "((\d){5})"}}]
    }
]

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
ruler = nlp.add_pipe("entity_ruler")

In [None]:
ruler.add_patterns(patterns)

In [None]:
doc = nlp(text)

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)

5555555 PHONE_NUMBER


In [None]:
import re

In [None]:
text = "Paul Newman was an American actor, but Paul Holywood is a British TV Host. The name Paul is quite common."

In [None]:
pattern = r"Paul [A-Z]\w+"

In [None]:
matches = re.finditer(pattern, text)

In [None]:
for match in matches:
  print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 52), match='Paul Holywood'>


In [None]:
import spacy
from spacy.tokens import Span

In [None]:
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.ents)
original_ents = list(doc.ents)
multi_word_token_ents = []
for match in re.finditer(pattern, doc.text):
  start, end = match.span()
  span = doc.char_span(start, end)
  if span is not None:
    multi_word_token_ents.append((span.start, span.end, span.text))
for ent in multi_word_token_ents:
  start, end, name = ent
  per_ent = Span(doc, start, end, label="PERSON")
  original_ents.append(per_ent)
doc.ents = original_ents
for ent in doc.ents:
  print(ent.text, ent.label_)

()
Paul Newman PERSON
Paul Holywood PERSON


In [None]:
from spacy.language import Language

@Language.component("paul_ner")
def paul_ner(doc):
  original_ents = list(doc.ents)
  multi_word_token_ents = []
  for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
      multi_word_token_ents.append((span.start, span.end, span.text))
  for ent in multi_word_token_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)
  doc.ents = original_ents
  return doc

In [None]:
nlp2 = spacy.blank("en")

In [None]:
nlp2.add_pipe("paul_ner")

In [None]:
doc = nlp2(text)
for ent in doc.ents:
  print(ent.text, ent.label_)

Paul Newman PERSON
Paul Holywood PERSON


In [None]:
from spacy.language import Language

@Language.component("cinema_ner")
def cinema_ner(doc):
  pattern = r"Holywood"
  original_ents = list(doc.ents)
  multi_word_token_ents = []
  for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
      multi_word_token_ents.append((span.start, span.end, span.text))
  for ent in multi_word_token_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)
  doc.ents = original_ents
  return doc

In [None]:
nlp3 = spacy.load("en_core_web_sm")
nlp3.add_pipe("cinema_ner")
nlp3.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'cinema_ner': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [None]:
doc3 = nlp3(text)

ValueError: [E1010] Unable to set entity information for token 9 which is included in more than one span in entities, blocked, missing or outside.

In [None]:
from spacy.language import Language

@Language.component("cinema_ner")
def cinema_ner(doc):
  pattern = r"Holywood"
  original_ents = list(doc.ents)
  multi_word_token_ents = []
  for match in re.finditer(pattern, doc.text):
    start, end = match.span()
    span = doc.char_span(start, end)
    if span is not None:
      multi_word_token_ents.append((span.start, span.end, span.text))
  for ent in multi_word_token_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="CINEMA")
    original_ents.append(per_ent)
  filtered = filter_spans(original_ents)
  doc.ents = filtered
  return doc

In [None]:
nlp4 = spacy.load("en_core_web_sm")

In [None]:
doc4 = nlp4(text)

In [None]:
for ent in doc4.ents:
  print(ent.text, ent.label_)

Paul Newman PERSON
American NORP
Paul Holywood PERSON
British NORP
Paul PERSON


The error was because the Holywood was part of the PERSON entity, and also the CINEMA entity, which overlapped.

With filter_spans it just took the longer span, which belonged to the PERSON entity.

___

# **6. Applied spaCy Financial NER**

In [None]:
import spacy
import pandas as pd



In [None]:
df = pd.read_csv("data/2022_03_17_02_06_nasdaq.csv")
len(df)

8339

In [None]:
df.drop("Unnamed: 0", axis=1, inplace=True)

In [None]:
df.columns

Index(['symbol', 'name', 'price', 'pricing_changes',
       'pricing_percentage_changes', 'sector', 'industry', 'market_cap',
       'share_volume', 'earnings_per_share', 'annualized_dividend',
       'dividend_pay_date', 'symbol_yield', 'beta', 'errors'],
      dtype='object')

In [None]:
df.head()

Unnamed: 0,symbol,name,price,pricing_changes,pricing_percentage_changes,sector,industry,market_cap,share_volume,earnings_per_share,annualized_dividend,dividend_pay_date,symbol_yield,beta,errors
0,AAPL,Apple Inc. Common Stock,157.51,2.42,(+1.56%),Technology,Computer Manufacturing,2699423838000,63429579,$6.04,$0.88,"Feb 10, 2022",0.58%,1.18,False
1,MSFT,Microsoft Corporation Common Stock,289.56,2.41,(+0.84%),Technology,Computer Software: Prepackaged Software,2143429080429,22790662,$9.39,$2.48,"Jun 9, 2022",0.9%,0.91,False
2,GOOG,Alphabet Inc. Class C Capital Stock,2639.755,46.545,(+1.79%),Technology,Internet and Information Services,1724718735878,900760,$112.23,,,,1.06,False
3,GOOGL,Alphabet Inc. Class A Common Stock,2629.01,45.05,(+1.74%),Technology,Internet and Information Services,1718961675672,1008687,$112.23,,,,1.06,False
4,AMZN,"Amazon.com, Inc. Common Stock",3009.07,61.74,(+2.09%),Consumer Services,Catalog/Specialty Distribution,1511267897700,2623915,$64.78,,,,1.11,False


In [None]:
companies = df.name.tolist()
symbols = df.symbol.tolist()

In [None]:
companies = [item.replace("Common Stock", "").replace("Stock", "").replace("Inc.", "").replace("Corporation", "").strip() for item in companies]

In [None]:
nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")
patterns = []
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

for symbol in symbols:
  patterns.append({"label" : "STOCK", "pattern" : symbol})
  for l in letters:
    patterns.append({"label" : "STOCK", "pattern" : symbol+f".{l}"})

for company in companies:
  patterns.append({"label" : "COMPANY", "pattern" : company})

ruler.add_patterns(patterns)

In [None]:
text = '''
Sept 10 (Reuters) - Wall Street's main indexes were subdued on Friday as signs of higher inflation and a drop in Apple shares following an unfavorable court ruling offset expectations of an easing in U.S.-China tensions.

Data earlier in the day showed U.S. producer prices rose solidly in August, leading to the biggest annual gain in nearly 11 years and indicating that high inflation was likely to persist as the pandemic pressures supply chains. read more .

"Today's data on wholesale prices should be eye-opening for the Federal Reserve, as inflation pressures still don't appear to be easing and will likely continue to be felt by the consumer in the coming months," said Charlie Ripley, senior investment strategist for Allianz Investment Management.

Apple Inc (AAPL.O) fell 2.7% following a U.S. court ruling in "Fortnite" creator Epic Games' antitrust lawsuit that stroke down some of the iPhone maker's restrictions on how developers can collect payments in apps.


Sponsored by Advertising Partner
Sponsored Video
Watch to learn more
Report ad
Apple shares were set for their worst single-day fall since May this year, weighing on the Nasdaq (.IXIC) and the S&P 500 technology sub-index (.SPLRCT), which fell 0.1%.

Sentiment also took a hit from Cleveland Federal Reserve Bank President Loretta Mester's comments that she would still like the central bank to begin tapering asset purchases this year despite the weak August jobs report. read more

Investors have paid keen attention to the labor market and data hinting towards higher inflation recently for hints on a timeline for the Federal Reserve to begin tapering its massive bond-buying program.

The S&P 500 has risen around 19% so far this year on support from dovish central bank policies and re-opening optimism, but concerns over rising coronavirus infections and accelerating inflation have lately stalled its advance.


Report ad
The three main U.S. indexes got some support on Friday from news of a phone call between U.S. President Joe Biden and Chinese leader Xi Jinping that was taken as a positive sign which could bring a thaw in ties between the world's two most important trading partners.

At 1:01 p.m. ET, the Dow Jones Industrial Average (.DJI) was up 12.24 points, or 0.04%, at 34,891.62, the S&P 500 (.SPX) was up 2.83 points, or 0.06%, at 4,496.11, and the Nasdaq Composite (.IXIC) was up 12.85 points, or 0.08%, at 15,261.11.

Six of the eleven S&P 500 sub-indexes gained, with energy (.SPNY), materials (.SPLRCM) and consumer discretionary stocks (.SPLRCD) rising the most.

U.S.-listed Chinese e-commerce companies Alibaba and JD.com , music streaming company Tencent Music (TME.N) and electric car maker Nio Inc (NIO.N) all gained between 0.7% and 1.4%


Report ad
Grocer Kroger Co (KR.N) dropped 7.1% after it said global supply chain disruptions, freight costs, discounts and wastage would hit its profit margins.

Advancing issues outnumbered decliners by a 1.12-to-1 ratio on the NYSE and by a 1.02-to-1 ratio on the Nasdaq.

The S&P index recorded 14 new 52-week highs and three new lows, while the Nasdaq recorded 49 new highs and 38 new lows.
'''

In [None]:
doc = nlp(text)
for ent in doc.ents:
  print(ent.text, ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
ET STOCK
Dow COMPANY
TME.N STOCK
NIO.N STOCK
KR.N STOCK


In [None]:
from spacy import displacy
displacy.render(doc, style="ent")