# Natural Language Processing with SpaCy & Python - Course
* Youtube Video from FreeCodeCamp - https://www.youtube.com/watch?v=dIUTsFT2MeQ
* Introduction to SpaCy 3 Book - http://spacy.pythonhumanities.com/intro.html
* Github Repository - https://github.com/wjbmattingly/freecodecamp_spacy
* SpaCy Docs - https://spacy.io/api/doc

## Intro is NLP
![image.png](attachment:f7d18450-22ef-4154-a926-2a1345ad338c.png)
### What is NLP
**Natural language processing (NLP)** is a field of computer science that gives computers the ability to understand and process human language. It is a subfield of artificial intelligence (AI).

**Natural language understanding (NLU)** is a subfield of natural language processing (NLP) that deals with the ability of computers to understand human language. NLU is used in a variety of applications, including:
* **Machine translation**: NLU is used to understand the meaning of text in one language so that it can be translated into another language.
* **Question answering**: NLU is used to understand the meaning of questions so that they can be answered.
* **Chatbots**: NLU is used to understand the meaning of user input so that chatbots can have conversations with humans.
* **Virtual assistants**: NLU is used to understand the meaning of user requests so that virtual assistants can help users with tasks such as setting reminders, making appointments, and finding information.


**Natural Language Toolkit (NLTK)** is a free, open-source Python library for natural language processing (NLP). NLTK provides a wide range of NLP tools and resources

## Installing SpaCy
SpaCy install page - https://spacy.io/usage

In [1]:
!pip install -U pip setuptools wheel
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting setuptools
  Downloading setuptools-67.8.0-py3-none-any.whl (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m26.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: setuptools
  Attempting uninstall: setuptools
    Found existing installation: setuptools 59.8.0
    Uninstalling setuptools-59.8.0:
      Successfully uninstalled setuptools-59.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
momepy 0.6.0 requires shapely>=2, but you have shapely 1.8.5.post1 which is incompatible.
opentelemetry-api 1.17.0 requires importlib-metadata~=6.0.0, but you have importlib-metadata 5.2.0 which is incompatible.
pymc3 3.11.5 requires numpy<1.22.2,>=1.15.0, but you have numpy 1.23.5 which is incompatible.
pymc3 3.11.5 requires scipy<1.8.0,>=1.7.3, but you have scipy 1.10.1 which is incompatible

In [2]:
import spacy 

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [3]:
# Loading the data
# If worked, you downloaded data spacy correctly
nlp = spacy.load("en_core_web_sm")

## What are Containers in SpaCy
**Containers** in SpaCy are objects that store linguistic annotations for a piece of text. They are used to represent the structure and meaning of text, and they can be used for a variety of NLP tasks, such as named entity recognition, part-of-speech tagging, and sentiment analysis. There are four main types of containers in SpaCy:

* **Doc**: A Doc object represents a complete document. It contains a list of Token objects, and it also stores information about the document's structure, such as its sentences and named entities.
* **Token**: A Token object represents a single token in a document. It contains information about the token's text, its part-of-speech tag, and its dependency relations to other tokens in the document.
* **Span**: A Span object represents a contiguous span of tokens in a document. It can be used to represent phrases, clauses, or other linguistic units.
* **Lexeme**: A Lexeme object represents a unique word or phrase in a language. It contains information about the word's morphology, such as its inflections and derivations.
![image.png](attachment:96b61ef9-78f6-4fb1-8d42-714c158ea446.png)

## Linguistic Annotation

In [4]:
# Reading Data
with open("/kaggle/input/wiki-us/wiki_us.txt","r") as f:
    text = f.read()
    
print("Text : \n",text)

Text : 
 The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British col

In [5]:
# Convert every word into a Token instead of every charachter
doc = nlp(text)
print("Doc : \n",doc)
print("Doc Length :",len(doc)," ","Text Length :",len(text))

Doc : 
 The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colo

In [6]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [7]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [8]:
# It is a problem no to remove '(' & ')' from words 
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [9]:
# Tokenize into sentense level
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [10]:
try:
    # you can't print it like this
#     sentence1 = doc.sents[0]
#     print("Not Working Cell: ",sentence1)
    # Not workin
    pass
finally:
    # you have first convert them into list
    sentence1 = list(doc.sents)[0]
    print("Working Cell: ",sentence1)

Working Cell:  The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [11]:
token2 = sentence1[2]
print(token2)

States


In [12]:
token2.text # return string

'States'

In [13]:
token2.left_edge

The

In [14]:
token2.right_edge

America

In [15]:
token2.ent_type # GPE (Named entity)

384

In [16]:
token2.ent_type_

'GPE'

In [17]:
# Different entity types
token2.ent_iob_ # this word states is i > inside (Entity) , o > outside of (Entity) , b > Beginning of (entity) 

'I'

In [18]:
token2.lemma_ # Root form of the word

'States'

In [19]:
print(sentence1[12]) # Original
print(sentence1[12].lemma_) # Root

known
know


In [20]:
token2.morph # morphologicly (Grammer) 
# Singuler

Number=Sing

In [21]:
print(sentence1[12])
print(sentence1[12].morph) #  Perfect , Past , Participle

known
Aspect=Perf|Tense=Past|VerbForm=Part


In [22]:
token2.pos_ # Part of speech
# Proper Noun

'PROPN'

In [23]:
token2.dep_ # Dependency
# Noun Subject

'nsubj'

In [24]:
token2.lang_ # Language of the doc object
# English

'en'

In [25]:
text = "Mike enjoys playing football"
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football


In [26]:
for token in doc2:
    print("Text:",token.text ,",Pos:",token.pos_ ,",Dep:",token.dep_)

Text: Mike ,Pos: PROPN ,Dep: nsubj
Text: enjoys ,Pos: VERB ,Dep: ROOT
Text: playing ,Pos: VERB ,Dep: xcomp
Text: football ,Pos: NOUN ,Dep: dobj


In [27]:
from spacy import displacy
displacy.render(doc2 , style = 'dep')

## Named Entity Recognition (NER)
**NER** is a challenging task because natural language is ambiguous and there are many ways to refer to the same entity. For example, the word "Acme" could refer to the company Acme Corporation, or it could refer to the highest point of something. NER systems typically use a combination of rule-based and statistical methods to identify named entities.

In [28]:
for ent in doc.ents:
    print(ent.text , ent.label_) 
# GPE > Geo Political Entities
# it my show wrong results like "The American Revolutionary War" > ORG

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Soviet Union

In [29]:
displacy.render(doc , style = "ent")

## Word Vectors
**Word vectors** are a type of data structure that represents words as vectors of real numbers. They are used in natural language processing (NLP) to represent the meaning of words and to perform tasks such as text classification, sentiment analysis, and question answering.

**Word vectors** are typically created using a statistical method called word embedding. **Word embedding** algorithms learn the vector representation of a word by analyzing its context in a large corpus of text. For example, the word "cat" might be represented by the vector [0.1, 0.2, 0.3, 0.4, 0.5], where each number represents the strength of the association between the word "cat" and a particular concept, such as "small", "furry", or "animal".

**Word vectors** have several advantages over traditional methods of representing words, such as bag-of-words and n-grams. Word vectors are more compact and efficient, and they can be used to represent the meaning of words in a more nuanced way. This makes them well-suited for a variety of NLP tasks.

In [30]:
import spacy

In [31]:
!python -m spacy download en_core_web_md

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']
Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.8/42.8 MB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.5.0
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [32]:
nlp = spacy.load("en_core_web_md")

In [33]:
with open("/kaggle/input/wiki-us/wiki_us.txt","r") as f:
    text = f.read()

In [34]:
doc = nlp(text)
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [35]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [36]:
import numpy as np

your_word = 'country'
# Printing most words that are similar to 'country'

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(f"Most words that are similar to '{your_word}' :",words)

Most words that are similar to 'country' : ['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [37]:
# Calculate document similarity
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
print(doc1 , "<->" , doc2, "Similarity :", doc1.similarity(doc2))

doc3 = nlp("The Empire State Building is in New York.")
print(doc1 , "<->" , doc3, "Similarity :", doc1.similarity(doc3))

doc4 = nlp("I enjoy oranges.")
doc5 = nlp("I enjoy apples.")
print(doc4 , "<->" , doc5, "Similarity :", doc4.similarity(doc5))

doc6 = nlp("I enjoy burgers.")
print(doc4 , "<->" , doc6, "Similarity :", doc4.similarity(doc6))

I like salty fries and hamburgers. <-> Fast food tastes very good. Similarity : 0.691649353055761
I like salty fries and hamburgers. <-> The Empire State Building is in New York. Similarity : 0.1766669125394067
I enjoy oranges. <-> I enjoy apples. Similarity : 0.9775702131220241
I enjoy oranges. <-> I enjoy burgers. Similarity : 0.9628306772893752


In [38]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489675521851


## Pipelines
An **NLP pipeline** is a set of steps that are followed to build an end-to-end NLP software. The pipeline usually consists of the following steps:

* **Data Acquisition**: The first step is to acquire the data that will be used to train the NLP model. This data can be in the form of text, code, or other forms of human language.
* **Data Cleaning**: The next step is to clean the data. This involves removing any errors or inconsistencies in the data.
* **Feature Extraction**: The next step is to extract features from the data. Features are the input to the NLP model. They can be extracted using a variety of methods, such as bag-of-words, n-grams, or topic modeling.
* **Modeling**: The next step is to train the NLP model. This involves using a machine learning algorithm to learn the relationship between the features and the labels.
* **Evaluation**: The next step is to evaluate the NLP model. This involves testing the model on a held-out dataset and measuring its accuracy.
* **Deployment**: The final step is to deploy the NLP model. This involves making the model available to users so that they can use it to perform tasks, such as text classification, sentiment analysis, or question answering.
![image.png](attachment:473a762c-49f8-4ffe-a8eb-386409ee5342.png)

In [39]:
nlp = spacy.blank("en")


In [40]:
nlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x7e670e85f580>

In [41]:
nlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'doc.sents': {'assigns': ['sentencizer'], 'requires': []},
  'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}}}

In [42]:
nlp2 = spacy.load("en_core_web_sm")
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att

### Entity Ruler
In NLP, an **entity ruler** is a pipeline component that allows you to add spans to the Doc.ents using token-based rules or exact phrase matches. It can be combined with the statistical EntityRecognizer to boost accuracy, or used on its own to implement a purely rule-based entity recognition system

In [43]:
import spacy

nlp = spacy.load("en_core_web_sm")
text = "West Chestertenfieldville was references in Mr. Deeds."

In [44]:
doc = nlp(text)
print(doc)

West Chestertenfieldville was references in Mr. Deeds.


In [45]:
for ent in doc.ents:
    print(ent.text , ent.label_)
    # The output is wrong

West Chestertenfieldville GPE
Deeds PERSON


In [46]:
# Extracting from text
# Creating a ruler
ruler = nlp.add_pipe("entity_ruler")
nlp.analyze_pipes()

# 'entity ruler' is added

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [47]:
patterns = [
    {"label" : "GPE" , "pattern" : "West Chestertenfieldville"}
]

In [48]:
ruler.add_patterns(patterns)

In [49]:
# Now we have got the correct answer which west ch.... got GPE
# But in the video still didn't 
doc2 = nlp(text)
for ent in doc2.ents:
    print(ent.text , ent.label_)

West Chestertenfieldville GPE
Deeds PERSON


In [50]:
nlp2 = spacy.load("en_core_web_sm")

In [51]:
# Now we wanna place the 'entity ruler' before 'ner'
ruler = nlp2.add_pipe("entity_ruler" , before = "ner")
ruler.add_patterns(patterns)

In [52]:
doc = nlp2(text)

In [53]:
for ent in doc.ents:
    print(ent.text ,ent.label_ )

West Chestertenfieldville GPE
Deeds PERSON


In [54]:
# Now we can see 'entity_ruler' before 'ner'
nlp2.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [55]:
nlp3 = spacy.load("en_core_web_sm")

In [56]:
patterns = [
    {"label" : "GPE" , "pattern" : "West Chestertenfieldville"},
    {"label" : "FILM" , "pattern" : "Mr. Deeds"}
]

In [57]:
ruler = nlp3.add_pipe('entity_ruler' , before = 'ner')
ruler.add_patterns(patterns)

In [58]:
doc = nlp3(text)

In [59]:
for ent in doc.ents:
    print(ent.text , ent.label_)

West Chestertenfieldville GPE
Mr. Deeds FILM


> **Toponym Resolution** : is a problem of many in the NLP

**Toponym resolution** is the process of identifying the real-world geographic location of a place name mentioned in a text. It is a challenging task because place names can be ambiguous, and the same place name can refer to different locations in different contexts.

### Matcher
A **matcher** in NLP is a component of a natural language processing (NLP) pipeline that can be used to find specific patterns in text. Matchers are often used for tasks such as named entity recognition, where the goal is to identify entities such as people, organizations, and locations in text.

**Matchers** work by using a set of rules to match patterns in text. The rules can be based on the text's tokens, their part-of-speech tags, or other features. For example, a matcher might have a rule that matches the pattern "the company". When the matcher is applied to text, it will find all of the places where the pattern "the company" appears.

**Matchers** can be used to find patterns in both structured and unstructured text. In structured text, such as a table or a form, the patterns are often fixed. For example, a matcher might be used to find all of the phone numbers in a table. In unstructured text, such as a news article or a blog post, the patterns can be more complex. For example, a matcher might be used to find all of the mentions of a particular company in a news article.

In [60]:
import spacy
from spacy.matcher import Matcher

In [61]:
nlp = spacy.load("en_core_web_sm")

In [62]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "LIKE_EMAIL" : True # You have to write it true
    }
]
matcher.add("EMAIL_ADDRESS" ,[pattern])

In [63]:
doc = nlp("This is an email address : test@gmail.com")
matches = matcher(doc)
print(matches) # Return Lexeme , Start Token , End Token

[(16571425990740197027, 6, 7)]


In [64]:
print(nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


In [65]:
with open("/kaggle/input/wiki-mlk/wiki_mlk.txt","r") as f:
    text = f.read()
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his famous 

In [66]:
nlp = spacy.load("en_core_web_sm")

In [67]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "POS" : "PROPN"
    }
]
matcher.add("PROPER_NOUN" , [pattern])

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  102
Lexeme , Start Token , End Token
(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist


In [68]:
# you can find more about 'OP' - '+' in SpaCy docs
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "POS" : "PROPN",
        "OP" : "+"
    }
]
matcher.add("PROPER_NOUN" , [pattern])

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  175
Lexeme , Start Token , End Token
(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


In [69]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "POS" : "PROPN",
        "OP" : "+"
    }
]
matcher.add("PROPER_NOUN" , [pattern] , greedy = "LONGEST")

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  61
Lexeme , Start Token , End Token
(451313080118390996, 83, 88) Martin Luther King Sr.
(451313080118390996, 469, 474) Martin Luther King Jr. Day
(451313080118390996, 536, 541) Martin Luther King Jr. Memorial
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 128, 132) Southern Christian Leadership Conference
(451313080118390996, 247, 251) Director J. Edgar Hoover
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 325, 328) Nobel Peace Prize
(451313080118390996, 422, 425) James Earl Ray
(451313080118390996, 463, 466) Congressional Gold Medal


In [70]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "POS" : "PROPN",
        "OP" : "+"
    }
]
matcher.add("PROPER_NOUN" , [pattern] , greedy = "LONGEST")

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))
matches.sort(key = lambda x : x[1])

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  61
Lexeme , Start Token , End Token
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 23, 24) Baptist
(451313080118390996, 49, 50) King
(451313080118390996, 69, 71) Mahatma Gandhi
(451313080118390996, 83, 88) Martin Luther King Sr.
(451313080118390996, 89, 90) King
(451313080118390996, 113, 114) King


In [71]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "POS" : "PROPN",
        "OP" : "+"
    },
    {
        "POS" : "VERB"
    }
]
matcher.add("PROPER_NOUN" , [pattern] , greedy = "LONGEST")

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))
matches.sort(key = lambda x : x[1])

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  7
Lexeme , Start Token , End Token
(451313080118390996, 49, 51) King advanced
(451313080118390996, 89, 91) King participated
(451313080118390996, 113, 115) King led
(451313080118390996, 167, 169) King helped
(451313080118390996, 247, 252) Director J. Edgar Hoover considered
(451313080118390996, 322, 324) King won
(451313080118390996, 485, 488) United States beginning


In [72]:
import json
with open("/kaggle/input/alice-nlp/alice.json",'r') as f:
    data = json.load(f)
    
text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [73]:
text = text.replace("`" , "'")  
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [74]:
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "ORTH" : "'"
    },
    {
        "IS_ALPHA" : True,
        "OP" : "+"
    },
    {
        "IS_PUNCT" : True,
        "OP" : '*'
    },
    {
        "ORTH" : "'"
    }
]
matcher.add("PROPER_NOUN" , [pattern] , greedy = "LONGEST")

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))
matches.sort(key = lambda x : x[1])

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  2
Lexeme , Start Token , End Token
(451313080118390996, 47, 58) 'and what is the use of a book,'
(451313080118390996, 60, 67) 'without pictures or conversation?'


In [75]:
speak_lemmas = ["think" , 'say']
matcher = Matcher(nlp.vocab)
pattern = [
    {
        "ORTH" : "'"
    },
    {
        "IS_ALPHA" : True,
        "OP" : "+"
    },
    {
        "IS_PUNCT" : True,
        "OP" : '*'
    },
    {
        "ORTH" : "'"
    },
    {
        "POS" : "VERB",
        "LEMMA" : {
            "IN" : speak_lemmas
        }
    },
    # Grapping the speaker
    {
        "POS" : "PROPN",
        "OP" : "+"
    },
    # Grapping qoute
    {
        "ORTH" : "'"
    },
    {
        "IS_ALPHA" : True,
        "OP" : "+"
    },
    {
        "IS_PUNCT" : True,
        "OP" : '*'
    },
    {
        "ORTH" : "'"
    },
    
]
matcher.add("PROPER_NOUN" , [pattern] , greedy = "LONGEST")

doc = nlp(text)
matches = matcher(doc)
print("Matches Length : ",len(matches))
matches.sort(key = lambda x : x[1])

print("Lexeme , Start Token , End Token")
for match in matches[:10]:
    print(match , doc[match[1] : match[2]])

Matches Length :  1
Lexeme , Start Token , End Token
(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [76]:
for text in data[0][2]:
    text = text.replace("`" , "'")  
    doc = nlp(text)
    matches = matcher(doc)
    print("Matches Length : ",len(matches))
    matches.sort(key = lambda x : x[1])
    for match in matches[:10]:
        print(match , doc[match[1] : match[2]])

Matches Length :  1
(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0
Matches Length :  0


In [77]:
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace( "`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy='LONGEST')
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


### Custom Components
a **Custom Component** is a user-defined function that is added to the NLP pipeline. Custom components can be used to perform a variety of tasks, such as:

* Adding custom metadata to documents and tokens
* Updating built-in attributes, such as the named entity spans
* Performing custom text analysis tasks, such as sentiment analysis or named entity recognition

In [78]:
import spacy

In [79]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Britain is a place. Mary is a doctor.")

In [80]:
for ent in doc.ents:
    print(ent.text , ent.label_)

Britain GPE
Mary PERSON


In [81]:
from spacy.language import Language

In [82]:
@Language.component("remove_gpe")
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in doc.ents:
        if ent.label_ == "GPE":
            original_ents.remove(ent)
    doc.ents = original_ents
    return(doc)

In [83]:
nlp.add_pipe("remove_gpe")

<function __main__.remove_gpe(doc)>

In [84]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [85]:
doc = nlp("Britain is a place. Mary is a doctor.")
for ent in doc.ents:
    print(ent.text , ent.label_)

Mary PERSON


In [86]:
nlp.to_disk("/kaggle/working/saved_new_core_web_sm")

### RegEx
* RegEx Docs : https://docs.pexip.com/admin/regex_reference.htm
* RegEx Editor : https://regexr.com/

**A regular expression (regex or regexp)** is a sequence of characters that specifies a search pattern. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

**RegEx** can be used to check if a string contains the specified search pattern. It can also be used to extract substrings from a string that match the specified pattern.



In [87]:
import re

In [88]:
text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."

In [89]:
pattern = r"Paul [A-Z]\w+"

In [90]:
matches = re.finditer(pattern,text)
for match in matches:
    print(match)

<re.Match object; span=(0, 11), match='Paul Newman'>
<re.Match object; span=(39, 53), match='Paul Hollywood'>


In [91]:
import spacy
from spacy.tokens import Span

In [92]:
nlp = spacy.blank("en")
doc = nlp(text)
print(doc.ents)
original_ents = list(doc.ents)
mwt_ents = []
for match in re.finditer(pattern ,doc.text):
    start , end = match.span()
    span = doc.char_span(start,end)
    if span is not None:
        mwt_ents.append((span.start , span.end , span.text))

for ent in mwt_ents:
    start ,end , name = ent
    per_ent = Span(doc , start , end , label = "PERSON")
    original_ents.append(per_ent)
doc.ents = original_ents
print(doc.ents)

for ent in doc.ents:
    print(ent.text , ent.label_)

()
(Paul Newman, Paul Hollywood)
Paul Newman PERSON
Paul Hollywood PERSON


In [93]:
print(mwt_ents)

[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]


In [94]:
from spacy.language import Language

@Language.component("paul_ner")
def paul_ner(doc):
    original_ents = list(doc.ents)
    pattern = r"Paul [A-Z]\w+"    
    mwt_ents = []
    for match in re.finditer(pattern ,doc.text):
        start , end = match.span()
        span = doc.char_span(start,end)
        if span is not None:
            mwt_ents.append((span.start , span.end , span.text))

    for ent in mwt_ents:
        start ,end , name = ent
        per_ent = Span(doc , start , end , label = "PERSON")
        original_ents.append(per_ent)
    doc.ents = original_ents
    return(doc)

In [95]:
nlp2 = spacy.blank("en")
nlp2.add_pipe("paul_ner")

<function __main__.paul_ner(doc)>

In [96]:
doc2 = nlp2(text)
print(doc2.ents)

(Paul Newman, Paul Hollywood)


In [97]:
from spacy.language import Language
from spacy.util import filter_spans
@Language.component("cinema_ner")
def paul_ner(doc):
    original_ents = list(doc.ents)
    pattern = r"Hollywood"    
    mwt_ents = []
    for match in re.finditer(pattern ,doc.text):
        start , end = match.span()
        span = doc.char_span(start,end)
        if span is not None:
            mwt_ents.append((span.start , span.end , span.text))
    for ent in mwt_ents:
        start ,end , name = ent
        per_ent = Span(doc , start , end , label = "CINEMA")
        original_ents.append(per_ent)
    filtered = filter_spans(original_ents)
    doc.ents = filtered
    return(doc)

In [98]:
nlp3 = spacy.load("en_core_web_sm")
nlp3.add_pipe("cinema_ner")

<function __main__.paul_ner(doc)>

In [99]:
doc3 = nlp3(text)
for ent in doc3.ents:
    print(ent.text , ent.label_)


Paul Newman PERSON
American NORP
Paul Hollywood PERSON
British NORP
Paul PERSON


# SpaCy Financial NER

In [100]:
import spacy
import pandas as pd

In [101]:
df = pd.read_csv("/kaggle/input/stocks/stocks.tsv" , sep='\t')
df.sample(10)

Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
5289,TSEM,Tower Semiconductor,Semiconductors & Semiconductor Equipment,3.17B
2331,GPRO,GoPro,Household Durables,1.44B
2145,FWP,Forward Pharma A/S,Biotechnology,43.03M
5348,TYHT,Shineco,Personal Products,80.39M
1965,FFIE,Faraday Future Intelligent Electric,Auto Manufacturers,2.91B
3259,MACU,Mallard Acquisition,Shell Companies,136.81M
111,ADTN,"ADTRAN, Inc.",Communications Equipment,951.81M
580,AYLA,Ayala Pharmaceuticals,Biotechnology,186.33M
4227,PRG,PROG Holdings,Rental & Leasing Services,3.04B
1533,DIBS,1stdibs.com,Internet Retail,594.47M


In [102]:
symbols = df.Symbol.tolist()
companies = df.CompanyName.tolist()
print(symbols[:10])

['A', 'AA', 'AAC', 'AACG', 'AADI', 'AAIC', 'AAL', 'AAMC', 'AAME', 'AAN']


In [103]:
df2 = pd.read_csv("/kaggle/input/indexes/indexes.tsv" , sep = "\t")
df2

Unnamed: 0,IndexName,IndexSymbol
0,Dow Jones Industrial Average,DJIA
1,Dow Jones Transportation Average,DJT
2,Dow Jones Utility Average Index,DJU
3,NASDAQ 100 Index (NASDAQ Calculation),NDX
4,NASDAQ Composite Index,COMP
5,NYSE Composite Index,NYA
6,S&P 500 Index,SPX
7,S&P 400 Mid Cap Index,MID
8,S&P 100 Index,OEX
9,NASDAQ Computer Index,IXCO


In [104]:
indexes = df2.IndexName.tolist()
index_symbols = df2.IndexSymbol.tolist()

In [105]:
df3 = pd.read_csv("/kaggle/input/stock-exchanges/stock_exchanges.tsv" , sep = "\t")
df3.sample(10)

Unnamed: 0,BloombergExchangeCode,BloombergCompositeCode,Country,Description,ISOMIC,Google Prefix,EODcode,NumStocks
73,SE,SW,Switzerland,Six Swiss Exchange,XSWX,SWX,SW,127
17,CY,,Cyprus,Cyprus Stock Exchange,XCYS,,,1
98,UV,US,USA,OTC markets,OOTC,OTCMKTS,US,2433
76,TG,,Taiwan,Tapei Exchange,ROCO,,TWO,22
40,IM,,Italy,Borsa Italiana S.P.A.,MTAA,BIT,MI,146
75,SY,,Syria,Damascus Securities Exchange,XDSE,,,1
0,AF,AR,Argentina,Bolsa de Comercio de Buenos Aires,XBUE,,BA,12
2,AT,AU,Australia,Asx - All Markets,XASX,ASX,AU,875
66,SJ,,South Africa,Johannesburg Stock Exchange,XJSE,JNB,JSE,170
93,UF,US,USA,CBOE BATS BZX,BATS,BATS,US,1


In [106]:
exchanges = df3.ISOMIC.tolist()+df3['Google Prefix'].tolist()+df3.Description.tolist()
print(exchanges[:10])

['XBUE', 'XNEC', 'XASX', 'XWBO', 'XBAH', 'XDHA', 'XBRU', 'BVMF', 'XCNQ', 'XTSE']


In [107]:
stops = ['two']
nlp = spacy.blank('en')
ruler = nlp.add_pipe('entity_ruler')
letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
patterns = []
for symbol in symbols:
    patterns.append({'label' : "STOCK" , "pattern" : symbol})
    for l in letters : patterns.append({'label' : "STOCK" , "pattern" : symbol+f".{l}"})
for company in companies:
    if company not in stops:
        patterns.append({'label' : "COMPANY" , "pattern" : company})
for index in indexes:
    patterns.append({'label' : "INDEX" , "pattern" : index})
    words = index.split()
    patterns.append({'label' : "INDEX" , "pattern" : " ".join(words[:2])})
for index in index_symbols:
    patterns.append({'label' : "INDEX" , "pattern" : index})
    
for e in exchanges:
    patterns.append({"label" : "STOCK_EXCHANGE" , "pattern" : e})
ruler.add_patterns(patterns)

In [108]:
doc =nlp(text)
for ent in doc.ents:
    print(ent.text , ent.label_)

TV STOCK


In [109]:
#source: https://www.reuters.com/business/futures-rise-after-biden-xi-call-oil-bounce-2021-09-10/
text = '''
Sept 10 (Reuters) - Wall Street's main indexes were subdued on Friday as signs of higher inflation and a drop in Apple shares following an unfavorable court ruling offset expectations of an easing in U.S.-China tensions.

Data earlier in the day showed U.S. producer prices rose solidly in August, leading to the biggest annual gain in nearly 11 years and indicating that high inflation was likely to persist as the pandemic pressures supply chains. read more .

"Today's data on wholesale prices should be eye-opening for the Federal Reserve, as inflation pressures still don't appear to be easing and will likely continue to be felt by the consumer in the coming months," said Charlie Ripley, senior investment strategist for Allianz Investment Management.

Apple Inc (AAPL.O) fell 2.7% following a U.S. court ruling in "Fortnite" creator Epic Games' antitrust lawsuit that stroke down some of the iPhone maker's restrictions on how developers can collect payments in apps.


Sponsored by Advertising Partner
Sponsored Video
Watch to learn more
Report ad
Apple shares were set for their worst single-day fall since May this year, weighing on the Nasdaq (.IXIC) and the S&P 500 technology sub-index (.SPLRCT), which fell 0.1%.

Sentiment also took a hit from Cleveland Federal Reserve Bank President Loretta Mester's comments that she would still like the central bank to begin tapering asset purchases this year despite the weak August jobs report. read more

Investors have paid keen attention to the labor market and data hinting towards higher inflation recently for hints on a timeline for the Federal Reserve to begin tapering its massive bond-buying program.

The S&P 500 has risen around 19% so far this year on support from dovish central bank policies and re-opening optimism, but concerns over rising coronavirus infections and accelerating inflation have lately stalled its advance.


Report ad
The three main U.S. indexes got some support on Friday from news of a phone call between U.S. President Joe Biden and Chinese leader Xi Jinping that was taken as a positive sign which could bring a thaw in ties between the world's two most important trading partners.

At 1:01 p.m. ET, the Dow Jones Industrial Average (.DJI) was up 12.24 points, or 0.04%, at 34,891.62, the S&P 500 (.SPX) was up 2.83 points, or 0.06%, at 4,496.11, and the Nasdaq Composite (.IXIC) was up 12.85 points, or 0.08%, at 15,261.11.

Six of the eleven S&P 500 sub-indexes gained, with energy (.SPNY), materials (.SPLRCM) and consumer discretionary stocks (.SPLRCD) rising the most.

U.S.-listed Chinese e-commerce companies Alibaba and JD.com , music streaming company Tencent Music (TME.N) and electric car maker Nio Inc (NIO.N) all gained between 0.7% and 1.4%


Report ad
Grocer Kroger Co (KR.N) dropped 7.1% after it said global supply chain disruptions, freight costs, discounts and wastage would hit its profit margins.

Advancing issues outnumbered decliners by a 1.12-to-1 ratio on the NYSE and by a 1.02-to-1 ratio on the Nasdaq.

The S&P index recorded 14 new 52-week highs and three new lows, while the Nasdaq recorded 49 new highs and 38 new lows.
'''

In [110]:
doc = nlp(text)
for ent in doc.ents:
    print(ent.text , ent.label_)

Apple COMPANY
Apple COMPANY
AAPL.O STOCK
Apple COMPANY
Nasdaq COMPANY
S&P 500 INDEX
S&P 500 INDEX
ET STOCK
Dow Jones Industrial Average INDEX
S&P 500 INDEX
Nasdaq COMPANY
S&P 500 INDEX
JD.com COMPANY
TME.N STOCK
NIO.N STOCK
Kroger COMPANY
KR.N STOCK
NYSE STOCK_EXCHANGE
Nasdaq COMPANY
Nasdaq COMPANY


In [111]:
from spacy import displacy

In [112]:
doc = nlp(text)
displacy.render(doc , style = "ent")