# NLP - Spacy Basics 

NLP vs NLU 

Areas in NLP: 

- Named Entity Recoginition (NER) 
- Part-of-speech (POS) Tagging
- Syntactic Parsing 
- Text Categorization 
- Coreference resolution
- machine translation 

Areas in NLU: 

- Relation Extraction 
- Paraphrasing 
- Semantic Parsing 
- Sentiment Analysis 
- Question and Answering 
- Summarization 

SpaCy is a NLP framework in Python. Performs quickly, has better accuracy and scalability. 

Ref: https://spacy.pythonhumanities.com/intro.html

In [1]:
!pip install spacy



In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     --------------------------------------- 12.8/12.8 MB 18.7 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy

In [4]:
nlp = spacy.load('en_core_web_sm')

### Containers in spacy 

- SpaCy oojects that contains a large amount of data about a text. 
- This can be broken down abstractly as: 
   `DOC -> Sent (Sentence) -> Token (Characters) -> Span -> SpanGroups`

In [5]:
#data from: https://github.com/wjbmattingly/freecodecamp_spacy/tree/main
with open ('data/wiki_us.txt', 'r') as f: 
    text = f.read()

In [6]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [7]:
#create a doc object, which contains the metadata of the text -> which counts individual tokens
doc = nlp(text)

In [8]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [9]:
print(len(text))
print(len(doc))

3525
652


In [10]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [11]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [12]:
for token in text.split()[0:10]: 
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


### Sentence Boundry Detection (SBD)

- is the identification of sentences in a text

In [13]:
# seperates each sentences from the doc 
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [14]:
sentence1 = doc.sents[0]
print(sentence1) 

TypeError: 'generator' object is not subscriptable

In [15]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


### Token Attributes 

In [16]:
token2 = sentence1[2]
print(token2)

States


In [17]:
token2.text

'States'

In [18]:
token2.head # syntatic parent

is

In [19]:
token2.left_edge

The

In [20]:
token2.right_edge

America

In [21]:
token2.ent_type_

'GPE'

In [22]:
token2.ent_iob_

'I'

In [23]:
token2.lemma_

'States'

In [24]:
sentence1[12].lemma_

'know'

In [25]:
token2.morph

Number=Sing

In [26]:
token2.pos_

'PROPN'

In [27]:
token2.lang_

'en'

### Part of Speech Tagging (POS)

In [28]:
text2 = "Mike enjoys playing football"
doc2 = nlp(text2)
print(doc2)

Mike enjoys playing football


In [29]:
for token in doc2: 
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


In [30]:
from spacy import displacy
# to visualize how each words are connected to one another
displacy.render(doc2, style='dep')

### Named Entity Recognition 

In [31]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The Spanishโ€“American War and World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVEN

In [32]:
displacy.render(doc, style='ent')

### Word Vectors / Word Embeddings

- `en_core_web_md` → a library that contains a large English medium model
- Word vectors are numerical representations of the words in multidimensional space
- utilized to make a computer system understand a word → since they can't understand texts efficiently 
- to understand how other words have a relation to others 

In [33]:
import spacy
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.5.0/en_core_web_md-3.5.0-py3-none-any.whl (42.8 MB)
     --------------------------------------- 42.8/42.8 MB 16.4 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


In [34]:
import numpy as np

In [35]:
nlp2 = spacy.load('en_core_web_md')

In [36]:
with open('data/wiki_us.txt', 'r') as f:
    text = f.read()

In [37]:
doc3 = nlp2(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [38]:
# how the word 'country' relates to others:
your_word = "country"

ms = nlp2.vocab.vectors.most_similar(
    np.asarray([nlp2.vocab.vectors[nlp2.vocab.strings[your_word]]]), n=10)
words = [nlp2.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)


['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [39]:
doc1 = nlp2("I like salty fries and hamburgers.")
doc2 = nlp2("Fast food tastes very good.")

In [40]:
print(doc1, '<->', doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [41]:
doc3 = nlp2("The Tower Bridge is in London.")

In [42]:
print(doc2, '<->', doc3, doc2.similarity(doc3))

Fast food tastes very good. <-> The Tower Bridge is in London. 0.2800160111470374


In [43]:
doc4 = nlp2('I enjoy apples.')
doc5 = nlp2('I enjoy apples')

In [44]:
print(doc4, '<->', doc5, doc4.similarity(doc5))

I enjoy apples. <-> I enjoy apples 0.9385818249058347


In [45]:
doc6 = nlp2('I enjoy burgers.')
print(doc4, '<->', doc6, doc4.similarity(doc6))

I enjoy apples. <-> I enjoy burgers. 0.9722734327467586


In [46]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489675521851


### spaCy's Pipelines

#### Standard Pipes
- spacy provides standard pipelines (models) in which the input text goes through a couple of stages that alters the data

#### Custom Pipes
- spacy allows customized pipelines to be created according to the problem in hand, below is the code to create a custom pipeline in spacy

In [47]:
#creates a blank model, that consists of the English tokenizer
nlp3 = spacy.blank('en')

In [48]:
#breaks down texts into sentences
nlp3.add_pipe('sentencizer') 

<spacy.pipeline.sentencizer.Sentencizer at 0x18bd9fafa80>

In [54]:
nlp3.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'doc.sents': {'assigns': ['sentencizer'], 'requires': []},
  'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}}}

In [56]:
nlp.analyze_pipes() #nlp = en_core_web_sm

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
  'parser': [],
  'attribute_ruler': [],
  'lemmatizer': [],
  'ner': []},
 'att