In [1]:
print("Hello Spacy")

Hello Spacy


In [2]:
import spacy
print(spacy.__version__)

3.8.2


In [3]:
nlp = spacy.load("en_core_web_sm")

## 2.2 Containers

Containers are spacy objects that contain a large quantity of data about a text. When we analyze texts with the spacy framework, we create different container objects to do that. Here is a full list of all spacy containers, We will be focusing on three (emboldened): Doc, Span, and Token.

In [4]:
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()

In [5]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [6]:
## Doc Container
doc = nlp(text)

In [7]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [8]:
print (len(doc))
print (len(text))

652
3525


In [9]:
for token in text[:10]:
    print (token)

T
h
e
 
U
n
i
t
e
d


In [10]:
for token in doc[:20]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)
,
commonly
known
as
the
United
States
(
U.S.
or


Tokens are a fundamental building block of spaCy or any NLP framework. They can be words or punctuation marks. Tokens are something that has syntactic purpose in a sentence and is self-contained.

In [11]:
doc1 = nlp("don't")
for token in doc1: 
    print(token)

do
n't


## 2.3 Sentence Boundary Detection (SBD)

In [12]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [13]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## 2.4 Token Attributes

In [14]:
token2 = sentence1[2]
print(token2)

States


In [15]:
token2.text

'States'

In [16]:
# This tells to which word it is governed by, in this case, the primary verb, “is”, as it is part of the noun subject.
token2.head

is

In [17]:
# If part of a sequence of tokens that are collectively meaningful, known as multi-word tokens, this will tell us where the multi-word token begins.
token2.left_edge

The

In [18]:
# This will tell us where the multi-word token ends.
token2.right_edge

,

In [19]:
# entity name | Named Entity Recognition
print(token2.ent_type)
print(token2.ent_type_)

384
GPE


In [20]:
# IOB is a method of annotating a text. In this case, we see “I” because states is inside an entity, that is to say that it is part of the United States of America.
token2.ent_iob_

'I'

In [21]:
# Lemmatization
print(token2.lemma_)
print(sentence1[12].lemma_) # known


States
know


In [22]:
# Morphological analysis
sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

In [23]:
# Coarse-grained part-of-speech from the Universal POS tag set
token2.pos_

'PROPN'

In [24]:
# Syntactic dependency relation.
token2.dep_

'nsubj'

In [25]:
token2.lang_

'en'

In [26]:
for token in sentence1:
    print (token.text, token.pos_, token.dep_)

The DET det
United PROPN compound
States PROPN nsubj
of ADP prep
America PROPN pobj
( PUNCT punct
U.S.A. PROPN appos
or CCONJ cc
USA PROPN conj
) PUNCT punct
, PUNCT punct
commonly ADV advmod
known VERB acl
as ADP prep
the DET det
United PROPN compound
States PROPN pobj
( PUNCT punct
U.S. PROPN appos
or CCONJ cc
US PROPN conj
) PUNCT punct
or CCONJ cc
America PROPN conj
, PUNCT punct
is AUX ROOT
a DET det
country NOUN attr
primarily ADV advmod
located VERB acl
in ADP prep
North PROPN compound
America PROPN pobj
. PUNCT punct


## Display dependencies between tokens

In [27]:
from spacy import displacy
displacy.render(sentence1, style="dep")

##  2.6 Named Entity Recognition

In [28]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The Spanishâ€“American War and World War I EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Sov

In [29]:
displacy.render(doc, style="ent")

# 3. Word Vectors and spaCy

In [30]:
import spacy
!python -m spacy download en_core_web_md

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     ---------------------------------------- 0.0/33.5 MB ? eta -:--:--
     --------------------------------------- 0.0/33.5 MB 162.5 kB/s eta 0:03:26
     --------------------------------------- 0.0/33.5 MB 217.9 kB/s eta 0:02:34
     --------------------------------------- 0.1/33.5 MB 416.7 kB/s eta 0:01:21
     ---------------------------------------- 0.2/33.5 MB 1.1 MB/s eta 0:00:31
     ---------------------------------------- 0.3/33.5 MB 1.2 MB/s eta 0:00:29
     ---------------------------------------- 0.3/33.5 MB 1.1 MB/s eta 0:00:31
     ---------------------------------------- 0.3/33.5 MB 1.1 MB/s eta 0:00:31
      --------------------------------------- 0.6/33.5 MB 1.5 MB/s eta 0:00:23
      ---------------------------------------


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [31]:
nlp = spacy.load("en_core_web_md")
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()
doc = nlp(text)
sentence1 = list(doc.sents)[0]

## 3.1 Word Vectors

word vectors is the the representation words as matrices so the computer can process them quickly.

**Bag of words** : Tom loves to eat chocolate. == 1, 2, 3, 4, 5 . The limitation of this method is the converting of words without giving a meaning. that's why we'll use the word vectors which give the dict a multidimensional meaning.

In [34]:
!pip install PyDictionary

Collecting PyDictionary


[notice] A new release of pip is available: 24.0 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip



  Downloading PyDictionary-2.0.1-py3-none-any.whl.metadata (4.0 kB)
Collecting bs4 (from PyDictionary)
  Using cached bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting goslate (from PyDictionary)
  Downloading goslate-1.5.4.tar.gz (14 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting beautifulsoup4 (from bs4->PyDictionary)
  Using cached beautifulsoup4-4.12.3-py3-none-any.whl.metadata (3.8 kB)
Collecting futures (from goslate->PyDictionary)
  Downloading futures-3.0.5.tar.gz (25 kB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with s

##3.1.1 Use of Synonyms

In [35]:
from PyDictionary import PyDictionary

dictionary=PyDictionary()
text = "Tom loves to eat chocolate"

words = text.split()
for word in words:
    syns = dictionary.synonym(word)
    print (f"{word}: {syns[0:5]}\n")

Tom has no Synonyms in the API


TypeError: 'NoneType' object is not subscriptable

In [39]:
from PyDictionary import PyDictionary

dictionary=PyDictionary()

words  = ["like", "love"]
for word in words:
    syns = dictionary.synonym(word)
    print (f"{word}: {syns[0:5]}\n")

like has no Synonyms in the API


TypeError: 'NoneType' object is not subscriptable

--> Limitations : cimputer can't process words, not always accurate ...

### 3.1.2. What do Word Vectors Look Like?


In [40]:
sentence1[0].vector

array([-6.5276e-01,  2.3873e-01, -2.3325e-01,  1.8608e-01,  3.7674e-01,
       -5.4116e-02, -1.9189e-01,  2.2731e-01, -9.2528e-02,  1.8388e+00,
       -1.8715e-01, -2.2237e-01,  3.1873e-01,  1.1472e-01,  3.8304e-01,
        2.0092e-01, -2.7932e-01,  2.3462e+00, -3.9846e-01, -1.9525e-01,
       -2.5649e-01,  2.5508e-01,  9.4618e-02, -4.1082e-01, -3.4191e-01,
       -1.9499e-01,  1.7814e-01,  5.3463e-03, -4.7565e-01,  2.8022e-01,
       -2.1920e-01,  6.0433e-01,  2.9309e-01, -2.4232e-01,  5.2700e-01,
        3.9024e-01, -5.6955e-01,  3.7620e-01, -2.3126e-01, -2.9921e-01,
        4.5643e-02,  1.4555e-01,  1.4231e-01,  1.0587e-01,  4.1210e-01,
       -2.5261e-01, -3.2090e-02, -5.2830e-01, -2.6925e-02,  2.6227e-01,
        1.6375e-01,  9.9259e-02,  3.1664e-01, -1.1040e-01,  2.5732e-01,
       -4.0720e-01, -6.9903e-02, -1.3189e-01, -5.5753e-01, -1.4815e-01,
       -3.3673e-01, -3.6122e-01,  2.1905e-02,  6.8589e-01, -8.0151e-02,
       -1.2327e-01, -5.0595e-02, -1.3694e-01,  2.7306e-01, -1.48

In [53]:
import numpy as np
import spacy

## let's find the words most closely related to the word dog.
 
nlp = spacy.load("en_core_web_md")
    
your_word = "dog"
    
ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), 
    n=10,
)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['pooch', 'CHINCHILLA', 'CORGI', 'adopt', 'cattery', 'ADOPTED', 'CHINCHILLAS', 'goldendoodles', 'cockapoos', 'sighthound']


## 3.2. Doc Similarity

 Through word vectors we can calculate the similarity between two documents.

In [55]:
nlp = spacy.load("en_core_web_md")  # make sure to use larger package!
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.8015960454940796


## 3.3. Word Similarity

In [56]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.5733411312103271
