In [5]:
!pip install spacy




In [7]:
!python3.10 -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import spacy
import numpy as np
from scipy.spatial import distance

### Load the sm(small) model for the process

In [3]:
nlp = spacy.load('en_core_web_sm')

In [4]:
# import the text file
with open("/Users/mohammad/Desktop/Spacy_Tutorial/wiki_us.txt" , "r") as file:
    text = file.read()


In [5]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [5]:
# create a doc object
doc = nlp(text)

In [6]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [7]:
print(len(text))
print(len(doc))

3525
652


In [8]:
for token in text[:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [9]:
# Access the tokens in the doc container(token: valuable metadata or attributes in a text)
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [12]:
# the functionality of "doc" is different with the split method that is used in strings. 
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


## Sentence Boundary Detection

In [10]:
# Access sentences in the doc container:
for sent in doc.sents:
    print(sent , end = "\n\n")

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]

At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]

The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.

The national capital is Washington, D.C., and the most populous city is New York.



Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.

The United States emerged from the thirteen British colon

In [11]:
# "doc.sents" is a "generator" and it is not subscriptable(i.e., we can not iterate through the genetator to access each sentence)

# Alternative: We can create a list to access each sentence:
sent_1 = list(doc.sents)[0]
sent_2 = list(doc.sents)[1]
sent_3 = list(doc.sents)[2]


print(sent_1)
print(sent_2)
print(sent_3)




The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]


## Token Attributes

### Each token in the doc has its own attributes which are valuable metadata

In [56]:
token_2 = sent_1[2]

print("Sentence 1: {}".format(sent_1) , end="\n\n")
print("token_2: {}".format(token_2))

Sentence 1: The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

token_2: States


In [37]:
# .text attribute
token_2.text

'States'

In [44]:
print(token_2.left_edge)
print(token_2.right_edge)

The
,


In [57]:
# entity type
print(token_2.ent_type)
print(token_2.ent_type_)


384
GPE


In [63]:
'''
i: inside the entity
o: outside the entity
b: begining the entity
'''

print(token_2.ent_iob_)


I


In [67]:
# lemma_: it gives the root form of the word
print(sent_1[12])
print(sent_1[12].lemma_)

known
know


In [83]:
# lemma_: it gives the root form of the word
print(sent_1[29]) #type: token
print(sent_1[29].lemma_) #type: str


located
locate


In [84]:
# morph analysis: morphology of the token
print(token_2.morph)

# sing: singular

Number=Sing


In [87]:
# morph analysis: morphology of the token
print(sent_1[29].morph)

Aspect=Perf|Tense=Past|VerbForm=Part


In [92]:
# pos_: part of speech
print(token_2.pos_) #PROPN: proper noun
print(sent_1[12].pos_) # VERB
print(sent_1[28].pos_) # ADVERB

PROPN
VERB
ADV


In [95]:
# dep_: dependecy of the token in the sentence
print(token_2.dep_)
print(sent_1[12].dep_)
print(sent_1[28].dep_)

nsubj
acl
advmod


In [97]:
# lang_: it gives the language of the token
print(token_2.lang_) #en: english

en


In [98]:
# define a new text
text_2 = "Hi! This is Mo, welcome to my house."
mydoc = nlp(text_2)

print(mydoc)

Hi! This is Mo, welcome to my house.


In [99]:
for token in mydoc:
    print(token.text , token.pos_ , token.dep_)

Hi INTJ ROOT
! PUNCT punct
This PRON nsubj
is AUX ROOT
Mo PROPN attr
, PUNCT punct
welcome VERB acomp
to ADP prep
my PRON poss
house NOUN pobj
. PUNCT punct


In [100]:
# Visualize the text for morphology purposes
from spacy import displacy
displacy.render(mydoc , style="dep") # dep: dependency


## Named Entity Recognition

In [101]:
for ent in doc.ents:
    print(ent.text , ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
third- or fourth CARDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million MONEY
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775â€“1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
The Spanishâ€“American War and World War I EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean

In [102]:
# Visualize the entities in doc
displacy.render(doc , style = "ent") #ent: entity

#### The utilized pre-trained model has some bugs in recognizing some entities in the doc; however; we aim to utilize a more powerful model to recognize all of the entities correctly.

In [104]:
# Download the medium model

!python3.10 -m spacy download en_core_web_md

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 MB[0m [31m697.7 kB/s[0m eta [36m0:00:00[0m00:01[0m00:02[0m
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-3.2.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_md')


## Word Vectors

### Load the md(medium) model for the process

In [105]:
# The medium model containes the word vectors.
nlp_md = spacy.load("en_core_web_md")

In [106]:
print(sent_1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [130]:
# Find words close in meaning(or the most similar) to the input word(using word vectors)
# The model in this section has been developed in "Gensim".

your_word = "united"

ms = nlp_md.vocab.vectors.most_similar(
    np.asarray([nlp_md.vocab.vectors[nlp_md.vocab.strings[your_word]]]), n=20)
words = [nlp_md.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['United', 'UNited', 'AMerica', 'America', 'UsA', 'USA', 'KINDOM', 'Kingdom', 'England', 'cotswold', 'REPUBLICS', 'Republic', 'Europe', 'EUrope', 'union', 'TEAMSTERS', 'canda', 'Canada', 'REUNIFY', 'BRITIAN']


In [140]:
# Check the similarity of two sentences(Also, it is possible to catch the similarity between words)

doc_1 = nlp_md("Genious people work hard everyday.")
doc_2 = nlp_md("It is a fact that you should try hard to achieve this goal.")

print(doc_1 , "<->" , doc_2 )
print("similarity: {}".format("{:.3f}".format(doc_1.similarity(doc_2))))

Genious people work hard everyday. <-> It is a fact that you should try hard to achieve this goal.
similarity: 0.826


## Pipes and Pipelines

In [6]:
# Create a blank model which allows us to add specific pipes in to it.
mynlp = spacy.blank("en")

In [7]:
# Add a pipe to the blank model
# In this case, we add a scentencizer
mynlp.add_pipe("sentencizer")

<spacy.pipeline.sentencizer.Sentencizer at 0x1227c7800>

In [8]:
# Utilize the custom pipline to extract the sentences from a text

import requests
from bs4 import BeautifulSoup
s = requests.get("https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt")
soup = BeautifulSoup(s.content).text.replace("-\n", "").replace("\n", " ")
mynlp.max_length = 5278439

# Create a doc to contain the text
doc = mynlp(soup)

In [17]:
# Number of sentences in the text
print(len(list(doc.sents)))

94133


In [14]:
for sent in list(doc.sents)[0:10]:
    print(sent , end = "\n\n")

This is the 100th Etext file presented by Project Gutenberg, and is presented in cooperation with World Library, Inc., from their Library of the Future and Shakespeare CDROMS.

 Project Gutenberg often releases Etexts that are NOT placed in the Public Domain!!

 Shakespeare  *This Etext has certain copyright implications you should read!*

 <>  *Project Gutenberg is proud to cooperate with The World Library* in the presentation of The Complete Works of William Shakespeare for your reading for education and entertainment.

 HOWEVER, THIS IS NEITHER SHAREWARE NOR PUBLIC DOMAIN. .

.AND UNDER THE LIBRARY OF THE FUTURE CONDITIONS OF THIS PRESENTATION. .

.NO CHARGES MAY BE MADE FOR *ANY* ACCESS TO THIS MATERIAL.

 YOU ARE ENCOURAGED!!

TO GIVE IT AWAY TO ANYONE YOU LIKE, BUT NO CHARGES ARE ALLOWED!!

  **Welcome To The World of Free Plain Vanilla Electronic Texts**  **Etexts Readable By Both Humans and By Computers, Since 1971**  *These Etexts Prepared By Hundreds of Volunteers and Donatio

In [18]:
# Analyze the created pipeline 
mynlp.analyze_pipes()

{'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'],
   'requires': [],
   'scores': ['sents_f', 'sents_p', 'sents_r'],
   'retokenizes': False}},
 'problems': {'sentencizer': []},
 'attrs': {'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []},
  'doc.sents': {'assigns': ['sentencizer'], 'requires': []}}}