In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [4]:
with open('data/wiki_us.txt', 'r') as file:
    text = file.read()

In [5]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

### Work with Doc object

In [6]:
doc = nlp(text)

In [7]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [8]:
print(len(text))
print(len(doc))

3525
652


In [9]:
for token in text[:10]:
    print(token)
    

T
h
e
 
U
n
i
t
e
d


In [10]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


This is why the doc object so more valuable. The text object is basically counting up every instance of character whitespaces, punctuations etc.
the doc object is counting individual tokens words punctuations etc. 

In [11]:
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


Problem: this method split by whitespaces. Despite with doc object we reach the individual tokens

### Sentence Boundary Detection(SBD)

Identification of sentences within a text

In [13]:
# Grabbing to sents attribute of the doc object
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

#### Important:
"en_core_web_sm" pipeline does not save word vectors and larger pipeline go with the models typically better to going to have with regards to sentence detection.

In [15]:
# Accessing sentence
sentence1 = doc.sents[0]

TypeError: 'generator' object is not subscriptable

#### Important:
'generator' object is not subscriptable,

doc.sents object is not iterable we need to convert it to a list.


In [16]:
sentence1 = list(doc.sents)[0]
print(sentence1)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


### Token Attributes

* .text
* .head
* .left_edge
* .right_edge
* .ent_type_
* .iob_
* .lemma_
* .morph
* .pos_
* .den_
* .lang_

In [18]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [19]:
token_2 = doc[2]
print(token_2)

States


##### .text 
It will return a raw string value.

In [20]:
token_2.text

'States'

##### .left_edge , .right_edge
If there exists a multi-word token, i.e., contains a collective meaning then we can use the left_edge property to find the Start of the multi-word token, and the right_edge to find the end of the multi-word token.

In [26]:
# The United States of America
token_2.left_edge, token_2.right_edge

(The, America)

##### . ent_type , .ent_type_

Gives type of entity , corresponding number or string

In [28]:
token_2.ent_type

384

In [29]:
token_2.ent_type_

'GPE'

'GPE' ------> Geopolitical entity

##### .ent_iob_

'b' beginning of entity, 'o' means outside of entity, 'i' means inside of entity

is it part of larger entity?

In [41]:
token_2.ent_iob_ , token_2.ent_iob

('I', 1)

In [39]:
token_3 = doc[0]
token_4 = doc[1]
token_5 = doc[5]
print(token_3, token_4, token_5)

The United (


In [42]:
token_3.ent_iob_ ,token_5.ent_iob_

('B', 'O')

In [43]:
token_3.ent_iob ,token_5.ent_iob

(3, 2)

##### .lemma_

the lemma form , root form of entity.

In [47]:
token_6 = doc[12] # written as knowm in doc
token_6.lemma_

'know'

##### .morph

###### will be very useful

when creating rules, to extract information find those pieces of informations for writing rules


it gives morphological analysis, what is this word morphologically

In [52]:
token_2.text, token_2.morph,token_3.text, token_3.morph,token_4.text, token_4.morph,token_5.text, token_5.morph

('States',
 Number=Sing,
 'The',
 Definite=Def|PronType=Art,
 'United',
 Number=Sing,
 '(',
 PunctSide=Ini|PunctType=Brck)

In [55]:
sentence2 = list(doc.sents)[2]
token_7 = sentence2[4]
print(token_7)

miles


In [56]:
token_7.morph

Number=Plur

Number=Sing,
Number=Plur

In [60]:
sentence3 = list(doc.sents)[3]
token_8 = sentence3[3]
token_8.text, token_8.morph

('shares', Number=Sing|Person=3|Tense=Pres|VerbForm=Fin)

###### Person=3: 
This indicates that the token is in the third person. In the context of verbs, it means that the action is performed by someone or something other than the speaker (first person) or the listener (second person).

###### Tense=Pres: 
This indicates that the tense of the token is present tense. It describes actions that are happening currently or regularly.

###### VerbForm=Fin: 
This indicates that the token represents a finite verb form. Finite verbs are those that are inflected according to the grammatical features of person, number, tense, mood, and voice.

In [62]:
sentence1[12].text, sentence1[12].morph

('known', Aspect=Perf|Tense=Past|VerbForm=Part)

output: perfect pas particaple

##### .pos_

we use the pos_ (Identify Part of Speech) command:

token_2.pos_ will considered a proper noun. You may need to spend some time learning grammar rules because NLP does require grammar understanding.

In [63]:
token_2.pos_

'PROPN'

###### Some exaples

* Noun (NOUN): A word that represents a person, place, thing, or idea.
* 
Verb (VERB): A word that describes an action, occurrence, or state of bein
* 
Adjective (ADJ): A word that modifies or describes a noun or pronou
* Adverb (ADV): A word that modifies or describes a verb, adjective, or other adverb.
* Pronoun (PRON): A word that replaces or substitutes for a noun or noun phrase.
* Preposition (ADP): A word that shows the relationship between a noun or pronoun and other words in a sentence.
* Conjunction (CONJ): A word that connects words, phrases, or clauses.
* Determiner (DET): A word that introduces a noun and expresses reference to something specific or something that is not specifically identified.
* Interjection (INTJ): A word or phrase that expresses emotion or exclamation.
amation.

##### ,dep_

Dependency relation
To check what role our selected token plays in the sentence

In [64]:
token_2.dep_

'nsubj'

* ROOT: The main/root word of the sentence.
* nsubj: The subject of a verb.
* dobj: The direct object of a verb.
* attr: The attribute or complement of a verb.
* prep: A prepositional modifier.
* advcl: A dependent clause functioning as an adverbial modifier.
* conj: A conjunct (word or phrase) that shares the same head word.
* punct: Punctuation marks.

##### .lang_

langueage of the doc object

In [66]:
token_2.lang_

'en'

### Speech Tagging

detailed analsis of part of speech and dependency sparser how to analyse it

In [67]:
text = "Mike enjoys playing football."
doc2 = nlp(text)
for token in doc2:
    print(f"{token.text:<10} {token.pos_:<10} {token.dep_:<10}")

Mike       PROPN      nsubj     
enjoys     VERB       ROOT      
playing    VERB       xcomp     
football   NOUN       dobj      
.          PUNCT      punct     


We can see basic semantics of the sentence, visualize this information and how these words relate to one another

In [70]:
from spacy import displacy
displacy.render(doc2, style="dep") 

###### displacy styles:
dep: This style visualizes the syntactic dependencies between words in the form of arrows connecting the words. It helps in understanding the grammatical structure of sentences.

ent: This style highlights named entities in the text, such as persons, organizations, locations, etc., with different colors for each entity type.

In [72]:
from spacy import displacy
displacy.render(doc2, style="ent") 