In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
text = """
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies established along the East Coast. Disputes over taxation and political representation with Great Britain led to the American Revolutionary War (1775â€“1783), which established independence. In the late 18th century, the U.S. began expanding across North America, gradually obtaining new territories, sometimes through war, frequently displacing Native Americans, and admitting new states; by 1848, the United States spanned the continent. Slavery was legal in the southern United States until the second half of the 19th century when the American Civil War led to its abolition. The Spanishâ€“American War and World War I established the U.S. as a world power, a status confirmed by the outcome of World War II.

During the Cold War, the United States fought the Korean War and the Vietnam War but avoided direct military conflict with the Soviet Union. The two superpowers competed in the Space Race, culminating in the 1969 spaceflight that first landed humans on the Moon. The Soviet Union's dissolution in 1991 ended the Cold War, leaving the United States as the world's sole superpower.

The United States is a federal republic and a representative democracy with three separate branches of government, including a bicameral legislature. It is a founding member of the United Nations, World Bank, International Monetary Fund, Organization of American States, NATO, and other international organizations. It is a permanent member of the United Nations Security Council. Considered a melting pot of cultures and ethnicities, its population has been profoundly shaped by centuries of immigration. The country ranks high in international measures of economic freedom, quality of life, education, and human rights, and has low levels of perceived corruption. However, the country has received criticism concerning inequality related to race, wealth and income, the use of capital punishment, high incarceration rates, and lack of universal health care.

The United States is a highly developed country, accounts for approximately a quarter of global GDP, and is the world's largest economy. By value, the United States is the world's largest importer and the second-largest exporter of goods. Although its population is only 4.2% of the world's total, it holds 29.4% of the total wealth in the world, the largest share held by any country. Making up more than a third of global military spending, it is the foremost military power in the world; and it is a leading political, cultural, and scientific force internationally.[23]
"""

2.2. Creating a Doc Container¶


In [4]:
doc = nlp(text)

In [5]:
doc


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies es

In [6]:
len(doc)

654

In [7]:
len(text)

3527

In [9]:
for token in text[:11]:
    print(token)



T
h
e
 
U
n
i
t
e
d


In [11]:
for token in doc[:11]:
    print (token)



The
United
States
of
America
(
U.S.A.
or
USA
)


In [13]:
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


In [14]:
words = text.split()[:10]

In [15]:
i=5
for token in doc[i:8]:
    print (f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i=i+1

SpaCy Token 5:
America
Word Split 5:
(U.S.A.


SpaCy Token 6:
(
Word Split 6:
or


SpaCy Token 7:
U.S.A.
Word Split 7:
USA),




Sentence Boundary Detection

In [16]:
for sent in doc.sents:
    print(sent)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies e

In [17]:
sentence1 = list(doc.sents)[0]
print (sentence1)


The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


Token Attributes                                                  
The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy. We will be working with a few of them, such as:

.text

.head

.left_edge

.right_edge

.ent_type_

.iob_

.lemma_

.morph

.pos_

.dep_

.lang_

In [18]:
token2 = sentence1[2]
print (token2)

United


In [19]:
token2.text

'United'

In [20]:
# The syntactic parent, or “governor”, of this token
token2.head

States

In [21]:
# The leftmost token of this token’s syntactic descendants
token2.left_edge

United

In [22]:
token2.right_edge

United

Entity Type

In [24]:
token2.ent_type

384

In [25]:
token2.ent_type_

'GPE'

Ent IOB                                                           
IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

In [26]:
token2.ent_iob_

'I'

Lemma

In [27]:
token2.lemma_

'United'

In [28]:
sentence1[12].lemma_

'commonly'

Part of Speech

In [29]:
token2.pos_

'PROPN'

Syntactic Dependency

In [30]:
token2.dep_

'compound'

Language                                                           
Language of the parent document’s vocabulary

In [31]:
token2.lang_

'en'

Part of Speech Tagging                                              
In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

In [32]:
for token in sentence1:
    print(token.text, " | ", token.pos_, " | ", token2.dep_)


  |  SPACE  |  compound
The  |  DET  |  compound
United  |  PROPN  |  compound
States  |  PROPN  |  compound
of  |  ADP  |  compound
America  |  PROPN  |  compound
(  |  PUNCT  |  compound
U.S.A.  |  PROPN  |  compound
or  |  CCONJ  |  compound
USA  |  PROPN  |  compound
)  |  PUNCT  |  compound
,  |  PUNCT  |  compound
commonly  |  ADV  |  compound
known  |  VERB  |  compound
as  |  ADP  |  compound
the  |  DET  |  compound
United  |  PROPN  |  compound
States  |  PROPN  |  compound
(  |  PUNCT  |  compound
U.S.  |  PROPN  |  compound
or  |  CCONJ  |  compound
US  |  PROPN  |  compound
)  |  PUNCT  |  compound
or  |  CCONJ  |  compound
America  |  PROPN  |  compound
,  |  PUNCT  |  compound
is  |  AUX  |  compound
a  |  DET  |  compound
country  |  NOUN  |  compound
primarily  |  ADV  |  compound
located  |  VERB  |  compound
in  |  ADP  |  compound
North  |  PROPN  |  compound
America  |  PROPN  |  compound
.  |  PUNCT  |  compound


In [35]:
from spacy import displacy
displacy.render(sentence1, style="ent")