# Introduction to Spacy

## Containers
Containers are spaCy objects that contain a large quantity of dat about a text. When we analyze texts with spaCy framework, we create different container objects to do that.

* Doc
* DocBin
* Example
* Language
* Lexeme
* Span
* SpanGroup
* Token

In [1]:
import spacy

In [2]:
model = spacy.load("en_core_web_sm")

In [3]:
with open("data/wiki_us.txt") as f:
    txt = f.read()

In [4]:
print(txt)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

### Doc Container
In NLP, sentence boundary detection, or SBD, is the identification of sentences in a text. Again, this may seem fairly easy to do with rules.

In [5]:
doc = model(txt)

In [6]:
txt[:10]

'The United'

In [7]:
# doc object automatically tokenizes the text
doc[:10]

The United States of America (U.S.A. or USA)

In [8]:
sentence = list(doc.sents)[0]

In [9]:
sentence

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

### Tokens
The token object contains a lot of different attributes that are VITAL to perform NLP with spaCy.

* .text
* .head
* .left_edge
* .right_edge
* .ent_type_
* .iob_
* .lemma_
* .morph
* .pos_
* .dep_
* .lang_

In [10]:
token = sentence[2]

In [11]:
token

States

In [12]:
'''
Verbatim text content.
'''
token.text

'States'

In [13]:
'''
The syntactiv parent, or "governor", of this token.
'''
token.head

is

In [14]:
'''
The leftmost token of this token's syntactic descendants.
If part of a sequence of tokens that are collectively meaningful,
known as multi-word tokens, this will tell us where the multi-word token begins.
'''
token.left_edge

The

In [15]:
'''
The rightmost token of this token's syntactic descendants.
This will tell us where the multi-word token ends.
'''
token.right_edge

,

In [16]:
'''
Named entity type.
'''
token.ent_type, token.ent_type_

(384, 'GPE')

In [17]:
'''
IOB code of named entity tag.
“B” means the token begins an entity,
“I” means it is inside an entity,
“O” means it is outside an entity,
and "" means no entity tag is set.
'''
token.ent_iob_

'I'

In [18]:
'''Base form of the token, with no inflection suffixes.'''
token.lemma_

'States'

In [19]:
doc[12], doc[12].lemma_

(known, 'know')

In [20]:
'''
Morphological analysis
'''
token.morph

Number=Sing

In [21]:
'''
Coarse-grained part-of-speech from the Universal POS tag set.
'''
token.pos_

'PROPN'

In [22]:
'''
Syntactic dependency relation.
'''
token.dep_

'nsubj'

In [23]:
'''
Language of the parent document’s vocabulary.
'''
token.lang_

'en'