# Spacy Introduction

In [1]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.3.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl (12.8 MB)
[K     |████████████████████████████████| 12.8 MB 1.4 MB/s eta 0:00:01
Collecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp39-cp39-macosx_10_9_x86_64.whl (2.7 MB)
[K     |████████████████████████████████| 2.7 MB 1.4 MB/s eta 0:00:01
Collecting smart-open<6.0.0,>=5.2.1
  Downloading smart_open-5.2.1-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 9.3 MB/s  eta 0:00:01
Installing collected packages: smart-open, pydantic, en-core-web-sm
  Attempting uninstall: smart-open
    Found existing installation: smart-open 5.1.0
    Uninstalling smart-open-5.1.0:
      Successfully uninstalled smart-open-5.1.0
  Attempting uninstall: pydantic
    Found existing installation: pydantic 1.10.1
    Uninstalling pydantic-1.10.1:
      Successfully uninstalled pydantic-1.10.1


In [2]:
import spacy

## Creating an NLP object

In [3]:
# Creating an nlp object

nlp = spacy.load("en_core_web_sm")

## Reading from a File

In [5]:
with open("data/wiki_us.txt", "r") as f:
    text = f.read()

In [6]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

http://spacy.pythonhumanities.com/01_02_linguistic_annotations.html

## Defining `doc` types

In [7]:
doc = nlp(text)

In [8]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

# Tokens

In [9]:
print(len(text))
print(len(doc))

3521
654


Tokens intelligently identify different words where simple split fails

In [11]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [12]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [13]:
for token in text.split()[0:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


## Finding Sentences

In [14]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [15]:
sentence1 = doc.sents[0]
print(sentence1)

TypeError: 'generator' object is not subscriptable

Finding a particular sentence from the passage

In [17]:
# Must convert into lists
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [18]:
all_sentences = list(doc.sents)
print(all_sentences)

[The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America., It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j], At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d], The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world., The national capital is Washington, D.C., and the most populous city is New York., 

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century., The United States emerged from the thirteen British colo

In [19]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [20]:
token2 = sentence1[2]
print(token2)

States


# Basic Processing Steps

## The Five Phase Analysis
### Lexical Analysis and Morphological

 - scans texts and divides it paragraph, texts and words



### Syntactic Analysis

- to check grammar, word aggrements, and relation ship among words
- Here, statements that don't make sense are rejected


### Semantic Analysis

- Finding the meaning of the statements
- Literal meaning of phrases and words


### Disclourse Integration

- Finds the meaning the meaning of the sentence before and after 


### Pragmatic Analysis

- Discover intended effects
- Application of set of rules that characatrise dialogues

# Token Attributes

- .text $\to$ extract text from `doc`
- .left_edge $\to$ the last word
- .right_edge $\to$ the next word
- .ent_type_ $\to$ entity type
- .ent_iob_ $\to$ “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.
- .lemma_ $\to$ lemmatization (base from)
- .lang_ $\to$ language
- .morph $\to$ morphological analysis
- .pos_ $\to$ part of speech tagging
- .dep_ $\to$ syntactic dependency

In [21]:
token2.text

'States'

# Disclosure Integration

In [22]:
token2.left_edge

The

In [23]:
token2.right_edge

America

# Entity Type

It gives the type of word you are lokking at on basis of models is `en_core_web_sm`

In [24]:
token2.ent_type

384

In [25]:
token2.ent_type_

'GPE'

In [26]:
token2.ent_iob_

'I'

In [29]:
print(sentence1[12])

known


# Morphological Analysis

In [30]:
token2.morph

Number=Sing

In [32]:
print(sentence1[12])
sentence1[12].morph

known


Aspect=Perf|Tense=Past|VerbForm=Part

# .pos_ (Part of Speech)

This gives part of speech

In [33]:
token2.pos_

'PROPN'

Here, States is a proper noun

# Syntactic Dependency

In [34]:
token2.dep_

'nsubj'

In [35]:
token2.lang_

'en'

In [36]:
text = "Mike enjoys playing football"
doc2 = nlp(text)
print(doc2)


Mike enjoys playing football


In [37]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


# Rendering Dependency Parsing

In [40]:
from spacy import displacy
displacy.render(doc2, style="dep")

Going to read more about parsing and parsing tree `https://web.stanford.edu/~jurafsky/slp3/14.pdf`

# Entity Rendering

In [42]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
third- or DATE
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million QUANTITY
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
1775–1783 CARDINAL
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War EVENT
Spanish NORP
World War I EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVE

In [43]:
displacy.render(doc, style="ent")