In [None]:
!pip install spacy



In [3]:
!python -m spacy download en_core_web_sm

^C


Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --- ------------------------------------ 1.0/12.8 MB 8.5 MB/s eta 0:00:02
     ------------- -------------------------- 4.2/12.8 MB 12.6 MB/s eta 0:00:01
     ----------------------- ---------------- 7.6/12.8 MB 13.8 MB/s eta 0:00:01
     ------------------------------- ------- 10.5/12.8 MB 13.9 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 13.8 MB/s  0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
!python -m spacy download es_core_news_sm

In [4]:
import spacy

In [5]:
nlp = spacy.load('en_core_web_sm')

In [6]:
doc = nlp('This is a sentence')

In [7]:
doc[0]

This

## Spacy Linguistic Annotations

In [8]:
with open("data/wiki_us.txt", "r") as f:
    text = f.read()

In [9]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [10]:
doc = nlp(text)

In [11]:
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [12]:
print(len(text))
print(len(doc))

3525
652


In [13]:
print([i for i in text[:10]])
print([i for i in doc[:10]]) #cuenta tokens individuales (palabras, signos de puntuación...)

['T', 'h', 'e', ' ', 'U', 'n', 'i', 't', 'e', 'd']
[The, United, States, of, America, (, U.S.A., or, USA, )]


In [14]:
print([i for i in text.split()[:10]]) #fíjate que no te quita los paréntesis ni te los separa, al contrario de spacy

['The', 'United', 'States', 'of', 'America', '(U.S.A.', 'or', 'USA),', 'commonly', 'known']


Each token has attributes

## Sentence Boundary Detection (SBD)

Es la identificación de frases en un texto. 

Una forma podría ser split(".") pero en inglés también se usa para siglas (U.S.A.). Podemos decirle que preceda de un lowercase, pero pa q complicarnos, si tenemos spacy

In [15]:
for sent in doc.sents:
    print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [16]:
sentence1 = doc.sents[0]
print(sentence1) #Error, es un generador

# Un generator es un tipo de objeto en Python que sirve para producir valores uno a uno, bajo demanda, en lugar de calcularlos todos de golpe.

TypeError: '_cython_3_2_1.generator' object is not subscriptable

In [17]:
def contador():
    yield 1
    yield 2
    yield 3

gen = contador()
print(gen)

print(next(gen))
print(next(gen))
print(next(gen))
print(next(gen))

<generator object contador at 0x0000019A6D6E6EA0>
1
2
3


StopIteration: 

generator expression

In [18]:
gen = (x * 2 for x in range(5))
gen

<generator object <genexpr> at 0x0000019A6D6E6DC0>

In [19]:
sentence1 = list(doc.sents)[0]
print(sentence1) #Error, es un generador

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


### Token attributes

* .text
* .head
* .left_edge
* .right_edge
* .ent_type_
* .iob_
* .lemma_
* .morph
* .pos_
* .dep_
* .lang_

In [21]:
for token in doc:
    print(token)
    break

The


In [22]:
token2 = sentence1[2]
print(token2)

States


.text --> str

In [24]:
print(type(token2.text))
token2.text

<class 'str'>


'States'

.left_edge --> The.

Nos dice que este token es parte de un token multipalabra o tiene multiples componentes para crear un gran SPAN and it's the leftmost token that corresponds to it


In [25]:
token2.left_edge

The

In [26]:
token2.right_edge

,

ent_type --> Type of entity (integer)

ent_type_ --> type entity name

In [27]:
token2.ent_type

384

In [None]:
token2.ent_type_ #Geopolitical Entity

'GPE'

.ent_iob_ --> Specific kind of named entity code

b means that it's the beginning of an entity and i means that it's inside of an entity and o means outside of an entity

In [29]:
token2.ent_iob_

'I'

.lemma_ --> root form of the word or token

In [31]:
token2.lemma_ #root form

'States'

In [33]:
sentence1[12].lemma_

'know'

.morph --> NounType = Prop | Number=Sing

this means proper noun | singular

Morphological analysis

In [36]:
token2.morph

Number=Sing

In [None]:
sentence1[12].morph #Perfect Past Participle VAN A SER IMPORTANTE PARA ESTABLECER REGLAS GRAMATICALES

Aspect=Perf|Tense=Past|VerbForm=Part

.pos_ = Part of Speech

In [None]:
token2.pos_  #Proper Noun 

'PROPN'

.dep = dependency relation

In [None]:
token2.dep_ #dependency relation --> noun subject nsubj

'nsubj'

.lang_ --> language

In [42]:
token2.lang_

'en'

In [43]:
text = "Mike enjoys playing football"
doc2 = nlp(text)

print(doc2)

Mike enjoys playing football


In [45]:
for token in doc2: 
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


In [46]:
from spacy import displacy

displacy.render(doc2, style = "dep")