In [67]:
#importing spacy
import spacy

In [68]:
nlp = spacy.load('en_core_web_sm')

In [69]:
with open ('data/article.txt') as f:
    text = f.read()

In [70]:
text

'Saudi Arabia has suffered a string of deadly shootings and bomb attacks in recent months, many of which the Daesh (so-called IS) terrorist organization have claimed responsibility for. In July, suicide bombers struck three cities across Saudi Arabia, killing at least four security officers. The apparently coordinated attacks came on the penultimate day of the Muslim holy month of Ramadan. Recently, the Ministry of Interior foiled a planned suicide attack on a mosque in the Qatif region, in eastern Saudi Arabia, on Aug. 24. Indeed, the Ministry of Interior has successfully repelled hundreds of terrorist operations since 2003, displaying security expertise that is admired worldwide. Many countries have benefited from this experience in counter terrorism. However, success at security level does not correspond to a counter terrorist ideology that has seeped into our communities through social media or by any other means.\n\nThe discourse and ideology of any terrorist organization, be it A

In [71]:
doc = nlp(text)

In [72]:
print(doc)

Saudi Arabia has suffered a string of deadly shootings and bomb attacks in recent months, many of which the Daesh (so-called IS) terrorist organization have claimed responsibility for. In July, suicide bombers struck three cities across Saudi Arabia, killing at least four security officers. The apparently coordinated attacks came on the penultimate day of the Muslim holy month of Ramadan. Recently, the Ministry of Interior foiled a planned suicide attack on a mosque in the Qatif region, in eastern Saudi Arabia, on Aug. 24. Indeed, the Ministry of Interior has successfully repelled hundreds of terrorist operations since 2003, displaying security expertise that is admired worldwide. Many countries have benefited from this experience in counter terrorism. However, success at security level does not correspond to a counter terrorist ideology that has seeped into our communities through social media or by any other means.

The discourse and ideology of any terrorist organization, be it Al-Q

In [73]:
print(f'Len of text: {len(text)}')
print(f'Len of doc: {len(doc)}')

Len of text: 4282
Len of doc: 776


In [74]:
for token in text[0:10]:
    print(token)

S
a
u
d
i
 
A
r
a
b


In [75]:
for token in doc[:10]:
    print(token)

Saudi
Arabia
has
suffered
a
string
of
deadly
shootings
and


#### Different between text object and doc object is:

In [76]:
for token in text.split()[:10]:
    print(token)

Saudi
Arabia
has
suffered
a
string
of
deadly
shootings
and


## Sentence Boundary Detection(SBD)
In NLP, sentence boundary detection, or SBD, is the identification of sentences in a text. Again, this may seem fairly easy to do with rules. One could use split(“.”), but in English we use the period to also denote abbreviation. You could, again, write rules to look for periods not proceeded by a lowercase word, but again, I ask the question, “why bother?”. We can use spaCy and in seconds have all sentences fully separated through SBD.

In [77]:
for sent in doc.sents:
    print(sent)

Saudi Arabia has suffered a string of deadly shootings and bomb attacks in recent months, many of which the Daesh (so-called IS) terrorist organization have claimed responsibility for.
In July, suicide bombers struck three cities across Saudi Arabia, killing at least four security officers.
The apparently coordinated attacks came on the penultimate day of the Muslim holy month of Ramadan.
Recently, the Ministry of Interior foiled a planned suicide attack on a mosque in the Qatif region, in eastern Saudi Arabia, on Aug. 24.
Indeed, the Ministry of Interior has successfully repelled hundreds of terrorist operations since 2003, displaying security expertise that is admired worldwide.
Many countries have benefited from this experience in counter terrorism.
However, success at security level does not correspond to a counter terrorist ideology that has seeped into our communities through social media or by any other means.


The discourse and ideology of any terrorist organization, be it Al-

In [81]:
sentence1 =list(doc.sents)[0]
print(sentence1)

Saudi Arabia has suffered a string of deadly shootings and bomb attacks in recent months, many of which the Daesh (so-called IS) terrorist organization have claimed responsibility for.


In [84]:
len(list(doc.sents))

26

In [85]:
for sentence in doc.sents:
    print('sentence:',sentence)

sentence: Saudi Arabia has suffered a string of deadly shootings and bomb attacks in recent months, many of which the Daesh (so-called IS) terrorist organization have claimed responsibility for.
sentence: In July, suicide bombers struck three cities across Saudi Arabia, killing at least four security officers.
sentence: The apparently coordinated attacks came on the penultimate day of the Muslim holy month of Ramadan.
sentence: Recently, the Ministry of Interior foiled a planned suicide attack on a mosque in the Qatif region, in eastern Saudi Arabia, on Aug. 24.
sentence: Indeed, the Ministry of Interior has successfully repelled hundreds of terrorist operations since 2003, displaying security expertise that is admired worldwide.
sentence: Many countries have benefited from this experience in counter terrorism.
sentence: However, success at security level does not correspond to a counter terrorist ideology that has seeped into our communities through social media or by any other means.

In [79]:
for token in doc[:10]:
    print(token)

Saudi
Arabia
has
suffered
a
string
of
deadly
shootings
and


In [80]:
token2 = sentence1[1]
print(token2)

TypeError: 'int' object is not subscriptable

In [None]:
token2.text

'Arabia'

In [None]:
token2.left_edge

Saudi

In [None]:
token2.right_edge

Arabia

In [None]:
token2.ent_type #type of entity

384

In [None]:
token2.ent_type_ # "GPE" --> Geopolitical entity

'GPE'

In [None]:
token2.ent_iob_ # inside, outside, beginning of entity

'I'

In [None]:
token2.ent_iob

1

In [None]:
token2.lemma_

'Arabia'

In [None]:
sentence1[3].lemma_

'suffer'

In [None]:
print(sentence1[3])

suffered


In [None]:
token2.morph

Number=Sing

In [None]:
print(sentence1[3])
sentence1[3].morph

suffered


Aspect=Perf|Tense=Past|VerbForm=Part

In [None]:
token2.pos_ # pos_ --> part of speech, PROPN --> Proper Noun

'PROPN'

In [None]:
token2.dep_ # dependency relation, nsubj --> Noun subject

'nsubj'

In [None]:
token2.lang_ # Language of the doc object

'en'

In [None]:
text = "Mike enjoys playing football"
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football


In [None]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj


In [None]:
from spacy import displacy
displacy.render(doc2, style='dep')

In [None]:
displacy.render(doc, style='dep')

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Saudi Arabia GPE
recent months DATE
July DATE
three CARDINAL
Saudi Arabia GPE
at least four CARDINAL
Muslim NORP
the Ministry of Interior ORG
Qatif GPE
Saudi Arabia GPE
Aug. 24 DATE
hundreds CARDINAL
2003 DATE
Al-Qaeda ORG
three CARDINAL
Arab NORP
Islamic NORP
the Ottoman Empire GPE
Islamic NORP
Islamic NORP
Islamic NORP
Caliph PERSON
Muslims NORP
East LOC
West LOC
the Sykes-Picot Agreement ORG
Islamic NORP
the Muslim Brotherhood ORG
Egypt GPE
the Gulen Movement ORG
Turkey GPE
India GPE
Al-Qaeda ORG
Al-Qaeda ORG
Saudi Arabia GPE
Saudi NORP
Al-Qaeda ORG
Yemen GPE
Today DATE
Islamic NORP
Mohammad Bin Naif Counseling and Care Centre PERSON
Assakina LOC
the Hedaya Centre FAC
the Sawab Centre ORG
Gulf LOC
Arab NORP


In [None]:
displacy.render(doc, style='ent')

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
with open('data/article.txt', 'r') as f:
    text=f.read()

In [None]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

Saudi Arabia has suffered a string of deadly shootings and bomb attacks in recent months, many of which the Daesh (so-called IS) terrorist organization have claimed responsibility for.


In [None]:
import numpy as np
y_word = 'Arabia'

ms = nlp.vocab.vectors.most_similar(
    np.asarray(
        [nlp.vocab.vectors[
            nlp.vocab.strings[y_word]]]
        ), 
    n=10)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]

KeyError: '[E058] Could not retrieve vector for key 16771399832892321903.'