## Tokenization

In [50]:
#!pip install spacy
import spacy

### creation of blank language object

In [114]:
nlp = spacy.blank("en") 

# we generate blank language processing pipeline
# nlp object get text document into the framework of pipeline. 
# So we can tokenize it now. Still I do not have pipeline. 

<img src="spacy_blank_pipeline.jpg" height=100, width=500/>

In [115]:
import pandas as pd

In [116]:
df = pd.read_excel("D:/github/Python/Deep Learning/annotation.xlsx")

In [117]:
df.head()

Unnamed: 0,sentence,indicator,factor,source
0,"In war, three quarters turns on morale; the ba...",Morale,moral,"Napoleon, as cited in Partington, 1996."
1,"Where superiority of numbers is overwhelming, ...",numerical superiority,physical,"Clausewitz, 1989, p.196​"
2,Grand strategy should calculate and develop ec...,"economic resources, man-power",physical,"Hart, 1991, 322"
3,"Beyond geography, money has always been the gr...",finance,physical,"Smith, 2019, 19."
4,We have no right to judge cities by their appe...,numerical superiority,physical,"Thucydides, 1972, 41"


In [118]:
df.shape

(41, 4)

### creation of document (paragraph)

In [119]:
df.sentence.values[0]

'In war, three quarters turns on morale; the balance of manpower counts only for the remaining quarter.'

In [120]:
doc = nlp(df.sentence.values[0])

In [122]:
nlp.pipe_names 

# by default we have anly tokenizer, we do not have pipeline yet. 

[]

In [121]:
for token in doc:
    print(token)

In
war
,
three
quarters
turns
on
morale
;
the
balance
of
manpower
counts
only
for
the
remaining
quarter
.


In [59]:
doc[0]

In

In [60]:
df.sentence.values[0].split()

['In',
 'war,',
 'three',
 'quarters',
 'turns',
 'on',
 'morale;',
 'the',
 'balance',
 'of',
 'manpower',
 'counts',
 'only',
 'for',
 'the',
 'remaining',
 'quarter.']

In [61]:
type(nlp) #it is object of english language

spacy.lang.en.English

In [62]:
type(doc)

spacy.tokens.doc.Doc

In [63]:
type(token)

spacy.tokens.token.Token

In [64]:
span = doc[1:5]
type(span)

spacy.tokens.span.Span

In [65]:
token0 = doc[0]
token0

In

In [66]:
dir(token0) # methods of the class-token0

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang

In [67]:
token0.is_alpha

True

In [68]:
token0.like_num

False

In [69]:
token0.is_currency

False

In [70]:
doc

In war, three quarters turns on morale; the balance of manpower counts only for the remaining quarter.

### customizing tokenizer

In [71]:
from spacy.symbols import ORTH

doc = nlp("gimme double cheese extra large healthy pizza")

nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"}
])


doc = nlp("gimme double cheese extra large healthy pizza")

[i.text for i in doc]

#tokenizing is a little bit dump: just splits whole thing into segments. You do not want to change actual text

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

### Sentence tokenizer

In [107]:
doc = nlp(df.sentence.values[12])

In [108]:
doc

It is important to decide sort of war we are going to fight. If we can neither defeat them at sea nor take away from them the resources on which their navy depends, we shall do ourselves more harm than good. 

In [109]:
#nlp.add_pipe("sentencizer")

# I need to add sentencizer component to blank pipeline. now this nlp object knows how to split.

In [110]:
nlp.pipe_names

['sentencizer']

In [111]:
for i in doc.sents: 
    print(i)

It is important to decide sort of war we are going to fight.
If we can neither defeat them at sea nor take away from them the resources on which their navy depends, we shall do ourselves more harm than good.


## pre-trained pipeline

<img src = "spacy_loaded_pipeline.jpg" height = 100, width=500/>

<h3>Download trained pipeline</h3>

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart

In [126]:
# run this command to install: python -m spacy download en_core_web_sm 

nlp = spacy.load("en_core_web_sm")<h3>Download trained pipeline</h3>

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [127]:
nlp.pipe_namessm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [129]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x201c0e77170>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x201c0e76390>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x201c0e985f0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x201c1017c90>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x201c1013f10>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x201c0e984a0>)]

In [133]:
doc = nlp(df.sentence.values[0])

for token in doc:
    print(token, token.pos_, token.lemma_)
    
#lemma: techniques to reduce a given word to its base word
#pos: part of speech tagging 

In ADP in
war NOUN war
, PUNCT ,
three NUM three
quarters NOUN quarter
turns VERB turn
on ADP on
morale NOUN morale
; PUNCT ;
the DET the
balance NOUN balance
of ADP of
manpower NOUN manpower
counts VERB count
only ADV only
for ADP for
the DET the
remaining VERB remain
quarter NOUN quarter
. PUNCT .


<h3>Named Entity Recognition</h3>

In [177]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names


doc = nlp(df.sentence.str.cat(sep=";"))

In [179]:
doc[:6]

In war, three quarters turns

In [182]:
print(doc.ents)

(three quarters, Grand, War, Trojans, ten years, fraction of Greek Army, one, Spartans, Corcyra, 64, Persians, Salamis, Hellas, Athens, three, navy, First, secondly, War, British, Falklands, a hundred, Strategy, three, winter, summer)


In [183]:
for ent in doc.ents:
    print(ent.text, "-", ent.label_, "-", spacy.explain(ent.label_))

three quarters - DATE - Absolute or relative dates or periods
Grand - ORG - Companies, agencies, institutions, etc.
War - EVENT - Named hurricanes, battles, wars, sports events, etc.
Trojans - NORP - Nationalities or religious or political groups
ten years - DATE - Absolute or relative dates or periods
fraction of Greek Army - ORG - Companies, agencies, institutions, etc.
one - CARDINAL - Numerals that do not fall under another type
Spartans - NORP - Nationalities or religious or political groups
Corcyra - PERSON - People, including fictional
64 - CARDINAL - Numerals that do not fall under another type
Persians - NORP - Nationalities or religious or political groups
Salamis - ORG - Companies, agencies, institutions, etc.
Hellas - GPE - Countries, cities, states
Athens - GPE - Countries, cities, states
three - CARDINAL - Numerals that do not fall under another type
navy - ORG - Companies, agencies, institutions, etc.
First - ORDINAL - "first", "second", etc.
secondly - ORDINAL - "first"

In [138]:
from spacy import displacy

In [140]:
displacy.render(doc, style = "ent")

### Manual uploading 

In below image you can see sentecizer component in the pipeline

<img src = "sentecizer.jpg" height=100 widht=200/>

In [147]:
nlp = spacy.blank("en")

doc = nlp(df.sentence.values[0])

for ent in doc.ents:
    print(ent.text)

In [148]:
source_nlp=spacy.load("en_core_web_sm")

nlp=spacy.blank("en") # generating blank pipeline
nlp.add_pipe("ner", source=source_nlp) #here adding ner to it. 
nlp.pipe_names

['ner']

In [149]:
doc = nlp(df.sentence.values[0])
for ent in doc.ents:
    print(ent.text)

three quarters
the remaining quarter


<h3>Further reading</h3>

https://spacy.io/usage/processing-pipelines#pipelines