### <b> Tokenisation </b>
 Tokenisation is the proces of splitting text into meaningfull segments. <br>
 * Site to get NLP api's : https://www.firstlanguage.in/

 <img src="spacy_blank_pipeline.jpg" height=200, width=500/>

In [2]:
import spacy

In [3]:
# initializes a blank English language model 
nlp = spacy.blank("en")

doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate")

for token in doc:
    print(token)

Dr.
Strange
loves
pav
bhaji
of
mumbai
as
it
costs
only
2
$
per
plate


In [4]:
type(nlp)

spacy.lang.en.English

In [5]:
# different methods in token class

for token in doc:
    print(token, "==>", "index: ", token.i, "is_alpha:", token.is_alpha, 
          "is_punct:", token.is_punct, 
          "like_num:", token.like_num,
          "is_currency:", token.is_currency,
         )

Dr. ==> index:  0 is_alpha: False is_punct: False like_num: False is_currency: False
Strange ==> index:  1 is_alpha: True is_punct: False like_num: False is_currency: False
loves ==> index:  2 is_alpha: True is_punct: False like_num: False is_currency: False
pav ==> index:  3 is_alpha: True is_punct: False like_num: False is_currency: False
bhaji ==> index:  4 is_alpha: True is_punct: False like_num: False is_currency: False
of ==> index:  5 is_alpha: True is_punct: False like_num: False is_currency: False
mumbai ==> index:  6 is_alpha: True is_punct: False like_num: False is_currency: False
as ==> index:  7 is_alpha: True is_punct: False like_num: False is_currency: False
it ==> index:  8 is_alpha: True is_punct: False like_num: False is_currency: False
costs ==> index:  9 is_alpha: True is_punct: False like_num: False is_currency: False
only ==> index:  10 is_alpha: True is_punct: False like_num: False is_currency: False
2 ==> index:  11 is_alpha: False is_punct: False like_num: True

In [6]:
# reading data from file

with open("students.txt") as f:
    text = f.readlines()
text

['Dayton high school, 8th grade students information\n',
 '\n',
 'Name\tbirth day   \temail\n',
 '-----\t------------\t------\n',
 'Virat   5 June, 1882    virat@kohli.com\n',
 'Maria\t12 April, 2001  maria@sharapova.com\n',
 'Serena  24 June, 1998   serena@williams.com \n',
 'Joe      1 May, 1997    joe@root.com\n',
 '\n',
 '\n',
 '\n']

In [7]:
text = ' '.join(text)
print(text)

Dayton high school, 8th grade students information
 
 Name	birth day   	email
 -----	------------	------
 Virat   5 June, 1882    virat@kohli.com
 Maria	12 April, 2001  maria@sharapova.com
 Serena  24 June, 1998   serena@williams.com 
 Joe      1 May, 1997    joe@root.com
 
 
 



In [8]:
# extracting the email

doc = nlp(text)
emails = []

for token in doc:
    if token.like_email:
        emails.append(token.text)
emails

['virat@kohli.com',
 'maria@sharapova.com',
 'serena@williams.com',
 'joe@root.com']

In [9]:
# using hindi language

nlp = spacy.blank("hi")

doc = nlp("भैया जी! 5000 ₹ उधार थे वो वापस देदो")
for token in doc:
    print(token)

भैया
जी
!
5000
₹
उधार
थे
वो
वापस
देदो


## Customising Tokenizer

In [10]:
from spacy.symbols import ORTH

doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gimme', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

In [16]:
nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"}, 
    {ORTH:"me"},
])

doc = nlp("gimme double cheese extra large healthy pizza")
tokens = [token.text for token in doc]
tokens

['gim', 'me', 'double', 'cheese', 'extra', 'large', 'healthy', 'pizza']

<h3>Sentence Tokenization or Segmentation</h3>

In [19]:
# adding pipeline in the nlp {simply a set of components staged together}
# nlp.add_pipe('sentencizer')

doc = nlp("Dr. Strange loves pav bhaji of mumbai as it costs only 2$ per plate")

for sentence in doc.sents:
    print(sentence)

Dr.
Strange loves pav bhaji of mumbai as it costs only 2$ per plate
