# Tokenization
Segmenting text into words, punctuations marks etc

# Loading the package

In [1]:
import spacy
nlp = spacy.load('en')

#nlp = en_core_web_sm.load()

In [2]:
# Reading the text /tokens
docx = nlp("Sharp rise in patent applications for self-driving vehicles in Europe")

In [3]:
docx

Sharp rise in patent applications for self-driving vehicles in Europe

In [4]:
docx2= nlp(u"A study published today, 6 November 2018, by the European Patent Office (EPO) reveals that innovation in self-driving vehicles (SDV) is accelerating fast")

# Reading a file

In [5]:
myfile = open("examplefile.txt").read()

In [6]:
doc_file = nlp(myfile)

In [7]:
doc_file

A study published today, 6 November 2018, by the European Patent Office (EPO) reveals that innovation in self-driving vehicles (SDV) is accelerating fast. 
Additionally, study finds that patent protection strategies in the area of self-driving vehicle technology more closely resemble those in the information and communication (ICT) sector than those in the traditional automotive industry.

https://www.epo.org/news-issues/news/2018/20181106.html


In [8]:
# As this over-whelming. We can read few initial words only
doc_file[:20]

A study published today, 6 November 2018, by the European Patent Office (EPO) reveals that innovation

# Sentence Tokens

In [9]:
# show output by sentence
for sentence in doc_file.sents:
    print(sentence)

A study published today, 6 November 2018, by the European Patent Office (EPO) reveals that innovation in self-driving vehicles (SDV) is accelerating fast. 

Additionally, study finds that patent protection strategies in the area of self-driving vehicle technology more closely resemble those in the information and communication (ICT) sector than those in the traditional automotive industry.


https://www.epo.org/news-issues/news/2018/20181106.html




In [10]:
# Show by each sentence with a number to keep track of line number
for num,sentence in enumerate(doc_file.sents):
    print(f'{num}: {sentence}') 

0: A study published today, 6 November 2018, by the European Patent Office (EPO) reveals that innovation in self-driving vehicles (SDV) is accelerating fast. 

1: Additionally, study finds that patent protection strategies in the area of self-driving vehicle technology more closely resemble those in the information and communication (ICT) sector than those in the traditional automotive industry.


2: https://www.epo.org/news-issues/news/2018/20181106.html




Our command is dividing them into sentences based on their punctuations. 

In [11]:
# Word Tokens
for token in doc_file:
    print(token.text)

A
study
published
today
,
6
November
2018
,
by
the
European
Patent
Office
(
EPO
)
reveals
that
innovation
in
self
-
driving
vehicles
(
SDV
)
is
accelerating
fast
.


Additionally
,
study
finds
that
patent
protection
strategies
in
the
area
of
self
-
driving
vehicle
technology
more
closely
resemble
those
in
the
information
and
communication
(
ICT
)
sector
than
those
in
the
traditional
automotive
industry
.



https://www.epo.org/news-issues/news/2018/20181106.html





Note how each word is in token except weblink

### List of Word Tokens

In [12]:
[token.text for token in doc_file ]

['A',
 'study',
 'published',
 'today',
 ',',
 '6',
 'November',
 '2018',
 ',',
 'by',
 'the',
 'European',
 'Patent',
 'Office',
 '(',
 'EPO',
 ')',
 'reveals',
 'that',
 'innovation',
 'in',
 'self',
 '-',
 'driving',
 'vehicles',
 '(',
 'SDV',
 ')',
 'is',
 'accelerating',
 'fast',
 '.',
 '\n',
 'Additionally',
 ',',
 'study',
 'finds',
 'that',
 'patent',
 'protection',
 'strategies',
 'in',
 'the',
 'area',
 'of',
 'self',
 '-',
 'driving',
 'vehicle',
 'technology',
 'more',
 'closely',
 'resemble',
 'those',
 'in',
 'the',
 'information',
 'and',
 'communication',
 '(',
 'ICT',
 ')',
 'sector',
 'than',
 'those',
 'in',
 'the',
 'traditional',
 'automotive',
 'industry',
 '.',
 '\n\n',
 'https://www.epo.org/news-issues/news/2018/20181106.html',
 '\n\n']

Now, we have split even weblink into tokens

In [13]:
# Similar to splitting on spaces like we do in nltk
doc_file.text.split(" ")

['A',
 'study',
 'published',
 'today,',
 '6',
 'November',
 '2018,',
 'by',
 'the',
 'European',
 'Patent',
 'Office',
 '(EPO)',
 'reveals',
 'that',
 'innovation',
 'in',
 'self-driving',
 'vehicles',
 '(SDV)',
 'is',
 'accelerating',
 'fast.',
 '\nAdditionally,',
 'study',
 'finds',
 'that',
 'patent',
 'protection',
 'strategies',
 'in',
 'the',
 'area',
 'of',
 'self-driving',
 'vehicle',
 'technology',
 'more',
 'closely',
 'resemble',
 'those',
 'in',
 'the',
 'information',
 'and',
 'communication',
 '(ICT)',
 'sector',
 'than',
 'those',
 'in',
 'the',
 'traditional',
 'automotive',
 'industry.\n\nhttps://www.epo.org/news-issues/news/2018/20181106.html\n\n']

In [14]:
# Let's create a small text file to see our results
ex_doc = nlp("Hello hello HELLO HeLLO")

In [15]:
ex_doc.text.split(" ")

['Hello', 'hello', 'HELLO', 'HeLLO']

### Word Shape

In [16]:
for word in ex_doc:
    print(word.text)

Hello
hello
HELLO
HeLLO


### Word Shape As Hash Value

In [17]:
for word in ex_doc:
    print(word.text,word.shape)

Hello 16072095006890171862
hello 13110060611322374290
HELLO 13862804199789047564
HeLLO 2558401074738440126


Hello 16072095006890171862 is not really understandable for us. So, we will go one step further to make some sense.

### Word Shape As readable representation

In [18]:
for word in ex_doc:
    print(word.text,word.shape_)

Hello Xxxxx
hello xxxx
HELLO XXXX
HeLLO XxXXX


So, it shows that Hello has 5 words.First one is capital case and rest a small-case letters.

In [19]:
for word in ex_doc:
    print("Token =>", word.text, "Shape ",word.shape_,word.is_alpha,word.is_stop)

Token => Hello Shape  Xxxxx True False
Token => hello Shape  xxxx True False
Token => HELLO Shape  XXXX True False
Token => HeLLO Shape  XxXXX True False


- word.text shows text i.e Hello
- word.shape shows shape of hash value 
- word.shape_ shows shape of words
- word.is_alpha shows if it is aplha-numeric(true or false)
- word.is_stop shows if there is stop word(true or false)