<a href="https://colab.research.google.com/github/RifatMuhtasim/NLP_Natural_Language_Processing/blob/main/Learn/08.Tokenization_in_Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spacy Tokenizer

In [None]:
import spacy

In [None]:
# NLP Tokenization

nlp = spacy.blank("en")

text = "Let's go the N.Y.!"
doc = nlp(text)
for token in doc:
    print(token)

Let
's
go
the
N.Y.
!


In [None]:
# Retrieve the 3rd Index Value
doc[2]

go

In [None]:
# Retrieve the list of token
doc[2:5]

go the N.Y.

In [None]:
# Show the doc type
type(doc)

spacy.tokens.doc.Doc

In [None]:
type(nlp)

## Check the token type

In [None]:
text = "Tony gave two $ to Peter."
doc = nlp(text)

for token in doc:
    print(token, "==>", token.i,
          "is_alpha:", token.is_alpha,
          "is_punct:", token.is_punct,
          "like_num:", token.like_num,
          "is_currency:", token.is_currency)

Tony ==> 0 is_alpha: True is_punct: False like_num: False is_currency: False
gave ==> 1 is_alpha: True is_punct: False like_num: False is_currency: False
two ==> 2 is_alpha: True is_punct: False like_num: True is_currency: False
$ ==> 3 is_alpha: False is_punct: False like_num: False is_currency: True
to ==> 4 is_alpha: True is_punct: False like_num: False is_currency: False
Peter ==> 5 is_alpha: True is_punct: False like_num: False is_currency: False
. ==> 6 is_alpha: False is_punct: True like_num: False is_currency: False


## Grab all of the Emails

In [None]:
with open("/content/drive/MyDrive/Colab Notebooks/NLP/Media/students.txt") as f:
    text = f.readlines()

text = " ".join(text)
text



In [None]:
doc = nlp(text)

emails = []
for token in doc:
    if token.like_email:
        emails.append(token)

emails

[virat@kohli.com, maria@sharapova.com, serena@williams.com, joe@root.com]

## Use Bengali Language in Spacy

In [None]:
nlp = spacy.blank("bn")

bn_text = "তোমার নাম কি? এই নাও ১০০০ টাকা। ফুলগুলো কিন্তু অনেক সুন্দর।"
doc = nlp(bn_text)

# Print Bengali Token with there token type
for token in doc:
    print(f"{token} is currency: {token.is_currency} & is number: {token.like_num}")

তোমার is currency: False & is number: False
নাম is currency: False & is number: False
কি is currency: False & is number: False
? is currency: False & is number: False
এই is currency: False & is number: False
নাও is currency: False & is number: False
১০০০ is currency: False & is number: True
টাকা is currency: False & is number: False
। is currency: False & is number: False
ফুলগুলো is currency: False & is number: False
কিন্তু is currency: False & is number: False
অনেক is currency: False & is number: False
সুন্দর is currency: False & is number: False
। is currency: False & is number: False


Comment: Here We See that, Taka which is the currency of Bangladesh can't detect properly.

## Customized Spacy Token

In [None]:
from spacy.symbols import ORTH

# Customizer "gimme" single word
nlp = spacy.blank("en")
nlp.tokenizer.add_special_case("gimme", [
    {ORTH: "gim"},
    {ORTH: "me"}
])

text = "gimme double extra large healthy pizza."
doc = nlp(text)
for token in doc:
    print(token)

gim
me
double
extra
large
healthy
pizza
.


## Sentence Customizer

In [None]:
nlp.add_pipe("sentencizer")
text = "Hello Sylhet! You are a wonderful city in the World."
doc = nlp(text)

tokens = [token.text for token in doc ]
tokens

['Hello',
 'Sylhet',
 '!',
 'You',
 'are',
 'a',
 'wonderful',
 'city',
 'in',
 'the',
 'World',
 '.']

# Exercise

## 1. Grab All the URLS for the text using Spacy

In [None]:
text='''
Look for data to help you address the question. Governments are good
sources because data from public research is often freely available. Good
places to start include http://www.data.gov/, and http://www.science.
gov/, and in the United Kingdom, http://data.gov.uk/.
Two of my favorite data sets are the General Social Survey at http://www3.norc.org/gss+website/,
and the European Social Survey at http://www.europeansocialsurvey.org/.
'''

nlp = spacy.blank("en")
doc = nlp(text)

urls = []
for token in doc:
    if token.like_url:
        urls.append(token.text)

urls

['http://www.data.gov/',
 'http://www.science',
 'http://data.gov.uk/.',
 'http://www3.norc.org/gss+website/',
 'http://www.europeansocialsurvey.org/.']

## 2. Extract all transaction
Extract all money with currency from the sentence. Output Should be look like: two $ , 500  €

In [None]:
transactions = "Tony gave two $ to Peter, Bruce gave 500 € to Steve"

doc = nlp(transactions)
for token in doc:
    if token.like_num and doc[token.i +1].is_currency:
        print(token.text, doc[token.i + 1].text)

two $
500 €
