# Tokenization

In [1]:
import nltk

In [2]:
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
text = " Hello everyone. Hope all are fine and doing well. Hope you find the book interesting"
tokenizer.tokenize(text)

[' Hello everyone.',
 'Hope all are fine and doing well.',
 'Hope you find the book interesting']

##### Tokenization of text in other languages :

For performing tokenization in languages other than English, we can load the respective language pickle file found in tokenizers/punkt and then tokenize the text in another language, which is an argument of the tokenize() function.

In [3]:
french_tokenizer=nltk.data.load('tokenizers/punkt/french.pickle')
text1="Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret. Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire. L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire"
french_tokenizer.tokenize(text1)

['Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage collège franco-britanniquedeLevallois-Perret.',
 'Deux agressions en quelques jours, voilà ce qui a motivé hier matin le débrayage Levallois.',
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, janvier , d'un professeur d'histoire.",
 "L'équipe pédagogique de ce collège de 750 élèves avait déjà été choquée par l'agression, mercredi , d'un professeur d'histoire"]

### Tokenization of sentences into words : TreebankWordTokenizer()

In [4]:
from nltk.tokenize import TreebankWordTokenizer
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize("Have a nice day. I hope you find the book interesting.")

['Have',
 'a',
 'nice',
 'day.',
 'I',
 'hope',
 'you',
 'find',
 'the',
 'book',
 'interesting',
 '.']

### splitting punctuation WordPunctTokenizer()

In [5]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
tokenizer.tokenize(" Don't hesitate to ask questions.")

['Don', "'", 't', 'hesitate', 'to', 'ask', 'questions', '.']

### Tokenization using regular expressions(regex)

In [6]:
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")
tokenizer.tokenize("Don't hesitate to ask questions")

['Don', 't', 'hesitate', 'to', 'ask', 'questions']

In [7]:
from nltk.tokenize import regexp_tokenize
sent="Don't hesitate to ask questions"
print(regexp_tokenize(sent, pattern='\w+|\$[\d\.]+|\S+'))

['Don', "'t", 'hesitate', 'to', 'ask', 'questions']


## Conversion into lowercase and uppercase :

In [8]:
text='HARdWork IS KEy to SUCCESS'
print(text.lower())

hardwork is key to success


In [9]:
print(text.upper())

HARDWORK IS KEY TO SUCCESS


# Dealing with stop words :

In [10]:
from nltk.corpus import stopwords
stops=set(stopwords.words('english'))
words=["Don't", 'hesitate','to','ask','questions']
[word for word in words if word not in stops]

["Don't", 'hesitate', 'ask', 'questions']

In [11]:
#test
test1 = "Have a nice day. I hope you find the book interesting."
a = tokenizer.tokenize(test1)
[word for word in a if word not in stops]

['Have', 'nice', 'day', 'I', 'hope', 'find', 'book', 'interesting']

In [12]:
stopwords.fileids()

['arabic',
 'azerbaijani',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'greek',
 'hungarian',
 'indonesian',
 'italian',
 'kazakh',
 'nepali',
 'norwegian',
 'portuguese',
 'romanian',
 'russian',
 'slovene',
 'spanish',
 'swedish',
 'tajik',
 'turkish']

### Example of the replacement of a text with another text

In [13]:
import re

replacement_patterns = [
    (r'won\'t', 'will not'),
    (r'can\'t', 'cannot'),
    (r'i\'m', 'i am'),
    (r'ain\'t', 'is not'),
    (r'(\w+)\'ll', '\g<1> will'),
    (r'(\w+)n\'t', '\g<1> not'),
    (r'(\w+)\'ve', '\g<1> have'),
    (r'(\w+)\'s', '\g<1> is'),
    (r'(\w+)\'re', '\g<1> are'),
    (r'(\w+)\'d', '\g<1> would'),
]

class RegexpReplacer(object):
    def __init__(self, patterns=replacement_patterns): 
        self.patterns = [(re.compile(regex), repl) for (regex, repl) in patterns]
    def replace(self, text):
        s = text
        for (pattern, repl) in self.patterns:
            s = re.sub(pattern, repl, s) 
        return s
    

replacer=RegexpReplacer()
replacer.replace("Don't hesistate to ask questions")

'Do not hesistate to ask questions'

In [14]:
replacer.replace("She must've gone to the market but she didn't go")

'She must have gone to the market but she did not go'

The function of RegexpReplacer.replace() is substituting every instance of a replacement pattern with its corresponding substitution pattern. Here, must've is replaced by must have and didn't is replaced by did not , since the replacement pattern in replacers.py has already been defined by tuple pairs, that is, (r'(\w+)\'ve', '\g<1> have') and (r'(\w+)n\'t', '\g<1> not') .
We can not only perform the replacement of contractions; we can also substitute a token with any other token.

# Lemmatization 

Lemmatization is the process in which we transform the word into a form with a different word category. The word formed after lemmatization is entirely different. The built-in morphy() function is used for lemmatization in WordNetLemmatizer. The inputted word is left unchanged if it is not found in WordNet. In the argument, pos refers to the part of speech category of the inputted word. Consider an example of lemmatization in NLTK:

In [15]:
from nltk.stem import WordNetLemmatizer
lemmatizer_output = WordNetLemmatizer()
lemmatizer_output.lemmatize('working')

'working'

In [16]:
lemmatizer_output.lemmatize('working',pos='v')

'work'

In [17]:
lemmatizer_output.lemmatize('works')

'work'

In [18]:
lemmatizer_output.lemmatize('children')

'child'

In [19]:
lemmatizer_output.lemmatize('are')

'are'

In [20]:
lemmatizer_output.lemmatize('consequences')

'consequence'

In [21]:
from nltk.stem import PorterStemmer
stemmer_output=PorterStemmer()
stemmer_output.stem('happiness')

'happi'

In [22]:
lemmatizer_output.lemmatize('happiness')

'happiness'

# Similarity measure

In [23]:
from nltk.metrics import *
edit_distance("relate","relation")

3

In [24]:
edit_distance("suggestion","calculation")

7

• Jaccard(X,Y)=|X∩Y|/|XUY|
• Jaccard(X,X)=1
• Jaccard(X,Y)=0 if X∩Y=0

In [25]:
X=set([10,20,30,40])
Y=set([20,30,60])
print(jaccard_distance(X,Y))

0.6


### Finding 2 nearest sentences between 2 articles.

In [26]:
open_file = open('NBA1.txt', 'r', encoding='utf-8')
file_to_string = open_file.read()
type(file_to_string)

str

- We start by performing the replacement of contractions.

In [27]:
text_replaced = replacer.replace(file_to_string)

print(file_to_string[-156:-142])
print(text_replaced[-158:-142])
print(type(text_replaced))

We’ll find out

“We’ll find out
<class 'str'>


- Tokenizing into sentences.

In [28]:
tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(text_replaced)
sentences[0:2]

['The NBA’s history in China is more than three decades old.',
 'China Central Television struck a deal with the league in 1987 to offer games for free, and their relationship prospered in the 1990s, as the Chicago Bulls were busy winning championships and Michael Jordan was becoming a global icon.']

- Tokenizing into words.

In [29]:
tokenizer=RegexpTokenizer("[\w]+")

for i in range(len(sentences)):
    sentences[i] = tokenizer.tokenize(sentences[i])
sentences[0]


['The',
 'NBA',
 's',
 'history',
 'in',
 'China',
 'is',
 'more',
 'than',
 'three',
 'decades',
 'old']


- We now delete stop words



In [30]:
sentences[0]

['The',
 'NBA',
 's',
 'history',
 'in',
 'China',
 'is',
 'more',
 'than',
 'three',
 'decades',
 'old']

In [31]:
from nltk.corpus import stopwords
stops=set(stopwords.words('english'))
words= sentences[0]
[word for word in words if word not in stops]



['The', 'NBA', 'history', 'China', 'three', 'decades', 'old']

- Lemmatization for each sentence.

In [32]:
sentences

[['The',
  'NBA',
  's',
  'history',
  'in',
  'China',
  'is',
  'more',
  'than',
  'three',
  'decades',
  'old'],
 ['China',
  'Central',
  'Television',
  'struck',
  'a',
  'deal',
  'with',
  'the',
  'league',
  'in',
  '1987',
  'to',
  'offer',
  'games',
  'for',
  'free',
  'and',
  'their',
  'relationship',
  'prospered',
  'in',
  'the',
  '1990s',
  'as',
  'the',
  'Chicago',
  'Bulls',
  'were',
  'busy',
  'winning',
  'championships',
  'and',
  'Michael',
  'Jordan',
  'was',
  'becoming',
  'a',
  'global',
  'icon'],
 ['The',
  'NBA',
  'has',
  'only',
  'become',
  'more',
  'popular',
  'in',
  'China',
  'since',
  'then'],
 ['The',
  'league',
  's',
  'official',
  'Chinese',
  'language',
  'account',
  'on',
  'Weibo',
  'Inc',
  's',
  'short',
  'messaging',
  'service',
  'has',
  'more',
  'followers',
  'than',
  'its',
  'account',
  'on',
  'Twitter'],
 ['Floor',
  'seats',
  'to',
  'the',
  'Lakers',
  'vs',
  'Nets',
  'game',
  'on',
  'Thursd

In [33]:
from nltk.stem import WordNetLemmatizer
lemmatizer_output=WordNetLemmatizer()


for i in range(len(sentences)):
    for j in range(len(sentences[i])):
        sentences[i][j] = lemmatizer_output.lemmatize(sentences[i][j])
sentences[1]

['China',
 'Central',
 'Television',
 'struck',
 'a',
 'deal',
 'with',
 'the',
 'league',
 'in',
 '1987',
 'to',
 'offer',
 'game',
 'for',
 'free',
 'and',
 'their',
 'relationship',
 'prospered',
 'in',
 'the',
 '1990s',
 'a',
 'the',
 'Chicago',
 'Bulls',
 'were',
 'busy',
 'winning',
 'championship',
 'and',
 'Michael',
 'Jordan',
 'wa',
 'becoming',
 'a',
 'global',
 'icon']

Join the words back into a sentence.

In [34]:
for i in range(len(sentences)):
    sentences[i] = ' '.join(sentences[i])
sentences[0]

'The NBA s history in China is more than three decade old'

In [35]:
#Replacer replace
#Tokenize
#Remove stopwords
#Lemmatize

In [36]:
#Replacer replace
open_file2 = open('NBA2.txt', 'r', encoding='utf-8')
file_to_string2 = open_file2.read()
text_replaced2 = replacer.replace(file_to_string2)

#Tokenize
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences2 = tokenizer.tokenize(text_replaced2)

#Tokenize words
from nltk.tokenize import RegexpTokenizer
tokenizer=RegexpTokenizer("[\w]+")

for i in range(len(sentences2)):
    sentences2[i] = tokenizer.tokenize(sentences2[i])
    
#Remove stop words

from nltk.corpus import stopwords
stops=set(stopwords.words('english'))

for i in range(len(sentences2)):
    sentences2[i] = [word for word in sentences2[i] if word not in stops]
    
#Lemmatize

from nltk.stem import WordNetLemmatizer
lemmatizer_output=WordNetLemmatizer()

for i in range(len(sentences2)):
    for j in range(len(sentences2[i])):
        sentences2[i][j] = lemmatizer_output.lemmatize(sentences2[i][j])

        
#Join the words back into a sentence.

for i in range(len(sentences2)):
    sentences2[i] = ' '.join(sentences2[i])

In [37]:
sentences2[0]

'SHANGHAI A day scheduled tipoff Brooklyn Nets Los Angeles Lakers game Shanghai crisis National Basketball Association appeared closer resolution'

In [38]:
def preprocess_text (text):
    with open(text,'r',encoding="utf-8") as f_open:
        a=f_open.read()
    
    sentences_a=tokenizer.tokenize(a)
    processed_a=[replacer.replace(sentence) for sentence in sentences_a]

    wtokenizer = TreebankWordTokenizer()
    w_processed_a =[wtokenizer.tokenize(sentence) for sentence in processed_a]

    #Stop words
    stops=set(stopwords.words('english'))
    s_w_processed_a =[]
    for i in range(len(w_processed_a)):
        s_w_processed_a += [[wd for wd in w_processed_a[i] if wd not in stops]]
    
    

    lm=WordNetLemmatizer()
    lm_s_w_processed_a=[]
    for i in range(len(s_w_processed_a)):
        lm_s_w_processed_a += [[lm.lemmatize(wd) for wd in s_w_processed_a[i]]]
    

    
    #Reconstruire mon texte avec les mots lmatizés
    return [' '.join(s) for s in lm_s_w_processed_a] 

We are now going to compare all sentences by measuring the similarity between two sentences using Jaccard's coefficient.

In [41]:
"""def jacard(a,b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))"""
def get_jaccard_sim(str1, str2): 
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

We compare all the sentences between both articles and keep the indexes of the sentences that got the best similarity (best Jaccard's coefficient).

In [43]:
from nltk.metrics import *

comparison = []
maximum, imax, jmax = 0.0, 0, 0
for i in range(len(sentences)):
    for j in range(len(sentences2)):
        dist = get_jaccard_sim(sentences[i],sentences2[j])
        if(dist > maximum):
            maximum = dist
            imax = i
            jmax = j

In [44]:
print(maximum)
print(sentences[imax])
print(sentences2[jmax])

0.1346153846153846
There would be a social stability cost to banning the NBA in China The Friday night tweet by Houston Rockets general manager Daryl Morey on a banned platform in China ha suddenly thrust the NBA into the turbulent water of Sino American politics
Chinese broadcaster sponsor suspended aspect cooperation NBA commissioner said apologize tweet Rockets general manager Daryl Morey though league also said regrettable upset Chinese fan
