# **NLP: Text Processing**





# **Processing Text Data with NLTK**

-- Italian recipes data

- Data set of Italian recipes from https://www.gutenberg.org/ebooks/24407 (public domain)

- The txt format of this has been split into multiple files, one recipe per file.

- The data can be found in /recipes/{1, 2, ..., 220}.txt

- There are 220 recipes

In [1]:
#!unzip recipes.zip

In [1]:
#importing the dataset of recepes files
import os

data_folder = r'recipes/recipes/'
all_recipe_files = [os.path.join(data_folder, fname)
                    for fname in os.listdir(data_folder)]

In [2]:
len(all_recipe_files)#it will print the total length of file

220

In [3]:
all_recipe_files[0:5]#it will print the first 05 recepies file

['recipes/recipes/1.txt',
 'recipes/recipes/10.txt',
 'recipes/recipes/100.txt',
 'recipes/recipes/101.txt',
 'recipes/recipes/102.txt']

In [6]:
documents = {}#crearting the dictionary

for recipe_fname in all_recipe_files:
    bname = os.path.basename(recipe_fname)

    recipe_number = os.path.splitext(bname)[0]
    
    with open(recipe_fname, 'r') as f:
        documents[recipe_number] = f.read()
#it creates the a python dictionary called documents,then iterates over a list of file names stored in the 'all_racipe_files' variable
#for each file name, the code extracts the base file name(i.e name without directory path or file extension)
#and store basename in bname variable

#then extract the recipe number from the base file name using 'os.path.splitext'
#it stored in receipe_no

#then it reads the contents of each recipe file in all_recipe_files, stores the contents in a dictionary called documents,
#and uses the recipe number as the key to access the corresponding recipe in the dictionary.


In [7]:
list(documents.items())[:4]
#it will give list of first four item

[('1',
  '\nBROTH OR SOUP STOCK\n\n(Brodo)\n\nTo obtain good broth the meat must be put in cold water, and then\nallowed to boil slowly. Add to the meat some pieces of bones and "soup\ngreens" as, for instance, celery, carrots and parsley. To give a brown\ncolor to the broth, some sugar, first browned at the fire, then diluted\nin cold water, may be added.\n\nWhile it is not considered that the broth has much nutritive power, it\nis excellent to promote the digestion. Nearly all the Italian soups are\nmade on a basis of broth.\n\nA good recipe for substantial broth to be used for invalids is the\nfollowing: Cut some beef in thin slices and place them in a large\nsaucepan; add some salt. Pour cold water upon them, so that they are\nentirely covered. Cover the saucepan so that it is hermetically closed\nand place on the cover a receptacle containing water, which must be\nconstantly renewed. Keep on a low fire for six hours, then on a strong\nfire for ten minutes. Strain the liquid in che

In [8]:
corpus_all_in_one = ' '.join([doc for doc in documents.values()])
#above line is used to convert into string

print("Number of docs: {}".format(len(documents)))
print("Corpus size (char): {}".format(len(corpus_all_in_one)))

Number of docs: 220
Corpus size (char): 161170


In [10]:
corpus_all_in_one[0:1000]
#it will print the first1000 strings

'\nBROTH OR SOUP STOCK\n\n(Brodo)\n\nTo obtain good broth the meat must be put in cold water, and then\nallowed to boil slowly. Add to the meat some pieces of bones and "soup\ngreens" as, for instance, celery, carrots and parsley. To give a brown\ncolor to the broth, some sugar, first browned at the fire, then diluted\nin cold water, may be added.\n\nWhile it is not considered that the broth has much nutritive power, it\nis excellent to promote the digestion. Nearly all the Italian soups are\nmade on a basis of broth.\n\nA good recipe for substantial broth to be used for invalids is the\nfollowing: Cut some beef in thin slices and place them in a large\nsaucepan; add some salt. Pour cold water upon them, so that they are\nentirely covered. Cover the saucepan so that it is hermetically closed\nand place on the cover a receptacle containing water, which must be\nconstantly renewed. Keep on a low fire for six hours, then on a strong\nfire for ten minutes. Strain the liquid in cheese cloth

## **Tokenisation**

-- Tokenisation is the process of splitting a raw string into a list of tokens

-- What is a token? We're interested in meaningful units of text

- Words
- Phrases
- Punctuation
- Numbers
- Dates
- Currencies
- Hashtags


In [11]:

import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ronak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
from nltk.tokenize import word_tokenize

text = "NLP is awesome! It is very easy to learn."

print(word_tokenize(text))

['NLP', 'is', 'awesome', '!', 'It', 'is', 'very', 'easy', 'to', 'learn', '.']


In [13]:
from nltk.tokenize import sent_tokenize

print()




In [15]:
# Returns the list of syllables of words

all_tokens = [t for t in word_tokenize(corpus_all_in_one)]

print("Total number of tokens: {}".format(len(all_tokens)))
#it will print the length of token from the corpus

Total number of tokens: 33719


In [13]:
all_tokens[:10]

['BROTH', 'OR', 'SOUP', 'STOCK', '(', 'Brodo', ')', 'To', 'obtain', 'good']

## **Counting Words**

-- Simple word count using collections.Counter

-- We are interested in finding:

- how many times a word occurs across the whole corpus (total number of occurrences)
- in how many documents a word occurs

In [17]:
from collections import Counter

total_term_frequency =Counter(all_tokens)

total_term_frequency.most_common(20)
#it will print the most common occuring or repeating word

[('the', 1933),
 (',', 1726),
 ('.', 1568),
 ('and', 1435),
 ('a', 1076),
 ('of', 988),
 ('in', 811),
 ('with', 726),
 ('it', 537),
 ('to', 452),
 ('or', 389),
 ('is', 337),
 ('(', 295),
 (')', 295),
 ('be', 266),
 ('them', 248),
 ('butter', 231),
 ('on', 220),
 ('water', 205),
 ('little', 198)]

## **Stop-words**

-- We notice that some of the most common words above are not very interesting.

-- These words are called stop-words, and they don't provide any particular meaning in isolation (articles, conjunctions, pronouns, etc.)

-- Notice:

- there is no "universal" list of stop-words
- removing stop-words can be useful or damaging depending on the application
- e.g. if you remove stop-words, what do you do with "The Who", "to be or not to be" and similar phrases?

In [19]:
nltk.download('stopwords')#downloading the nltk stopword module

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ronak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [22]:
from nltk.corpus import stopwords
import string

print(stopwords.words('english'))#it will print all the stop words in nltk stopword module
print("------------")
print(len(stopwords.words('english')))#it will print the length of the nltk stopword
print('-----------')
print(string.punctuation)#it will print punctuation

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [23]:
stop_list = stopwords.words('english') + list(string.punctuation)
#above line stop words from nltk stopword module and it also generates punctuation from string module
#and store in the stop_list

tokens_no_stop = [token for token in all_tokens
                 if token not in stop_list]
#above line will  keep token that are not in the 'stop_list' variable

total_term_frequency_no_stop = Counter(tokens_no_stop)
#it will calculate the repeating word and then stored in a variable 
total_term_frequency_no_stop.most_common(20)
#it will print the 20 most common occuring token 

[('butter', 231),
 ('water', 205),
 ('little', 198),
 ('put', 197),
 ('one', 186),
 ('salt', 185),
 ('fire', 169),
 ('half', 169),
 ('two', 157),
 ('When', 132),
 ('sauce', 128),
 ('pepper', 128),
 ('add', 125),
 ('cut', 125),
 ('flour', 116),
 ('piece', 116),
 ('The', 111),
 ('sugar', 100),
 ('saucepan', 100),
 ('oil', 99)]

In [24]:
# Find in how many documents a word occurs, after removing stopwords

document_frequency = Counter()
#above line creates an empty counter object  to store the frequency of each unique token across all documets

for recepe_number, content in documents.items():
    tokens = word_tokenize(content)
    tokens = [token for token in tokens if token not in stop_list]
    unique_tokens = set(tokens)
    document_frequency.update(unique_tokens)
    
document_frequency.most_common(20)
#In summary, this code calculates the document frequency of each word in a collection of documents after removing stop words,
#and prints the 20 most common words across all documents.

[('salt', 142),
 ('butter', 137),
 ('put', 126),
 ('water', 125),
 ('one', 117),
 ('fire', 115),
 ('little', 107),
 ('pepper', 106),
 ('two', 105),
 ('half', 105),
 ('When', 101),
 ('cut', 94),
 ('piece', 87),
 ('add', 83),
 ('saucepan', 81),
 ('oil', 78),
 ('sauce', 75),
 ('flour', 74),
 ('The', 72),
 ('Put', 69)]

## **Text Normalisation**

-- Replacing tokens with a canonical form, so we can group together different spelling/variations of the same word

-- Examples:

- lowercasing
-  stemming
-  American-to-British mapping
- synonym mapping

-- **Stemming** is the process of reducing a word to its base/root form, called stem

## **Stemming**

In [25]:
from nltk.stem import PorterStemmer

sentence = "Hello, You have build a very good application and I love using this product."

words = word_tokenize(sentence)

ps = PorterStemmer()

words_stem = [ps.stem(word) for word in words]
words_stem

['hello',
 ',',
 'you',
 'have',
 'build',
 'a',
 'veri',
 'good',
 'applic',
 'and',
 'i',
 'love',
 'use',
 'thi',
 'product',
 '.']

In [27]:
# Stemmed Recipe Corpus

# stemmer = PorterStemmer()

# all_tokens_lower = [t.lower(t for t in all_tokens)]

# tokens_normalised = [stemmer.stem(t) for t in all_tokens_lower if t not in stop_list]

# total_term_frequency_normalised = Counter(tokens_normalised)

# total_term_frequency_normalised.most_common(20)

## **Lemmatization**

**Lemmatization** is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meaning to one word.

Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.

Lemmatization is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.



**Wordnet** is an NLTK corpus reader, a lexical database for English. It can be used to find the meaning of words, synonym or antonym. One can define it as a semantically oriented dictionary of English. 

In [28]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ronak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [29]:
from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
print("rocks :", lemmatizer.lemmatize("rocks")) 
print("corpora :", lemmatizer.lemmatize("corpora")) 
  
# a denotes adjective in "pos" 
print("better :", lemmatizer.lemmatize("better", pos ="a")) 

rocks : rock
corpora : corpus
better : good


- **Stemming v/s Lemmatization**

In [46]:
# Stemming
#Stemming involves removing the suffix of a word to obtain its root form, also known as the stem. For example, the stem of the words "running", "runner", and "runs" is "run". 

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
ps=PorterStemmer()
lem=WordNetLemmatizer()
text = "studies studying cries cry"

tokenization = nltk.word_tokenize(text)

for w in tokenization:
  print("Stemming for {} is {}".format(w, ps.stem(w)))
print("--------------------------------------------------")
for w in tokenization:
  print("Lemma for {} is {}".format(w, lem.lemmatize(w)))   

Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri
--------------------------------------------------
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry


In [47]:
# Lemmatized Recipe Corpus
#Lemmatization, on the other hand, involves reducing words to their base form or lemma by taking into account their context and part of speech. For example, the lemma of the words "running", "runner", and "runs" is "run". 

from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
all_tokens_lower = [t.lower() for t in all_tokens]

tokens_lemmatized = [lemmatizer.lemmatize(t) for t in all_tokens_lower
                    if t not in stop_list]

total_term_frequency_normalised = Counter(tokens_lemmatized)

total_term_frequency_normalised.most_common(20)

[('put', 276),
 ('butter', 243),
 ('piece', 211),
 ('one', 210),
 ('water', 209),
 ('little', 198),
 ('salt', 197),
 ('half', 173),
 ('fire', 169),
 ('cut', 166),
 ('egg', 163),
 ('two', 162),
 ('add', 160),
 ('sauce', 152),
 ('pepper', 130),
 ('flour', 123),
 ('sugar', 116),
 ('brown', 105),
 ('saucepan', 101),
 ('onion', 101)]

## **n-grams**

-- When we are interested in phrases rather than single terms, we can look into n-grams

-- An n-gram is a sequence of n adjacent terms.

-- Commonly used n-grams include bigrams (n=2) and trigrams (n=3).

In [48]:
from nltk import ngrams

phrases = Counter(ngrams(tokens_lemmatized,2))

phrases.most_common(20)

[(('salt', 'pepper'), 106),
 (('piece', 'butter'), 82),
 (('grated', 'cheese'), 55),
 (('put', 'fire'), 47),
 (('season', 'salt'), 46),
 (('bread', 'crumb'), 41),
 (('tomato', 'sauce'), 36),
 (('complete', 'cooking'), 34),
 (('brown', 'stock'), 30),
 (('thin', 'slice'), 29),
 (('olive', 'oil'), 27),
 (('little', 'piece'), 27),
 (('small', 'piece'), 27),
 (('low', 'fire'), 25),
 (('put', 'saucepan'), 25),
 (('chopped', 'fine'), 25),
 (('half', 'ounce'), 25),
 (('serve', 'hot'), 22),
 (('yolk', 'egg'), 22),
 (('boiling', 'water'), 22)]

In [49]:
phrases = Counter(ngrams(tokens_lemmatized,3))
phrases.most_common(20)

[(('season', 'salt', 'pepper'), 44),
 (('bread', 'crumb', 'ground'), 13),
 (('cut', 'thin', 'slice'), 13),
 (('taste', 'lemon', 'peel'), 12),
 (('pinch', 'grated', 'cheese'), 11),
 (('good', 'olive', 'oil'), 10),
 (('small', 'piece', 'butter'), 10),
 (('saucepan', 'piece', 'butter'), 9),
 (('another', 'piece', 'butter'), 9),
 (('cut', 'little', 'piece'), 9),
 (('crumb', 'ground', 'fine'), 9),
 (('cut', 'small', 'piece'), 9),
 (('half', 'inch', 'thick'), 9),
 (('greased', 'butter', 'sprinkled'), 9),
 (('medium', 'sized', 'onion'), 9),
 (('ounce', 'sweet', 'almond'), 9),
 (('tomato', 'sauce', '12'), 8),
 (('little', 'piece', 'butter'), 8),
 (('three', 'half', 'ounce'), 8),
 (('fire', 'piece', 'butter'), 7)]

# **spaCy**

In [50]:
import spacy
from spacy import displacy # for visualization
nlp = spacy.load("en_core_web_sm")

## **Parts of Speech with spaCy**

In [51]:
s1 = nlp("He drinks a drink")

for word in s1:
    print(word.text, word.pos_,"|",spacy.explain(word.pos_))

He PRON | pronoun
drinks VERB | verb
a DET | determiner
drink NOUN | noun


In [52]:
s2 = nlp("I fish a fish")

for word in s2:
    print(word.text,word.pos_, spacy.explain(word.pos_),word.tag_)

I PRON pronoun PRP
fish VERB verb VBP
a DET determiner DT
fish NOUN noun NN


In [53]:
spacy.explain('VBP')

'verb, non-3rd person singular present'

## **Syntactic Dependency**

- It helps us to know the relation between tokens

- How each word is connected and dependent on each other

In [54]:
s3 = nlp("Kiara likes Sid")

for word in s3:
    print((word.text, word.tag_, word.pos_, word.dep_))

('Kiara', 'NNP', 'PROPN', 'nsubj')
('likes', 'VBZ', 'VERB', 'ROOT')
('Sid', 'NNP', 'PROPN', 'dobj')


In [55]:
spacy

<module 'spacy' from 'C:\\Users\\ronak\\anaconda3\\lib\\site-packages\\spacy\\__init__.py'>

In [56]:
from spacy import displacy
displacy.render(s3, style='dep', jupyter=True)

## **Lemmatization with spaCy**  

In [57]:
doc_lemma = nlp("studying student study studies studio studious")

for word in doc_lemma:
    print("Token=>", word.text, "Lemma=>", word.lemma_, word.pos_)

Token=> studying Lemma=> study VERB
Token=> student Lemma=> student NOUN
Token=> study Lemma=> study NOUN
Token=> studies Lemma=> study NOUN
Token=> studio Lemma=> studio NOUN
Token=> studious Lemma=> studious ADJ


In [61]:
doc_lemma=nlp("good goods run running runner was be were")
for word in doc_lemma:
    print("Token->",word.text,"lemma->",word.lemma_,word.pos_)

Token-> good lemma-> good ADJ
Token-> goods lemma-> good NOUN
Token-> run lemma-> run VERB
Token-> running lemma-> run VERB
Token-> runner lemma-> runner NOUN
Token-> was lemma-> be AUX
Token-> be lemma-> be AUX
Token-> were lemma-> be AUX


## **NER with spaCy**

In [62]:
wikitext = nlp(u"By 2020 the telecom company Orange, will relocate from Turkey to Orange County in the U.S. close to Apple.It will cost them 2 billion dollars.")

for entity in wikitext.ents:
    print(entity.text, entity.label_)

2020 DATE
Orange NORP
Turkey GPE
Orange County GPE
U.S. GPE
Apple ORG
2 billion dollars MONEY


In [63]:
# What does GPE means
spacy.explain('GPE')

'Countries, cities, states'

## **Entity Types**


In [64]:
def explain_text_entities(text):
    doc = nlp(text)
    for ent in doc.ents:
        print(f'{ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')

In [65]:
explain_text_entities('Tesla has gained 20% market share in the months since')

Tesla, Label: ORG, Companies, agencies, institutions, etc.
20%, Label: PERCENT, Percentage, including "%"
the months, Label: DATE, Absolute or relative dates or periods


## **Semantic Similarity**

object1.similarity(object2)

- Uses:

- Recommendation systems

- Data Preprocessing eg removing duplicates

In [66]:
import warnings
warnings.filterwarnings("ignore")

# Similarity of object
doc1 = nlp("India")
doc2 = nlp("India")
doc3 = nlp("Lion")

print(doc1.similarity(doc2))
print(doc1.similarity(doc3))

1.0
0.5523053837367063
