<a href="https://colab.research.google.com/github/Blazer-007/Data-Science/blob/master/Text_Processing_NLTK.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Preprocessing using NLTK**

---



Install NLTK</br>
     !pip install nltk
  


In [0]:
import nltk

<b>NLTK contains many corpus or text datasets to work with it.</b></br>
e.g. brown dataset , etc.

In [0]:
# nltk.download('brown')
# from nltk.corpus import brown
# print(brown.categories())
# print(len(brown.categories()))
# data = brown.sents(categories='adventure')
# data
# len(data)
# data[0]
# ' '.join(data[1]) ## to print sentence in adventure category

# Basic NLP Pipeline
- Data Collection
- Tokenization, Stopword, Stemming
- Building a common vocab
- Vectorizing the documents
- Performing Classification/ Clustering

### 2. Tokenization and Stopword Removal

In [0]:
document = """It was a very pleasant day. The weather was cool and there were light showers. I went to the market to buy some fruits. """
sentence = "Send all the documents realated to SSD chapters 9,10,11 at vkr@ece.com"

In [6]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [0]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [8]:
# sent_tokenize() is used to convert a multi line string(document/paragraph) into list of strings
sents = sent_tokenize(document)
print(sents)

['It was a very pleasant day.', 'The weather was cool and there were light showers.', 'I went to the market to buy some fruits.']


In [9]:
sents[0]

'It was a very pleasant day.'

In [0]:
# To convert sentence into words two methods are present -
# 1. inbuilt split() method
# 2. word_tokenize() from nltk

In [11]:
sentence.split()

['Send',
 'all',
 'the',
 'documents',
 'realated',
 'to',
 'SSD',
 'chapters',
 '9,10,11',
 'at',
 'vkr@ece.com']

In [12]:
word_list = word_tokenize(sentence)
print(word_list)

['Send', 'all', 'the', 'documents', 'realated', 'to', 'SSD', 'chapters', '9,10,11', 'at', 'vkr', '@', 'ece.com']


### Stopword Removal

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
from nltk.corpus import stopwords

In [0]:
sw  = set(stopwords.words('english'))

In [16]:
print(sw)

{'she', 'and', "you've", 'while', 'below', 'until', 'after', "she's", 'for', 'it', 'nor', 'myself', "shouldn't", 'd', 'did', 'my', 'such', 'haven', 'few', 'whom', 'weren', 'of', 'again', 'most', 'themselves', 'our', 'o', 'that', 'off', "didn't", 'been', 'be', 'from', 'yourselves', 'me', 'ourselves', 'no', "mustn't", 'any', "don't", 'own', 'doing', 'll', 'to', 'needn', 'all', 'didn', "won't", 'once', 'who', 'if', 'the', 'had', 'more', 'other', 's', 'wasn', 'they', 'an', "should've", 'were', 'hers', 'now', 'both', 'is', 'he', 'before', 'shan', 'then', 'am', 'wouldn', 'as', 'above', 'have', 'her', 'at', 'them', 'on', 'over', "mightn't", 'during', 'yourself', 'your', 'mightn', 'against', 'with', "wasn't", 'can', 'do', 'than', 'his', 'himself', 'which', 'was', "you'll", 've', 'here', 'you', "that'll", 'each', 'theirs', 'isn', 'will', 'does', 'just', 'itself', 'aren', 'shouldn', 'very', 'those', 'having', "wouldn't", 'has', 'mustn', 'what', 'ain', "you're", 'because', 'or', 'couldn', "couldn

In [0]:
def remove_stopwords(word_list):
    useful_words = [w for w in word_list if w not in sw]
    return useful_words

In [18]:
useful_words= remove_stopwords(word_tokenize(sents[1]))
print(useful_words)

['The', 'weather', 'cool', 'light', 'showers', '.']


### Tokenization using Regular Expression 
Problem with Word Tokenizer - Doesn't remove numbers and Can't handle complex tokenizations ! So we use a Regexp Tokenizer Class in NLTK

For help in writing RegExpression , see cheatsheets at [Regex Tester](https://www.regexpal.com/)

In [0]:
from nltk.tokenize import RegexpTokenizer

In [20]:
tokenizer = RegexpTokenizer("[a-zA-Z]+")    # expression inside [] will be used in tokenization and + is used to take whole word as one token
useful_text = tokenizer.tokenize(sentence)
useful_text

['Send',
 'all',
 'the',
 'documents',
 'realated',
 'to',
 'SSD',
 'chapters',
 'at',
 'vkr',
 'ece',
 'com']

In [21]:
tokenizer = RegexpTokenizer("[a-zA-Z@]+")
useful_text = tokenizer.tokenize(sentence)
useful_text

['Send',
 'all',
 'the',
 'documents',
 'realated',
 'to',
 'SSD',
 'chapters',
 'at',
 'vkr@ece',
 'com']

## Stemming
- Process that transforms particular words(verbs,plurals)into their radical form
- Preserve the semantics of the sentence without increasing the number of unique tokens
- jumps, jumping, jumped, jump ==> jump

In [22]:
text= """Foxes love to make jumps.The quick brown fox was seen jumping over the 
        lovely dog from a 6ft feet high wall"""

words_list = tokenizer.tokenize(text.lower())
print(words_list)

['foxes', 'love', 'to', 'make', 'jumps', 'the', 'quick', 'brown', 'fox', 'was', 'seen', 'jumping', 'over', 'the', 'lovely', 'dog', 'from', 'a', 'ft', 'feet', 'high', 'wall']


### Stemming 

- 1) Snowball Stemmer (Multilingual)
- 2) Porter Stemmer 
- 3) LancasterStemmer

In [0]:
from nltk.stem.snowball import PorterStemmer,SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

In [0]:
ps = PorterStemmer()

In [25]:
print(ps.stem('playing'))
print(ps.stem('played'))
print(ps.stem('plays'))
print(ps.stem('play'))

play
play
play
play


In [0]:
ss = SnowballStemmer('english')      # Since it is multilingual stemmer , it is need to specify the language

In [27]:
print(ss.stem('playing'))
print(ss.stem('played'))
print(ss.stem('plays'))
print(ss.stem('play'))

play
play
play
play


In [28]:
print(ss.stem('lovely'))
print(ss.stem('loving'))
print(ss.stem('lover'))

love
love
lover


In [29]:
print(ss.stem('fighting'))
print(ss.stem('fight--ing'))   # if no base word is found , it just remove 'ing' 

fight
fight--


In [30]:
ls = LancasterStemmer()
ls.stem("teeth")

print(ls.stem("teenager")) #English
print(ps.stem("teenager")) #English
print(ss.stem('teenager')) #English

teen
teenag
teenag


**All in One Function**

In [0]:
def text_preprocess(text):
  word_splits = tokenizer.tokenize(text) #regex tokenizer
  useful_words= remove_stopwords(word_splits)
  return [ls.stem(w) for w in useful_words ]


In [0]:
strr = " I am playing cricket and after that i will watch cricket , then I will go for studying . "

In [33]:
strrr = text_preprocess(strr)
print(strrr) 

['i', 'play', 'cricket', 'watch', 'cricket', 'i', 'go', 'study']


## Lemmatization
Same as stemmer

In [34]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [35]:
from nltk.stem import WordNetLemmatizer

l = WordNetLemmatizer()
l.lemmatize("crying")

'cry'

# **Bag of Words Model Construction**

## Building Common Vocab



In [0]:
corpus = [
        'Indian cricket team will wins World Cup, says Capt. Virat Kohli. World cup will be held at Sri Lanka.',
        'We will win next Lok Sabha Elections, says confident Indian PM',
        'The nobel laurate won the hearts of the people',
        'The movie Raazi is an exciting Indian Spy thriller based upon a real story'
]

Here first we will find common words for vocabulary and make a dictionary to match every word with a number , which will help in converting documents into vector

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
cv = CountVectorizer()

In [0]:
vectorized_corpus = cv.fit_transform(corpus)  #fit will learn dictionary and transform will convert into vector

In [40]:
vectorized_corpus  #It is a sparse matrix 

<4x42 sparse matrix of type '<class 'numpy.int64'>'
	with 47 stored elements in Compressed Sparse Row format>

In [0]:
vectorized_corpus = vectorized_corpus.toarray() #converting sparse matrix into array

In [42]:
print(vectorized_corpus[0]) #to see the first document
print(len(vectorized_corpus[0]))

[0 1 0 1 1 0 1 2 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0
 2 0 1 0 2]
42


In [43]:
print(cv.vocabulary_)    #matching b/w words and index where it will occur in vector

{'indian': 12, 'cricket': 6, 'team': 31, 'will': 37, 'wins': 39, 'world': 41, 'cup': 7, 'says': 27, 'capt': 4, 'virat': 35, 'kohli': 14, 'be': 3, 'held': 11, 'at': 1, 'sri': 29, 'lanka': 15, 'we': 36, 'win': 38, 'next': 19, 'lok': 17, 'sabha': 26, 'elections': 8, 'confident': 5, 'pm': 23, 'the': 32, 'nobel': 20, 'laurate': 16, 'won': 40, 'hearts': 10, 'of': 21, 'people': 22, 'movie': 18, 'raazi': 24, 'is': 13, 'an': 0, 'exciting': 9, 'spy': 28, 'thriller': 33, 'based': 2, 'upon': 34, 'real': 25, 'story': 30}


In [44]:
print(len(cv.vocabulary_.keys()))

42


In [0]:
# Reverse mapping or inverse transform i.e. from vector to original sentence
numbers = vectorized_corpus[2]

In [46]:
numbers

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [47]:
s = cv.inverse_transform(numbers)
print(s)     # It will not be in correct order that's why model is called Bag of Words Model

[array(['hearts', 'laurate', 'nobel', 'of', 'people', 'the', 'won'],
      dtype='<U9')]


## Vectorization with Stopwords removal

In [0]:
def myTokenizer(document):
  words = tokenizer.tokenize(document.lower())
  words = remove_stopwords(words)
  return words

In [49]:
myTokenizer("this is some FUNCTION")

['function']

In [0]:
cv = CountVectorizer(tokenizer=myTokenizer)

In [0]:
vectorized_corpus = cv.fit_transform(corpus).toarray()

In [52]:
#vectorized_corpus
print(vectorized_corpus)
print(len(vectorized_corpus[0]))

[[0 1 0 1 2 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 2]
 [0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0]]
33


In [53]:
cv.inverse_transform(vectorized_corpus)

[array(['capt', 'cricket', 'cup', 'held', 'indian', 'kohli', 'lanka',
        'says', 'sri', 'team', 'virat', 'wins', 'world'], dtype='<U9'),
 array(['confident', 'elections', 'indian', 'lok', 'next', 'pm', 'sabha',
        'says', 'win'], dtype='<U9'),
 array(['hearts', 'laurate', 'nobel', 'people'], dtype='<U9'),
 array(['based', 'exciting', 'indian', 'movie', 'raazi', 'real', 'spy',
        'story', 'thriller', 'upon'], dtype='<U9')]

In [0]:
# For test data
test_corpus = [
               'Cricket Rocks!',
]

In [55]:
vectorized_test_corpus = cv.transform(test_corpus).toarray()
print(vectorized_test_corpus)

[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


### Features in Bag of Words Model
- Unigrams
- Bigrams, Trigrams
- N-Grams

In [0]:
sent_1 = ["this is good movie"]
sent_2 = ["this is not good movie"]

In [0]:
cv = CountVectorizer()

In [58]:
docs = [sent_1[0],sent_2[0]]
cv.fit_transform(docs).toarray()

array([[1, 1, 1, 0, 1],
       [1, 1, 1, 1, 1]])

In [0]:
#Bigrams treats every pair as features it helps in capturing negation and more features
#ngram_range = (1,1) -> Unigram
#ngram_range = (2,2) -> Bigram
#ngram_range = (3,3) -> Trigram
#ngram_range = (1,n) -> ngram

In [60]:
cv = CountVectorizer(tokenizer=myTokenizer,ngram_range=(1,1))
vectorized_corpus = cv.fit_transform(corpus)
vc = vectorized_corpus.toarray()

print(cv.vocabulary_)

{'indian': 9, 'cricket': 3, 'team': 26, 'wins': 31, 'world': 32, 'cup': 4, 'says': 22, 'capt': 1, 'virat': 29, 'kohli': 10, 'held': 8, 'sri': 24, 'lanka': 11, 'win': 30, 'next': 15, 'lok': 13, 'sabha': 21, 'elections': 5, 'confident': 2, 'pm': 18, 'nobel': 16, 'laurate': 12, 'hearts': 7, 'people': 17, 'movie': 14, 'raazi': 19, 'exciting': 6, 'spy': 23, 'thriller': 27, 'based': 0, 'upon': 28, 'real': 20, 'story': 25}


# Tf-idf Normalisation
- Avoid features that occur very often, becauase they contain less information
- Information decreases as the number of occurences increases across different type of documents
- So we define another term - term-document-frequency which associates a weight with every term

Tf-idf -> Term frequency - inverse document frequency</br>
tf(t,d) -> this is count of particular term in a particular document </br>
idf(t,d) = log(N/(1+count(t,D))) -> </br>
where , N = total no of documents </br>
  &nbsp;&nbsp;&nbsp;&nbsp; count(t,d) = how many times t appears across all d

In [0]:
sent_1 = "this is good movie"
sent_2 = "this was good movie"
sent_3 = "this is not good movie"

In [0]:
corpus = [sent_1,sent_2,sent_3]

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
tfidf = TfidfVectorizer()

In [65]:
vectorized_corpus = tfidf.fit_transform(corpus).toarray()
print(vectorized_corpus)

[[0.46333427 0.59662724 0.46333427 0.         0.46333427 0.        ]
 [0.41285857 0.         0.41285857 0.         0.41285857 0.69903033]
 [0.3645444  0.46941728 0.3645444  0.61722732 0.3645444  0.        ]]


In [66]:
print(tfidf.vocabulary_)

{'this': 4, 'is': 1, 'good': 0, 'movie': 2, 'was': 5, 'not': 3}


In [67]:
tfidf_vectorizer = TfidfVectorizer(tokenizer=myTokenizer,ngram_range=(1,1),norm='l2')

vectorized_corpus = tfidf_vectorizer.fit_transform(corpus).toarray()
print(vectorized_corpus)
print(tfidf_vectorizer.vocabulary_)

[[0.70710678 0.70710678]
 [0.70710678 0.70710678]
 [0.70710678 0.70710678]]
{'good': 0, 'movie': 1}
