# NLP
## Basic NLP Pipelines
- Data Collection
- Tokenisation of Stopword: Process which we use to remove unnecessary stopwords. 
- Stemming is the reduction of a word in its base form. Jump- Jumped, Jumping, Jumps, this reducing the number of words in the disctionary. 
- After these two steps are done, we start Buildig a Common Vocab. 
    - It was raining and the cat was jumping: [rain, cat, jump]
- We will vecotrise the documents and feed it to the Classification/Clustering Algorithm

In [17]:
from nltk.corpus import brown
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [18]:
data = brown.sents(categories='editorial')[:100]

In [19]:
print(data)

[['Assembly', 'session', 'brought', 'much', 'good'], ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'], ...]


In [20]:
print(len(data))

100


### Tokenisation

In [21]:
text = "It was a very pleasant day, the weather was cool and there were light showers. I went to the market to buy some fruits."
text = text.lower()
print(text)

it was a very pleasant day, the weather was cool and there were light showers. i went to the market to buy some fruits.


In [22]:
from nltk.tokenize import sent_tokenize,word_tokenize

In [23]:
words = word_tokenize(text)
print(words)

['it', 'was', 'a', 'very', 'pleasant', 'day', ',', 'the', 'weather', 'was', 'cool', 'and', 'there', 'were', 'light', 'showers', '.', 'i', 'went', 'to', 'the', 'market', 'to', 'buy', 'some', 'fruits', '.']


In [24]:
from nltk.corpus import stopwords

sw  = set(stopwords.words('english'))

In [25]:
def filter_words(words):
    useful_words = [w for w in words if w not in sw]
    return useful_words

##### Tokenizer using Regular Expression so that unwated punctuation marks and characters do not appear :

In [26]:
from nltk.tokenize import RegexpTokenizer

In [27]:
tokenizer  = RegexpTokenizer("[a-zA-Z]+")

words = tokenizer.tokenize(text)
print(words)

['it', 'was', 'a', 'very', 'pleasant', 'day', 'the', 'weather', 'was', 'cool', 'and', 'there', 'were', 'light', 'showers', 'i', 'went', 'to', 'the', 'market', 'to', 'buy', 'some', 'fruits']


In [28]:
words = filter_words(words)
print(words)

['pleasant', 'day', 'weather', 'cool', 'light', 'showers', 'went', 'market', 'buy', 'fruits']


### Stemming
- Snowball Stemmer (Multilingual)
- Porter Semmer
- Lancater Stemmer

In [29]:
text= "Foxes love to make jumps.The quick brown fox was seen jumping over the lovely dog from a 6ft feet high wall"

words_list = tokenizer.tokenize(text.lower())
print(words_list)

['foxes', 'love', 'to', 'make', 'jumps', 'the', 'quick', 'brown', 'fox', 'was', 'seen', 'jumping', 'over', 'the', 'lovely', 'dog', 'from', 'a', 'ft', 'feet', 'high', 'wall']


In [30]:
word_list = filter_words(words_list) #Remove the stopwords
print(word_list)

['foxes', 'love', 'make', 'jumps', 'quick', 'brown', 'fox', 'seen', 'jumping', 'lovely', 'dog', 'ft', 'feet', 'high', 'wall']


In [31]:
from nltk.stem.snowball import PorterStemmer,SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer

ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer('english')

In [32]:
ps.stem("Teenager")

'teenag'

In [64]:
ps.stem("Quickly")

'quickli'

In [33]:
ls.stem("Teenager")

'teen'

In [34]:
ss.stem("Teenager")

'teenag'

In [35]:
for i in range(len(word_list)):
    word_list[i] = ss.stem(word_list[i])

In [36]:
print(word_list)

['fox', 'love', 'make', 'jump', 'quick', 'brown', 'fox', 'seen', 'jump', 'love', 'dog', 'ft', 'feet', 'high', 'wall']


In [37]:
def pipeline_vocab(text):
    words = tokenizer.tokenize(text)
    words = [w for w in words if w not in sw]
    for i in range(len(words)):
        words[i] = ss.stem(words[i])
    
    return words

text= "Foxes love to make jumps.The quick brown fox was seen jumping over the lovely dog from a 6ft feet high wall"

new_words = pipeline_vocab(text)
print(new_words)

['fox', 'love', 'make', 'jump', 'the', 'quick', 'brown', 'fox', 'seen', 'jump', 'love', 'dog', 'ft', 'feet', 'high', 'wall']


### Building Common Vocabulary and Vectorizing Documents (based upon Bag of Words Model)

In [38]:
corpus = [
        'Indian cricket team will wins World Cup, says Capt. Virat Kohli. World cup will be held at Sri Lanka.',
        'We will win next Lok Sabha Elections, says confident Indian PM',
        'The nobel laurate won the hearts of the people',
        'The movie Raazi is an exciting Indian Spy thriller based upon a real story'
]

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [40]:
cv = CountVectorizer()

In [41]:
vectorized_corpus = cv.fit_transform(corpus).toarray()

In [42]:
vectorized_corpus
print(vectorized_corpus)
print(len(vectorized_corpus[0]))

[[0 1 0 1 1 0 1 2 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1
  0 2 0 1 0 2]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
  1 1 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 3 0 0 0
  0 0 0 0 1 0]
 [1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0
  0 0 0 0 0 0]]
42


In [43]:
print(cv.vocabulary_) #Dictionary - Word -> Index

{'indian': 12, 'cricket': 6, 'team': 31, 'will': 37, 'wins': 39, 'world': 41, 'cup': 7, 'says': 27, 'capt': 4, 'virat': 35, 'kohli': 14, 'be': 3, 'held': 11, 'at': 1, 'sri': 29, 'lanka': 15, 'we': 36, 'win': 38, 'next': 19, 'lok': 17, 'sabha': 26, 'elections': 8, 'confident': 5, 'pm': 23, 'the': 32, 'nobel': 20, 'laurate': 16, 'won': 40, 'hearts': 10, 'of': 21, 'people': 22, 'movie': 18, 'raazi': 24, 'is': 13, 'an': 0, 'exciting': 9, 'spy': 28, 'thriller': 33, 'based': 2, 'upon': 34, 'real': 25, 'story': 30}


In [44]:
import numpy as np
vector = np.ones((42,))
vector[3:7] = 0

print(vector)
print(len(vector))

[1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
42


In [45]:
print(cv.inverse_transform(vector))

[array(['an', 'at', 'based', 'cup', 'elections', 'exciting', 'hearts',
       'held', 'indian', 'is', 'kohli', 'lanka', 'laurate', 'lok',
       'movie', 'next', 'nobel', 'of', 'people', 'pm', 'raazi', 'real',
       'sabha', 'says', 'spy', 'sri', 'story', 'team', 'the', 'thriller',
       'upon', 'virat', 'we', 'will', 'win', 'wins', 'won', 'world'],
      dtype='<U9')]


In [46]:
cv.vocabulary_["capt"]

4

### Unigram: Bag of words model
- Our vocab contains the frequency of a single word only
- Not able to capture the behaviour of the sentence sometimes. 
- We might need to pair up words in some cases, for that we use Bigrams, Trigrams and N-grams

In [47]:
cv = CountVectorizer(tokenizer=pipeline_vocab)
vectorized_corpus = cv.fit_transform(corpus)
vc = vectorized_corpus.toarray()
print(vc[0])
print(cv.inverse_transform(vc[0]))
print(len(vc[0]))

vc[0][0] = 1
v = vc[0]
print(v)
print(cv.inverse_transform(v))

[0 1 0 1 2 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 2]
[array(['capt', 'cricket', 'cup', 'held', 'indian', 'koh', 'lanka', 'say',
       'sri', 'team', 'virat', 'win', 'world'], dtype='<U8')]
32
[1 1 0 1 2 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 1 2]
[array(['base', 'capt', 'cricket', 'cup', 'held', 'indian', 'koh', 'lanka',
       'say', 'sri', 'team', 'virat', 'win', 'world'], dtype='<U8')]


### Bigrams and Trigrams and N-Grams

In [56]:
cv = CountVectorizer(tokenizer=pipeline_vocab, ngram_range=(1,3))
# ngram range is used for specifying the type of Gram. 
# (1,2) would mean the Vectorization would take place as both unigrams and bigrams
# Similiarly, (2,3) would mean Bigrams and Trigrams only
vectorized_corpus = cv.fit_transform(corpus)
vc = vectorized_corpus.toarray()

print(cv.vocabulary_)
v = vc[0]
print(len(v))


{'indian': 28, 'cricket': 9, 'team': 74, 'win': 86, 'world': 91, 'cup': 12, 'say': 63, 'capt': 3, 'virat': 83, 'koh': 34, 'held': 25, 'sri': 71, 'lanka': 37, 'indian cricket': 29, 'cricket team': 10, 'team win': 75, 'win world': 89, 'world cup': 92, 'cup say': 15, 'say capt': 64, 'capt virat': 4, 'virat koh': 84, 'koh world': 35, 'cup held': 13, 'held sri': 26, 'sri lanka': 72, 'indian cricket team': 30, 'cricket team win': 11, 'team win world': 76, 'win world cup': 90, 'world cup say': 94, 'cup say capt': 16, 'say capt virat': 65, 'capt virat koh': 5, 'virat koh world': 85, 'koh world cup': 36, 'world cup held': 93, 'cup held sri': 14, 'held sri lanka': 27, 'next': 47, 'lok': 41, 'sabha': 60, 'elect': 17, 'confid': 6, 'pm': 54, 'win next': 87, 'next lok': 48, 'lok sabha': 42, 'sabha elect': 61, 'elect say': 18, 'say confid': 66, 'confid indian': 7, 'indian pm': 31, 'win next lok': 88, 'next lok sabha': 49, 'lok sabha elect': 43, 'sabha elect say': 62, 'elect say confid': 19, 'say conf

### Tf-idf Normalisation
- A word which occurs in many documents then it is less likely to aid us in categorising a particular document 
- So, we need to focus more on terms which are more specific to one document.
- So we define another term - term-document-frequency which associates a weight with every term

##### tf: number of times a particular term occurs in a particular document 
##### idf- inverse document frequency: log(n/(1+count(D,t)))
    - n is the number of documents
    - count(D,t) is the count of the number of times a term is occuring in all documents combined
#### weight of word = wd = tf*idf

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [63]:
tfidf = TfidfVectorizer(tokenizer = pipeline_vocab, ngram_range=(1,1))
tfidf_vector = tfidf.fit_transform(corpus).toarray()
print(tfidf_vector)
print(tfidf.vocabulary_)

[[0.         0.23802376 0.         0.23802376 0.47604753 0.
  0.         0.         0.23802376 0.15192748 0.23802376 0.23802376
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.18766067 0.
  0.23802376 0.         0.23802376 0.         0.         0.23802376
  0.18766067 0.47604753]
 [0.         0.         0.36153669 0.         0.         0.36153669
  0.         0.         0.         0.23076418 0.         0.
  0.         0.36153669 0.         0.36153669 0.         0.
  0.36153669 0.         0.         0.36153669 0.28503968 0.
  0.         0.         0.         0.         0.         0.
  0.28503968 0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.5        0.         0.         0.         0.
  0.5        0.         0.         0.         0.5        0.5
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.        ]
