# **Bag Of Words Pipeline**

* Get the Data/Corpus
* Tokenisation, Stopward Removal
* Stemming And Lemmatization
* Building a Vocab
* Vectorization
* Classification

# **1. Get The Data**

In [167]:
import nltk
from nltk.corpus import brown

In [168]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [169]:
data = brown.sents(categories="fiction")

# 2. **Tokenisation, Stopward Removal**

In [170]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [171]:
document = """It was a very pleasant day. The weather was cool and there were light showers. 
I went to the market to buy some fruits."""

sentence = "Send all the 50 documents related to chapters 1,2,3 at prateek@cb.com"

In [172]:
sents = sent_tokenize(document)
print(sents)

['It was a very pleasant day.', 'The weather was cool and there were light showers.', 'I went to the market to buy some fruits.']


In [173]:
words = word_tokenize(sentence)
print(words)

['Send', 'all', 'the', '50', 'documents', 'related', 'to', 'chapters', '1,2,3', 'at', 'prateek', '@', 'cb.com']


In [174]:
from nltk.corpus import stopwords

In [175]:
sw = set(stopwords.words("english"))
sw.remove('not')

In [176]:
def remove_stopwords(text,stopwords):
    useful_words = [w for w in text if w not in stopwords]
    return useful_words

In [177]:
useful_text = remove_stopwords(words,sw)
print(useful_text)

['Send', '50', 'documents', 'related', 'chapters', '1,2,3', 'prateek', '@', 'cb.com']


In [178]:
from nltk.tokenize import RegexpTokenizer

In [179]:
token = RegexpTokenizer('[a-zA-Z.@]+')
useful_text = token.tokenize(sentence)
print(useful_text)

['Send', 'all', 'the', 'documents', 'related', 'to', 'chapters', 'at', 'prateek@cb.com']


# **3. Stemming & Lemmatization**
* Process that transforms particular words(verbs,plurals)into their radical form
* Preserve the semantics of the sentence without increasing the number of unique tokens
* Example - jumps, jumping, jumped, jump ==> jump

In [180]:
text= """Foxes love to make jumps.The quick brown fox was seen jumping over the 
        lovely dog from a 6ft feet high wall"""

In [181]:
from nltk.stem import SnowballStemmer,PorterStemmer,LancasterStemmer,WordNetLemmatizer

In [182]:
ps = PorterStemmer()
print(ps.stem('loving'))
print(ps.stem('lovely'))

love
love


In [183]:
ss = SnowballStemmer('english')
print(ss.stem('jumping'))
print(ss.stem('jumps'))

jump
jump


In [184]:
ls = LancasterStemmer()
print(ls.stem('walking'))
print(ls.stem('walked'))

walk
walk


In [185]:
wn = WordNetLemmatizer()
print(wn.lemmatize("aardwolves"))
print(wn.lemmatize('jumps'))

aardwolf
jump


# **4. Building a Vocab And Vectorization**

In [186]:
corpus = [
        'Indian cricket team will wins World Cup, says Capt. Virat Kohli. World cup will be held at Sri Lanka.',
        'We will win next Lok Sabha Elections, says confident Indian PM',
        'The nobel laurate won the hearts of the people.',
        'The movie Raazi is an exciting Indian Spy thriller based upon a real story.'
]

In [187]:
from sklearn.feature_extraction.text import CountVectorizer

In [188]:
cv = CountVectorizer()
vectorize_corpus = cv.fit_transform(corpus)
vectorize_corpus = vectorize_corpus.toarray()

In [189]:
print(vectorize_corpus)
print(len(vectorize_corpus[0]))

[[0 1 0 1 1 0 1 2 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1
  0 2 0 1 0 2]
 [0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0
  1 1 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 3 0 0 0
  0 0 0 0 1 0]
 [1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0
  0 0 0 0 0 0]]
42


In [190]:
print(cv.vocabulary_)

{'indian': 12, 'cricket': 6, 'team': 31, 'will': 37, 'wins': 39, 'world': 41, 'cup': 7, 'says': 27, 'capt': 4, 'virat': 35, 'kohli': 14, 'be': 3, 'held': 11, 'at': 1, 'sri': 29, 'lanka': 15, 'we': 36, 'win': 38, 'next': 19, 'lok': 17, 'sabha': 26, 'elections': 8, 'confident': 5, 'pm': 23, 'the': 32, 'nobel': 20, 'laurate': 16, 'won': 40, 'hearts': 10, 'of': 21, 'people': 22, 'movie': 18, 'raazi': 24, 'is': 13, 'an': 0, 'exciting': 9, 'spy': 28, 'thriller': 33, 'based': 2, 'upon': 34, 'real': 25, 'story': 30}


In [191]:
print(len(cv.vocabulary_.keys()))

42


In [192]:
for i in range(len(corpus)):
    s = " ".join(cv.inverse_transform(vectorize_corpus[i])[0])
    print(str(i+1)+". "+s)

1. at be capt cricket cup held indian kohli lanka says sri team virat will wins world
2. confident elections indian lok next pm sabha says we will win
3. hearts laurate nobel of people the won
4. an based exciting indian is movie raazi real spy story the thriller upon


# **Vectorization From Stop Removal**

In [193]:
def myTokenizer(document):
    words = token.tokenize(document.lower())
    useful_words = remove_stopwords(words,sw)
    return useful_words
cv = CountVectorizer(tokenizer=myTokenizer)

In [194]:
vectorize_corpus = cv.fit_transform(corpus).toarray()
print(vectorize_corpus)

[[0 1 0 1 2 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 2]
 [0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0]]


In [195]:
print(len(cv.vocabulary_.keys()))

33


In [196]:
for i in range(len(corpus)):
    s = " ".join(cv.inverse_transform(vectorize_corpus[i])[0])
    print(str(i+1)+". "+s)

1. capt. cricket cup held indian kohli. lanka. says sri team virat wins world
2. confident elections indian lok next pm sabha says win
3. hearts laurate nobel people.
4. based exciting indian movie raazi real spy story. thriller upon


In [197]:
test_corpus = [
    'Indian Cricket Rock!'
]

In [198]:
test_vector = cv.transform(test_corpus).toarray()
print(test_vector)

[[0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


# **More ways to Create Features**
* Unigram - every word as a feature
* Bigrams
* Trigrams
* n-grams
* TF-IDF Normalisation

In [199]:
sent_1  = ["this is good movie"]
sent_2 = ["this is good movie but actor is not present"]
sent_3 = ["this is not good movie"]

In [207]:
docs = [sent_1[0],sent_3[0]]
cv = CountVectorizer(ngram_range=(1,3))
cv.fit_transform(docs).toarray()

array([[1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0],
       [1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1]], dtype=int64)

In [208]:
print(cv.vocabulary_)

{'this': 11, 'is': 2, 'good': 0, 'movie': 7, 'this is': 12, 'is good': 3, 'good movie': 1, 'this is good': 13, 'is good movie': 4, 'not': 8, 'is not': 5, 'not good': 9, 'this is not': 14, 'is not good': 6, 'not good movie': 10}


# **Tf-idf Normalisation**

* Avoid features that occur very often, becauase they contain less information
* Information decreases as the number of occurences increases across different type of documents
* So we define another term - term-document-frequency which associates a weight with every term

In [209]:
sent_1  = "this is good movie"
sent_2 = "this was good movie"
sent_3 = "this is not good movie"

corpus = [sent_1,sent_2,sent_3]

In [210]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [212]:
tfidf = TfidfVectorizer()
vc = tfidf.fit_transform(corpus).toarray()
print(vc)

[[0.46333427 0.59662724 0.46333427 0.         0.46333427 0.        ]
 [0.41285857 0.         0.41285857 0.         0.41285857 0.69903033]
 [0.3645444  0.46941728 0.3645444  0.61722732 0.3645444  0.        ]]


In [213]:
print(tfidf.vocabulary_)

{'this': 4, 'is': 1, 'good': 0, 'movie': 2, 'was': 5, 'not': 3}
