# **Bag Of Words Pipeline**

* Get the Data/Corpus
* Tokenisation, Stopward Removal
* Stemming And Lemmatization
* Building a Vocab
* Vectorization
* Classification

# **1. Get The Data**

In [3]:
import nltk
from nltk.corpus import brown

In [4]:
print(brown.categories())

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [5]:
data = brown.sents(categories="fiction")

# 2. **Tokenisation, Stopward Removal**

In [6]:
from nltk.tokenize import word_tokenize,sent_tokenize

In [7]:
document = """It was a very pleasant day. The weather was cool and there were light showers. 
I went to the market to buy some fruits."""

sentence = "Send all the 50 documents related to chapters 1,2,3 at prateek@cb.com"

In [8]:
sents = sent_tokenize(document)
print(sents)

['It was a very pleasant day.', 'The weather was cool and there were light showers.', 'I went to the market to buy some fruits.']


In [9]:
words = word_tokenize(sentence)
print(words)

['Send', 'all', 'the', '50', 'documents', 'related', 'to', 'chapters', '1,2,3', 'at', 'prateek', '@', 'cb.com']


In [10]:
from nltk.corpus import stopwords

In [18]:
sw = set(stopwords.words("english"))
sw.remove('not')

In [19]:
def remove_stopwords(text,stopwords):
    useful_words = [w for w in text if w not in stopwords]
    return useful_words

In [22]:
useful_text = remove_stopwords(words,sw)
print(useful_text)

['Send', '50', 'documents', 'related', 'chapters', '1,2,3', 'prateek', '@', 'cb.com']


In [26]:
from nltk.tokenize import RegexpTokenizer

In [31]:
token = RegexpTokenizer('[a-zA-Z.@]+')
useful_text = token.tokenize(sentence)
print(useful_text)

['Send', 'all', 'the', 'documents', 'related', 'to', 'chapters', 'at', 'prateek@cb.com']


# **3. Stemming & Lemmatization**
* Process that transforms particular words(verbs,plurals)into their radical form
* Preserve the semantics of the sentence without increasing the number of unique tokens
* Example - jumps, jumping, jumped, jump ==> jump

In [35]:
text= """Foxes love to make jumps.The quick brown fox was seen jumping over the 
        lovely dog from a 6ft feet high wall"""

In [54]:
from nltk.stem import SnowballStemmer,PorterStemmer,LancasterStemmer,WordNetLemmatizer

In [55]:
ps = PorterStemmer()
print(ps.stem('loving'))
print(ps.stem('lovely'))

love
love


In [56]:
ss = SnowballStemmer('english')
print(ss.stem('jumping'))
print(ss.stem('jumps'))

jump
jump


In [57]:
ls = LancasterStemmer()
print(ls.stem('walking'))
print(ls.stem('walked'))

walk
walk


In [61]:
wn = WordNetLemmatizer()
print(wn.lemmatize("jumping"))
print(wn.lemmatize('jumps'))

jumping
jump
