<h2>Bag of Words </h2> <br>
The main idea behind bag of words:"... the more a words appears the more important it is.."
<br>
<h3>without pre-processing </h3><br>
In this first example we start with making a bag of words of the unprocessed raw text

In [4]:
from nltk.tokenize import word_tokenize

In [5]:
from collections import Counter

<b>Counter</b> objects.<br> A Counter is a dict subclass for counting hashable objects. It is an unordered collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts.

Example :

In [6]:
#initialize the counter object
c = Counter('abcdaab')

#loop over the counter object
for letter in 'abcde':
    print(letter, c[letter])

a 3
b 2
c 1
d 1
e 0


Define the text to processed:

In [7]:
txt = "The trader is in the pit before the opening of the markets. \
The trader walks in the pit. Everybody screams at the trader."

Initialize the counter object with the tokenized text but move the text first to lower case

In [8]:
a = Counter(word_tokenize(txt.lower()))

Look at the 2 most common words

In [9]:
a.most_common(2)

[('the', 7), ('trader', 3)]

The result above is not useful at all. We clearly need to remove stopwords and punctuations.
<br><hr>
<h3>with pre-processing</h3>
Sequence of actions:
<ol>
<li>tokenization<br>
<li>lowercasing words <br>
<li>stemming <br>
<li>removing punctuation<br>
<li>removing stopwords<br>
</ol>

In [10]:
from nltk.corpus import stopwords

The function <b>isalpha</b> evaluates to True if the string it is called on contains only characters from the alphabet, otherwise to False <br>
The line of code below tokenizes the text in lower case and excludes non-alphabetic characters.

In [11]:
tokens = [w for w in word_tokenize(txt.lower()) if w.isalpha()]

From the tokens we remove the stop words :

In [12]:

tokens_no_stopwords = [t for t in tokens if t not in stopwords.words('english')]

Analyze the text using the counter collection

In [13]:
a = Counter(tokens_no_stopwords)
for n in a.most_common(2):
    print(n[0],'(',n[1],')')

trader ( 3 )
pit ( 2 )


<h4>Grouping texts together</h4>
<br>

The Counter instances support arithmetic and set operations for aggregating results. Imagine we have two count objects c1 and c2:<br>
Combined counts:
c1 + c2
<br>
Subtraction: c1 - c2
<br>
Intersection:c1 & c2
<br>
Union (taking maximums):c1 | c2

In [14]:
txt_2 = "There is a lot of activity in the pit. People shout and scream. Tickets are written"

In [15]:
tokens_2 = [w for w in word_tokenize(txt_2.lower()) if w.isalpha()]
tokens_no_stopwords_2 = [t for t in tokens_2 if t not in stopwords.words('english')]

In [16]:
b = Counter(tokens_no_stopwords_2)

In [17]:
c = a & b
c

Counter({'pit': 1})

In [18]:
c = a + b
c

Counter({'trader': 3,
         'pit': 3,
         'opening': 1,
         'markets': 1,
         'walks': 1,
         'everybody': 1,
         'screams': 1,
         'lot': 1,
         'activity': 1,
         'people': 1,
         'shout': 1,
         'scream': 1,
         'tickets': 1,
         'written': 1})

In [19]:
c = a -b
c

Counter({'trader': 3,
         'pit': 1,
         'opening': 1,
         'markets': 1,
         'walks': 1,
         'everybody': 1,
         'screams': 1})