## Bag of words

A commonly used way to describe/capture text features is by counting how often words appear in a text. 

A 'dictionary' (not a python dictionary) is a list of words to count.

In this example we will count the positive and the negative words in the business description section.

In [None]:
# read business section
with open('apple_business_section.txt' , 'r', encoding='utf-8') as myfile:    
    item1 =  myfile.read()     

In [None]:
item1[0:200]

In [None]:
# load dictionaries
# positive words
with open('positive words.txt' , 'r') as myfile:    
    pos = [word.lower().strip('\n') for word in myfile.readlines()]  

print(len(pos))
pos[0:10]

In [None]:
# negative words
with open('negative words.txt' , 'r') as myfile:    
    neg = [word.lower().strip('\n') for word in myfile.readlines()]   

In [None]:
print( len(neg) )
# first 10
neg[0:10]

### Stemmed

When counting the 'bag of words', you can have the text and the dictionaries stemmed (or not, as long as you are consistent).

In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
# notice the use of 'set' to drop duplicates
neg_stemmed = list( set( [ ps.stem(w) for w in neg  ] ) )
print( len(neg_stemmed) )
neg_stemmed[0:10]

## Example

In [None]:
import nltk
import re
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# list of stop words 
stopWords = set(stopwords.words('english') ) 

# remove numbers, see https://stackoverflow.com/questions/57030670/how-to-remove-punctuation-and-numbers-during-tweettokenizer-step-in-nlp
item1 = re.sub(r'\d+', '', item1)

# tokens excluding stopwords and punctuation
business_tokens = [x.lower() for x in word_tokenize(item1) if x.lower() not in stopWords and x not in string.punctuation]

print( 'number of words:',  len(business_tokens) )
# first 10
business_tokens[0:10]

In [None]:
# positive 
pos_matches =  [ w for w in business_tokens if w in pos  ]  

# negative
neg_matches =  [ w for w in business_tokens if w in neg  ]  

In [None]:
neg_matches[0:15]

In [None]:
len(business_tokens)

In [None]:
print ( '#positive words:', len(pos_matches),  pos_matches )

In [None]:
print ( '#negative words:', len(neg_matches),  neg_matches )