## Bag of words

A commonly used way to describe/capture text features is by counting how often words appear in a text. 

A 'dictionary' (not a python dictionary) is a list of words to count.

In this example we will count the positive and the negative words in the business description section.

In [1]:
# read business section
with open('apple_business_section.txt' , 'r', encoding='utf-8') as myfile:    
    item1 =  myfile.read()     

In [2]:
item1[0:200]

'Item 1.    Business\nCompany Background\nThe Company designs, manufactures and markets smartphones, personal computers, tablets, wearables and accessories, and sells a variety of related services. The C'

In [8]:
# load dictionaries
# positive words
with open('positive words.txt' , 'r') as myfile:    
    pos = [word.lower().strip('\n') for word in myfile.readlines()]  

print(len(pos))
pos[0:10]

354


['able',
 'abundance',
 'abundant',
 'acclaimed',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishing',
 'accomplishment',
 'accomplishments']

In [9]:
# negative words
with open('negative words.txt' , 'r') as myfile:    
    neg = [word.lower().strip('\n') for word in myfile.readlines()]   

In [10]:
print( len(neg) )
# first 10
neg[0:10]

2355


['abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandonments',
 'abandons',
 'abdicated',
 'abdicates',
 'abdicating',
 'abdication']

### Stemmed

When counting the 'bag of words', you can have the text and the dictionaries stemmed (or not, as long as you are consistent).

In [15]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
# notice the use of 'set' to drop duplicates
neg_stemmed = list( set( [ ps.stem(w) for w in neg  ] ) )
print( len(neg_stemmed) )
neg_stemmed[0:10]

915


['improprieti',
 'late',
 'conspiraci',
 'confin',
 'penalti',
 'shortfal',
 'sentenc',
 'forfeit',
 'malic',
 'insurrect']

## Example

In [22]:
import nltk
import re
from nltk import FreqDist
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# list of stop words 
stopWords = set(stopwords.words('english') ) 

# remove numbers, see https://stackoverflow.com/questions/57030670/how-to-remove-punctuation-and-numbers-during-tweettokenizer-step-in-nlp
item1 = re.sub(r'\d+', '', item1)

# tokens excluding stopwords and punctuation
business_tokens = [x.lower() for x in word_tokenize(item1) if x.lower() not in stopWords and x not in string.punctuation]

print( 'number of words:',  len(business_tokens) )
# first 10
business_tokens[0:10]

number of words: 2844


['item',
 'business',
 'company',
 'background',
 'company',
 'designs',
 'manufactures',
 'markets',
 'smartphones',
 'personal']

In [17]:
# positive 
pos_matches =  [ w for w in business_tokens if w in pos  ]  

# negative
neg_matches =  [ w for w in business_tokens if w in neg  ]  

In [18]:
neg_matches[0:15]

['accidental',
 'damage',
 'loss',
 'force',
 'downward',
 'infringing',
 'cut',
 'loss',
 'shortage',
 'declines',
 'harassment',
 'questions',
 'concerns',
 'hazards',
 'crisis']

In [19]:
len(business_tokens)

2844

In [20]:
print ( '#positive words:', len(pos_matches),  pos_matches )

#positive words: 42 ['exclusive', 'improvement', 'advancements', 'successfully', 'innovative', 'innovation', 'strong', 'opportunities', 'collaborate', 'able', 'advances', 'successfully', 'enhance', 'success', 'innovative', 'innovations', 'able', 'opportunity', 'enable', 'success', 'succeed', 'opportunity', 'collaborative', 'succeed', 'encouraged', 'profitability', 'able', 'improvement', 'advancements', 'successfully', 'innovative', 'achieve', 'able', 'successfully', 'effective', 'innovative', 'attractive', 'advantage', 'collaborate', 'improve', 'advantages', 'able']


In [26]:
print ( '#negative words:', len(neg_matches),  neg_matches )

