# Prep files for sentiment analysis with Deep Learning
#### Data used is from the reddit comments which are already classified into positive and negative files. These contain huge number of sentences.
NLTK lemmatization, tokenization, etc. concepts are used for extracting features. Let's see it step by step. I made it modular, so that it can be easier to debug and read.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import numpy as np
import random
import pickle
from collections import Counter

We stem the words using WordNetLemmatizer. It's the basic method, we can also introduce parts of speech tags, which is indeed interesting. I don't have any idea how many lines are in there, so I'll set a maximum(say 10000000).

In [2]:
lemmatizer = WordNetLemmatizer()
hm_lines = 10000000

For creating lexicons, we limit them by their occurence in the data. Let's say permitted frequencies could be from 50 to 1000. We can tweak that anytime.

In [10]:
def create_lexicon(pos,neg):
	print("Creating lexicon...")
	lexicon = []
	for fi in [pos,neg]:
		fpath = "/Users/tejasreddy9/Documents/PyNotebooks/Reddit_data/"+fi
		with open(fpath,'r') as f:
			print("In file",fi,"...")
			contents = f.readlines()
			for l in contents[:hm_lines]:
				all_words = word_tokenize(l.lower())
				lexicon += list(all_words)
			print("Done tokeninzing. Closed",fi,".")
	lexicon = [lemmatizer.lemmatize(i) for i in lexicon]
	print("Extracted lexicon.")
	w_counts = Counter(lexicon)
	print("Counting frequencies of lexicon.")

	l2 = []
	for w in w_counts:
		if 1000 > w_counts[w] > 50:
			l2.append(w)
	print("Permissible frequencies: 50 to 1000")
	print("Total number of words in lexicon:",len(l2))
	print("-"*80)
	return l2

Let's test this function we just built. So, this is the glimpse of what we computed.

In [12]:
lex = create_lexicon("pos.txt","neg.txt")
print(len(lex))
print(lex[:10])

Creating lexicon...
In file pos.txt ...
Done tokeninzing. Closed pos.txt .
In file neg.txt ...
Done tokeninzing. Closed neg.txt .
Extracted lexicon.
Counting frequencies of lexicon.
Permissible frequencies: 50 to 1000
Total number of words in lexicon: 423
--------------------------------------------------------------------------------
423
['be', 'new', '``', 'he', 'going', 'make', 'even', 'than', 'or', 'so']


In the sample handling below, we append the classification in the form of one-hot, i.e., [1, 0] for positive sentiment and [0, 1] for negative sentiment.

And for features, we maintain a feature vector for each sentence, whose vector length is same as that of the lexicon. Scan by the sentence, whenever you see a word which is in lexicon, you will increment the component of that vector. This is how we create featureset.

In [16]:
def sample_handling(sample, lexicon, classification):
	print("Sample Handling...")
	featureset = []
	fpath = "/Users/tejasreddy9/Documents/PyNotebooks/Reddit_data/"+sample
	with open(fpath,'r') as f:
		print("In file",sample,"...")
		print("Making featureset. Each feature in the format (featurevalues|classification) .......|.. ")
		contents = f.readlines()

		for l in contents[:hm_lines]:
			current_words = word_tokenize(l.lower())
			current_words = [lemmatizer.lemmatize(i) for i in current_words]
			features = np.zeros(len(lexicon))
			for word in current_words:
				if word.lower() in lexicon:
					i = lexicon.index(word.lower())
					features[i] += 1
			features = list(features)
			featureset.append([features, classification]) 
		print("Done. Closed",sample,".")

	print("Sampled features(n):",len(featureset))
	print("Shape(n, total classes):",np.array(featureset).shape)
	print("-"*80)
	return featureset

Let's test this function, if it's working as intended. 

In [17]:
x = sample_handling("pos.txt", lex, [1,0])
print(len(x))
print(np.array(x).shape)
print(x[:10])

Sample Handling...
In file pos.txt ...
Making featureset. Each feature in the format (featurevalues|classification) .......|.. 
Done. Closed pos.txt .
Sampled features(n): 5331
Shape(n, total classes): (5331, 2)
--------------------------------------------------------------------------------
5331
(5331, 2)
[[[1.0, 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,

Let's define our labels. Then shuffle all the features, split up the data into train and validation datasets. 

We have to shuffle compulsorily, or else the neural networks we use these on, will behave awkward. Network keeps on trying to adjust weights and becomes unmanageable by subsequent layers.

In [19]:
def create_features_set_and_labels(pos, neg, test_size=0.1):
	lexicon = create_lexicon(pos, neg)
	features = []
	features += sample_handling('pos.txt',lexicon,[1,0])
	features += sample_handling('neg.txt',lexicon,[0,1])
	random.shuffle(features)
	features = np.array(features)
	print("Setting Labels.")
	testing_size = int(test_size * len(features))

	print("Shape of final features combined.",features.shape)
	print("Total size:",len(features))
	print("Testing size:",testing_size)
	train_x = list(features[:,0][:-testing_size])
	train_y = list(features[:,1][:-testing_size])
	test_x = list(features[:,0][-testing_size:])
	test_y = list(features[:,1][-testing_size:])

	print("train_x -",len(train_x))
	print("train_y -",len(train_y))
	print("test_x -",len(test_x))
	print("test_y -",len(test_y))
	print("-"*80)
	return train_x,train_y,test_x,test_y

Now, all set. We have our modules. Let's invoke this function, by giving the input data from those reddit comments.

In [20]:
if __name__ == '__main__':
    train_x,train_y,test_x,test_y = create_features_set_and_labels('pos.txt','neg.txt')
    with open('sentiment_set.pickle','wb') as f:
    	print("Started pickling..")
    	pickle.dump([train_x,train_y,test_x,test_y],f)
    	print("Saved and closed pickle.")

Creating lexicon...
In file pos.txt ...
Done tokeninzing. Closed pos.txt .
In file neg.txt ...
Done tokeninzing. Closed neg.txt .
Extracted lexicon.
Counting frequencies of lexicon.
Permissible frequencies: 50 to 1000
Total number of words in lexicon: 423
--------------------------------------------------------------------------------
Sample Handling...
In file pos.txt ...
Making featureset. Each feature in the format (featurevalues|classification) .......|.. 
Done. Closed pos.txt .
Sampled features(n): 5331
Shape(n, total classes): (5331, 2)
--------------------------------------------------------------------------------
Sample Handling...
In file neg.txt ...
Making featureset. Each feature in the format (featurevalues|classification) .......|.. 
Done. Closed neg.txt .
Sampled features(n): 5331
Shape(n, total classes): (5331, 2)
--------------------------------------------------------------------------------
Setting Labels.
Shape of final features combined. (10662, 2)
Total size: 1066

Now, we have saved our training and validation datasets into pickle. It's not surprising that this pickle file is of size 130MB. In fact, we have dealt with a lot of data.