# Sentiment 140 Dataset
## Data Prep
#### Let's see what happens if we increase the dataset that we have. Let's use this 1.4Million sentences dataset and build a model with it.
#### You can download the dataset from [Sentiment140](http://help.sentiment140.com/for-students). These are the tweet data they collected.
#### THIS IS TOO SLOW ON CPU, I'LL PUT INSTRUCTIONS FOR RUNNING ON GPU

We'll run all these cells once, so that data is preprocessed and can be used for training the actual model.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pickle
import numpy as np
import pandas as pd

In [2]:
# WordNetLemmatizer from NLTK package, used to extract root words from the sentences.
lemmatizer = WordNetLemmatizer()

I've done the basic preprocessing.

1. Given data is comma-seperated quoted strings.
2. Data has no column names on the head.
3. I figured out that, in each entry, first string related to the sentiment attached to that entry. It's value is in `0,2,4` indicating `negative, neutral, positive` respectively.
4. There are other strings for each entry describing the user, timestamp, etc. 
5. Also I figured, in each entry, last string is the tweet itself on which the sentiment was determined.
6. I've only taken two classes `0,4` for simplicity. Negative sentiment and positive sentiment. Assigned the one-hot classification vectors `[1,0]` representing negative sentiment and `[0,1]` representing positive sentiment.
7. Then, I created a preprocessed file contianing the classification detail and the tweet itself seperated by triple colon (`:::`).

Data is too large, so I decided to use buffering.

In [8]:
def init_process(fin,fout):
	fin_path = "/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/"+fin
	fout_path = "/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/"+fout
	outfile = open(fout_path,'a')
	with open(fin_path, buffering=200000, encoding='latin-1') as f:
		try:
			for line in f:
				line = line.replace('"','')
				initial_polarity = line.split(',')[0]
				if initial_polarity == '0':
					initial_polarity = [1,0]
				elif initial_polarity == '4':
					initial_polarity = [0,1]

				tweet = line.split(',')[-1]
				outline = str(initial_polarity)+':::'+tweet
				outfile.write(outline)
		except Exception as e:
			print(str(e))
		print("Closed",fin)
	outfile.close()
	print("Closed",fout)

Run the below preprocessing only once. I'm commenting it out for now. As I'm using append, it will append to the bottom each time you run and causing duplicated data. Buffering is in 200,000 bytes.

In [9]:
# init_process('training.1600000.processed.noemoticon.csv','train_set.csv')
# init_process('testdata.manual.2009.06.14.csv','test_set.csv')

Closed training.1600000.processed.noemoticon.csv
Closed train_set.csv
Closed testdata.manual.2009.06.14.csv
Closed test_set.csv


Now, we'll create lexicon, and store it in pickle so that we keep track of code reusability. We'll load in the pickle whenever needed, it holds the lexicon for us. Run this when we have our `train_set.csv` is obtained from above.

In [13]:
def create_lexicon(fin):
	lexicon = []
	fin_path = "/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/"+fin
	with open(fin_path, 'r', buffering=100000, encoding='latin-1') as f:
		try:
			counter = 1
			content = ''
			for line in f:
				counter += 1
				if (counter/2500.0).is_integer():
					tweet = line.split(':::')[1]
					content += ' '+tweet
					words = word_tokenize(content)
					words = [lemmatizer.lemmatize(i) for i in words]
					lexicon = list(set(lexicon + words))
					print(counter, len(lexicon))

		except Exception as e:
			print(str(e))
	
	print(len(lexicon))
	with open('/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/lexicon-2500-2638.pickle','wb') as f:
		pickle.dump(lexicon,f)

Similar to the preprocessing cell, we do run this only once to obtain our lexicon pickled. I'll comment this for now. Buffering is in 100,000 bytes.

In [15]:
# create_lexicon('train_set.csv')

2500 26
5000 32
7500 41
10000 63
12500 85
15000 89
17500 91
20000 108
22500 113
25000 120
27500 128
30000 141
32500 159
35000 173
37500 177
40000 185
42500 198
45000 212
47500 224
50000 240
52500 253
55000 266
57500 267
60000 269
62500 280
65000 284
67500 291
70000 299
72500 306
75000 321
77500 329
80000 336
82500 345
85000 356
87500 358
90000 361
92500 374
95000 388
97500 399
100000 402
102500 412
105000 418
107500 425
110000 429
112500 438
115000 448
117500 450
120000 453
122500 460
125000 465
127500 466
130000 468
132500 470
135000 480
137500 490
140000 496
142500 500
145000 507
147500 514
150000 528
152500 532
155000 536
157500 540
160000 548
162500 556
165000 560
167500 562
170000 566
172500 571
175000 580
177500 583
180000 588
182500 594
185000 597
187500 600
190000 607
192500 611
195000 621
197500 624
200000 628
202500 630
205000 632
207500 641
210000 644
212500 648
215000 651
217500 659
220000 674
222500 684
225000 689
227500 694
230000 698
232500 700
235000 706
237500 714
2400

We got a large lexicon-set having `2638` unique lexicon determined by NLTK. This indicates that we need our neural network's initial layer to be having `2638` nodes on as input.

We need to convert our training set into feature vectors, so that we can train our models. 

As we have observed that for nearly 3MB data we used in the first attempt, vectors produced are of complete size 140MB. But, the data we have is 113MB intially, this could scale upto 20GB. 

DO NOT RUN THIS ON CPU. WE WILL DO THIS INLINE.

In [16]:
def convert_to_vec(fin,fout,lexicon_pickle):
	fin_path = "/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/"+fin
	fout_path = "/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/"+fout
	lexicon_path = "/Users/tejasreddy9/Documents/PyNotebooks/Sentiment140/"+lexicon_pickle
	with open(lexicon_path,'rb') as f:
		lexicon = pickle.load(f)
	outfile = open(fout_path,'a')
	with open(fin_path, buffering=20000, encoding='latin-1') as f:
		counter = 0
		for line in f:
			counter +=1
			label = line.split(':::')[0]
			tweet = line.split(':::')[1]
			current_words = word_tokenize(tweet.lower())
			current_words = [lemmatizer.lemmatize(i) for i in current_words]

			features = np.zeros(len(lexicon))

			for word in current_words:
				if word.lower() in lexicon:
					index_value = lexicon.index(word.lower())
					# OR DO +=1, test both
					features[index_value] += 1

			features = list(features)
			outline = str(features)+'::'+str(label)+'\n'
			outfile.write(outline)

		print(counter)

Shuffle the training data, as per the norms of neutral networks. Otherwise, it tries to do the unnecessary re-adjustment makes the weigths so large layer by layer, and at some threshold point, subsequent layers will have negligible contribution. Hence, shuffling is mandatory.

In [17]:
def shuffle_data(fin):
	df = pd.read_csv(fin, error_bad_lines=False)
	df = df.iloc[np.random.permutation(len(df))]
	print(df.head())
	df.to_csv('train_set_shuffled.csv', index=False)

Run this also only once, we just meant to shuffle it once before we go.

In [None]:
shuffle_data('train_set.csv')

For, testing dataset also, we'll create a pickle so that we can load it whenever needed.

In [None]:
def create_test_data_pickle(fin):

	feature_sets = []
	labels = []
	counter = 0
	with open(fin, buffering=20000) as f:
		for line in f:
			try:
				features = list(eval(line.split('::')[0]))
				label = list(eval(line.split('::')[1]))

				feature_sets.append(features)
				labels.append(label)
				counter += 1
			except:
				pass
	print(counter)
	feature_sets = np.array(feature_sets)
	labels = np.array(labels)



This also run only once. 

In [None]:
create_test_data_pickle('processed-test-set.csv')