In this notebook we will be extracting and pre-processing all the data that we will need to train our neural network.
We have the following data sets:
- Trump tweets before announcing he was running  (http://trumptwitterarchive.com)
    - 25967 tweets
- A collection of tweets made by regular users in 2010 (https://archive.org/details/twitter_cikm_2010)
    - 25967 tweets
- Barack Obama's tweet story until March 2017 (https://community.periscopedata.com/t/x1fy7p/barack-obamas-tweet-history)
     - 6735 tweets
- Tweets made by Democrat politicians as of May 2018(https://www.kaggle.com/kapastor/democratvsrepublicantweets)
     - 42068 tweets
 
The original dataset used didn't contain any politicians besides Trump, which ended up causing a lot of false positives when mentioning any political issues, or America. That's why Obama and Democrat tweets were added.

# Extracting data from Trump tweets

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import json
import csv
import re

In [2]:
trump_data=[]
json_data = json.load(open('data/condensed_2009.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])
json_data = json.load(open('data/condensed_2010.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])
json_data = json.load(open('data/condensed_2011.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])
json_data = json.load(open('data/condensed_2012.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])
json_data = json.load(open('data/condensed_2013.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])
json_data = json.load(open('data/condensed_2014.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])
json_data = json.load(open('data/condensed_2015.json'))
for tweet in json_data:
    trump_data.append(tweet["text"])

Now we clean up the data. We remove links and create a "LINK" token. We remove Twitter mentions and create a "MENT" token

In [3]:
for index,tweet in enumerate(trump_data):
    trump_data[index] = re.sub(r'https?:\/\/.*[\r\n]*', 'LINK', tweet, flags=re.MULTILINE)

In [4]:
for index,tweet in enumerate(trump_data):
    trump_data[index] = re.sub(r'@[^\s]+', 'MENT', tweet, flags=re.MULTILINE)

# Non-Trump data

We now extract and process all the negative data points (non-Trump)

In [5]:
#Democrat Tweets
#Due to the large size of this file,  you will need to download from the previous Kaggle link and it to the /data subdirectory
i=0
non_trump_data = []
with open('data/ExtractedTweets.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    for row in readCSV:
        if i == 0:
            i+=1
        else:
            if row[0] == "Democrat":
                non_trump_data.append(row[2])

In [6]:
#Obama Tweets
i=0
with open('data/obama_tweets.csv') as csvfile:
    readCSV = csv.reader(csvfile, delimiter=',')
    for row in readCSV:
        if i == 0:
            i+=1
        else:
            non_trump_data.append(row[0])

In [7]:
#Regular User tweets
i = 0
user_dict = {}
with open("data/training_set_tweets.txt") as f:
    while i<len(trump_data):
        line = next(f)
        split_line = line.split('\t')
        if split_line[0] not in user_dict and len(split_line) >= 3:
            user_dict[split_line[0]] = 1
            non_trump_data.append(split_line[2])
            i+=1

We pre-process them the same way we did with Trump tweets

In [8]:
for index,tweet in enumerate(non_trump_data):
    non_trump_data[index] = re.sub(r'https?:\/\/.*[\r\n]*', 'LINK', tweet, flags=re.MULTILINE)

In [9]:
for index,tweet in enumerate(non_trump_data):
    non_trump_data[index] = re.sub(r'@[^\s]+', 'MENT', tweet, flags=re.MULTILINE)

In [10]:
#Example
non_trump_data[0]

'Today, Senate Dems vote to #SaveTheInternet. Proud to support similar #NetNeutrality legislation here in the House… LINK'

We add them together to create our training set

In [11]:
training_data = trump_data + non_trump_data

And we see that we have a total of ~100k tweets

In [12]:
len(training_data)

100736

# Converting the data to word count vectors

We willtokenize a our data set and build a vocabulary of known words using CountVectorizer. We will also save and use this vectorizer to apply it to the inputs on our model when we want to classify a new input.
We set a max number of features (words) of 10k. We are not interested in words that appear rarely.

In [13]:
vectorizer = CountVectorizer(max_features=10000)
vectorizer.fit(training_data)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=10000, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Lets take a look at the vocabulary we have extracted. Every word will have a unique identifier. We can see for example that trump has been assigned the number "9126"

In [14]:
#Large output. Uncomment if you want to take a look
#print(vectorizer.vocabulary_)

We now encode our training data using this vocabulary. This is a one hot encoding. We will take care of this later, as we prefer label based encoding as an input for our neural network (less inputs)

In [15]:
vector_data = vectorizer.transform(training_data)

In [16]:
reverse_word_index =dict([(value, key) for (key, value) in vectorizer.vocabulary_.items()])

We also need to create the labels for our data set. 1 for a Trump tweet. 0 for not Trump.

In [17]:
data_labels = [1 for _ in range(len(training_data))]

In [18]:
for i in range(len(trump_data), len(training_data)):
    data_labels[i] = 0

Example of one particular tweet, and how the encoded version looks (it's a 10k long array)

In [19]:
training_data[25966]

'"MENT MENT Thanks Donald. Now run for president! Fulfill your purpose! "To much is given, much is required"'

In [20]:
vector_data[25966].toarray()

array([[0, 0, 0, ..., 0, 0, 0]])

And how it looks after we convert it to label based encoding. Every number corresponds to one word in the vocabulary we've seen before. We can see the number "2818", which corresponds to "donald".

In [21]:
test = [sparse_row.indices for sparse_row in vector_data[25966]]
test

[array([2818, 3669, 3781, 3915, 4834, 5674, 5924, 6174, 6925, 7152, 7502,
        7744, 8914, 9035, 9973], dtype=int32)]

So let's convert our one hot encoding to label based encoding

In [22]:
label_data = [None for _ in range(len(training_data))]

In [23]:
for index,one_hot in enumerate(vector_data):
    label_data[index] = [sparse_row.indices for sparse_row in vector_data[index]]

And finally we'll pickle the data set, labels, and vectorizer so we can easily import it in our model

In [24]:
import pickle

In [25]:
filename = 'tweet_data'
outfile = open(filename,'wb')

In [26]:
pickle.dump(label_data,outfile)
outfile.close()

In [27]:
filename = 'data_labels'
outfile = open(filename,'wb')

In [28]:
pickle.dump(data_labels,outfile)
outfile.close()

In [29]:
import pickle
filename = "vectorizer"
outfile = open(filename, 'wb')

In [30]:
pickle.dump(vectorizer,outfile)
outfile.close()