# Tweet Prediction - Using NLP to classify Political Leanings of Tweets

Inspired by Assignment \#5, we can train a Support Vector Classifier to classify the political leanings of Russian bot Tweets based on their content. This SVC will be trained specifically to "guess" if a Tweet is left-leaning or right-leaning. It is trained based on the account_type its posting account was classified as.

## Setup

We first install dependencies and import modules used to train our NLP model.

In [1]:
! pip install --user nltk



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import itertools
import collections
import numpy as np

In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report, precision_recall_fscore_support

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Loading and Cleaning Data

Load our Tweets into a CSV, and then clean it:

1. We only want to focus on English Tweets in the USA region.
2. All content should be lower-cased, since the sample data uses capitalizations somewhat randomly
3. Finally, because we are only interested in Left/Right tweets, filter out any Tweets which accounts are not "Left" or "Right"

In [4]:
tweets_df = pd.read_csv('./data/IRAhandle_tweets_1.csv')

In [5]:
english_tweets = tweets_df[(tweets_df['language']=='English') & (tweets_df['region']=="United States")]
english_tweets['content'] = english_tweets['content'].str.lower() # screw the warnings!
english_tweets = english_tweets[(english_tweets['account_type'] == 'Right') | (english_tweets['account_type'] == 'Left')]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [6]:
english_tweets['account_type'].unique()

array(['Right', 'Left'], dtype=object)

In [7]:
# A quick way for us to make sure to some light degree our data is properly cleaned
assert(sorted(set(english_tweets['account_type'])) == ["Left", "Right"])

### Label Matrix Representations

In our matrix, we need to assign values to our groups.

We assign 0.0 for "Right" and 1.0 for "Left" account_types"

In [8]:
def convert_label(label):
    if label == "Right": return 0.0
    elif label == "Left": return 1.0
    else: return label

And then now, we use this function and apply it to our english_tweets dataframe.

In [9]:
english_tweets['y'] = english_tweets['account_type'].apply(convert_label)

In [10]:
# Once again, make sure we're doing things right - this is useful for quick validations
assert(sorted(set(english_tweets['y'])) == [0., 1.])

## Vectorizing Our Data

We will vectorize our data using the TF-IDF method. Our SGDClassifier will be trained using these vectors. 

A hardware limitation we encountered was that our hardware devices aren't really capable of transforming more than about 20,000 tweets at any given time. Thus, we will batch-train the SGDClassifier by using 2000 TF-IDF fit-transformed data which uses random samples of size 20,000.

In [11]:
# Create our SVC that we will batch-train
clf = SGDClassifier()

In [12]:
# Create the TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer="word", 
                             tokenizer=word_tokenize, 
                             stop_words=stopwords.words('english'))

# Training the SGDClassifier

### Fitting the Data
Batch-train the SGDClassifier with `partial_fit()` by passing in 2,000 random samples from our Tweets.

In [13]:
for i in range(1):
    # Train using sample sizes of 20,000
    data = english_tweets.sample(20000)
    et_x = vectorizer.fit_transform(data['content']).toarray()
    et_y = data['y']
    # Once we have this sample data transformed and fitted into the vectorizer, train CLF with it
    # Note: vectorizer.fit_transform() will change [vocab] shape each time cause fit_transform(), this is okay
    clf.partial_fit(et_x, et_y, classes=np.unique(et_y))

0


  'stop_words.' % sorted(inconsistent))


52752



### How Accurate Are We?

Let's fetch a random sample of 20,000 tweets (doesn't have to be 20,000), and run an analysis on how accurate the SGDClassifier is:

TODO: This isn't really accurate right now because we've only done 1 iteration on a small set, but once it's integrated into the master code it'll be different.

In [15]:
# Let's see how accurate (or inaccurate this is)

test_data = english_tweets.sample(20000)
test_data_vector = vectorizer.transform(test_data['content']).toarray()
test_data_y = test_data['y']

results = clf.predict(test_data_vector)
print(classification_report(test_data_y, results))

(20000, 52752)