# Outline, Notes, Questions

##### Steps:
1. Pre processing:
    - import
    - label
    - group
    - cleaning with regex
    - subsample

2. Create custom tokenizer with spacy
    - define stop words
    - specify lemmatize
    - remove stop words and punctuation

3. Train, test, split
    - start at .25, lower as I tune the model

4. Create the pipeline
    - vectorizer
    - classifier
5. Fit the model to the data

6. Test the results

7. Refine the model and the data

8. Explore results
    - confusion matrix
    - precision
    -accuracy

9. Visualize

##### Questions:
What happens if I don't group the tweets?

How do I see the tweets/accounts that were confused?

##### To do:
- Even out the sample in ungrouped tweets

# Code

#### Packages

In [1]:
import pandas as pd
import numpy as np
import chardet
import re
import spacy

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, classification_report, confusion_matrix

#### Data Import

In [198]:
congress_tweets = pd.read_csv("politician_tweets.csv")
troll_tweets = pd.read_csv("troll_tweets.csv")
trump_tweets = pd.read_csv("trump_tweets.csv", encoding ='Windows-1252')

#### Subsetting and renaming columns, grouping by author

In [199]:
%%time
# Congress
congress_tweets = congress_tweets[['Handle', 'Tweet']]
congress_tweets.columns = ['author', 'text']
congress_tweets['class'] = 0

# Trump
trump_tweets['author'] = 'realdonaldtrump'
trump_tweets['class'] = 0
trump_tweets = trump_tweets.sample(n=200, random_state = 1)

# Trolls
troll_tweets = troll_tweets[['author', 'content']]
troll_tweets['class'] = 1
troll_tweets.rename(columns = {'content': 'text'}, inplace = True)

# Merging
labeled_tweets = pd.merge(congress_tweets, trump_tweets, how = 'outer')
labeled_tweets = pd.merge(labeled_tweets, troll_tweets, how = 'outer')
print(labeled_tweets.shape)

# Dropping ULRs
def drop_characters(tweet):
    tweet = re.sub(r'http\S+', '', tweet)
    tweet = re.sub(r'-', '', tweet)
    tweet = re.sub(r'\.', '', tweet)
    tweet = re.sub(r'"', '', tweet)
    return tweet
labeled_tweets['text'] = labeled_tweets['text'].apply(drop_characters)



(525223, 3)
Wall time: 3.44 s


In [307]:
# Applying weights to the observations for even sample size
class_num = labeled_tweets['class'].nunique()

pol_class_weight = (1/class_num)/(len(labeled_tweets[labeled_tweets['class'] == 0]))

troll_class_weight = (1/class_num)/(len(labeled_tweets[labeled_tweets['class'] == 1]))

labeled_tweets['weight'] = pol_class_weight
labeled_tweets.loc[test['class'] == 1, 'weight'] = troll_class_weight

2
7.379093552148054e-06
1.092982180018537e-06


In [308]:
# creating a subsample of tweets because the full data set takes too long to process
ungrouped_tweets = labeled_tweets.sample(n=75000, weights = 'weight', random_state = 2)
ungrouped_tweets.drop('weight', axis = 1)
grouped_tweets = ungrouped_tweets.groupby(['author', 'class'])['text'].apply(' '.join).reset_index()
grouped_tweets['author'].nunique()

759

#### Spacy Tokenizer

In [90]:
def spacy_tokenizer(tweet):
    tweet = nlp(tweet)
    tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
    tweet = [tok for tok in tweet if (tok not in stopwords and tok not in not_allowed)] 
    return tweet

#### Train, Test, Split

In [312]:
# Grouped Tweets
X = grouped_tweets['text']
Y = grouped_tweets['class']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .25, random_state = 3)

# Ungrouped Tweets
A = ungrouped_tweets['text']
B = ungrouped_tweets['class']
A_train, A_test, b_train, b_test = train_test_split(A, B, test_size = .25, random_state = 4)

#### Pipeline Creation

In [232]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
    ('fit', MultinomialNB())
])

#### Run the Model

In [317]:
%%time
# total time: 7-8 min
# grouped data fit, 5-6 minutes
group_clf = pipeline.fit(X_train, y_train)
# grouped data predict, 1-2 minutes
group_preds = group_clf.predict(X_test)

Wall time: 7min 47s


In [314]:
%%time
# total time: 18-20 minutes
# ungrouped data fit, 14-15 minutes
ungroup_clf = pipeline.fit(A_train, b_train)
# ungrouped data predict, 4-5 minutes
ungroup_preds = ungroup_clf.predict(A_test)

Wall time: 14min 30s


#### Evaluate Accuracy

In [318]:
# grouped predictions
print('Confusion Matrix:')
print(confusion_matrix(y_test, group_preds))

print('Accuracy Score:')
print(accuracy_score(y_test, group_preds))

print('Precision Score:')
print(precision_score(y_test, group_preds))

Confusion Matrix:
[[102   0]
 [  6  82]]
Accuracy Score:
0.968421052631579
Precision Score:
1.0


In [316]:
# ungrouped predictions
print('Confusion Matrix:')
print(confusion_matrix(b_test, ungroup_preds))

print('Accuracy Score:')
print(accuracy_score(b_test, ungroup_preds))

print('Precision Score:')
print(precision_score(b_test, ungroup_preds))

Confusion Matrix:
[[7436  779]
 [ 574 9961]]
Accuracy Score:
0.92784
Precision Score:
0.9274674115456238


## Working Code

Congressional Tweets

In [3]:
congress_tweets = congress_tweets[['Handle', 'Tweet']]
congress_tweets.columns = ['author', 'text']
congress_tweets['account_category'] = 'politician'
print(congress_tweets.shape)
congress_tweets.head()

(67559, 3)


Unnamed: 0,author,text,account_category
0,RepDarrenSoto,"Today, Senate Dems vote to #SaveTheInternet. P...",politician
1,RepDarrenSoto,Hurricane Maria left approx $90 billion in dam...,politician
2,RepDarrenSoto,.@realDonaldTrump official policy to separate ...,politician
3,RepDarrenSoto,Thank you to my mom Jean and all the mothers a...,politician
4,RepDarrenSoto,We paid our respects at Nat’l Law Enforcement ...,politician


In [4]:
# Grouping by author
congress_tweets = congress_tweets.groupby('author')['text'].apply(' '.join).reset_index()

Trump Tweets

In [5]:
trump_tweets['author'] = 'realdonaldtrump'
trump_tweets['account_category'] = 'politician'
trump_tweets = trump_tweets.sample(n=200)
print(trump_tweets.shape)
trump_tweets.head()

(200, 3)


Unnamed: 0,text,author,account_category
174,James Comey is a proven LEAKER &amp; LIAR. Vir...,realdonaldtrump,politician
460,After years of rebuilding OTHER nations we are...,realdonaldtrump,politician
55,....great people of Montana will not stand for...,realdonaldtrump,politician
576,The Democrats are pushing for Universal Health...,realdonaldtrump,politician
79,.@JimRenacci has worked so hard on Tax Reducti...,realdonaldtrump,politician


In [6]:
trump_tweets = trump_tweets.groupby('author')['text'].apply(' '.join).reset_index()

Troll Tweets

In [4]:
troll_tweets = troll_tweets[['author', 'content', 'account_category']]
troll_tweets['account_category'] = 'troll'
troll_tweets.rename(columns = {'content': 'text'}, inplace = True)
print(troll_tweets.shape)
troll_tweets.head()

(457464, 3)


Unnamed: 0,author,text,account_category
0,10_GOP,"""We have a sitting Democrat US Senator on tria...",troll
1,10_GOP,Marshawn Lynch arrives to game in anti-Trump s...,troll
2,10_GOP,JUST IN: President Trump dedicates Presidents ...,troll
3,10_GOP,"Dan Bongino: ""Nobody trolls liberals better th...",troll
4,10_GOP,'@SenatorMenendez @CarmenYulinCruz Doesn't mat...,troll


In [95]:
troll_tweets = troll_tweets.groupby('author')['text'].apply(' '.join).reset_index()

In [96]:
troll_tweets['author'].nunique()

325

#### Merge Data Frame

In [6]:
labeled_tweets = pd.merge(congress_tweets, trump_tweets, how = 'outer')
labeled_tweets = pd.merge(labeled_tweets, troll_tweets, how = 'outer')
print(labeled_tweets.shape)

(525223, 3)


#### Tokenize and process

In [4]:
nlp = spacy.load('en')

In [22]:
# progress bar
def log_progress(sequence, every=None, size=None, name='Items'):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )

In [87]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS

import string
punctuations = string.punctuation

In [88]:
#use this to remove additional strings/punctuation
not_allowed = string.punctuation + string.digits + '-' + '.' + '"'

Using generator to lemmatize

In [67]:
%%time
spacy_tweets = nlp.pipe(labeled_tweets.iloc[:,1], batch_size = 1000, n_threads = 3)

Wall time: 0 ns


In [30]:
%%time
clean_tweets = []
for tweet in log_progress(spacy_tweets, every = 1):
    tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
    tweet = [tok for tok in tweet if (tok not in stopwords and tok not in punctuations)] 
    clean_tweets.append(tweet)

A Jupyter Widget

Wall time: 6.86 s


Custom Tokenizer using spacy. This will eliminate stop words as well

In [90]:
def spacy_tokenizer(tweet):
    tweet = nlp(tweet)
    tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
    tweet = [tok for tok in tweet if (tok not in stopwords and tok not in not_allowed)] 
    return tweet

Multithreading lemmatizer

In [69]:
from multiprocessing.dummy import Pool as ThreadPool 
pool = ThreadPool(1) 

In [70]:
%%time
results = pool.map(spacy_tokenizer, labeled_tweets.iloc[:1000,1])

Wall time: 13.6 s


In [54]:
%%time 
test = []
for tweet in labeled_tweets.iloc[:1000,1]:
    tweet = spacy_tokenizer(tweet)
    test.append(tweet)

Wall time: 16.2 s


TFIDF Vectorizer

In [117]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer(tokenizer = spacy_tokenizer)

In [167]:
%%time
sparse_matrix = vect.fit_transform(grouped_tweets.iloc[:,1])

Wall time: 4min 27s


In [152]:
print(vect.get_feature_names())






Train-test-split

In [None]:
from sklearn.model_selection import train_test_split

X = grouped_tweets['text']
Y = grouped_tweets['class']

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = .25, random_state = 3)

Pipeline

Steps in the pipeline:
1. tfidf vectorizer
   
   a. lemmatize
   
2. fit the model

In [168]:
from sklearn.pipeline import Pipeline

In [None]:
pipeline = Pipeline([
    ('vect', TfidfVectorizer(tokenizer = spacy_tokenizer)),
    ('fit', some model())
]