# Text Classification + Trump's Tweet Sentiment Analysis 

Creating a text classifier using a Cornell dataset taken from http://www.cs.cornell.edu/people/pabo/movie-review-data/ and applying the model onto a dataset of Trump's tweets since his inauguration.

In [165]:
import numpy as np
import pandas as pd
import re
import pickle
import json
import nltk
from sklearn.datasets import load_files
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from nltk.corpus import stopwords
nltk.download('stopwords')
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# retrieve reviews to do our sentiment model analysis on
reviews = load_files('txt_sentoken')
X,y = reviews.data, reviews.target

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/207454/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Pre-Process dataset
The reviews first need to get cleaned and stripped from anything that could cause noise. Some of these steps include transforming to lowercase, removing extra spaces and single characters from the text.

In [174]:
def pre_process(X):
    text = []
    for i in range(0,len(X)):
        review = re.sub(r'\W', ' ', str(X[i]))
        # convert to lower case
        review = review.lower()
        # removes all single characters
        review = re.sub(r'\s+[a-z]\s+', ' ', review)
        # removes all single characters at the start of a review
        review = re.sub(r'\^[a-z]\s+', ' ', review)
        # remove all the extra spaces we just added
        review = re.sub(r'\s+', ' ', review)
        text.append(review)
    return text

## Create Piplines
I will be testing two models, Ridge Classifier and Logistic Regression, and will be using as a first step TF-IDF to transform words into vectors and more text clean up. I'll be creating then two pipelines for each model so I can later test them both on sample datasets.

In [175]:
# Pipeline components
vec   = TfidfVectorizer()
ridge = RidgeClassifier()
logReg    = LogisticRegression()

ri_pipe = make_pipeline(vec, ridge)
lr_pipe = make_pipeline(vec, logReg)

In [176]:
X = pre_process(X)

## Vectorizing dataset using TF-IDF
TF (Term Frequency) takes the sum of occurences of each word throughout all the dataset and divides it by the total number of words. IDF (Inverse Data Frequency) takes the log of the number of reviews divided by the number of reviews that contain each word. This determines the weight of rare words across all reviews in the dataset. I am also passing in a list of stopword (words that don't add meaning to a sentence such as he, she, they) and lowers those word's weights accordingly. 

In [177]:
# applying Tfidf model on the data
ri_pipe.steps[0][1].set_params(max_features=2000, min_df=3, max_df=0.6, stop_words=stopwords.words('english'))
ri_pipe.steps[0][1].fit_transform(X).toarray()

lr_pipe.steps[0][1].set_params(max_features=2000, min_df=3, max_df=0.6, stop_words=stopwords.words('english'))
lr_pipe.steps[0][1].fit_transform(X).toarray()




array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.06887219, 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.12007883, 0.        , 0.06321361, ..., 0.        , 0.        ,
        0.        ]])

## Split dataset to training and test sets

In [178]:
# creating training and test sets
text_train, text_test, sent_train, sent_test = train_test_split(X, y, test_size=0.2,random_state=0)

## Fitting our training set with classification models
First pipeline is going to use Logistic Regression to train the model. Logistic Regression is used on classification problems to predict probability of 1 or 0. I am making predictions on the test sets to and looking at the score to see how it performs on my dataset. 

In [179]:
# training the classifier with logistic regression
lr_pipe.fit(text_train, sent_train)




Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.6, max_features=2000,
                                 min_df=3, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                       

In [180]:
prediction = lr_pipe.predict(text_test)

In [185]:
lr_pipe.score(text_test, sent_test)

0.8475

For the second pipeline I'm using the Ridge Classifier, an extension of linear Regression but regularized, to train my model. I'm will use it to predict my test data and apply cross validation to see how it scores across different sections of my data. 

In [59]:
# training the classifier with ridge classifier
ri_pipe.fit(text_train, sent_train)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.6, max_features=2000,
                                 min_df=3, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...],
                                 strip_accents=None, sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                

In [64]:
ri_pipe.predict(text_test)

array([1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0,

In [65]:
scores = cross_val_score(estimator=ri_pipe, X=text_train, y=sent_train, cv=10, scoring="f1")
scores

array([0.8742515 , 0.78362573, 0.84705882, 0.81481481, 0.86956522,
       0.88505747, 0.80981595, 0.84146341, 0.85534591, 0.85      ])

## Exporting our pipelines into pickled files and importing them 

In [66]:
# Pickling the classifier
with open('logistic_regression.pickle', 'wb') as f:
    pickle.dump(lr_pipe, f)
    
# Pickling the vectorizer
with open('ridge_classification.pickle', 'wb') as f:
    pickle.dump(ri_pipe, f)

In [192]:
# Import our classifier and vectorizer
with open('logistic_regression.pickle', 'rb') as f:
    lr = pickle.load(f)
    
with open('ridge_classification.pickle', 'rb') as f:
    rc = pickle.load(f)

In [193]:
rc.predict(['You are the best person ever, I love your day'])

array([1])

## Applying sentiment analysis on Trumps tweets using my models
I am importing all of Trump's tweets since his inauguration from a json file stored locally and parsing through the fields to create an object that I can convert into a dataframe.  

In [209]:
data = {}
with open('tweets.json') as f:
    data = json.load(f)
    
keys = ['text', 'favorite_count', 'created_at'];
# create dictionaries that only contain the keys specified in the keys argument
abbreviated_tweets = [{key: tweet[key] for key in keys} for tweet in data["tweets"]]

# create a dictionary with key values for each one 
df_dict = {key: [tweet[key] for tweet in abbreviated_tweets] for key in keys }
df = pd.DataFrame(df_dict)


## Pre-processing tweets 
In order to clean the data as much as possible I'm removing url's, emoji, html tags and applying a spelling checker.

In [210]:
# cleaning tweet functions
def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'',text)

def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)

def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

# !pip install pyspellchecker
from spellchecker import SpellChecker
spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

In [211]:

df['text']=df['text'].apply(lambda x : remove_URL(x))
df['text']=df['text'].apply(lambda x : remove_html(x))
df['text']=df['text'].apply(lambda x: remove_emoji(x))
# df['text']=df['text'].apply(lambda x : remove_punct(x))
# df['text']=df['text'].apply(lambda x : correct_spellings(x))

## Applying the Ridge Classifier on the tweets

In [212]:
prediction = rc.predict(df['text'])

In [213]:
prediction

array([1, 0, 1, ..., 1, 1, 1])

In [214]:
df["prediction"] = prediction
df

Unnamed: 0,text,favorite_count,created_at,prediction
0,"....I have been packed all day with meetings, ...",45397,Wed Mar 25 21:41:27 +0000 2020,1
1,I hear that Fake News CNN just reported that I...,54754,Wed Mar 25 21:41:26 +0000 2020,0
2,The LameStream Media is the dominant force in ...,99300,Wed Mar 25 20:04:56 +0000 2020,1
3,Today is National #MedalofHonorDay. Join me in...,51664,Wed Mar 25 15:45:36 +0000 2020,1
4,Congratulations to Prime Minister Abe of Japan...,127643,Wed Mar 25 14:54:47 +0000 2020,1
...,...,...,...,...
16722,"""@mike_pence: Congratulations to @RealDonaldTr...",74081,Tue Dec 20 02:50:25 +0000 2016,1
16723,"""@Franklin_Graham: Congratulations to Presiden...",50087,Tue Dec 20 02:46:01 +0000 2016,1
16724,RT @DanScavino: #TrumpTrain,0,Tue Dec 20 01:31:21 +0000 2016,1
16725,We did it! Thank you to all of my great suppor...,223249,Mon Dec 19 23:51:41 +0000 2016,1


In [222]:
positive = df[df["prediction"] == 1]
negative = df[df["prediction"] == 0]
total_positive = positive.shape
total_negative = negative.shape
print(total_positive, total_negative)

(11834, 4) (4893, 4)


In [218]:
import matplotlib.pyplot as plot
import numpy as np
barplot.plot(kind='bar', figsize=(16,8), title='Something Important');

TypeError: no numeric data to plot

['https co hofqef1hum',
 'https co ipslmvefja',
 'rt michaelcoudrey new data french study has demonstrated evidence that the combination of hydroxychloroquine amp azithromycin are highly ',
 'rt michaelcoudrey everyone who thinks coronavirus is harmless or doesn matter should rethink that opinion immediately this is an extr ',
 'white house news conference at 12 30 m thank you ',
 ' be put in use immediately people are dying move fast and god bless everyone us_fda stevefda cdcgov dhsgov',
 'hydroxychloroquine amp azithromycin taken together have real chance to be one of the biggest game changers in the history of medicine the fda has moved mountains thank you hopefully they will both works better with international journal of antimicrobial agents ',
 'great story thank you to mr young of jonesboro arkansas https co i9xh8vxfs2',
 'https co mllfftqv19',
 'https co 2waufzwbsa',
 'rt secretarysonny to the heroes in the s food supply chain we salute you https co 3wiuuntupb',
 'rt secretserv

(46667, 2000)

(46667, 2000)