# Predicting whether Trump's Tweets Will Have Vast Retweets

## Introduction

Donald John Trump, as known as the 45th president of the United States served from 2017 to 2021, extremely prefers to post tweets on Twitter. Although his Twitter account is suspended due to unfriendly content, some of the posts he already published were wildly retweeted. In the dataset [Trump Tweets](https://www.kaggle.com/austinreese/trump-tweets) on Kaggle, his tweets before June 2020 were recorded. In this project, an objective data analysis using machine learning techniques will be performed to solve one particular question: **How to predict whether a tweet posted by Donald Trump will go "viral" (i.e. Having more than 10,000 retweets)?**

## IDE

First of all, importing packages that are used for this project, and import the Trump Tweets dataset.

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

plt.rcParams["font.size"] = 16

from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier

In [2]:
tweets_df = pd.read_csv("realdonaldtrump.csv", index_col=0)
y = tweets_df["retweets"] > 10_000
X = tweets_df["content"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=321)

Since we are predicting the "viral" tweets, and only the contents of the tweets are needed, we modified the targets and features set. Other columns in the original dataset are ignored. As a result, the data is already clean enough for further analysis. Moreover, we split the training and testing data for further assessment.

## Applying the Models

Since the values in the dataset are texts, `CountVectorizer`, which is a common technique to deal with text will be used. Since there is only one column in the `X` set, preprocessing techniques are not needed in this dataset. To obtain a baseline result, `DummyClassifier` will be applied first, since it is the most simple classification model.

In [3]:
countvec = CountVectorizer(stop_words="english")
dm = DummyClassifier()
pipe_dm = make_pipeline(countvec, dm)
cross_val_results_dm = pd.DataFrame(
    cross_validate(pipe_dm, X_train, y_train, return_train_score=True)
)
dummy_result = pd.DataFrame(cross_val_results_dm.mean())
dummy_result.columns = ["DummyClassifier"]
dummy_result

Unnamed: 0,DummyClassifier
fit_time,0.438671
score_time,0.095556
test_score,0.738543
train_score,0.738543


The test score of `DummyClassifier` is around 74%, which means that there are 74% of tweets in the training set are "viral", while the rest are not "viral". It seems not bad using the baseline model. However, there is one problem with `DummyClassifier` model, which is that it does not need the `CountVectorizer` because it simply predicts by looking at the most frequent value in the training set. So, some more advanced models are required to solve the problem. Here, we use the Logistic Regression model.

In [4]:
lr = LogisticRegression(max_iter=1000)
pipe_lr = make_pipeline(countvec, lr)
cross_val_results = pd.DataFrame(
    cross_validate(pipe_lr, X_train, y_train, return_train_score=True)
)
lr_result = pd.DataFrame(cross_val_results.mean())
lr_result.columns = ["Logistic Regression"]
lr_result

Unnamed: 0,Logistic Regression
fit_time,1.117634
score_time,0.094519
test_score,0.89789
train_score,0.967045


As shown in the result, the Logistic Regression model performs better than the baseline model, since it has a much higher test score. It is a clear improvement of the baseline model. To examine it further, some probability scores will be assessed later at the end of this project to evaluate the performance of this model.

## Hyperparameter Tuning and Coefficients

In the previous part, we applied the Logistic Regression model without choosing the best hyperparameter. To obtain a more accurate result, we will run the hyperparameter optimization to obtain the best hyperparameter.

In [5]:
max_features = [10, 100, 1000, 10_000, 100_000]
C_vals = 10.0 ** np.arange(-1.5, 2, 0.5)
param_grid = {
    "countvectorizer__max_features": max_features,
    "logisticregression__C": C_vals,
}

pipe_tune = make_pipeline(CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000))
grid_search = GridSearchCV(pipe_tune, param_grid, verbose=1, n_jobs=-1)
grid_search.fit(X_train, y_train);

Fitting 5 folds for each of 35 candidates, totalling 175 fits


In [6]:
grid_search.best_params_

{'countvectorizer__max_features': 100000, 'logisticregression__C': 1.0}

In [7]:
grid_search.best_score_

0.8978900824847041

There is a slight improvement of the test score, and the best hyperparameters are shown above. Now training the Logistic Regression model again applying the hyperparameters. 

In [8]:
pipe_best = make_pipeline(
        CountVectorizer(stop_words="english", max_features=100000),
        LogisticRegression(max_iter=1000, C=1.0),
    )
pipe_best.fit(X_train, y_train)

Pipeline(steps=[('countvectorizer',
                 CountVectorizer(max_features=100000, stop_words='english')),
                ('logisticregression', LogisticRegression(max_iter=1000))])

Furthermore, we can find the words with the highest and lowest coefficients in the training set, that is, words that mostly and least determine whether the tweets is "viral" or not.

In [9]:
vec_from_pipe = pipe_best.named_steps["countvectorizer"]
lr_from_pipe = pipe_best.named_steps["logisticregression"]

feature_names = np.array(vec_from_pipe.get_feature_names_out())
coeffs = lr_from_pipe.coef_.flatten()
word_coeff_df = pd.DataFrame(coeffs, index=feature_names, columns=["Coefficient"])
word_coeff_df.sort_values(by="Coefficient", ascending=False)

Unnamed: 0,Coefficient
harassment,2.731876
mini,2.712430
fake,2.692801
coronavirus,2.434258
transcripts,2.380516
...,...
1pic,-2.295077
trump2016,-2.316185
barackobama,-2.565437
trump2016pic,-2.637216


We have abstracted the 5 words with the highest coefficient, and the 5 words with the lowest coefficient in the table shown above.

## Result

In this part, we will evaluate the tuned Logistic Regression model using the testing data. To examine it more, we will also generate probability scores, and find the most "viral" tweet in the testing set.

In [10]:
print("Test score of the tuned Logistic Regression model: %f" % (pipe_best.score(X_test, y_test)))

Test score of the tuned Logistic Regression model: 0.899243


In [11]:
viral_probs = pipe_best.predict_proba(X_test)[:,1]
highest_prob = np.argmax(viral_probs)
print("The most 'viral' tweet in the testing set is: '%s'" % X_test.iloc[highest_prob])

The most 'viral' tweet in the testing set is: 'Corrupt politician Adam Schiff wants people from the White House to testify in his and Pelosi’s disgraceful Witch Hunt, yet he will not allow a White House lawyer, nor will he allow ANY of our requested witnesses. This is a first in due process and Congressional history!'


In [12]:
print("associated probability is " + str(viral_probs[highest_prob]))

associated probability is 0.9999999325332923


## Discussion

In this project, the Logistic Regression model is used to predict whether a tweet is "viral" or not, and its performance is generally satisfied. The final prediction accuracy is 89.9243%, which is high in such a text predicting problem. We also predicted the most "viral" tweet in the testing set, and the probability of such prediction is relatively high (close to 1). In conclusion, the Logistic Regression model with the `CountVectorizer` performs well in predicting whether a tweet has vast retweets or not. It is worth using in solving similar problems in the real words, such as analyzing which post is likely to spread widely on the internet, and it is useful to deal with public relation problems.