# Covid19 Tweet Truth Analysis

Coded by Luna McBride

This dataset contains the training, validation, and test csv's, along with excel documents for the train and test files, a csv with the test file actual values, and ERNIE test results. For this analysis, I will be ignoring the excel files (as they are the same as the csv's) and the ERNIE results. I will be acting as if the test answer file did not exist for the duration of the testing phase as well, thus sticking with a basic approach of train, validate, see what the model decides for the tests.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import nltk #Natural Language Toolkit for Processing
from nltk.corpus import stopwords #Get the Stopwords to Remove

import re #Regular Expressions
import html #Messing with HTML content, like &amp;
import string #String Processing

import tensorflow as tf #Import tensorflow in order to use Keras
from tensorflow.keras.preprocessing.text import Tokenizer #Add the keras tokenizer for tweet tokenization
from tensorflow.keras.preprocessing.sequence import pad_sequences #Add padding to help the Keras Sequencing
import tensorflow.keras.layers as L #Import the layers as L for quicker typing
from tensorflow.keras.optimizers import Adam #Pull the adam optimizer for usage

from tensorflow.keras.losses import SparseCategoricalCrossentropy #Loss function being used
from sklearn.model_selection import train_test_split #Train Test Split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/covid19-fake-news-dataset-nlp/Constraint_Val.csv
/kaggle/input/covid19-fake-news-dataset-nlp/Constraint_Train.xlsx
/kaggle/input/covid19-fake-news-dataset-nlp/Constraint_Test.csv
/kaggle/input/covid19-fake-news-dataset-nlp/Constraint_Test.xlsx
/kaggle/input/covid19-fake-news-dataset-nlp/english_test_with_labels.csv
/kaggle/input/covid19-fake-news-dataset-nlp/test_ernie2.0_results.csv
/kaggle/input/covid19-fake-news-dataset-nlp/Constraint_Train.csv


In [2]:
twTrain = pd.read_csv("../input/covid19-fake-news-dataset-nlp/Constraint_Train.csv") #Load the tweet (tw) training set
twTrain.head() #Take a peek at the data

Unnamed: 0,id,tweet,label
0,1,The CDC currently reports 99031 deaths. In gen...,real
1,2,States reported 1121 deaths a small rise from ...,real
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,5,Populous states can generate large case counts...,real


In [3]:
twValid = pd.read_csv("../input/covid19-fake-news-dataset-nlp/Constraint_Val.csv") #Load the tweet (tw) validation set
twValid.head() #Take a peek at the data

Unnamed: 0,id,tweet,label
0,1,Chinese converting to Islam after realising th...,fake
1,2,11 out of 13 people (from the Diamond Princess...,fake
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real


In [4]:
twTest = pd.read_csv("../input/covid19-fake-news-dataset-nlp/Constraint_Test.csv") #Load the tweet (tw) testing set
twTest.head() #Take a peek at the data

Unnamed: 0,id,tweet
0,1,Our daily update is published. States reported...
1,2,Alfalfa is the only cure for COVID-19.
2,3,President Trump Asked What He Would Do If He W...
3,4,States reported 630 deaths. We are still seein...
4,5,This is the sixth time a global health emergen...


---

# Check for Null Values

In [5]:
print("Training Set:\n", twTrain.isnull().any()) #Check for null values in the training set
print("Validation Set:\n", twValid.isnull().any()) #Check for null values in the validation set
print("Testing Set:\n", twTest.isnull().any()) #Check for null values in the testing set

Training Set:
 id       False
tweet    False
label    False
dtype: bool
Validation Set:
 id       False
tweet    False
label    False
dtype: bool
Testing Set:
 id       False
tweet    False
dtype: bool


There are no null values in the dataset.

---

# Data Exploration

In [6]:
print(twTrain["tweet"][0]) #Print a simple tweet example
print(twTrain["tweet"][300]) #Print a more typical tweet example

The CDC currently reports 99031 deaths. In general the discrepancies in death counts between different sources are small and explicable. The death toll stands at roughly 100000 people today.
NEW: There have been numerous #COVID19 outbreaks on recent cruise ship voyages. @CDCDirector has extended the previous No Sail Order to prevent the spread of COVID-19 among crew onboard. https://t.co/OTWJgCN8wQ https://t.co/sbHX4p907F


It appears there are more dry tweets along with more typical tweets (with hashtags and links). The typical tweet examples exist, so I will have to do more usual tweet cleaning.

In [7]:
print("Training Labels:\n", twTrain["label"].value_counts()) #See the training labels
print("Validation Labels:\n", twValid["label"].value_counts()) #See the validation labels

Training Labels:
 real    3360
fake    3060
Name: label, dtype: int64
Validation Labels:
 real    1120
fake    1020
Name: label, dtype: int64


The labels appear to be pretty balanced in number. I will definitely need to get dummies for these to make real and fake into 1 and 0, but the fact that the labels are balanced in number means the model should pick up on these labels without too much difficulty.

---

# Tweet Processing

In [8]:
punctuations = string.punctuation #List of punctuations to remove
print(punctuations) #See the punctuations the string library has

STOP = stopwords.words("english") #Get the NLTK stopwords
print(STOP) #See what NLTK considers stopwords

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only

In [9]:
#CleanTweets: parses the tweets and removes punctuation, stop words, digits, and links.
#Input: the list of tweets that need parsing
#Output: the parsed tweets
def cleanTweets(tweetParse):
    for i in range(0,len(tweetParse)):
        tweet = tweetParse[i] #Putting the tweet into a variable so that it is not calling tweetParse[i] over and over
        tweet = html.unescape(tweet) #Removes leftover HTML elements, such as &amp;
        tweet = re.sub(r"@\w+", " ", tweet) #Completely removes @'s, as other peoples' usernames mean nothing
        tweet = re.sub(r"http\S+", " ", tweet) #Removes links, as links provide no data in tweet analysis in themselves
        
        tweet = "".join([punc for punc in tweet if not punc in punctuations]) #Removes the punctuation defined above
        tweet = tweet.lower() #Turning the tweets lowercase real quick for later use
    
        tweetWord = tweet.split() #Splits the tweet into individual words
        tweetParse[i] = "".join([word + " " for word in tweetWord if not word in STOP]) #Checks if the words are stop words
        
    return tweetParse #Returns the parsed tweets

This code is reworked from my original coronavirus tweet sentiment analysis from earlier in the pandemic (https://www.kaggle.com/lunamcbride24/coronavirus-tweet-processing). I have changed it to use NLTK instead of spacy since those stopwords do not require building a spacy model. I have also used the string library to get punctuation instead of having a bulky hard-coded list and removed the number remover, as I feel that numbers may be a key factor here (especially with the usage of the name Covid-19, since that may have lost the 19 and became just covid, which has a different connotation). These were factors I wanted to change about the original after playing with Keras for TripAdvisor reviews (https://www.kaggle.com/lunamcbride24/hotel-review-keras-classification-project). 

This may be a note to myself, but I did both of those projects half a year ago. This is why you should keep your code well-commented.

In [10]:
twTrain["cleanTweet"] = cleanTweets(twTrain["tweet"].copy()) #Clean the training tweets
twTrain.head() #Take a look at the dataset

Unnamed: 0,id,tweet,label,cleanTweet
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently reports 99031 deaths general dis...
1,2,States reported 1121 deaths a small rise from ...,real,states reported 1121 deaths small rise last tu...
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost uses pandemic...
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid testing laborator...
4,5,Populous states can generate large case counts...,real,populous states generate large case counts loo...


In [11]:
twValid["cleanTweet"] = cleanTweets(twValid["tweet"].copy()) #Clean the validation tweets
twValid.head() #Take a peek at the dataset

Unnamed: 0,id,tweet,label,cleanTweet
0,1,Chinese converting to Islam after realising th...,fake,chinese converting islam realising muslim affe...
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 caused bacterium virus treated aspirin
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praises donald trump’s c...
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 skys explains latest covid19 data governme...


In [12]:
twTest["cleanTweet"] = cleanTweets(twTest["tweet"].copy()) #Clean the testing tweets
twTest.head() #Take a peek at the dataset

Unnamed: 0,id,tweet,cleanTweet
0,1,Our daily update is published. States reported...,daily update published states reported 734k te...
1,2,Alfalfa is the only cure for COVID-19.,alfalfa cure covid19
2,3,President Trump Asked What He Would Do If He W...,president trump asked would catch coronavirus ...
3,4,States reported 630 deaths. We are still seein...,states reported 630 deaths still seeing solid ...
4,5,This is the sixth time a global health emergen...,sixth time global health emergency declared in...


---

# Check for Post-Processing Blank Tweets

In [13]:
print("Training: \n", twTrain.loc[twTrain["cleanTweet"] == ""]) #Check for Training Blank Tweets
print("Validation: \n", twValid.loc[twValid["cleanTweet"] == ""]) #Check for Validation Blank Tweets
print("Testing: \n", twTest.loc[twTest["cleanTweet"] == ""]) #Check for Testing Blank Tweets

Training: 
 Empty DataFrame
Columns: [id, tweet, label, cleanTweet]
Index: []
Validation: 
 Empty DataFrame
Columns: [id, tweet, label, cleanTweet]
Index: []
Testing: 
 Empty DataFrame
Columns: [id, tweet, cleanTweet]
Index: []


In [14]:
print(twTrain["tweet"][300]) #Print a more typical tweet example
print(twTrain["cleanTweet"][300]) #Print the tweet after processing to show link and stopword removal

NEW: There have been numerous #COVID19 outbreaks on recent cruise ship voyages. @CDCDirector has extended the previous No Sail Order to prevent the spread of COVID-19 among crew onboard. https://t.co/OTWJgCN8wQ https://t.co/sbHX4p907F
new numerous covid19 outbreaks recent cruise ship voyages extended previous sail order prevent spread covid19 among crew onboard 


There were no blank tweets created in any set. Tweets can become blank if they were just user names and links, so I just needed to make sure.

---

# Label Encoding

Interestingly, the get_dummies function in pandas will create encoded labels, since this is a binary classification problem. The real column created by it would have 1 for real and 0 for not real, which necessarily means fake in this case. That is the same as label encoding in this case.

In [15]:
dummyTrain = pd.get_dummies(twTrain["label"]) #Get the dummies for the training set
print(dummyTrain) #Show the dummies

      fake  real
0        0     1
1        0     1
2        1     0
3        0     1
4        0     1
...    ...   ...
6415     1     0
6416     1     0
6417     1     0
6418     1     0
6419     0     1

[6420 rows x 2 columns]


That real column shows the encoded values for real vs fake. I will be taking the real column as the encoded values.

In [16]:
twTrain["encodedLabel"] = dummyTrain["real"] #Get the encoded labels from the "real" dummies
twTrain.head() #Take a peek at the data

Unnamed: 0,id,tweet,label,cleanTweet,encodedLabel
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently reports 99031 deaths general dis...,1
1,2,States reported 1121 deaths a small rise from ...,real,states reported 1121 deaths small rise last tu...,1
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost uses pandemic...,0
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid testing laborator...,1
4,5,Populous states can generate large case counts...,real,populous states generate large case counts loo...,1


In [17]:
twValid["encodedLabel"] = pd.get_dummies(twValid["label"])["real"] #Get the encoded labels for the validation set
twValid.head() #Take a peek at the data

Unnamed: 0,id,tweet,label,cleanTweet,encodedLabel
0,1,Chinese converting to Islam after realising th...,fake,chinese converting islam realising muslim affe...,0
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...,0
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 caused bacterium virus treated aspirin,0
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praises donald trump’s c...,0
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 skys explains latest covid19 data governme...,1


---

# Tokenizing and Padding

In [18]:
trainClean = twTrain["cleanTweet"].copy() #Get the training clean tweets
testClean = twTest["cleanTweet"].copy() #Get the testing clean tweets
validClean = twValid["cleanTweet"].copy() #Get the validation clean tweets

trVaClean = trainClean.append(validClean, ignore_index = True) #Combine the training and validation tweets
allCleanTweet = trVaClean.append(testClean, ignore_index = True) #Combine all of the tweets into one series
print(len(allCleanTweet)) #Print the length to show they are all together

10700


In [19]:
token = Tokenizer() #Initialize the tokenizer (set here so all of the datasets are in the same tokenizer)
token.fit_on_texts(allCleanTweet) #Fit the tokenizer to all of the tweets

In [20]:
#TokenizeTweet: turn the tweets into tokens for Keras to use
#Input: a set of tweets
#Output: a set of padded sequences representing the tweets
def tokenizeTweet(tweets):
    texts = token.texts_to_sequences(tweets) #Convert the tweets into sequences for keras to use
    texts = pad_sequences(texts, padding='post') #Pad the sequences to make them similar lengths
    
    return texts #Return the padded sequences

In [21]:
texts = tokenizeTweet(twTrain["cleanTweet"].copy()) #Collect the tokenized tweet sequences
twTrain["tweetSequence"] = list(texts) #Add this data to the dataframe
twTrain.head() #Take a peek at the dataset

Unnamed: 0,id,tweet,label,cleanTweet,encodedLabel,tweetSequence
0,1,The CDC currently reports 99031 deaths. In gen...,real,cdc currently reports 99031 deaths general dis...,1,"[77, 177, 165, 10128, 7, 622, 5449, 73, 1293, ..."
1,2,States reported 1121 deaths a small rise from ...,real,states reported 1121 deaths small rise last tu...,1,"[9, 11, 10130, 7, 634, 311, 44, 1051, 2812, 9,..."
2,3,Politically Correct Woman (Almost) Uses Pandem...,fake,politically correct woman almost uses pandemic...,0,"[6967, 1495, 361, 448, 2075, 21, 2813, 6968, 2..."
3,4,#IndiaFightsCorona: We have 1524 #COVID testin...,real,indiafightscorona 1524 covid testing laborator...,1,"[19, 10131, 14, 12, 193, 16, 3457, 194, 40, 69..."
4,5,Populous states can generate large case counts...,real,populous states generate large case counts loo...,1,"[6970, 9, 5450, 403, 30, 1293, 469, 4, 2, 86, ..."


In [22]:
textsValid = tokenizeTweet(twValid["cleanTweet"].copy()) #Collect tokenized tweet sequences
twValid["tweetSequence"] = list(textsValid) #Add this data to the dataframe
twValid.head() #Take a peek at the dataset

Unnamed: 0,id,tweet,label,cleanTweet,encodedLabel,tweetSequence
0,1,Chinese converting to Islam after realising th...,fake,chinese converting islam realising muslim affe...,0,"[164, 6256, 3034, 17185, 664, 451, 3, 8085, 75..."
1,2,11 out of 13 people (from the Diamond Princess...,fake,11 13 people diamond princess cruise ship inti...,0,"[417, 477, 5, 4675, 4057, 1529, 1381, 17186, 3..."
2,3,"COVID-19 Is Caused By A Bacterium, Not Virus A...",fake,covid19 caused bacterium virus treated aspirin,0,"[1, 436, 2932, 22, 774, 2661, 0, 0, 0, 0, 0, 0..."
3,4,Mike Pence in RNC speech praises Donald Trump’...,fake,mike pence rnc speech praises donald trump’s c...,0,"[2111, 2021, 7433, 3246, 5261, 175, 1417, 1, 6..."
4,5,6/10 Sky's @EdConwaySky explains the latest #C...,real,610 skys explains latest covid19 data governme...,1,"[17188, 3382, 985, 91, 1, 29, 66, 2539, 61, 3,..."


In [23]:
textsTest = tokenizeTweet(twTest["cleanTweet"].copy()) #Collect tokenized tweet sequences
twTest["tweetSequence"] = list(textsTest) #Add this data to the dataframe
twTest.head() #Take a peek at the dataset

Unnamed: 0,id,tweet,cleanTweet,tweetSequence
0,1,Our daily update is published. States reported...,daily update published states reported 734k te...,"[41, 31, 84, 9, 11, 19400, 6, 4232, 4, 2, 6281..."
1,2,Alfalfa is the only cure for COVID-19.,alfalfa cure covid19,"[19401, 115, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,3,President Trump Asked What He Would Do If He W...,president trump asked would catch coronavirus ...,"[70, 47, 624, 135, 2383, 3, 271, 3, 0, 0, 0, 0..."
3,4,States reported 630 deaths. We are still seein...,states reported 630 deaths still seeing solid ...,"[9, 11, 19402, 7, 109, 966, 2386, 125, 887, 73..."
4,5,This is the sixth time a global health emergen...,sixth time global health emergency declared in...,"[3392, 69, 293, 15, 359, 1255, 525, 15, 2950, ..."


---

# Model Training

In [24]:
size = len(token.word_index) + 1 #Set the number of words for the size

tf.keras.backend.clear_session() #Clear any previous model building

epoch = 3 #Number of runs through the data
batchSize = 32 #The number of items in each batch
outputDimensions = 16 #The size of the output
units = 256 #Dimensions of the output space

model = tf.keras.Sequential([ #Start the sequential model, doing one layer after another in a sequence
    L.Embedding(size, outputDimensions, input_length = texts.shape[1]), #Embed the model with the number of words and size
    L.Bidirectional(L.LSTM(units, return_sequences = True)), #Make it so the model looks both forward and backward at the data
    L.GlobalMaxPool1D(), #Take the max values over time
    L.Dropout(0.3), #Make the dropout 0.3, making about a third 0 to prevent overfitting
    L.Dense(64, activation="relu"), #Create a large dense layer
    L.Dropout(0.3), #Make the dropout 0.3, making about a third 0 to prevent overfitting
    L.Dense(3) #Create a small dense layer
])


model.compile(loss = SparseCategoricalCrossentropy(from_logits = True), #Compile the model with a SparseCategorical loss function
              optimizer = 'adam', metrics = ['accuracy'] #Add an adam optimizer and collect the accuracy along the way
             )

history = model.fit(texts, twTrain["encodedLabel"], epochs = epoch, validation_split = 0, batch_size = batchSize) #Fit the model to the data

Epoch 1/3
Epoch 2/3
Epoch 3/3


---

# Validate

In [25]:
predict = model.predict_classes(textsValid) #Predict ratings based on the model
loss, accuracy = model.evaluate(textsValid, twValid["encodedLabel"]) #Get the loss and Accuracy based on the tests

#Print the loss and accuracy
print("Validation Loss: ", loss)
print("Validation Accuracy: ", accuracy)



Validation Loss:  0.23154787719249725
Validation Accuracy:  0.9116822481155396


In [26]:
pd.set_option("display.max_colwidth", 1000) #Show as much of the tweet as possible

validLabel = twValid["encodedLabel"].copy() #Get the encoded labels (1 for real, 0 for fake)
validLabel = pd.DataFrame(validLabel) #Convert to a dataframe to hold more data
validLabel["predictions"] = predict #Add the predictions to the dataframe
validLabel["tweet"] = twValid["tweet"].copy() #Add the original tweet for comparison sake
validLabel.head() #Compare

Unnamed: 0,encodedLabel,predictions,tweet
0,0,0,Chinese converting to Islam after realising that no muslim was affected by #Coronavirus #COVD19 in the country
1,0,1,11 out of 13 people (from the Diamond Princess Cruise ship) who had intially tested negative in tests in Japan were later confirmed to be positive in the United States.
2,0,0,"COVID-19 Is Caused By A Bacterium, Not Virus And Can Be Treated With Aspirin"
3,0,0,Mike Pence in RNC speech praises Donald Trump’s COVID-19 “seamless” partnership with governors and leaves out the president's state feuds: https://t.co/qJ6hSewtgB #RNC2020 https://t.co/OFoeRZDfyY
4,1,1,6/10 Sky's @EdConwaySky explains the latest #COVID19 data and government announcement. Get more on the #coronavirus data here👇 https://t.co/jvGZlSbFjH https://t.co/PygSKXesBg


This is just in case someone is interested to go line by line. Of the ones showing in my dashboard (which is very cropped), the second tweet was flagged as real despite being fake. The wording does seem a bit more reasonable. It probably could have fooled me too.

Note: both this and the test predictions will display their full lists at the bottom of the notebook for ease of access

---

# Test Set Predictions

In [27]:
predictTest = model.predict_classes(textsTest) #Predict ratings based on the model



In [28]:
tweetTest = twTest["tweet"].copy() #Get the original tweets
tweetTest = pd.DataFrame(tweetTest) #Put the tweets into a dataframe
tweetTest["prediction"] = predictTest #Add in the predictions
tweetTest = tweetTest[["prediction", "tweet"]] #Change column order to line up with the validation dataframe's order
tweetTest.head() #Show the tests

Unnamed: 0,prediction,tweet
0,1,Our daily update is published. States reported 734k tests 39k new cases and 532 deaths. Current hospitalizations fell below 30k for the first time since June 22. https://t.co/wzSYMe0Sht
1,0,Alfalfa is the only cure for COVID-19.
2,0,President Trump Asked What He Would Do If He Were To Catch The Coronavirus https://t.co/3MEWhusRZI #donaldtrump #coronavirus
3,1,States reported 630 deaths. We are still seeing a solid national decline. Death reporting lags approximately 28 days from symptom onset according to CDC models that consider lags in symptoms time in hospital and the death reporting process. https://t.co/LBmcot3h9a
4,1,This is the sixth time a global health emergency has been declared under the International Health Regulations but it is easily the most severe-@DrTedros https://t.co/JvKC0PTett


The ones displayed do seem to make sense in context. The "President Trump Asked What He Would Do If He Were To Catch The Coronavirus https://t.co/3MEWhusRZI #donaldtrump #coronavirus" tweet has less to do with the virus itself or truth claims, which is a bit odd, but the rest make sense. I will have all of the test and validation sets fully shown below for those who want to look deeper. A 91% accuracy on a validation set is very good, so I can reasonably assume that it should be fairly accurate on the test set.

---

# Tweets with Predictions: Full Data

## Validation

In [29]:
pd.set_option("display.max_rows", 10000) #Show as much as possible
validLabel #Show the validation set

Unnamed: 0,encodedLabel,predictions,tweet
0,0,0,Chinese converting to Islam after realising that no muslim was affected by #Coronavirus #COVD19 in the country
1,0,1,11 out of 13 people (from the Diamond Princess Cruise ship) who had intially tested negative in tests in Japan were later confirmed to be positive in the United States.
2,0,0,"COVID-19 Is Caused By A Bacterium, Not Virus And Can Be Treated With Aspirin"
3,0,0,Mike Pence in RNC speech praises Donald Trump’s COVID-19 “seamless” partnership with governors and leaves out the president's state feuds: https://t.co/qJ6hSewtgB #RNC2020 https://t.co/OFoeRZDfyY
4,1,1,6/10 Sky's @EdConwaySky explains the latest #COVID19 data and government announcement. Get more on the #coronavirus data here👇 https://t.co/jvGZlSbFjH https://t.co/PygSKXesBg
5,1,1,No one can leave managed isolation for any reason without returning a negative test. If they refuse a test they can then be held for a period of up to 28 days. ⁣ ⁣ On June the 16th exemptions on compassionate grounds have been suspended. ⁣ ⁣
6,1,1,#IndiaFightsCorona India has one of the lowest #COVID19 mortality globally with less than 2% Case Fatality Rate. As a result of supervised home isolation &amp; effective clinical treatment many States/UTs have CFR lower than the national average. https://t.co/QLiK8YPP7E
7,1,1,RT @WHO: #COVID19 transmission occurs primarily through direct indirect or close contact with infected people through their saliva and res…
8,0,0,News and media outlet ABP Majha on the basis of an internal memo of South Central Railway reported that a special train has been announced to take the stranded migrant workers home.
9,0,0,"???Church services can???t resume until we???re all vaccinated, says Bill Gates.??�"


## Test

In [30]:
tweetTest #Show the test set

Unnamed: 0,prediction,tweet
0,1,Our daily update is published. States reported 734k tests 39k new cases and 532 deaths. Current hospitalizations fell below 30k for the first time since June 22. https://t.co/wzSYMe0Sht
1,0,Alfalfa is the only cure for COVID-19.
2,0,President Trump Asked What He Would Do If He Were To Catch The Coronavirus https://t.co/3MEWhusRZI #donaldtrump #coronavirus
3,1,States reported 630 deaths. We are still seeing a solid national decline. Death reporting lags approximately 28 days from symptom onset according to CDC models that consider lags in symptoms time in hospital and the death reporting process. https://t.co/LBmcot3h9a
4,1,This is the sixth time a global health emergency has been declared under the International Health Regulations but it is easily the most severe-@DrTedros https://t.co/JvKC0PTett
5,1,Low #vitaminD was an independent predictor of worse prognosis in patients with COVID-19. https://t.co/CGD6Kphn31 https://t.co/chtni8K4Jd
6,1,A common question: why are the cumulative outcome numbers smaller than the current outcome numbers? A: Most states report current but a few states report cumulative. They are apples and oranges and we don't feel comfortable filling in state cumulative boxes with current #s.
7,1,The government should consider bringing in any new national lockdown rules over Christmas rather than now says an Oxford University professor https://t.co/pdOls6cqoN
8,1,Our daily update is published. We’ve now tracked more than 2.9 million tests up 119k from yesterday. That's the smallest reported increase since April 1. Note that we can only track tests that a state reports. And not all states report all tests. See: https://t.co/PZrmH4bl5Y https://t.co/2588xW5yNm
9,1,Breakdown of testing: 4 air crew 97 hotel &amp; health staff in the facility 71 former hotel guests &amp; 2 exempted individuals who returned negative tests. And there are over 200 current guests in the Novotel who were swabbed on Tuesday &amp; Wednesday. Their results were negative.⁣
