# 1. Introduction<a name="introduction"></a>
### With this notebook, we will attempt to find the best model to classify these tweets. We'll start with some data cleaning and some vizualizations to get a little more familiar with the data, and from there we'll explore some more traditional classification methods.  After that, we'll train BERT on the tweets and compare the results with other models. 

In [None]:
!pip install chart_studio

In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import re 
import string
import requests
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from wordcloud import WordCloud
import chart_studio.plotly as py
import plotly.graph_objects as go
from plotly.offline import download_plotlyjs, plot, iplot

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import os
from os import path

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
import nltk
nltk.download('wordnet')

In [None]:
data_dict = {}
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        data_dict[filename] = os.path.join(dirname,filename)
        print(os.path.join(dirname, filename))
print(data_dict)

In [None]:
# Load in training data
train = pd.read_csv(data_dict['Corona_NLP_train.csv'], encoding = 'latin1')
# Copy training data
df = train.copy()
df.head()

In [None]:
# Load in test data
test_df = pd.read_csv(data_dict['Corona_NLP_test.csv'], encoding = 'latin1')
test_df.head()

# 2. Data Cleaning <a name="data_cleaning"></a>

In [None]:
# Check for nulls
df.info()

In [None]:
# Replace na with 'None'
df['Location'].fillna('None', inplace = True)
df.head()

In [None]:
# Join stopwords together and set them for use in cleaning function.
", ".join(stopwords.words('english'))
stops = set(stopwords.words('english'))

# Function that cleans tweets for classification. 
def clean_tweet(tweet):
    # Remove hyperlinks.
    tweet= re.sub(r'https?://\S+|www\.\S+','',tweet)
    # Remove html
    tweet = re.sub(r'<.*?>','',tweet)
    # Remove numbers (Do we want to remove numbers? Death toll?)
    tweet = re.sub(r'\d+','',tweet)
    # Remove mentions
    tweet = re.sub(r'@\w+','',tweet)
    # Remove punctuation
    tweet = re.sub(r'[^\w\s\d]','',tweet)
    # Remove whitespace
    tweet = re.sub(r'\s+',' ',tweet).strip()
    # Remove stopwords
    tweet = " ".join([word for word in str(tweet).split() if word not in stops])
    
    return tweet.lower()

In [None]:
# Check function
example2 = df['OriginalTweet'][1]
clean_tweet(example2)

In [None]:
# Apply text cleaning function to training and test dataframes.
df['newTweet'] = df['OriginalTweet'].apply(lambda x: clean_tweet(x))
test_df['newTweet'] = test_df['OriginalTweet'].apply(lambda x: clean_tweet(x))
df.head()

### Here, we'll define a couple functions to either stem or lemmatize the tweets. These methods will be compared during classification to see which one gives us the model with the greatest accuracy. 

In [None]:
def token_stem(tweet):
    tk = TweetTokenizer()
    stemmer = PorterStemmer()
    tweet = tk.tokenize(tweet)
    tweet = [stemmer.stem(word) for word in tweet]
    tweet =  tweet = " ".join([word for word in tweet])
    return tweet

In [None]:
def token_lemma(tweet):
    tk = TweetTokenizer()
    lemma = WordNetLemmatizer()
    tweet = tk.tokenize(tweet)
    tweet = [lemma.lemmatize(word) for word in tweet]
    tweet = " ".join([word for word in tweet])
    return tweet

In [None]:
tweet = df['newTweet'][1]
tweet

In [None]:
print(token_stem(tweet))
print('\n')
print(token_lemma(tweet))

### See the differences in these techniques? Stemming converts words to their 'stems', while lemmatizing brings the words to their 'lemmas', or dictionary forms. 

In [None]:
df['stemTweet'] = df['newTweet'].apply(lambda x: token_stem(x))
df['lemmaTweet'] = df['newTweet'].apply(lambda x: token_lemma(x))
df.head()

In [None]:
# Create more useful labels for classification.
# We will take the original 5 possibilites and
# reduce them to 3, removing the "extremelys".
def make_label(sentiment):
    
    label = ''
    if 'Positive' in sentiment: 
        label = 1
    if 'Negative' in sentiment:
        label = -1
    if 'Neutral' in sentiment:
        label = 0
    return label

In [None]:
# Apply make_label funtion to training and test dataframes.
df['label'] = df['Sentiment'].apply(lambda x: make_label(x))
test_df['label'] = test_df['Sentiment'].apply(lambda x: make_label(x))
df.head()

### Below are some of the common locations found in the tweets that will help us properly map more tweets to a particular country.

In [None]:
# Some frequent US locations
us_filters = ('New York', 'New York, NY', 'NYC', 'NY', 'Washington, DC', 'Los Angeles, CA',
             'Seattle, Washington', 'Chicago', 'Chicago, IL', 'California, USA', 'Atlanta, GA',
             'San Francisco, CA', 'Boston, MA', 'New York, USA', 'Texas, USA', 'Austin, TX',
              'Houston, TX', 'New York City', 'Philadelphia, PA', 'Florida, USA', 'Seattle, WA',
             'Washington, D.C.', 'San Diego, CA', 'Las Vegas, NV', 'Dallas, TX', 'Denver, CO',
             'New Jersey, USA', 'Brooklyn, NY', 'California', 'Michigan, USA', 'Minneapolis, MN',
             'Virginia, USA', 'Miami, FL', 'Texas', 'Los Angeles', 'United States', 'San Francisco',
             'Indianapolis, IN', 'Pennsylvania, USA', 'Phoenix, AZ', 'New Jersey', 'Baltimore, MD',
             'CA', 'FL', 'DC', 'TX', 'IL', 'MA', 'PA', 'GA', 'NC', 'NJ', 'WA', 'VA', 'PAK', 'MI', 'OH',
             'CO', 'AZ', 'D.C.', 'WI', 'MD', 'MO', 'TN', 'Florida', 'IN', 'NV', 'MN', 'OR','LA', 'Michigan',
             'CT', 'SC', 'OK', 'Illinois', 'Ohio', 'UT', 'KY', 'Arizona', 'Colorado')

# Various nation's frequent locations
uk_filters = ('England', 'London', 'london', 'United Kingdom', 'united kingdom',
              'England, United Kingdom', 'London, UK', 'London, England',
              'Manchester, England', 'Scotland, UK', 'Scotland', 'Scotland, United Kingdom',
              'Birmingham, England', 'UK', 'Wales')
india_filters = ('New Delhi, India', 'Mumbai', 'Mumbai, India', 'New Delhi', 'India', 
                 'Bengaluru, India')
australia_filters = ('Sydney, Australia', 'New South Wales', 'Melbourne, Australia', 'Sydney',
                     'Sydney, New South Wales', 'Melbourne, Victoria', 'Melbourne', 'Australia')
canada_filters = ('Toronto, Ontario', 'Toronto', 'Ontario, Canada', 'Toronto, Canada', 'Canada',
                  'Vancouver, British Columbia', 'Ontario', 'Victoria', 'British Columbia', 'Alberta',)
south_africa_filters = ('Johannesburg, South Africa', 'Cape Town, South Africa', 'South Africa')
nigeria_filters = ('Lagos, Nigeria')
kenya_filters = ('Nairobi, Kenya')
france_filters = ('Paris, France')
ireland_filters = ('Ireland')
new_zealand_filters = ('New Zealand')
pakistan_filters = ('Pakistan')
malaysia_filters = ('Malaysia')
uganda_filters = ('Kampala, Uganda', 'Uganda')
singapore_filters = ('Singapore')
germany_filters = ('Germany', 'Deutschland')
switz_filters = ('Switzerland')
uae_filters = ('United Arab Emirates', 'Dubai')
spain_filters = ('Spain')
belg_filters = ('Belgium')
phil_filters = ('Philippines')
hk_filters = ('Hong Kong')
ghana_filters = ('Ghana')
# These all have large counts. Need to be removed from rest of data
other_filters = ('None', 'Worldwide', 'Global', 'Earth', '??')

In [None]:
df['country'] = df['Location'].apply(lambda x: x.split(",")[-1].strip() if ("," in x) else x)

In [None]:
df.head()

In [None]:
# Changing strings found with filters into 3 digit codes
df['country'] = df['country'].apply(lambda x: 'USA' if x in us_filters else x)
df['country'] = df['country'].apply(lambda x: 'GBR' if x in uk_filters else x)
df['country'] = df['country'].apply(lambda x: 'IND' if x in india_filters else x)
df['country'] = df['country'].apply(lambda x: 'AUS' if x in australia_filters else x)
df['country'] = df['country'].apply(lambda x: 'CAN' if x in canada_filters else x)
df['country'] = df['country'].apply(lambda x: 'ZAF' if x in south_africa_filters else x)
df['country'] = df['country'].apply(lambda x: 'KEN' if x in kenya_filters else x)
df['country'] = df['country'].apply(lambda x: 'NGA' if x in nigeria_filters else x)
df['country'] = df['country'].apply(lambda x: 'SGP' if x in singapore_filters else x)
df['country'] = df['country'].apply(lambda x: 'FRA' if x in france_filters else x)
df['country'] = df['country'].apply(lambda x: 'NZL' if x in new_zealand_filters else x)
df['country'] = df['country'].apply(lambda x: 'PAK' if x in pakistan_filters else x)
df['country'] = df['country'].apply(lambda x: 'MYS' if x in malaysia_filters else x)
df['country'] = df['country'].apply(lambda x: 'IRL' if x in ireland_filters else x)
df['country'] = df['country'].apply(lambda x: 'UGA' if x in uganda_filters else x)
df['country'] = df['country'].apply(lambda x: 'DEU' if x in germany_filters else x)
df['country'] = df['country'].apply(lambda x: 'CHE' if x in switz_filters else x)
df['country'] = df['country'].apply(lambda x: 'ARE' if x in uae_filters else x)
df['country'] = df['country'].apply(lambda x: 'ESP' if x in spain_filters else x)
df['country'] = df['country'].apply(lambda x: 'BEL' if x in belg_filters else x)
df['country'] = df['country'].apply(lambda x: 'PHL' if x in phil_filters else x)
df['country'] = df['country'].apply(lambda x: 'GHA' if x in ghana_filters else x)
df['country'] = df['country'].apply(lambda x: 'HKG' if x in hk_filters else x)
df['country'] = df['country'].apply(lambda x: 'None' if x in other_filters else x)

In [None]:
df['country'].value_counts()

# 3. Vizualizations <a name="viz"></a>

In [None]:
# 0:30 because that's where the labeled countries end
places_df = pd.DataFrame(df['country'].value_counts()[0:30])
places_df.reset_index(inplace = True)
places_df.rename(columns = {'index':'Country', 'country':'Tweets'}, inplace = True)
# Remove 'None' location
places_df = places_df[places_df['Country'] != 'None']

In [None]:
data = dict(type='choropleth',
            colorscale = 'inferno',
            locations = places_df['Country'],
            z = places_df['Tweets'],
            #locationmode = 'USA-states',
            text = places_df['Tweets'],
            marker = dict(line = dict(color = 'rgb(255,255,255)',width = 2)),
            colorbar = {'title':"Number of Tweets"}
            ) 

layout = dict(title = 'Number of Tweets By Country',
              geo = dict(#scope='usa',
                         showlakes = False,
                         lakecolor = 'rgb(85,173,240)',
                         projection_type='equirectangular')
             )

choromap = go.Figure(data = [data],layout = layout)

In [None]:
iplot(choromap)

### The vast majority of tweets come from English speaking countries, which makes sense, since these tweets are all in English. Largest contributer is the USA followed by the UK and Canada.  


In [None]:
# image courtesy of https://tse2.mm.bing.net/th?id=OIP.VLv_PpEOc8TDwuTNvj5hWQHaHa&pid=Api
#img = Image.open(data_dict['rona4.jpeg'])
mask = np.array(Image.open(data_dict['rona4.jpeg']))

# Positive WordCloud
pos_df = df[df['label'] == 1]
pos_text = pos_df['newTweet'].to_string(index = False)
pos_text = re.sub(r'\n','',pos_text)
pos_cloud = WordCloud(colormap = 'Greens', mask = mask).generate(pos_text)

# Neutral WordCloud
neut_df = df[df['label'] == 0]
neut_text = neut_df['newTweet'].to_string(index = False)
neut_text = re.sub(r'\n','', neut_text)
neut_cloud = WordCloud(colormap = 'Blues', mask = mask).generate(neut_text)

# Negative wordcloud
neg_df = df[df['label'] == -1]
neg_text = neg_df['newTweet'].to_string(index = False)
neg_text = re.sub(r'\n','', neg_text)
neg_cloud = WordCloud(colormap = 'Reds', mask = mask).generate(neg_text)

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize = [30,20])
ax1.imshow(pos_cloud)
ax1.set_title('Positive Cloud', fontsize = 30)
ax1.axis('off')
ax2.imshow(neut_cloud)
ax2.set_title('Neutral Cloud', fontsize = 30)
ax2.axis('off')
ax3.imshow(neg_cloud)
ax3.set_title('Negative Cloud', fontsize = 30)
ax3.axis('off')

### Tried to use an image of the coronavirus for the mask, it certainly could have turned out better...
### 'Grocery store', 'price', 'supermarket', and 'online shopping' being frequent in positive, neutral, and negative tweets is interesting.  Some stand-out negative terms are 'panic buying' and 'toilet paper'. For positive, 'hand sanitizer' catches my attention. 

In [None]:
def ngram_df(corpus,nrange,n=None):
    vec = CountVectorizer(stop_words = 'english',ngram_range=nrange).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    total_list=words_freq[:n]
    df=pd.DataFrame(total_list,columns=['text','count'])
    return df

In [None]:
unigram_df = ngram_df(df['newTweet'],(1,1),20)
bigram_df = ngram_df(df['newTweet'],(2,2),20)
trigram_df = ngram_df(df['newTweet'],(3,3),20)

In [None]:
unigram_df['text'][::-1]

In [None]:
sns.set(font_scale = 1.3)
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.barplot(data = unigram_df, y = 'text', x = 'count')

### 'Prices' being the most frequent unigram after covid/coronavirus may be due to rising food prices and other various shortages.

In [None]:
sns.set(font_scale = 1.3)
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.barplot(data = bigram_df, y = 'text', x = 'count')

### Grocery store way outpacing covid bigrams is pretty interesting. Online shopping, hand sanitizer, toilet paper, and panic buying are all within the realm of expectation. 

In [None]:
sns.set(font_scale = 1.3)
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.barplot(data = trigram_df, y = 'text', x = 'count')

### Grocery store dominates these trigrams. People may be concerned about the safety of grocery shopping during a pandemic, and the health of the grocery store workers. 

# 4. Classification <a name="classification"></a>

In [None]:
# Set X and y.
X = df['newTweet']
y = df['label']

# Split data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
X[1]

### We'll try 4 different classifiers here: SVC, Logisitic Regression, Naive Bayes, and Random Forest. Furthermore, we'll also be testing whether these models perform better using Term Frequency Inverse Document Frequency or just a simple Count for the vectors we feed into the model. TFIDF increases every time a word appears in a document(tweet), but is then offset for every document(tweet) that word appears. This can help pick out the more important words for classification. Additionally, we'll be using cross validation to help gauge each model's accuracy and variance across multiple splits of the data. 

In [None]:
clf = dict({'SVC': LinearSVC(max_iter = 5000),
            'Logisitc': LogisticRegression(max_iter = 5000),
            'NaiveBayes': MultinomialNB(),
            'RandomForest': RandomForestClassifier(),
           })

In [None]:
def make_models(clf, vectorizer, X_train, y_train, cv = 5):
    
    acc_df = pd.DataFrame(index=range(cv * len(clf)))
    results = []
    for classifier in clf.keys():
        model = Pipeline([('vectorizer',vectorizer),
                   ('clf', clf[classifier])])
        model.fit(X_train, y_train)
        scores = cross_val_score(model, X_train , y_train, cv = cv)
        model_name = classifier
        for fold, score in enumerate(scores):
            results.append((model_name, fold, score))
    
    acc_df = pd.DataFrame(results, columns=['model_name', 'fold', 'accuracy'])
    
    return acc_df

In [None]:
# Number of folds for K-fold cross validation
cv = 10

### Takes a good bit to run (over 30 minutes)...10 fold cross validation on 4 separate classifiers will take a while.
### Results are saved to 'pipe_results.csv' if you want to save time. 
### Skip down a few cells to see where I load the results if you don't want to run each model.
### Logistic and RandomForest take much longer than SVC and NaiveBayes.

In [None]:
tfidf_df = make_models(clf, TfidfVectorizer(), X_train, y_train, cv)
count_df = make_models(clf, CountVectorizer(), X_train, y_train, cv)
tfidf_df['vectorizer'] = 'tfidf'
count_df['vectorizer'] = 'count'
combined_df = tfidf_df.append(count_df)
combined_df['method'] = 'none'
combined_df.head(10)

In [None]:
# Set X and y.
X = df['stemTweet']
y = df['label']

# Split data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

stem_tfidf_df = make_models(clf, TfidfVectorizer(), X_train, y_train, cv)
stem_count_df = make_models(clf, CountVectorizer(), X_train, y_train, cv)

stem_tfidf_df['method'] = 'stem'
stem_tfidf_df['vectorizer'] = 'tfidf'
stem_count_df['method'] = 'stem'
stem_count_df['vectorizer'] = 'count'
stem_df = stem_tfidf_df.append(stem_count_df)

In [None]:
# Set X and y.
X = df['lemmaTweet']
y = df['label']

# Split data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

lemma_tfidf_df = make_models(clf, TfidfVectorizer(), X_train, y_train, cv)
lemma_count_df = make_models(clf, CountVectorizer(), X_train, y_train, cv)

lemma_tfidf_df['vectorizer'] = 'tfidf'
lemma_tfidf_df['method'] = 'lemma'
lemma_count_df['vectorizer'] = 'count'
lemma_count_df['method'] = 'lemma'
lemma_df = lemma_tfidf_df.append(lemma_count_df)

In [None]:
all_df = lemma_df.append(stem_df)
all_df = all_df.append(combined_df)

In [None]:
# Skip to here to avoid running the models
all_df = pd.read_csv(data_dict['pipe_results.csv'])

In [None]:
sns.set(font_scale = 1.4)
sns.catplot(x = 'model_name', y = 'accuracy', hue = 'method', height = 7,
            data = all_df, kind = 'box', col = 'vectorizer', palette = 'rainbow')

### Naive Bayes and RandomForest do much worse than Logistic and SVC, and make the boxplots fairly hard to look at. Let's drop them for better visuals. 

In [None]:
no_nb = all_df[all_df['model_name'] != 'NaiveBayes']
no_nb_rf = no_nb[no_nb['model_name'] != 'RandomForest']
sns.set(font_scale = 1.4)
sns.catplot(x = 'model_name', y = 'accuracy', hue = 'method', height = 7,
            data = no_nb_rf, kind = 'box', col = 'vectorizer', palette = 'rainbow')

### SVC does better when using tfidf, and Logistic Regression does better when using count. Stemming seems to do worse than lemmatization accuracy wise, although lemmatization has more outliers. The best results tend to come from using neither lemmatization nor stemming on the tweets. 
### SVC using tfidf and Logistic with count have approximately the same median, but the SVC has less variance and a slightly more even distribution. 
### It should be noted that the differences in accuracies between the best performing models are very small, and are probably due to the random splits more than methodology.  Bearing that in mind, I would select the LinearSVC using tfidf and no lemma/stem because it takes MUCH less time to run than the logistic regression, and based on these results, it has less variance. 

In [None]:
accuracies = all_df.groupby(['model_name', 'method', 'vectorizer']).accuracy.mean()
stdDev = all_df.groupby(['model_name', 'method', 'vectorizer']).accuracy.std()
metrics_df = pd.concat([accuracies, stdDev], axis = 1, ignore_index = True)
metrics_df.columns = ['mean_acc', 'mean_std']

In [None]:
metrics_df.sort_values(by = ['mean_acc','method'], ascending = False).head(10)

## Again, this displays just how small the accuracy differences are between the best models. For the sake of efficiency, an SVC using tfidf vectors is recommended. Let's fit one and explore the results more thorouhgly. 

In [None]:
# Set X and y.
X = df['newTweet']
y = df['label']

# Set vectorizer for feature extraction.
vectorizer = TfidfVectorizer()

# Split data into training and test sets to fit the model.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define model for predictions
model = Pipeline([('vectorizer',vectorizer),
                  ('clf', LinearSVC(max_iter = 5000))])

model.fit(X_train, y_train)

In [None]:
train_preds = model.predict(X_test)

print('Accuracy:', accuracy_score(y_test, train_preds))
print('\n')
print(classification_report(y_test, train_preds))

### ~80% accuracy on the training data, not too bad. Precision and recall are significantly lower for neutral tweets than positive or negative, possibly due to the lower support, but it could also be that neutral tweets are harder to classify. This model appears to be slightly better at predicting positive tweets than negative tweets. 
### Now, we'll see how the model performs on the test data.

In [None]:
# Set X and y.
X2 = test_df['newTweet']
y2 = test_df['label']


test_preds = model.predict(X2)
print('Accuracy:', accuracy_score(y2, test_preds))
print('\n')
print(classification_report(y2, test_preds))

### Model does a little bit worse on test data than on training data. Let's see if we can improve the accuracy by tuning some parameters

In [None]:
# Dictionary of parameters that can be tuned
model.get_params()

In [None]:
# GridSearchCV goes through specified parameter values and finds the best ones. 
from sklearn.model_selection import GridSearchCV

# We'll try a few different options here.
hyperparameters = { 'vectorizer__max_df': [1, 0.9, 0.95, .85],
                    'vectorizer__ngram_range': [(1,1), (1,2), (2,2),(2,3)],
                  }
model_tune = GridSearchCV(model, hyperparameters, cv=5)

# Fit and tune model
model_tune.fit(X_train, y_train)

In [None]:
# These are the best parameters according to the GridSearch
model_tune.best_params_

In [None]:
# Gridsearch will refit the model on the best settings
model_tune.refit

In [None]:
preds = model_tune.predict(X2)
print('Accuracy:', accuracy_score(y2, preds))
print('\n')
print(classification_report(y2, preds))

### Looks like our tuning didn't improve accuracy at all.  Let's take a look at some of the mislabled tweets.

In [None]:
test_df['pred_label'] = test_preds
test_df.head()

In [None]:
mislabel_df = test_df[test_df['label'] != test_preds]
mislabel_df.head()

In [None]:
mislabel_df['Location'].fillna('None', inplace = True)
mislabel_df['country'] = mislabel_df['Location'].apply(lambda x: x.split(",")[-1].strip() if ("," in x) else x)

# Changing strings found with filters into 3 digit codes
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'USA' if x in us_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'GBR' if x in uk_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'IND' if x in india_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'AUS' if x in australia_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'CAN' if x in canada_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'ZAF' if x in south_africa_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'KEN' if x in kenya_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'NGA' if x in nigeria_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'SGP' if x in singapore_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'FRA' if x in france_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'NZL' if x in new_zealand_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'PAK' if x in pakistan_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'MYS' if x in malaysia_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'IRL' if x in ireland_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'UGA' if x in uganda_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'DEU' if x in germany_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'CHE' if x in switz_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'ARE' if x in uae_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'ESP' if x in spain_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'BEL' if x in belg_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'PHL' if x in phil_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'GHA' if x in ghana_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'HKG' if x in hk_filters else x)
mislabel_df['country'] = mislabel_df['country'].apply(lambda x: 'None' if x in other_filters else x)

In [None]:
mislabel_df['country'].value_counts()

### The mislabels mostly come from the USA, which is where the majority of the tweets are from anyway. Let's take a look at a few of the tweets themselves.

In [None]:
mislabel_df.head()

In [None]:
mislabel_df.tail()

In [None]:
# Neutral tweet
print(mislabel_df['OriginalTweet'][7])
print('\n')
print(mislabel_df['newTweet'][7])

### Wonder how that mark above the 'A' affected the predicted label, and other marked words.  Words that may have contributed to the model predicting positive might be 'surgical' and 'healthworkers'. 

In [None]:
# Negative tweet
print(mislabel_df['OriginalTweet'][15])
print('\n')
print(mislabel_df['newTweet'][15])

### Here the model predicted positive, and the 'Â' showed up again in a mislabel. The model may have seen 'free' and 'rights' and labeled it as positive, even though to a human reader, this is quite clearly a negative tweet.

In [None]:
# Positive tweet
print(mislabel_df['OriginalTweet'][3779])
print('\n')
print(mislabel_df['newTweet'][3779])

### The model predicted negative for this tweet, while the true label is positive. Lower prices is indeed a positive thing for consumers. Perhaps the model took 'stuck', 'coronavirus', and 'covid' to be more negative than the other words in the tweet.

### A possible update to improve accuracy of this model may involve handling the accented letters in a better way. 
### However, if we want better accuracy, we should try BERT. We'll fit a BERT model and see how well it does.

# 5. BERT <a name="bert"></a>

In [None]:
import torch
from tqdm.notebook import tqdm

from transformers import BertTokenizer

from torch.utils.data import TensorDataset

import transformers
from transformers import BertForSequenceClassification

#import numpy as np
#import pandas as pd
#import re

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode the classes for BERT. We'll keep using the 3 labels we made earlier.  
encoder = LabelEncoder()
df['encoded_sentiment'] = encoder.fit_transform(df['label'])

In [None]:
# Set X and y.
X = df['newTweet']
y = df['encoded_sentiment']

# Split data into training and test sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)

In [None]:
# Encoding the words in the training data into vectors.
encoded_data_train = tokenizer.batch_encode_plus(
    X_train, 
    truncation = True,
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=50, 
    return_tensors='pt'
)

# Encoding the words in the test data into vectors.
encoded_data_test = tokenizer.batch_encode_plus(
    X_test, 
    truncation = True,
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=50, 
    return_tensors='pt'
)

In [None]:
# Get inputs and attention masks from previously encoded data. 
input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(y_train.values)

input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(y_test.values)

# Instantiate TensorDataset
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

In [None]:
# Initialize the model. 
model = transformers.BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=3,
                                                      output_attentions=False,
                                                      output_hidden_states=False)

In [None]:
# DataLoaders for running the model
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=128)

dataloader_test = DataLoader(dataset_test, 
                                   sampler=SequentialSampler(dataset_test), 
                                   batch_size=128)

In [None]:
# Setting hyperparameters
from transformers import AdamW, get_linear_schedule_with_warmup

optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)
                  
epochs = 10

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)


In [None]:
from sklearn.metrics import f1_score

def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
device = torch.device('cuda')

In [None]:
model.to(device)

for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0].to(device),
                  'attention_mask': batch[1].to(device),
                  'labels':         batch[2].to(device),
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')


In [None]:
def evaluate(dataloader_test):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_test:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_test) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
val_loss, predictions, true_vals = evaluate(dataloader_test)
val_f1 = f1_score_func(predictions, true_vals)

In [None]:
print('Val Loss = ', val_loss)
print('Val F1 = ', val_f1)

In [None]:
encoded_classes = encoder.classes_
predicted_category = [encoded_classes[np.argmax(x)] for x in predictions]
true_category = [encoded_classes[x] for x in true_vals]

In [None]:
x = 0
for i in range(len(true_category)):
    if true_category[i] == predicted_category[i]:
        x += 1
        
print('Accuracy Score = ', x / len(true_category))

In [None]:
print(classification_report(true_category, predicted_category))

### 87% accuracy is about 7% better than what we get using an SVC for training. 

## Now, we'll use the test dataset to evaluate BERT.

In [None]:
test_df['encoded_sentiment'] = encoder.fit_transform(test_df['label'])

# Set X and y.
X = test_df['newTweet']
y = test_df['encoded_sentiment']

encoded_data_test = tokenizer.batch_encode_plus(
    X, 
    truncation = True,
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=50, 
    return_tensors='pt'
)

input_ids_test = encoded_data_test['input_ids']
attention_masks_test = encoded_data_test['attention_mask']
labels_test = torch.tensor(y.values)

# Pytorch TensorDataset Instance
dataset_test = TensorDataset(input_ids_test, attention_masks_test, labels_test)

dataloader_test = DataLoader(dataset_test, 
                                   sampler=SequentialSampler(dataset_test), 
                                   batch_size=128)

In [None]:
val_loss, predictions, true_vals = evaluate(dataloader_test)
val_f1 = f1_score_func(predictions, true_vals)

In [None]:
encoded_classes = encoder.classes_
predicted_category = [encoded_classes[np.argmax(x)] for x in predictions]
true_category = [encoded_classes[x] for x in true_vals]

x = 0
for i in range(len(true_category)):
    if true_category[i] == predicted_category[i]:
        x += 1
        
print('Accuracy Score = ', x / len(true_category))
print('\n')
print(classification_report(true_category, predicted_category))

### On the actual test data, the model scores an 85%, which is ~6% better than the LinearSVC performed on this data. This BERT model could possibly squeeze out some more accuracy with additional hyperparameter tuning, as I did not play around with the learning rate. Also, you could try feeding the lemmatized or stemmed tweets into it, as for this run, I went with the cleaned tweets instead of the stemmed/lemmatized. 

# 6. Conclusion <a name="conclusion"></a>

### Unsuprisingly, BERT performs better than an SVC or logistic regression. However, it was fairly shocking to see lemmatization and stemming perform a bit worse than just leaving the words alone. It was also a bit curious how the Random Forest Classifier lagged a bit behind the SVC and the logistic regression. We chose the SVC using TFIDF amongst the traditional classifiers because it was the most accurate and runs much faster than the logisitic regression on this data. But when it comes to raw accuracy, BERT is decidedly better than an SVC. 

### To further increase prediction accuracy, one should try tuning the hyperparameters of BERT, or testing other pretrained HuggingFace transformers on this dataset. 

### A big thank you to all these notebooks:

https://www.kaggle.com/immvab/transformers-covid-19-tweets-sentiment-analysis/comments

https://www.kaggle.com/arushi2/covid19-tweets-geo-and-sentiment-analysis#data

https://www.kaggle.com/datatattle/battle-of-ml-classification-models

https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines

