## Introduction
Hello thereee!

In this project, the goal is to build two models - `Logistic Regression and LSTM` - that can detect and classify the sentiments (`postive, negative or neutral`)  of COVID19-related tweets. We'll also do some exploratory data analysis along the way

The dataset used can be found [here](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification)

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# data preprocessing
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords


# model building
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D

# metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

import re

%matplotlib inline
pd.options.display.max_rows = 300

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Understanding the Data

In [None]:
# load the datasets
train = pd.read_csv("../input/covid-19-nlp-text-classification/Corona_NLP_train.csv", encoding="latin-1")
test = pd.read_csv("../input/covid-19-nlp-text-classification/Corona_NLP_test.csv")

In [None]:
train.head()

In [None]:
test.head()

In [None]:
print(train.info(), '\n')
print(test.info())

    - UserName and ScreenName are randomly generated fields for unique identification purpose only. Their values wouldn't   impact our model, hence, we will be dropping both columns
    - Location is the only column with missing values
    - TweetAt, which contains times the tweets were made, has an object datatype - we'll be converting this to a datetime   datatype
    - Sentiment is the target variable

#### Duplicates and Null values

In [None]:
# drop duplicate entries
train.drop_duplicates(inplace= True)
test.drop_duplicates(inplace=True)

In [None]:
# drop UserName and ScreenName columns
train.drop(['UserName', 'ScreenName'], axis=1, inplace=True)
test.drop(['UserName', 'ScreenName'], axis=1, inplace=True)

In [None]:
# show columns with missing values
plt.figure(figsize=(14,4))
for index, df in enumerate([train, test]):
    plt.subplot(1,2, index+1)
    sns.heatmap(df.isnull(), cmap='viridis', yticklabels= False).set_title('train' if index==0 else 'test')

plt.show()

In [None]:
# check number of missing values
print(train.isnull().sum())

In [None]:
# check Location
print(train.Location.value_counts(normalize= True, dropna= False)[:30] *100)

- About 21% of the Location data is missing
- The Location values include both cities & countries and do not follow a consistent pattern - which makes it quite chaellnging to clean. However, I'll tidy up the column a bit by replacinig cases where we have for instance, `"London, England"` with just `"London"`; `"Los Angeles, CA"` with `"Los Angeles"`

In [None]:
train.Location = train.Location.str.split(',').str[0]

## Exploratory Data Analysis

### Sentiment

In [None]:
print(train.Sentiment.value_counts(normalize=True) * 100)

To make analysis easier, let's rename the "Extremely Positive", "Extremely Negative" labels to "Positive" and "Negative" respectively

In [None]:
# replace "extremely positive/negative" with "postive/negative"
train["Sentiment"] = train["Sentiment"].str.replace("Extremely Negative", "Negative")
train["Sentiment"] = train["Sentiment"].str.replace("Extremely Positive", "Positive")

test['Sentiment'] = test.Sentiment.str.replace('Extremely Positive', 'Positive')
test['Sentiment'] = test.Sentiment.str.replace('Extremely Negative', 'Negative')

In [None]:
# plot of tweet sentiment distribution
plt.figure(figsize=(6,6))

sentiments = train.Sentiment.value_counts()

sns.set_palette("coolwarm")
plt.pie(sentiments,
        labels= sentiments.index,
        autopct='%1.1f%%', startangle=80, 
        pctdistance=0.82, textprops={"fontsize": 14})

centreCircle = plt.Circle((0,0),0.65,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centreCircle)

plt.tight_layout()
plt.title("How much of our tweet data is +ve -ve or neutral?", x=0.53, fontsize= 16)

plt.show()

    - The tweets are mostly either postive or negative, with just about 20% of the tweet data classified as neutral

### Location
Let's check out the places around the world that tweeted the most about COVID. We'll also check out the mood of these tweets.

In [None]:
# plot of top cities/countries
plt.style.use("fivethirtyeight")

plt.figure(figsize=(16, 6))
location = sns.countplot(x= 'Location', data= train, hue="Sentiment", order=train.Location.value_counts()[:10].index)
location.set_title("Which places tweeted the most about COVID-19?", y=1.05)

def axis_labels(ax):
    ax.set_ylabel("Number of tweets")
    ax.set_xlabel("")

axis_labels(location)

plt.show()

    - Most covid-related tweets seem to come from four major countries - the United Kingdom, the USA, Cananda and India.
    - London and New York lead the way in terms of cities that tweeted the most about covid19
    - We also observe a pattern: there are more positive tweets than negative in all cities/countries, except England - well this actually follows the general trend in our data, as we have more postive tweets than negative and more negative ones than neutral

### Tweet At
As `TweetAt` contains dates the tweets in our data were made, let's proceed to find out:
- the period range our tweet data was gathered
- the most frequent day(s) of the week and month(s) users made covid-related tweets

For the latter, we would need to create a new day and month column

In [None]:
# Converting the TweetAt column to date time 
train['TweetAt'] = pd.to_datetime(train['TweetAt'])

# create day of the week and month columns
train['day'] = train['TweetAt'].dt.dayofweek
train['month'] = train['TweetAt'].dt.month

days = {0: 'Monday', 1: 'Tuesday', 2:'Wednesday', 3: 'Thursday', 4: 'Friday', 5: 'Saturday', 6: 'Sunday'}
months = {0: 'January ', 1: 'February', 2:'March', 3: 'April', 4: 'May', 5: 'June', 6: 'July',
      7:'August', 8:'September', 9: 'October', 10: 'November', 11: 'December'  }

train["day"] = train["day"].map(days)
train["month"] = train["month"].map(months)

In [None]:
print(f"First tweet: {train['TweetAt'].dt.date.min()}, Last tweet: {train['TweetAt'].dt.date.max()}")

    Our tweet data, which contains covid-related tweets made only in 2020, was collected over an 11-month period (January 4, 2020 through to December 4, 2020)

In [None]:
plt.figure(figsize=(14, 6))
days = sns.countplot(x="day", data=train)
days.set_title("What days were the most covid-related tweets made in 2020?", 
                                             y=1.05)

def add_labels(ax, space):
    for rect in ax.patches:
        width = rect.get_width()
        height = rect.get_height()
        total = train.shape[0]
        
        ax.text(rect.get_x() + width/2,
               height + space,
               '{}%'.format(int(np.round(height/total*100))),
                ha="center")

add_labels(days, 100)
axis_labels(days)
plt.show()

    - About 35% of the tweets were made on a Tuesday/Wednesday, with Sunday having the least engagement

In [None]:
plt.figure(figsize=(14, 6))
months = sns.countplot(train['month'])
months.set_title("Which months in 2020 were the most covid-related tweets made?", 
                                             y=1.05)

add_labels(months, 300)
axis_labels(months)
plt.show()

- A whooping 64% of the tweets were made in April! 
- This could perhaps be because it was around this period the number of cases and death toll first skyrocketed.  
- According to the timeline of COVID-19 events stated in this [article](https://www.thinkglobalhealth.org/article/updated-timeline-coronavirus), the Week of March 30–April 4 saw the Worldwide coronavirus cases exceed one million; with millions of Americans filing for unemployment and major sporting events such Wimbledon Tennis Tournament getting canceled for the first time in a very long time. These were very serious and sudden events that shook the world and hence got people talking and tweeting a lot.

### Tweets

In [None]:
# check out the first two tweets
def tweets(df, n, col_name="OriginalTweet"):
    for tweet_no, tweet in enumerate(df[col_name][:n]):
        print(tweet_no+1, tweet, '\n')
        print("*" * 60, '\n')
        
tweets(train, 10)

    The tweet data looks really unclean (well .. as expected) - but before proceeding to prepare our tweet text for modelling, let's explore the most frequent hashtags and top mentions in our data

#### Most common #hashtags

In [None]:
from wordcloud import WordCloud,STOPWORDS,ImageColorGenerator
from PIL import Image

In [None]:
def create_wordCloud(pattern):
    """create word cloud visualization
    
    arguments:
        pattern (str): regex pattern to extract certain text from the data
    """
    data = train["OriginalTweet"].str.extractall(pattern)[0].value_counts()

    data.index = data.index.map(str)                                                       # convert data index to string
    data_wc = WordCloud(max_words = 500, colormap='Dark2_r', 
                        background_color='white').generate_from_frequencies(data)          # generate word cloud

    # display the cloud
    fig = plt.figure()
    fig.set_figwidth(12) # set width
    fig.set_figheight(12) # set height

    plt.imshow(data_wc, interpolation="bilinear")
    plt.axis("off")
    plt.show()
    
# create word cloud of the most frequently used hashtags
hashtag = r"(#\w+)"
create_wordCloud(hashtag)

#### Most Mentions

In [None]:
# create word cloud of most frequent mentions
mentions = r"(@\w+)"
create_wordCloud(mentions)

    - Former US president, Donald Trump is unsurprisingly among the most tagged persons. We also see the UK prime minister,   Boris Johnson and Indian Prime Minister, Narendra Modi also gathered a number of mentions
    - CNN, BBCNews and SkyNews are the most tagged news channels, with Piers Morgan being the most tagged TV personality 
    - Retail companies such as Tesco, Walmart and Morrisons got a lot of mentions too

## Text Preprocessing
Next step is to clean and prepare our tweet data for modeling. So,we proceed to:
- Remove all hastages, links and numbers
- Remove Stopwords (common words like "the", "a" etc)
- Tokenize and Vectorize words, i.e, convert tweet words to numbers

But first, we combine the training and test dataframes, then keep just the features relevant to our model building - `OriginalTweet` and `Sentiment`

In [None]:
# combine train and test dataframes
combined = pd.concat([train, test], ignore_index= True)

# select relevant features: tweet and Sentiments
combined = combined.loc[:, ["OriginalTweet", "Sentiment"]]

# load stop words
stop_word = stopwords.words('english')

def clean_tweet(text):
    text = re.sub(r"#\w+", " ", text)            # remove hashtags
    text = re.sub(r"@\w+", " ",text)             # remove mentions
    text = re.sub(r"http\S+", " ", text)         # remove urls
    text = re.sub(r"[^a-zA-Z]", " ", text)        # remove non-words (digits, punctuations etc)
    text = text.lower().strip()                  # convert tweet to lowercase and strip
    
    text = " ".join([word for word in text.split() if not word in stop_word])           # remove stop words    
    
    text = " ".join(nltk.word_tokenize(text))           # tokenize text
      
    return text

# clean OriginalTweet and assign the data to an new "tweet" column
combined['tweet'] = combined['OriginalTweet'].apply(lambda x: clean_tweet(x))

In [None]:
# print first few tweets to confirm the data is rid of non-word characters
tweets(combined, 7, "tweet")

In [None]:
# most common words in our tweet data
corpus = ",".join(word for word in combined.tweet)
stopwords = set(STOPWORDS)
tweet_wc = WordCloud(max_words = 500, colormap='Dark2_r', 
                        background_color='white', stopwords=stopwords).generate(corpus)   

# display the cloud
fig = plt.figure()
fig.set_figwidth(10) # set width
fig.set_figheight(10) # set height

plt.imshow(tweet_wc, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
# encode Sentiment label values
le = LabelEncoder()
combined.Sentiment = le.fit_transform(combined.Sentiment)

# split data back into training and validation sets and sets
train = combined[: len(train)]
test = combined[len(train):].reset_index(drop=True)

# split test test set
X_test = test.tweet
y_test = test.Sentiment


# split training set into training and validation set
X_train, X_val, y_train, y_val = train_test_split(train.tweet,
                                                    train.Sentiment, test_size=0.2,random_state=42)

In [None]:
# initialize vectorizer
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=5).fit(X_train)

X_train = vectorizer.transform(X_train)
X_val = vectorizer.transform(X_val)
X_test = vectorizer.transform(X_test)

## Modeling

### Logistic Regression

In [None]:
# intialize model and fit it on the training data
logmodel = LogisticRegression(max_iter=10000)
logmodel.fit(X_train, y_train)

# check training accuracy
cross_val_score(logmodel, X_train, y_train, cv=5, verbose=1, n_jobs=-1).mean()

In [None]:
# extract labels from encoder
labels = list(le.classes_)

In [None]:
# make predictions
val_pred = logmodel.predict(X_val)
test_pred = logmodel.predict(X_test)

# print classification report
print(classification_report(val_pred, y_val, target_names= labels), '\n')
print(classification_report(test_pred, y_test, target_names= labels))

In [None]:
# check test accuracy
print('accuracy score on validation set: ', accuracy_score(y_val, val_pred))
print('accuracy score on test set:', accuracy_score(y_test, test_pred))

    The model performs about the same on both the validation set and the given test dataset
    
Next, we check out how the LSTM model will perform on our data

### LSTM

In [None]:
max_features = 20000                                            # maximum number of words to take from corpus
tokenizer = Tokenizer(num_words=max_features, split=' ')            # initialize tokenizer
tokenizer.fit_on_texts(train['tweet'].values)                   # fit tokenizer on training data


max_len = np.max(train.tweet.apply(lambda x :len(x)))
vocab_length = len(tokenizer.word_index)

In [None]:
print("Number of unique token:", vocab_length)
print("Maximum sequence length:", max_len)

In [None]:
# get text sequences from training and test dataframes
train_x = tokenizer.texts_to_sequences(train['tweet'].values)
X_test = tokenizer.texts_to_sequences(test['tweet'].values)


# adding padding of zeros to obtain uniform length for all sequences
train_x = pad_sequences(train_x, maxlen= max_len)
X_test = pad_sequences(X_test, maxlen= max_len)

# encode sentiment label values
train_y_encoded = pd.get_dummies(train['Sentiment']).values
y_test_encoded = pd.get_dummies(test['Sentiment']).values


# split training data 
X_train, X_val, Y_train, y_val = train_test_split(train_x, train_y_encoded, test_size = 0.33, random_state = 42)

In [None]:
print(train_x.shape, X_test.shape)
print(train_y_encoded.shape, y_test_encoded.shape)

In [None]:
print(X_train.shape,Y_train.shape)
print(X_val.shape, y_val.shape)

#### Model Building

In [None]:
embed_dim = 16
lstm_out = 196

model = Sequential()
model.add(Embedding(vocab_length, embed_dim, input_length = max_len))
model.add(SpatialDropout1D(0.4))
model.add(LSTM(lstm_out, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3,activation='softmax'))

model.compile(loss = 'categorical_crossentropy', 
              optimizer='adam',
              metrics = ['accuracy'])
print(model.summary())

#### Model Training

In [None]:
model.fit(X_train, Y_train, 
          validation_data=(X_val, y_val), 
          epochs=5, batch_size= 32, 
          shuffle=True)

#### Model Evaluation

In [None]:
# evaluating model on test dataset
model.evaluate(X_test, y_test_encoded, verbose=0)

In [None]:
predictions = model.predict(X_test)
predictions = np.argmax(predictions, axis=1)

# predictions = model.predict_classes(X_test)

In [None]:
# classification report
print(classification_report(y_test, predictions, target_names= labels))

    - As seen, the LTSM algorithm yields a better performance on our data (84% accuracy) than the Logistic Regression (79% accuracy)
    - While 79-84% is a fairly good score for accuracy, the performance of each model can still be further improved by tuning necessary parameters

##### Author: Ayomide Aderonmu