# GR5293 - Proj2 - Group9
## NLP with tweets related to COVID
#### NLP pipeline with sentiment prediction
* Tokenization
    > Split text into tokens(sentences or words), for this question, we split the document into sentence for automatic summarization, and words for sentiment analysis and topic modeling
* Screen out stop words and other meaningless corpus
* Lemmatization
    > Here we only use lemmatization rather than stemming is because lemmatization keeps the interpretability of words with their context. While stemming might lead to incorrect meaning. 
* EDA: wordCloud with different sentiment
    > Identify what poeple with different emotions were considering about
* EDA: Word2vec with Clustering
    > Word2Vec: Effective for detecting the synonymous words or suggesting additional words for a partial sentence
    <br>
    Clustering methods: K-means + DBScan
    <br>
    Use all the words in a specific part-of-speech from all the documents (e.g. all nouns / all adj.s)
* Topic Modeling: Feature extraction by TFIDF + Latent Dirichlet Allocation
    > Build a pipeline with kFoldCV to find the best topic number
* Automatic summrization
    > Identify what were most people thinking about or tweeting for
* Sentiment Analysis: Classification for sentiment(5 classes: Neutral / Positive / Extremely Positive / Negative / Extremely Negative)
    > Potential Model: BERT?

In [1]:
import numpy
import numpy as np
import pandas as pd
import sklearn
import nltk
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os
import re

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
os.getcwd()

'/Users/kangshuoli/Documents/VScode_workspace/GR5293/EODS-Project2-Group9/EODS_Project2_Group9/doc'

In [4]:
train = pd.read_csv('../data/Corona_NLP_train.csv', encoding = 'latin1')
test = pd.read_csv('../data/Corona_NLP_test.csv', encoding = 'latin1')

df = pd.concat([train, test])
df = df.loc[:, ["TweetAt", "OriginalTweet", "Sentiment"]]
df.index = np.arange(df.shape[0], dtype = int)
df

Unnamed: 0,TweetAt,OriginalTweet,Sentiment
0,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,16-03-2020,My food stock is not the only one which is emp...,Positive
4,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
...,...,...,...
44950,16-03-2020,Meanwhile In A Supermarket in Israel -- People...,Positive
44951,16-03-2020,Did you panic buy a lot of non-perishable item...,Negative
44952,16-03-2020,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral
44953,16-03-2020,Gov need to do somethings instead of biar je r...,Extremely Negative


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44955 entries, 0 to 3797
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   OriginalTweet  44955 non-null  object
 1   Sentiment      44955 non-null  object
dtypes: object(2)
memory usage: 1.0+ MB


### Data Cleaning
* Filter out ids starts with "@" and links starts with "https://" or "www." and html "<" and ">"
* Remove numbers
* Remove punctuation

In [None]:
def remove_urls(text):
    url_remove = re.compile(r'https?://\S+|www\.\S+')
    return url_remove.sub(r'', text)

def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)

def lower(text):
    low_text = text.lower()
    return low_text

def remove_num(text):
    remove = re.sub(r'\d+', '' ,text)
    return remove

def remove_punctuation(text):
    clean_list = [char for char in text if char not in string.punctuation]
    clean_str = ''.join(clean_list)
    return clean_str

df['OriginalTweet'] = df['OriginalTweet'].apply(lambda x: remove_urls(x)) \
                                               .apply(lambda x: remove_html(x)) \
                                               .apply(lambda x: lower(x)) \
                                               .apply(lambda x: remove_num(x)) \
                                               .apply(lambda x: remove_punctuation(x))


def classes_def(x):
    if x ==  'Extremely Positive' or x == 'Positive':
        return "positive"
    elif x == "Extremely Negative" or x == 'Negative':
        return "negative"
    else:
        return "neutral"
    
train['Sentiment'] = train['Sentiment'].apply(lambda x: classes_def(x))

train.head()