# GR5293 - Proj2 - Group9
## NLP with tweets related to COVID
#### NLP pipeline with sentiment prediction
* Tokenization
    > Split text into tokens(sentences or words), for this question, we split the document into sentence for automatic summarization, and words for sentiment analysis and topic modeling
* Screen out stop words and other meaningless corpus
* Lemmatization
    > Here we only use lemmatization rather than stemming is because lemmatization keeps the interpretability of words with their context. While stemming might lead to incorrect meaning. It is important to make morphological analysis of the words. 
* EDA: wordCloud with different sentiment
    > Identify what poeple with different emotions were considering about
* EDA: Word2vec with Clustering
    > Word2Vec: Effective for detecting the synonymous words or suggesting additional words for a partial sentence

    Clustering methods: K-means + DBScan

    Use all the words in a specific part-of-speech from all the documents (e.g. all nouns / all adj.s)
* (word2vec w/ recommendation)?
* Topic Modeling: Feature extraction by TFIDF + Latent Dirichlet Allocation
    Build a pipeline with KFoldCV to find the best topic number
* Automatic summrization
    > Identify what were most people thinking about or tweeting for
* Sentiment Analysis: Classification for sentiment(5 classes: Neutral / Positive / Extremely Positive / Negative / Extremely Negative)
  
    Potential Model: lightGBM / stacking / BERT?
#### Preprocessing
* One-hot encoding

In [34]:
import numpy
import numpy as np
import pandas as pd
import sklearn
import nltk
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os
import re
from sklearn import pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
import lightgbm as lgb
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score, train_test_split
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%xmode plain
os.getcwd()

Exception reporting mode: Plain


'/Users/kangshuoli/Documents/VScode_workspace/GR5293/EODS-Project2-Group9/EODS_Project2_Group9/doc'

### Read in filtered data

In [35]:
df = pd.read_csv("../data/Corona_NLP_filtered.csv")
df.drop(
    "Unnamed: 0", 
    axis = 1,
    inplace = True
)
df

Unnamed: 0,OriginalTweet,Sentiment,Tweet_filtered,Word_list,Senten_list,Senten_list_filtered
0,advice Talk to your neighbours family to excha...,Positive,advice talk neighbour family exchange phone nu...,"['advice', 'talk', 'neighbour', 'family', 'exc...",['advice Talk to your neighbours family to exc...,['advice talk to your neighbours family to exc...
1,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworth give elderly d...,"['coronavirus', 'australia', 'woolworth', 'giv...",['Coronavirus Australia: Woolworths to give el...,['coronavirus australia woolworths to give eld...
2,My food stock is not the only one which is emp...,Positive,food stock one empty please dont panic enough ...,"['food', 'stock', 'one', 'empty', 'please', 'd...",['My food stock is not the only one which is e...,['my food stock is not the only one which is e...
3,"Me, ready to go at supermarket during the #COV...",Extremely Negative,ready go supermarket covid outbreak im paranoi...,"['ready', 'go', 'supermarket', 'covid', 'outbr...","['Me, ready to go at supermarket during the #C...",['me ready to go at supermarket during the cov...
4,As news of the regionÂs first confirmed COVID...,Positive,news regionâ  first confirmed covid case came...,"['news', 'regionâ', '\x92', 'first', 'confirme...",['As news of the regionÂ\x92s first confirmed ...,['as news of the regionâ\x92s first confirmed ...
...,...,...,...,...,...,...
44248,Meanwhile In A Supermarket in Israel -- People...,Positive,meanwhile supermarket israel people dance sing...,"['meanwhile', 'supermarket', 'israel', 'people...",['Meanwhile In A Supermarket in Israel -- Peop...,['meanwhile in a supermarket in israel people...
44249,Did you panic buy a lot of non-perishable item...,Negative,panic buy lot nonperishable item echo need foo...,"['panic', 'buy', 'lot', 'nonperishable', 'item...",['Did you panic buy a lot of non-perishable it...,['did you panic buy a lot of nonperishable ite...
44250,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral,asst prof economics talking recent research co...,"['asst', 'prof', 'economics', 'talking', 'rece...","[""Asst Prof of Economics was on talking about ...",['asst prof of economics was on talking about ...
44251,Gov need to do somethings instead of biar je r...,Extremely Negative,gov need somethings instead biar je rakyat ass...,"['gov', 'need', 'somethings', 'instead', 'biar...","[""Gov need to do somethings instead of biar je...",['gov need to do somethings instead of biar je...


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44253 entries, 0 to 44252
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   OriginalTweet         44253 non-null  object
 1   Sentiment             44251 non-null  object
 2   Tweet_filtered        44251 non-null  object
 3   Word_list             44251 non-null  object
 4   Senten_list           44250 non-null  object
 5   Senten_list_filtered  44249 non-null  object
dtypes: object(6)
memory usage: 2.0+ MB


### Data Cleaning
* Most done in data_cleaning.ipynb
* Drop rows with missing values

In [43]:
df.dropna(
    axis = 0, 
    how = "any", 
    inplace = True
)
df.index = np.arange(df.shape[0], dtype = int)

In [44]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44249 entries, 0 to 44248
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   OriginalTweet         44249 non-null  object
 1   Sentiment             44249 non-null  object
 2   Tweet_filtered        44249 non-null  object
 3   Word_list             44249 non-null  object
 4   Senten_list           44249 non-null  object
 5   Senten_list_filtered  44249 non-null  object
dtypes: object(6)
memory usage: 2.4+ MB


## Word-level Analysis
1. wordCloud
2. word2Vec w/ clustering

## Document-level / Sentence-level Analysis

1. Sentence-level automatic summarization
* Extractive summarization: pick the original sentence which can represent the main focus of the whole document
* Abstractive summarization: generate new sentences for summary
    > Purely extractive summaries often times give better results compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which are relatively harder than data-driven approaches such as sentence extraction.

We use frequency-driven approch here. 

##### Extractive method w/ TFIDF
##### Model: Centroid-based summarization

In [66]:
# Centroid-based summarization

# combine all sentences into one list
sentence_list = []
sentence_list_filtered = []
for i in np.arange(df.shape[0]):
    sentence_list.extend(df.loc[i, "Senten_list"])
    sentence_list_filtered.extend(df.loc[i, "Senten_list_filtered"])

tfidf_sum = TfidfVectorizer(
    norm = "l2", # The cosine similarity between two vectors is their dot product when l2 norm has been applied.
    stop_words = None, # already filtered
    preprocessor = None, 
    lowercase = False, # already lowered
    max_df = 0.9, 
    min_df = 10, 
    ngram_range = (1,5)
)
tfidf_word = tfidf_sum.fit_transform(df["Tweet_filtered"])
vocab_sum = tfidf_sum.get_feature_names()
vocab_sum = [x.replace(' ', '_') for x in vocab_sum]


# sum all the tfidf score for each word
tfidf_word_all_doc = pd.Series(
    np.array(tfidf_word.sum(axis = 0)).reshape(-1, 1).ravel(), 
    index = vocab_sum
)

# set a threshold to filter out the word that are not important
threshold = tfidf_word_all_doc.median()


# for each sentence get the sentence centroid score by summing up the word score
def get_centroid_score(sent_):
    score = 0
    word_list = re.split(r'\s+', sent_)
    for word in word_list:
        if word in vocab_sum: # get rid of some mis-tokenized words
            if tfidf_word_all_doc[word] >= threshold:
                score += tfidf_word_all_doc[word]
    return score

# get the sentence score by filtered sentence
sentence_score_dict = {}
for i, curr_senten in enumerate(sentence_list_filtered):
    sentence_score_dict[sentence_list[i]] = get_centroid_score(curr_senten)


In [None]:
# save the sentence_score_dict
import pickle
file = open('centroid_based_sentence_score.txt', 'w')
pickle.dump(sentence_score_dict, file)
file.close()

In [None]:
for i in np.arange(len(sentence_list)):
    if sentence_score_dict[sentence_list[i]] >= 200:
        print(sentence_list[i])

2. Document-level sentiment classification