# GR5293 - Proj2 - Group9
## NLP with tweets related to COVID
#### NLP pipeline with sentiment prediction
* Tokenization
    > Split text into tokens(sentences or words), for this question, we split the document into sentence for automatic summarization, and words for sentiment analysis and topic modeling
* Screen out stop words and other meaningless corpus
* Lemmatization
    > Here we only use lemmatization rather than stemming is because lemmatization keeps the interpretability of words with their context. While stemming might lead to incorrect meaning. It is important to make morphological analysis of the words. 
* EDA: wordCloud with different sentiment
    > Identify what poeple with different emotions were considering about
* EDA: Word2vec with Clustering
    > Word2Vec: Effective for detecting the synonymous words or suggesting additional words for a partial sentence

    Clustering methods: K-means + DBScan

    Use all the words in a specific part-of-speech from all the documents (e.g. all nouns / all adj.s)
* (word2vec w/ recommendation)?
* Topic Modeling: Feature extraction by TFIDF + Latent Dirichlet Allocation
    Build a pipeline with KFoldCV to find the best topic number
* Automatic summrization
    > Identify what were most people thinking about or tweeting for
* Sentiment Analysis: Classification for sentiment(5 classes: Neutral / Positive / Extremely Positive / Negative / Extremely Negative)
  
    Potential Model: lightGBM / stacking / BERT?
#### Preprocessing
* One-hot encoding

In [24]:
import numpy
import numpy as np
import pandas as pd
import sklearn
import nltk
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import time
import os
import re
from sklearn import pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Word2Vec
from scipy.spatial.distance import cosine
import lightgbm as lgb
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score, train_test_split
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%xmode plain
os.getcwd()

Exception reporting mode: Plain


'/Users/kangshuoli/Documents/VScode_workspace/GR5293/EODS-Project2-Group9/EODS_Project2_Group9/doc'

### Read in filtered data

In [12]:
df = pd.read_csv("../data/Corona_NLP_filtered.csv")
df.drop(
    "Unnamed: 0", 
    axis = 1,
    inplace = True
)
df

Unnamed: 0,OriginalTweet,Sentiment,Tweet_filtered,Word_list,Senten_list
0,advice Talk to your neighbours family to excha...,Positive,advice talk neighbour family exchange phone nu...,"['advice', 'talk', 'neighbour', 'family', 'exc...",['advice talk to your neighbours family to exc...
1,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworth give elderly d...,"['coronavirus', 'australia', 'woolworth', 'giv...",['coronavirus australia: woolworths to give el...
2,My food stock is not the only one which is emp...,Positive,food stock one empty please dont panic enough ...,"['food', 'stock', 'one', 'empty', 'please', 'd...",['my food stock is not the only one which is e...
3,"Me, ready to go at supermarket during the #COV...",Extremely Negative,ready go supermarket covid outbreak im paranoi...,"['ready', 'go', 'supermarket', 'covid', 'outbr...","['me, ready to go at supermarket during the #c..."
4,As news of the regionÂs first confirmed COVID...,Positive,news regionâ  first confirmed covid case came...,"['news', 'regionâ', '\x92', 'first', 'confirme...",['as news of the regionâ\x92s first confirmed ...
...,...,...,...,...,...
44248,Meanwhile In A Supermarket in Israel -- People...,Positive,meanwhile supermarket israel people dance sing...,"['meanwhile', 'supermarket', 'israel', 'people...",['meanwhile in a supermarket in israel -- peop...
44249,Did you panic buy a lot of non-perishable item...,Negative,panic buy lot nonperishable item echo need foo...,"['panic', 'buy', 'lot', 'nonperishable', 'item...",['did you panic buy a lot of non-perishable it...
44250,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral,asst prof economics talking recent research co...,"['asst', 'prof', 'economics', 'talking', 'rece...","[""asst prof of economics @cconces was on @nbcp..."
44251,Gov need to do somethings instead of biar je r...,Extremely Negative,gov need somethings instead biar je rakyat ass...,"['gov', 'need', 'somethings', 'instead', 'biar...","[""gov need to do somethings instead of biar je..."


In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44253 entries, 0 to 44252
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   OriginalTweet   44253 non-null  object
 1   Sentiment       44251 non-null  object
 2   Tweet_filtered  44251 non-null  object
 3   Word_list       44250 non-null  object
 4   Senten_list     44249 non-null  object
dtypes: object(5)
memory usage: 1.7+ MB


### Data Cleaning
* Most done in data_cleaning.ipynb
* Drop rows with missing values

In [14]:
df.dropna(
    axis = 0, 
    how = "any", 
    inplace = True
)

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44249 entries, 0 to 44252
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   OriginalTweet   44249 non-null  object
 1   Sentiment       44249 non-null  object
 2   Tweet_filtered  44249 non-null  object
 3   Word_list       44249 non-null  object
 4   Senten_list     44249 non-null  object
dtypes: object(5)
memory usage: 2.0+ MB


## Word-level Analysis
1. wordCloud
2. word2Vec w/ clustering

## Document-level / Sentence-level Analysis

1. Sentence-level automatic summarization
* Extractive summarization: pick the original sentence which can represent the main focus of the whole document
* Abstractive summarization: generate new sentences for summary
    > Purely extractive summaries often times give better results compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which are relatively harder than data-driven approaches such as sentence extraction.

We use frequency-driven approch here. 

##### Extractive method w/ TFIDF
##### Model: Centroid-based summarization

In [33]:
# Centroid-based summarization
all_sentence_list = []
for i in np.arange(df.shape[0]):
    all_sentence_list.extend(df.loc[i, "Senten_list"])

# sentence cleaning



# tfidf_sum = TfidfVectorizer(
#     norm = "l2", # The cosine similarity between two vectors is their dot product when l2 norm has been applied.
#     stop_words = None, # already filtered
#     preprocessor = None, 
#     lowercase = False, # already lowered
#     max_df = 0.9, 
#     min_df = 10, 
#     ngram_range = (1,5)
# )
# tfidf_df = tfidf_sum.fit_transform(df["Tweet_filtered"])
# vocab_sum = tfidf_sum.get_feature_names()
# vocab_sum = [x.replace(' ', '_') for x in vocab_sum]

# # Set a threshold to filter the word that are important
# threshold = 0.5

# # Construct the centroid of clusters, use tfidf score as the centroid score
# mask_df = tfidf_df < threshold # sparse matrix
# print(type(tfidf_df))

<class 'scipy.sparse.csr.csr_matrix'>


In [18]:
tfidf_df.shape # total 13765 words

(44249, 13765)

In [None]:
# #Calculates cosine similarity
# def similarity(v1, v2):
#     score = 0.0
#     if np.count_nonzero(v1) != 0 and np.count_nonzero(v2) != 0:
#         score = ((1 - cosine(v1, v2)) + 1) / 2
#     return score

# def get_tf_idf(sentences):
#     vectorizer = CountVectorizer()
#     sent_word_matrix = vectorizer.fit_transform(sentences)

#     transformer = TfidfTransformer(norm=None, sublinear_tf=False, smooth_idf=False)
#     tfidf = transformer.fit_transform(sent_word_matrix)
#     tfidf = tfidf.toarray()

#     centroid_vector = tfidf.sum(0)
#     centroid_vector = np.divide(centroid_vector, centroid_vector.max())

#     feature_names = vectorizer.get_feature_names()

#     relevant_vector_indices = np.where(centroid_vector > 0.3)[0]

#     word_list = list(np.array(feature_names)[relevant_vector_indices])
#     return word_list

# #Populate word vector with all embeddings.
# #This word vector is a look up table that is used
# #for getting the centroid and sentences embedding representation.
# def word_vectors_cache(sentences, embedding_model):
#     word_vectors = dict()
#     for sent in sentences:
#         words = nlkt_word_tokenize(sent)
#         for w in words:
#             word_vectors.update({w: embedding_model.wv[w]})
#     return word_vectors

# # Sentence embedding representation with sum of word vectors
# def build_embedding_representation(words, word_vectors, embedding_model):
#     embedding_representation = np.zeros(embedding_model.vector_size, dtype="float32")
#     word_vectors_keys = set(word_vectors.keys())
#     count = 0
#     for w in words:
#         if w in word_vectors_keys:
#             embedding_representation = embedding_representation + word_vectors[w]
#             count += 1
#     if count != 0:
#        embedding_representation = np.divide(embedding_representation, count)
#     return embedding_representation

# def summarize(text, emdedding_model):
#     raw_sentences = sent_tokenize(text)
#     clean_sentences = cleanup_sentences(text)
#     for i, s in enumerate(raw_sentences):
#         print(i, s)
#     for i, s in enumerate(clean_sentences):
#         print(i, s)
#     centroid_words = get_tf_idf(clean_sentences)
#     print(len(centroid_words), centroid_words)
#     word_vectors = word_vectors_cache(clean_sentences, emdedding_model)
#     #Centroid embedding representation
#     centroid_vector = build_embedding_representation(centroid_words, word_vectors, emdedding_model)
#     sentences_scores = []
#     for i in range(len(clean_sentences)):
#         scores = []
#         words = clean_sentences[i].split()

#         #Sentence embedding representation
#         sentence_vector = build_embedding_representation(words, word_vectors, emdedding_model)

#         #Cosine similarity between sentence embedding and centroid embedding
#         score = similarity(sentence_vector, centroid_vector)
#         sentences_scores.append((i, raw_sentences[i], score, sentence_vector))
#     sentence_scores_sort = sorted(sentences_scores, key=lambda el: el[2], reverse=True)
#     for s in sentence_scores_sort:
#         print(s[0], s[1], s[2])
#     count = 0
#     sentences_summary = []
#     #Handle redundancy
#     for s in sentence_scores_sort:
#         if count > 100:
#             break
#         include_flag = True
#         for ps in sentences_summary:
#             sim = similarity(s[3], ps[3])
#             if sim > 0.95:
#                 include_flag = False
#         if include_flag:
#             sentences_summary.append(s)
#             count += len(s[1].split())

#         sentences_summary = sorted(sentences_summary, key=lambda el: el[0], reverse=False)

#     summary = "\n".join([s[1] for s in sentences_summary])
#     print(summary)
#     return summary


# clean_sentences = cleanup_sentences(text)
# words = []
# for sent in clean_sentences:
#     words.append(nlkt_word_tokenize(sent))
# model = Word2Vec(words, min_count=1, sg = 1)
# summarize(text, model)

2. Document-level sentiment classification