## Data Cleaning
* Tokenization
    > Split text into tokens(sentences or words), for this question, we split the document into sentence for automatic summarization, and words for sentiment analysis and topic modeling
* Screen out stop words and other meaningless corpus
* Lemmatization
    > Here we only use lemmatization rather than stemming is because lemmatization keeps the interpretability of words with their context. While stemming might lead to incorrect meaning. It is important to make morphological analysis of the words. 

In [1]:
import numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import sent_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

In [2]:
train = pd.read_csv('../data/Corona_NLP_train.csv', encoding = 'latin1')
test = pd.read_csv('../data/Corona_NLP_test.csv', encoding = 'latin1')

df = pd.concat([train, test])
df = df.reset_index()
df = df.loc[:,["OriginalTweet", "Sentiment"]]
df

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
...,...,...
44950,Meanwhile In A Supermarket in Israel -- People...,Positive
44951,Did you panic buy a lot of non-perishable item...,Negative
44952,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral
44953,Gov need to do somethings instead of biar je r...,Extremely Negative


In [3]:
# tw_tokenizer = TweetTokenizer()
# df["Word_list"] = df["OriginalTweet"].apply(
#     lambda x: tw_tokenizer.tokenize(x)
# )
# df["Senten_list"] = df["OriginalTweet"].apply(
#     lambda x: tw_tokenizer.tokenize_sents(x)
# )

In [17]:
import string
import re

def remove_urls(text):
    url_remove = re.compile(r'https?://\S+|www\.\S+')
    return url_remove.sub(r'', text)

def remove_stop_word(text):
    no_stop_word = 
    return 

def remove_html(text):
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)

def lower(text):
    low_text = text.lower()
    return low_text

def remove_num(text):
    remove = re.sub(r'\d+', '' ,text)
    return remove

def remove_punctuation(text):
    clean_list = [char for char in text if char not in string.punctuation]
    clean_str = ''.join(clean_list)
    return clean_str

def remove_at(text):
    at = re.compile(r'@.+?\s')
    no_at = re.sub(at, '', text)
    return no_at

df["Tweet_filtered"] = df["OriginalTweet"].apply(
    lambda x: remove_urls(x)) \
    .apply(lambda x: remove_at(x)) \
    .apply(lambda x: remove_html(x)) \
    .apply(lambda x: remove_num(x)
    # .apply(lambda x: lower(x)) \
    # .apply(lambda x: remove_punctuation(x)
)
 

IndentationError: expected an indented block (1878545692.py, line 10)

In [5]:
df

Unnamed: 0,OriginalTweet,Sentiment,Tweet_filtered
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,and and
1,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia: Woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,"Me, ready to go at supermarket during the #COV..."
...,...,...,...
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,Meanwhile In A Supermarket in Israel -- People...
44951,Did you panic buy a lot of non-perishable item...,Negative,Did you panic buy a lot of non-perishable item...
44952,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral,Asst Prof of Economics was on talking about he...
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,Gov need to do somethings instead of biar je r...


## Convert the filtered corpus into word list and sentence list

### Word-level
* Filter out stopwords via nltk
* Drop rows that have words less than 5
* Lower the text
* Lemmatization via WordNetLemmatizer
* Keep the tags inside

In [6]:
stop_words = set(stopwords.words('english'))
tw_tokenizer = TweetTokenizer()
df["Word_list"] = df["Tweet_filtered"].apply(
    lambda x: x.lower()
).apply(
    lambda x: remove_punctuation(x)
).apply(
    lambda x: tw_tokenizer.tokenize(x)
).apply(
    lambda x: [item for item in x if item not in stop_words]
)
wnl = WordNetLemmatizer()
df["Word_list"] = df["Word_list"].apply(
    lambda x: [wnl.lemmatize(item) for item in x]
)

In [7]:
df["Effective_word_count"] = df["Word_list"].apply(
    lambda x: len(x)
)
df = df.loc[df["Effective_word_count"] >= 5,["OriginalTweet", "Sentiment", "Tweet_filtered", "Word_list"]]

In [8]:
df

Unnamed: 0,OriginalTweet,Sentiment,Tweet_filtered,Word_list
1,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...,"[advice, talk, neighbour, family, exchange, ph..."
2,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia: Woolworths to give elde...,"[coronavirus, australia, woolworth, give, elde..."
3,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...,"[food, stock, one, empty, please, dont, panic,..."
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,"Me, ready to go at supermarket during the #COV...","[ready, go, supermarket, covid, outbreak, im, ..."
5,As news of the regionÂs first confirmed COVID...,Positive,As news of the regionÂs first confirmed COVID...,"[news, regionâ, , first, confirmed, covid, ca..."
...,...,...,...,...
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,Meanwhile In A Supermarket in Israel -- People...,"[meanwhile, supermarket, israel, people, dance..."
44951,Did you panic buy a lot of non-perishable item...,Negative,Did you panic buy a lot of non-perishable item...,"[panic, buy, lot, nonperishable, item, echo, n..."
44952,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral,Asst Prof of Economics was on talking about he...,"[asst, prof, economics, talking, recent, resea..."
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,Gov need to do somethings instead of biar je r...,"[gov, need, somethings, instead, biar, je, rak..."


### Sentence-level
* Lemmatization via WordNetLemmatizer
* Lower the text

In [9]:
df["Senten_list"] = df["OriginalTweet"].apply(
    lambda x: sent_tokenize(x)
).apply(
    lambda x: [remove_urls(item) for item in x]
).apply(
    lambda x: [remove_at(item) for item in x]
).apply(
    lambda x: [remove_html(item) for item in x]
).apply(
    lambda x: [remove_num(item) for item in x]
)

df["Senten_list_filtered"] = df["Senten_list"].apply(
    lambda x: [item.lower() for item in x]
).apply(
    lambda x: [wnl.lemmatize(item) for item in x]
).apply(
    lambda x: [remove_punctuation(item) for item in x]
)

In [10]:
df

Unnamed: 0,OriginalTweet,Sentiment,Tweet_filtered,Word_list,Senten_list,Senten_list_filtered
1,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...,"[advice, talk, neighbour, family, exchange, ph...",[advice Talk to your neighbours family to exch...,[advice talk to your neighbours family to exch...
2,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia: Woolworths to give elde...,"[coronavirus, australia, woolworth, give, elde...",[Coronavirus Australia: Woolworths to give eld...,[coronavirus australia woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...,"[food, stock, one, empty, please, dont, panic,...",[My food stock is not the only one which is em...,[my food stock is not the only one which is em...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,"Me, ready to go at supermarket during the #COV...","[ready, go, supermarket, covid, outbreak, im, ...","[Me, ready to go at supermarket during the #CO...",[me ready to go at supermarket during the covi...
5,As news of the regionÂs first confirmed COVID...,Positive,As news of the regionÂs first confirmed COVID...,"[news, regionâ, , first, confirmed, covid, ca...",[As news of the regionÂs first confirmed COVI...,[as news of the regionâs first confirmed covi...
...,...,...,...,...,...,...
44950,Meanwhile In A Supermarket in Israel -- People...,Positive,Meanwhile In A Supermarket in Israel -- People...,"[meanwhile, supermarket, israel, people, dance...",[Meanwhile In A Supermarket in Israel -- Peopl...,[meanwhile in a supermarket in israel people ...
44951,Did you panic buy a lot of non-perishable item...,Negative,Did you panic buy a lot of non-perishable item...,"[panic, buy, lot, nonperishable, item, echo, n...",[Did you panic buy a lot of non-perishable ite...,[did you panic buy a lot of nonperishable item...
44952,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral,Asst Prof of Economics was on talking about he...,"[asst, prof, economics, talking, recent, resea...",[Asst Prof of Economics was on talking about h...,[asst prof of economics was on talking about h...
44953,Gov need to do somethings instead of biar je r...,Extremely Negative,Gov need to do somethings instead of biar je r...,"[gov, need, somethings, instead, biar, je, rak...",[Gov need to do somethings instead of biar je ...,[gov need to do somethings instead of biar je ...


In [11]:
for item in df["Senten_list_filtered"][:10]:
    print(item)

['advice talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist gp set up online shopping accounts if poss adequate supplies of regular meds but not over order']
['coronavirus australia woolworths to give elderly disabled dedicated shopping hours amid covid outbreak ']
['my food stock is not the only one which is empty', 'please dont panic there will be enough food for everyone if you do not take more than you need', 'stay calm stay safe', 'covidfrance covid covid coronavirus confinement confinementotal confinementgeneral ']
['me ready to go at supermarket during the covid outbreak', 'not because im paranoid but because my food stock is litteraly empty', 'the coronavirus is a serious thing but please dont panic', 'it causes shortage\r\r\n\r\r\ncoronavirusfrance restezchezvous stayathome confinement ']
['as news of the regionâ\x92s first confirmed covid case came out of sullivan county last week people flock

### Document-level
* Lemmatization via WordNetLemmatizer
* Lower the text

In [12]:
df["Tweet_filtered"] = df["Tweet_filtered"].apply(
    lambda x: x.lower()
).apply(
    lambda x: remove_punctuation(x)
).apply(
    lambda x: tw_tokenizer.tokenize(x)
).apply(
    lambda x: [item for item in x if item not in stop_words]
).apply(
    lambda x: [wnl.lemmatize(item) for item in x]
).apply(
    lambda x: ' '.join(x)
)

### Screen out missing value

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44251 entries, 1 to 44954
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   OriginalTweet         44251 non-null  object
 1   Sentiment             44251 non-null  object
 2   Tweet_filtered        44251 non-null  object
 3   Word_list             44251 non-null  object
 4   Senten_list           44251 non-null  object
 5   Senten_list_filtered  44251 non-null  object
dtypes: object(6)
memory usage: 2.4+ MB


In [14]:
df.isna().sum().sum()

0

### Output filtered data

In [15]:
df.index = np.arange(df.shape[0], dtype = int)
df

Unnamed: 0,OriginalTweet,Sentiment,Tweet_filtered,Word_list,Senten_list,Senten_list_filtered
0,advice Talk to your neighbours family to excha...,Positive,advice talk neighbour family exchange phone nu...,"[advice, talk, neighbour, family, exchange, ph...",[advice Talk to your neighbours family to exch...,[advice talk to your neighbours family to exch...
1,Coronavirus Australia: Woolworths to give elde...,Positive,coronavirus australia woolworth give elderly d...,"[coronavirus, australia, woolworth, give, elde...",[Coronavirus Australia: Woolworths to give eld...,[coronavirus australia woolworths to give elde...
2,My food stock is not the only one which is emp...,Positive,food stock one empty please dont panic enough ...,"[food, stock, one, empty, please, dont, panic,...",[My food stock is not the only one which is em...,[my food stock is not the only one which is em...
3,"Me, ready to go at supermarket during the #COV...",Extremely Negative,ready go supermarket covid outbreak im paranoi...,"[ready, go, supermarket, covid, outbreak, im, ...","[Me, ready to go at supermarket during the #CO...",[me ready to go at supermarket during the covi...
4,As news of the regionÂs first confirmed COVID...,Positive,news regionâ  first confirmed covid case came...,"[news, regionâ, , first, confirmed, covid, ca...",[As news of the regionÂs first confirmed COVI...,[as news of the regionâs first confirmed covi...
...,...,...,...,...,...,...
44246,Meanwhile In A Supermarket in Israel -- People...,Positive,meanwhile supermarket israel people dance sing...,"[meanwhile, supermarket, israel, people, dance...",[Meanwhile In A Supermarket in Israel -- Peopl...,[meanwhile in a supermarket in israel people ...
44247,Did you panic buy a lot of non-perishable item...,Negative,panic buy lot nonperishable item echo need foo...,"[panic, buy, lot, nonperishable, item, echo, n...",[Did you panic buy a lot of non-perishable ite...,[did you panic buy a lot of nonperishable item...
44248,Asst Prof of Economics @cconces was on @NBCPhi...,Neutral,asst prof economics talking recent research co...,"[asst, prof, economics, talking, recent, resea...",[Asst Prof of Economics was on talking about h...,[asst prof of economics was on talking about h...
44249,Gov need to do somethings instead of biar je r...,Extremely Negative,gov need somethings instead biar je rakyat ass...,"[gov, need, somethings, instead, biar, je, rak...",[Gov need to do somethings instead of biar je ...,[gov need to do somethings instead of biar je ...


In [16]:
df.to_csv("../data/Corona_NLP_filtered.csv")