# Parsing text files and text preprocessing of Covid19 Tweets

Text documents, such as crawled web data, are usually comprised of topically coherent text
data, which within each topically coherent data, one would expect that the word usage
demonstrates more consistent lexical distributions than that across data-set. A linear partition of
texts into topic segments can be used for text analysis tasks, such as passage retrieval in IR
(information retrieval), document summarization, recommender systems, and learning-to-rank
methods.

In this project, there are 2 main tasks that I will carry out. 

In the first task, I will extract data from a very large number of semi-structured text files, each contains thousand of tweets related to Covid19. Then I will transform the extracted data into XML format, following some pre-specified standards. 

In the second task, it involves text pre-processing, in particular, preprocess a large amount of tweets and convert them into numerical representations (which are suitable
for input into recommender-systems/ information-retrieval algorithms)

## Table of Content
1. [Parsing Text Files](#1)
2. [Text Preprocessing](#2)

## 1. Parsing Text Files <a class="anchor" id="1"></a>

In this section, I attempt to extract data from semi-structured text files in `Covid19Tweets` files. Each text file contains information about the tweets such as "id", "text", "created_at" attributes. My task will be to extract the data and transform the data into XML format with the following elements:
- id: 19-digit number
- text: the actual tweet
- Created_at: date and time that the tweet was created

In order to correctly parse data to XML format, we need to understand the structure of XML file, as well as how to parse emoji to XML format, since a lot of tweets contain emoji, which cannot be parsed using normal method like ordinary texts.

There are some specification as follows:
- The 'id's are unique, so if there are multiple instances of the same tweets, i will only keep 1 of them in the final XML file
- Non-English tweets will be filtered out from the dataset and the final XML only contains tweets in English language. 

Later on, I realize that there are surrogate pairs that need to be handled correctly, so they can be converted into its proper emoji forms. 

In [1]:
#Import libraries
import re
import langid
import os

In [3]:
#Create the relative path to the data file that contains all the text tweet files
dir_path="./Covid19Tweets"

#Create an empty dictionary to store lists of dictionaries of tweets
tweet_dict={}
for filename in os.listdir(dir_path):
    tweet_list=[]
    name="Covid19Tweets/"+filename
    with open(os.path.join(dir_path,filename),"r") as f:
        file=open(name,encoding="UTF-8")
        for i in file:
            file=i
        
        #Use regex to extract all the smaller dictionaries (now still in string form) into a list
        text=re.findall(r"{(?:(?!\"data\")).+?}",file)
        
        #Filtered out corrupted tweets
        error_list=[]
        for a_record in text:
            if ("\"text\"" not in a_record) and ("\"id\""not in a_record) and ("\"created_at\"" not in a_record):
                error_list.append(a_record)
        
        #Use list comprehension to retain only uncorrupted tweets
        text=[a_record for a_record in text if a_record not in error_list]
        
        #Retain only tweets that are in English
        correct_text=[]
        for a_record in text:
            if langid.classify(a_record)[0]=='en':
                correct_text.append(a_record)
        #Use list comprehension to retain only English tweets
        text=[a_record for a_record in text if a_record in correct_text]
        
        #Use eval() function for each element in the list to convert them into proper dictionary. 
        #There are some entries with unescaped meta characters. Need to take care of these by try and except
        for a_record in text:
            try:
                dictionary=eval(a_record)
            except:
                a_record=a_record.replace("'","’")
                a_record=a_record.replace("\n","")
                a_record=a_record.replace("\"","")
            if dictionary["id"] not in [another_rec["id"] for another_rec in tweet_list]:
                tweet_list.append(dictionary)
                
        #Get the proper date which is the first 10 characters of the filename
        tweet_date=filename[:10]
        
        #Now with the empty tweet_dict created earlier, for each sheet(day) of data as a key, the corresponding value is the list 
        #of dictionaries created above for that day, tweet_list. However, for 1 day, there can be multiple sheets, so if the day 
        #already existed, we compile the lists of dictionaries of the same days altogether
        
        if tweet_date not in tweet_dict.keys():
            tweet_dict[tweet_date]=tweet_list
        else:
            tweet_dict[tweet_date]+=tweet_list

Next, we have to deal with surrogate pairs. We need to convert these into its "emoji" forms and check again if they are classified as English using langid. We only retain those tweets that are classified as English

In [6]:
for day in tweet_dict.keys():
    non_en=[]
    for i in range(len(tweet_dict[day])):
        tweet_dict[day][i]['text']=tweet_dict[day][i]['text'].encode('utf-16','surrogatepass').decode('utf-16')
        if langid.classify(tweet_dict[day][i]['text'])[0]!='en':
            non_en.append(tweet_dict[day][i])
    tweet_dict[day]=[tweet for tweet in tweet_dict[day] if tweet not in non_en]

We take another look at this modified `tweet_dict` dictionary

In [10]:
tweet_dict['2020-03-22'][:20]

[{'text': 'More than a dozen NYC inmates test positive for COVID-19 https://t.co/v9ZqTL2fCu',
  'id': '1241583710194950145',
  'created_at': '2020-03-22T04:33:18.000Z'},
 {'text': "@shytigress @dharmvirjangra9 @GenDADange @GenPanwar @cdrcshekhar @narendravarma49 @JaganNKaushik @URRao10 @nutan_jyot @IndiaKaPrahari @BHARATMACHINE99 @NaniBellary @nalini51purohit @WishMaster2019 @Bharatwashi1 @gouranga1964 @SethiVed @KEYESEN2000 @sinhrann @RulesElsa @J_o_l_i_e @venkatarat @surewrap @Savitritvs @RBhamaria @Kumaran92023000 @Drsunandambal @ravi_sec @kailashkaushik8 @UnchaTiranga @BillionIndian @roydebasis @1PM Boris Johnson tells Britons not to visit parents on Mother's Day because of #coronavirus\n\nBoris Johnson\xa0has urged the British public not to visit their parents on\xa0Mother’s Day\xa0as he warned that the\xa0NHS\xa0was in danger of being “overwhelmed”\n https://t.co/2P8VsDQFvq",
  'id': '1241583710396272643',
  'created_at': '2020-03-22T04:33:18.000Z'},
 {'text': 'Please Stay at Hom

We can see that the surrogate pairs are converted into its "emoji" form. Now we can start to transform the data into XML format. We need `encode('ascii', 'xmlcharrefreplace')` function and `decode("utf-8")` to pass those tweets above into XML file.

In [8]:
outfile=open("Covid19Tweets_parsed.xml",'w')
outfile.write('<?xml version="1.0" encoding="utf-8"?>\n')
outfile.write('<data>\n')

#Start the loop for the tweet_dict
for day in tweet_dict.keys():
    outfile.write('<tweets date=\"'+day+'\">')
    for i in range(len(tweet_dict[day])):
        text=tweet_dict[day][i]['text'].encode('ascii', 'xmlcharrefreplace')
        text=text.decode('utf-8')
        outfile.write('<tweet id=\"'+tweet_dict[day][i]['id']+'\">'+text+'</tweet>')
    outfile.write('</tweets>')
    
outfile.write('</data>')
outfile.close()

## 2. Text Preprocessing <a class="anchor" id="2"></a>

In this section, with a secondary data file, which contains 80+ sheets of tweets, each sheet with 2000 tweets, I will generate the corpus vocabulary and sort it alphabetically. Afterwards, for each sheet, I calculate the top 100 frequent unigrams and top 100 frequent bigrams. Lastly, I will generate the sparse representation of the excel file.

In [11]:
#Import libraries

#For reading dataframe
import pandas as pd  

#Natural Language Toolkit
import nltk

#for tokenization
from nltk import RegexpTokenizer

#for stemming
from nltk.stem import PorterStemmer

#for computing distribution from a set of word token
from nltk.probability import *

#for parallel processing
from itertools import chain

#for extracting n-grams
from nltk.util import ngrams

#for ensuring bigrams are not split into 2 words
from nltk.tokenize import MWETokenizer

#for creating sparse matrix
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
#Load data
excel_data=pd.ExcelFile('Covid19Tweets_Part2.xlsx')
excel_data

<pandas.io.excel._base.ExcelFile at 0x266199c2088>

In [13]:
#There are 81 sheets in this excel file
len(excel_data.sheet_names)

81

Next, we create stopword lists from `stopwords_en` file

In [14]:
stopwords_list=()
with open('stopwords_en.txt') as f:
    stopwords_list=f.read().splitlines()

In [17]:
#We create 2 empty dictionaries and an empty list to store values later
mydict,mybidict,mylist={},{},[]

#We will loop over all the sheets in the excel_data
for sheet in excel_data.sheet_names:
    dataset=excel_data.parse(sheet)
    
    #Shift the dataframe to correct position
    dataset=dataset.dropna(1,thresh=1)
    df=dataset.dropna()
    
    #Since there are some dataframes that already have the column header correct, run this "if" statement
    if df.columns.any() not in ['text','id','created_at']:
        #Make the first row the column header and remove the redundant first row afterwards
        df.columns=df.iloc[0]
        df=df[1:]
        
    #Create a new column "verified" which contains the boolean value for if the tweet is in english
    df['verified']=df['text'].apply(lambda x:langid.classify(str(x))[0]=='en')
    
    #Only retain tweets that are in English
    df=df[df.verified==True]
    
    #Combine all the rows in 'text' column into a string and perform case normalization
    textstr=str(df['text'].tolist())
    textstr=textstr.lower()
    
    #Use regex expression to tokenize the newly created string
    tokenizer=RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?")
    tokens=tokenizer.tokenize(textstr)
    
    #Add all the tokens created into mylist which will be used to create bigrams later
    for i in tokens:
        mylist.append(i)
        
    #add the key-value of sheet and tokens into mybidict dictionary created early so we use it to find out the top frequent
    #bigram later on
    mybidict[sheet]=[token for token in tokens]
    
    #Retains the list of tokens after removing stopwords
    ind_filtered_tokens=[token for token in tokens if token not in stopwords_list]
    
    #Store this sheet-name and list of tokens as key-value pairs in mydict
    mydict[sheet]=ind_filtered_tokens

Now we move on to remove context-dependent stop words, then stemming and remove tokens with very short length

In [18]:
#Create words_df variable to record the number of documents each word appear in, by ensuring that each word appears only once
words_df=list(chain.from_iterable([set(value) for value in mydict.values()]))

#Create words_tf variable to record the number of times each word appears
words_tf=list(chain.from_iterable(mydict.values()))

#This gives the document frequency of each word
a=FreqDist(words_df)

#Create this empty dictionary to store only words that appear in less than 60 documents but more than 5 documents
b={}
for k,v in a.items():
    if v<=60 and v>=5:
        b[k]=v

#Get all the words collected above into a new list fil_list
fil_list=[i for i in words_tf if i in b.keys()]

#Perform stemming on each of these words in the newly created list fil_list
stemmer=PorterStemmer()
stem_fil_token=[stemmer.stem(i) for i in fil_list]
    
#Retain only tokens with length longer or equal to 3 into a new list filtered_token
filtered_token=[token for token in stem_fil_token if len(token)>3 or len(token)==3]

Then we use PMI measure to find the first 200 meaningful bigrams, concatenate them with an underscore and add them to the filter_token list above

In [19]:
bigram_measures=nltk.collocations.BigramAssocMeasures()
finder=nltk.collocations.BigramCollocationFinder.from_words(mylist)
meaningful_bi=finder.nbest(bigram_measures.pmi,200)
for i in meaningful_bi:
    filtered_token.append(i[0]+'_'+i[1])

Then we sort the list in alphabetical order, and also retain only 1 entry per token to remove duplicates

In [20]:
sorted_list=sorted(filtered_token)
sorted_list=list(dict.fromkeys(sorted_list))

Lastly for this part, we create a text file to store the output. Each line contains a token in the list, followed by its index in the list

In [22]:
with open('Vocab.txt','w') as f:
    for i in sorted_list:
        f.write(i+':'+str(sorted_list.index(i))+'\n')

In [23]:
#Create a text file to store the top 100 frequent unigrams
with open ('top100_unigram.txt','w') as f:
    for day in mydict.keys():
        stemmed_value=[stemmer.stem(i) for i in mydict[day]]
        f.write(day +':'+ str(FreqDist(stemmed_value).most_common(100))+'\n')

In [24]:
#Create a text file to store the top 100 frequent bigrams
with open('top100_bigram.txt','w') as f:
    for day in mydict.keys():
        bigrams=ngrams(mybidict[day],n=2)
        f.write(day+':'+str(FreqDist(bigrams).most_common(100))+'\n')

Lastly, we generate Sparse Representation, with count vector representation.

In [25]:
vectorizer=CountVectorizer(analyzer='word')

Since we introduce 200 meaninful bigrams into our vocab, we need to use mwetokenizer to make sure those bigrams are not split into 2 individual words.

In [26]:
mwetokenizer=MWETokenizer(meaningful_bi)
colloc_tweet=dict((day,mwetokenizer.tokenize(tweet)) for day,tweet in mybidict.items())

Then start a loop to write the text file that has the sparse representation of the excel file

In [27]:
with open('Countvec.txt','w') as f:
    for day in colloc_tweet.keys():
        
        #For each loop, start the line with the date which can be assessed via mydict.keys()
        f.write(day+',')
        
        #join all the tokens of that day as a big string and make a list of 1 element, the whole string
        mystr=' '.join(colloc_tweet[day])
        list1=[]
        list1.append(mystr)
        
        #Then transform it into feature vectors
        data_features=vectorizer.fit_transform(list1)
        listword=[]
        
        #Get all the bigrams for each day
        diff=set(colloc_tweet[day])-set(mybidict[day])
        
        for word,count in zip(set(colloc_tweet[day]),data_features.toarray()[0]):
            if count>0:
                
                #Only stem unigrams 
                if word not in list(diff):
                    word=stemmer.stem(word)
                
                #Now we can check if the stemmed unigram or the bigram are in the vocab, so we can check for their index
                if word in sorted_list:
                    #check if the word is in the sorted_list, then write the index of that word, as given in the vocab, and its count
                    pair=str(sorted_list.index(word))+':'+str(count)
                    listword.append(pair)
        f.write(','.join(listword))
        f.write('\n')