# Parsing text files and text preprocessing of Covid19 Tweets

In this project, making use of only some simple Python libraries like `re`, `os` and `langid`, I will parse a very large files, containing thousands of semi-structured text files, each contains numerous tweets related to Covid19.

## Table of Content
1. [Parsing Text Files](#1)
2. [Text Preprocessing](#2)

## 1. Parsing Text Files <a class="anchor" id="1"></a>

In this section, I attempt to extract data from semi-structured text files in `Covid19Tweets` files. Each text file contains information about the tweets such as "id", "text", "created_at" attributes. My task will be to extract the data and transform the data into XML format with the following elements:
- id: 19-digit number
- text: the actual tweet
- Created_at: date and time that the tweet was created

There are some specification as follows:
- The 'id's are unique, so if there are multiple instances of the same tweets, i will only keep 1 of them in the final XML file
- Non-English tweets will be filtered out from the dataset and the final XML only contains tweets in English language. 

In [1]:
#Import libraries
import re
import langid
import os

In [3]:
#Create the relative path to the data file that contains all the text tweet files
dir_path="./Covid19Tweets"

#Create an empty dictionary to store lists of dictionaries of tweets
tweet_dict={}
for filename in os.listdir(dir_path):
    tweet_list=[]
    name="Covid19Tweets/"+filename
    with open(os.path.join(dir_path,filename),"r") as f:
        file=open(name,encoding="UTF-8")
        for i in file:
            file=i
        
        #Use regex to extract all the smaller dictionaries (now still in string form) into a list
        text=re.findall(r"{(?:(?!\"data\")).+?}",file)
        
        #Filtered out corrupted tweets
        error_list=[]
        for a_record in text:
            if ("\"text\"" not in a_record) and ("\"id\""not in a_record) and ("\"created_at\"" not in a_record):
                error_list.append(a_record)
        
        #Use list comprehension to retain only uncorrupted tweets
        text=[a_record for a_record in text if a_record not in error_list]
        
        #Retain only tweets that are in English
        correct_text=[]
        for a_record in text:
            if langid.classify(a_record)[0]=='en':
                correct_text.append(a_record)
        #Use list comprehension to retain only English tweets
        text=[a_record for a_record in text if a_record in correct_text]
        
        #Use eval() function for each element in the list to convert them into proper dictionary. 
        #There are some entries with unescaped meta characters. Need to take care of these by try and except
        for a_record in text:
            try:
                dictionary=eval(a_record)
            except:
                a_record=a_record.replace("'","’")
                a_record=a_record.replace("\n","")
                a_record=a_record.replace("\"","")
            if dictionary["id"] not in [another_rec["id"] for another_rec in tweet_list]:
                tweet_list.append(dictionary)
                
        #Get the proper date which is the first 10 characters of the filename
        tweet_date=filename[:10]
        
        #Now with the empty tweet_dict created earlier, for each sheet(day) of data as a key, the corresponding value is the list 
        #of dictionaries created above for that day, tweet_list. However, for 1 day, there can be multiple sheets, so if the day 
        #already existed, we compile the lists of dictionaries of the same days altogether
        
        if tweet_date not in tweet_dict.keys():
            tweet_dict[tweet_date]=tweet_list
        else:
            tweet_dict[tweet_date]+=tweet_list
#Take 10 minutes to finish running this code block, since there are thousands of text files

Next, we have to deal with surrogate pairs. We need to convert these into its "emoji" forms and check again if they are classified as English using langid. We only retain those tweets that are classified as English

In [6]:
for day in tweet_dict.keys():
    non_en=[]
    for i in range(len(tweet_dict[day])):
        tweet_dict[day][i]['text']=tweet_dict[day][i]['text'].encode('utf-16','surrogatepass').decode('utf-16')
        if langid.classify(tweet_dict[day][i]['text'])[0]!='en':
            non_en.append(tweet_dict[day][i])
    tweet_dict[day]=[tweet for tweet in tweet_dict[day] if tweet not in non_en]
#Take about 5 minutes to finish running this code block

We take another look at this modified `tweet_dict` dictionary

In [9]:
tweet_dict['2020-03-22']

[{'text': 'More than a dozen NYC inmates test positive for COVID-19 https://t.co/v9ZqTL2fCu',
  'id': '1241583710194950145',
  'created_at': '2020-03-22T04:33:18.000Z'},
 {'text': "@shytigress @dharmvirjangra9 @GenDADange @GenPanwar @cdrcshekhar @narendravarma49 @JaganNKaushik @URRao10 @nutan_jyot @IndiaKaPrahari @BHARATMACHINE99 @NaniBellary @nalini51purohit @WishMaster2019 @Bharatwashi1 @gouranga1964 @SethiVed @KEYESEN2000 @sinhrann @RulesElsa @J_o_l_i_e @venkatarat @surewrap @Savitritvs @RBhamaria @Kumaran92023000 @Drsunandambal @ravi_sec @kailashkaushik8 @UnchaTiranga @BillionIndian @roydebasis @1PM Boris Johnson tells Britons not to visit parents on Mother's Day because of #coronavirus\n\nBoris Johnson\xa0has urged the British public not to visit their parents on\xa0Mother’s Day\xa0as he warned that the\xa0NHS\xa0was in danger of being “overwhelmed”\n https://t.co/2P8VsDQFvq",
  'id': '1241583710396272643',
  'created_at': '2020-03-22T04:33:18.000Z'},
 {'text': 'Please Stay at Hom

We can see that the surrogate pairs are converted into its "emoji" form. Now we can start to transform the data into XML format. We need `encode('ascii', 'xmlcharrefreplace')` function and `decode("utf-8")` to pass those tweets above into XML file.

In [8]:
outfile=open("Covid19Tweets_parsed.xml",'w')
outfile.write('<?xml version="1.0" encoding="utf-8"?>\n')
outfile.write('<data>\n')

#Start the loop for the tweet_dict
for day in tweet_dict.keys():
    outfile.write('<tweets date=\"'+day+'\">')
    for i in range(len(tweet_dict[day])):
        text=tweet_dict[day][i]['text'].encode('ascii', 'xmlcharrefreplace')
        text=text.decode('utf-8')
        outfile.write('<tweet id=\"'+tweet_dict[day][i]['id']+'\">'+text+'</tweet>')
    outfile.write('</tweets>')
    
outfile.write('</data>')
outfile.close()

## 2. Text Preprocessing <a class="anchor" id="2"></a>