#### Date: 12/09/2020

##### Version: 1.0

#### Environment: Python 3.7.4 and Jupyter notebook

### Libraries used:

* os (for loading the system path and files, included in Anaconda Python 3.7)
* re (for regular expression, included in Anaconda Python 3.7)
* langid (for language classification, included in Anaconda Python 3.7)


## 1.  Import libraries 

In [1]:
# Code to import libraries 
from langid import classify
import os 
import re

## 2. Parse the text Files under /31224075/


* Using the os directory to list all the files within the directory

* Initializing the directory with the files stored and then creating a file list to sort the files in the order of filenames. 

* The final fileslist is used to read the files from.

In [2]:
%%time

#Initializing the data directoty path
dir = "31224075/" 

'''
The os.listdir changed the file order in which the file was read, so I had to incorporate the file sort logic to match my
date logic. 
'''

#Ordering the files in the directory according to the file name i.e date order
filelist = []
for file in os.listdir( dir ):
  if file.endswith( ".txt" ):
      filelist.append(file)
filelist.sort()  # sort file names


CPU times: user 1.91 ms, sys: 0 ns, total: 1.91 ms
Wall time: 2.47 ms


## 3. Initializing the variables used by the pre processing steps. 

In [3]:
%%time
#Initializing empty dictionaries to be used
tweet_dict = {}
date_dict = {}
final_text = ()
count_tweets = 0
count_english_tweets = 0

CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 9.78 µs


## 3 . The steps and approach followed for the below task: 

1. Now with the files in hand, it was neccessary to find the best way to store the data for the 2418 files. My initial approach was to store two dictionaries with date and text & another with ID and text. However, to optimize the runtime and make the code more efficient I made changes to store the data only in one dictionary. 

2. Another approach where in I improved the runtime complexity is by extracting the date from the file name and not the created_at field which added to the run time initially.

> FileName : 2020-03-22_235 --> The first 10 substring is the date input and this format is seen to be standard in the list of files. </font>

3. Withheld pattern was an exception found when I ran the file folder for submission. Added an extra regex to handle this situation. 

4. After reading the file, and the file lines, start by wrangling the data by matching the {} for the data contents in the file.

The text data file given is in the format of : 

> {"data":[{"text":"DADO QUE A GRANDE MÍDIA E O PRÓPRIO  MISTÉRIO DA SAÚDE NÃO MOSTRA A POPULAÇÃO. https://t.co/XHsulXGJ2p", "created_at":"2020-06-07T15:08:30.000Z", "id":"1269647427436515332"},{"text":"@guardian Weekend #Fatality Hangover - 77 #Covid19 Died is a Lie, True Figs come out from Tues till nxt W/e #BorisJohnson #coronavirus #Crimesagainstsociety #Blamegame #buckstopsatthrTOP https://t.co/F7aQs2e8Ld", "created_at":"2020-06-07T15:08:30.000Z", "id":"1269647427562323970"},...,]} 

So the first pattern in the regex looks for everything inside the {} in the file lines.  

5. If the pattern matches, look for the text pattern in the matched groups. Retrive the final_text with the matched groups.

> Text Pattern - "text":\s*"(.*?)(?<!\\)"
    * "text": matches the characters "text": literally (case sensitive)
    \s* matches any whitespace character (equal to [\r\n\t\f\v ])
    * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
    " matches the character " literally (case sensitive)
    1st Capturing Group (.*?)
    .*? matches any character (except for line terminators)
    *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    Negative Lookbehind (?<!\\)
    Assert that the Regex below does not match
    \\ matches the character \ literally (case sensitive)
    " matches the character " literally (case sensitive)
    
    Text_pattern - Captures everything in between double quotes which doesn't have a preceding backslash 


For ex: 

>  "text":"@guardian Weekend #Fatality Hangover - 77 #Covid19 Died is a Lie, True Figs come out from Tues till nxt W/e #BorisJohnson #coronavirus #Crimesagainstsociety #Blamegame #buckstopsatthrTOP https://t.co/F7aQs2e8Ld"

```python
>  final_text= text.group(1)  
  
>  match.group(1) will then return :   
```  
  
  > @guardian Weekend #Fatality Hangover - 77 #Covid19 Died is a Lie, True Figs come out from Tues till nxt W/e #BorisJohnson #coronavirus #Crimesagainstsociety #Blamegame #buckstopsatthrTOP https://t.co/F7aQs2e8Ld
  
6. With the final text, check if the final text is classified as English using langid's classify method. And proceed to the further pre processing steps only if it passes this condition. 

> langid.py is a Standalone Language Identification (LangID) tool.

7. Similarly looking for the ID tags within the match groups for the {} pattern. 

> ID Pattern: - '"id":\s*"(.*?)"'

8. Now that I have parsed the string from the text, it is also required to look for emoji characters from the text. 

9. The emoji characters are expressed in surrogate pairs, the term "surrogate pair" refers to a means of encoding Unicode characters with high code-points in the UTF-16 encoding scheme. 

10. The text after you read the file are not "surrogate pairs" - instead, they are backslash-encoded codepoints for surrogate pairs, encoded as text. Hence the need to encode the text as string to ascii with backslahreplace. 

> Backslashreplace is an error handler which replaces any unencodable chars into backslash replacements.

11. Now with the special "unicode escape" code, decode it back. This will preserve the other characters on the string and then decode it back using the same codec. At that point, both surrogate characters will be text as two characters. Then encode and decode it together, to get the actual surrogate codepoints that would be valid utf-16 to be decoded. The final decode from utf-16 will finally understand the surrogate pair as a single character. 

12. For the special characters in the text like Ampersand , Less-than, Greater-than, Quotes, Apostrophe needs to be escaped as &amp; &lt; &gt; &quot; and &apos; .

13. Replace the special characters with the escaped XML form characters.

14. Finally I'm storing the final text tuple in the dictionary with the date as values and the key as the ID.


In [4]:
%%time

# Reading the files in the sorted filelist
for file in filelist:
    #Substring to get only the date part from the file name
    date = file[:10]
    with open( os.path.join( dir, file ) ,"r", encoding='UTF-8') as fd: 
        #Reading the file lines
        file_lines = fd.readline() #.strip()
        
        #Parsing the withheld pattern found in one of the files
        withheld_pattern = '"withheld":\s*{(.*?)}'
        if re.search(withheld_pattern, file_lines):
            #Replacing the withheld pattern to null
            file_lines = re.sub(withheld_pattern, "", file_lines)
        
        #Searching for pattern between the {} brackets
        pattern = '\s*{(.*?)}'
        
        #For every match found with the pattern in the file line
        for match in re.finditer(pattern, file_lines):
            count_tweets +=1
            # Find the following text pattern from the matched record
            text_pattern = '"text":\s*"(.*?)(?<!\\\)"'
            # For every matched text pattern check if the match group returns a value
            if re.search(text_pattern, match.group(0)) is None:
                continue
            # If a value is returned, store the text as text 
            text = re.search(text_pattern, match.group(1))
            #print("textttt >>>>>>" +text)
            # Store the results to the final text 
            final_text= text.group(1)
            #print("finallllllllll>>>" +final_text)
            
            # Checking for English tweets using langid classify 
            x = classify(final_text)
            
            # If the text is English it will store the values in the dict with the ID as the key along with the date (from the file_name)
            if x[0] == 'en':
                count_english_tweets +=1
                # Similarly check for the ID tag from the below regex:
                id_pattern = '"id":\s*"(.*?)"'
                ID = re.search(id_pattern, match.group(0))
            
                # Encoding the final text with ascii and backslashreplace
                # Which returns a bytes representation of the Unicode string, encoded in the requested encoding. 
                # Where in Backslashreplace inserts a \uNNNN escape sequence)

                final_text = final_text.encode('ascii', 'backslashreplace')

                # Decoding the emoji characters using surrogatepass
                final_text = final_text.decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')

                # Replacing the XML special characters in their escaped form in the final text
                xml_escaped =[
                    ("&amp;", "&(?!#\d{4};|apos;|quot;)"),
                    ("&quot;", '"'),
                    ("&apos;", "'")
                    #("&gt;", ">"),
                    #("&lt;", "<")
                    ]

                # Replacing the special characters found with their escaped forms :
                for xml_char , og in xml_escaped:
                    final_text = re.sub(og, xml_char, final_text)

                tweet_dict[ID.group(1)] = [final_text, date]
            else:
                continue
            


CPU times: user 55.3 s, sys: 14 s, total: 1min 9s
Wall time: 36.6 s


## 4. Creating the output XML File

1. According to the dates in the dictionary, write the tweets in the the tweets tags with the ID and text. 
2. When a new date is read, end the date and start the the new date sequence and tweets for it.
3. And end the XML file with the closing tags.


In [5]:
%%time

# Writing the data to the XML file

fw = open('31224075_test.xml', 'w', encoding='UTF-8')
fw.write('<?xml version="1.0" encoding="UTF-8"?>\n')
fw.write("<data>\n")

# Date Logic :
# Let the initial date be stored as default
temp_date = 'default'

# Passing the tweet dictionary with text and date
for k, v in tweet_dict.items():
    date = tweet_dict[k][1]
    #dt = dt[:10]
    #Checking if the temp_date as well as the current date is not equal to the date
    if(temp_date != date):
        if(temp_date != 'default'):
            fw.write("</tweets>\n") #Marks the end of tweets tag if date is different
        
        #Sets the current date as date and starts the tweets tag with date
        temp_date = date
        tweet_date = '<tweets date="'+temp_date+'">\n'
        fw.write(tweet_date)
    
    #Write the tweet with the ID from the key and text as the value
    tweets = '<tweet id="'+k+'">'+v[0]+'</tweet>\n'
    fw.write(tweets)

# Final file write in the end closing tags      
fw.write("</tweets>\n")
fw.write("</data>")
fw.close()


CPU times: user 32.4 ms, sys: 86 µs, total: 32.5 ms
Wall time: 31.4 ms


## 3. Summary


In [6]:
#Give a short summary of your work done above, such as your findings.

print(" The total number of files for the XML folder: " + str(len(filelist)) + " files" ) 
print(" The average file processing time : 17 mins ")  
print(" The total no. of tweets in the file : " + str(count_tweets)) 
print(" The total no. of English tweets processed is : " + str(count_english_tweets))



 The total number of files for the XML folder: 176 files
 The average file processing time : 17 mins 
 The total no. of tweets in the file : 17600
 The total no. of English tweets processed is : 8725


## Conclusion and Learnings: 

While converting a file from one input format to another, there is always a need to look out for the common occurences of a particular format in both. Based on that, leverage the use of regex to match the data needed in a required format. However, using regex may not always be a better option, as there are many libraries and API's doing the same task in a simpler way.
But one can always find an answer using a regex pattern. 

Text language classification can be done by a powerful library called as langid, which recognizes over 95 different languages. Based on different use cases, langid can be modified to be used as per needs. 

Encoding is a major factor while reading and writing a file. One should take care about the special characters in the file and the escape sequences for a particular format. Here in this use case, emoji's had to be encoded in utf-16 and various special characters needed to be escaped in the final format. 

## Reference 

1. https://docs.python.org/2/library/re.html
2. https://stackoverflow.com/questions/58949094/converting-surrogate-pairs-to-emoji-python3
3. https://www.nltk.org/howto/collocations.html
4. https://www.advancedinstaller.com/user-guide/xml-escaped-chars.html
5. https://www.tidytextmining.com/ngrams.html