Libraries used:
* re 2.2.1 (for regular expression, included in Anaconda Python 3.6) 
* os (For interactig with the files and operation systems)
* langid (for language detection and processing)

## 1. Introduction
This assignment comprises the extraction of data from semi-structured files. We are given text files which contains tweets where we are to extract the date,id of the tweet and the text. After we have extracted the tweet we are to write it into XML format. 

More details for each task will be given in the following sections.

## 2.  Import libraries 
* Importing os for reading the files 
* Using langid for filtering english language 
* re for regular expression

In [8]:
import os
import langid
import re
import xmltodict

## 3. Loading the files
* We are loading all the files in part 1 which ends with .txt

In [2]:
path = './part1/'
text_files = [f for f in os.listdir(path) if f.endswith('.txt')]
print(len(text_files))

2413


## 4. Cleaning the data
* We are first cleaning the data 
* Fixing the issue with emojis 
* Quotes which were not closed are fixed and completing the quotes 

In [3]:
#Cleaning the text which are not completed 
def clean_text(text):
    return (text.replace('ï', '').replace('¸', '').replace('', '')
            .replace('â', '')
            .replace('', ''))

#Fixing the emojis and the special characters 
def fix_emojis(t):
    i = 0
    to_change = None
    changes = []
    while i<=(len(t)):
        if t[i:i+2] == '\\u':
            if to_change is None:to_change=t[i: i+6]
            else: to_change += t[i: i+6]
            i +=6
        else: 
            if to_change is not None:
                s = to_change.encode('utf-8').decode('unicode-escape').encode('utf-16', 'surrogatepass').decode('utf-16')
                changes.append((to_change, s))
                to_change=None
            i = i+1
    for o, n in sorted(changes, key=lambda x: len(x[1]), reverse=True):
        t = t.replace(o, n)
    return t

#Some of the quotes were not completely closed. Hence, completing the quotes 
def fix_unclosed_quotes(t):
    for i in reversed(t):
        if i=='”' or i=='"':
            return t
        elif i=='“':
            return t+'"'
    return t

## 5. Parsing the data 
1. We are using regex for date and id 
2. Date - Date is in the format of yyyy-mm-dd so we are using \d{4}-\d{2}-\d{2}.*?Z
    * \" matches the character " literally (case sensitive)
    * \d matches the digits from 0-9 and {4} matches 4 digits 
    * .*? matches any character (except for line terminators)
    * Z matches the character Z literally (case sensitive)
3. For the id we use :\"(\d{2,19})\"
    * \d{2,19} matches a digit (equal to [0-9]) since the length of the id is a constant of 19

In [10]:
#parsing the tweets using regex 
def parse_tweet(tweet):
    tweet_date = re.findall('\"\d{4}-\d{2}-\d{2}.*?Z\"', tweet)[0].split('T')[0].strip('"')
    if len(tweet_date)>10: tweet_date = tweet_date[:10]
    tweet_id = re.findall(':\"(\d{2,19})\"', tweet)[0]
    if len(tweet_id)>19: tweet_id = tweet_date[:19]
    tweet = re.split('","|},"', tweet)

    tweet_text=None
    for part in tweet:
        part = part.lstrip('"')
        if part.startswith('text":'):
            tweet_text = part[len('text":"'):]
            tweet_text = tweet_text.replace('\\n', '\n').replace('\\"', '"')
            if tweet_text[-1] == '"': tweet_text = tweet_text[:-1]
            #calling the emojis function and fixig the tweet
            tweet_text = fix_emojis(tweet_text)
            #If it comes across any unclosed quotes fixing that as well
            tweet_text = fix_unclosed_quotes(tweet_text)
    for i in [tweet_date, tweet_id, tweet_text]:
        try:
            assert not i is None and not i ==''
        except Exception as e:
            print(tweet)
            raise e
        
    return tweet_date, tweet_id, tweet_text


* Opening the file 
* Removing the extra, stripping everything on the left and removing all the errors 
* Checking the language of the tweets if it is in english 


In [11]:
tweets = {}
discarded_tweets = []
texts = []
ids = []
for file in text_files:
    #Opening the file 
    with open(path + file, 'r', encoding='utf-8', errors='replace') as f:
        text = f.read()      
        #Removing the extra - Strippping everything on left - Removing all the errors 
        text = text.lstrip('{"data:').split(',"errors":')[0][2:-2].split('},{')
        for tweet in text:
            tweet_date, tweet_id, tweet_text = parse_tweet(tweet)
            tweet_text = clean_text(tweet_text) 
            lang, score = langid.classify(tweet_text)
            #Checking the language if the tweets and if it is in english; appending
            if lang == 'en':
                if tweet_date not in tweets.keys():
                    tweets[tweet_date] = []

                tweets[tweet_date].append((tweet_id, tweet_text, score))
                
                texts.append(tweet_text)
                ids.append(tweet_id)
            else:
                discarded_tweets.append((tweet_date, tweet_id, tweet_text, score))

## 4. Writing it into xml file 
* Creating the xml file try.xml


In [14]:
#Writing it into xml file in xml format
xmlfile_path = './29999715.xml'
with open(xmlfile_path, 'w', encoding="UTF-8") as f:
        f.write(f"""<?xml version="1.0" encoding="UTF-8"?>\n""")
        f.write("<data>\n")
        for date, date_tweets in tweets.items():
            f.write(f"""<tweets date="{date}">\n""")
            for tweet in date_tweets:
                tweet_id, tweet_text = tweet[0], tweet[1]
                f.write(f"""<tweet id="{tweet_id}">{tweet_text}</tweet>\n""")
            f.write(f"""</tweets>\n""")
        f.write("</data>\n")

## 5. Verifying that the xml is loadable
* Checking if the xml file is loadable 

In [13]:
import xmltodict

with open(xmlfile_path, 'r', encoding='utf-8') as f:  
    xml_text = f.read()
    
parsed = xmltodict.parse(xml_text)['data']['tweets']
output_texts = [] 
output_ids = []
for p in parsed:
    for i in p['tweet']:
        output_texts.append(i['#text'])
        output_ids.append(i['@id'])
        
assert len(texts) == len(output_texts)
assert len(ids) == len(output_ids)

## 6. Reference
1. re — Regular expression operations — Python 3.8.6rc1 documentation. (2020). Retrieved 16 September 2020, from https://docs.python.org/3/library/re.html
2. os — Miscellaneous operating system interfaces — Python 3.8.6rc1 documentation. (2020). Retrieved 16 September 2020, from https://docs.python.org/3/library/os.html
3. langid. (2020). Retrieved 16 September 2020, from https://pypi.org/project/langid/1.1dev/
4. xmltodict. (2020). Retrieved 16 September 2020, from https://pypi.org/project/xmltodict/
