# ABSA data processing
This notebook pre-processes the xml format data files into pkl file that we will be using for the experiments. 

Before following the steps in this notebook, you have to 1) download the data and 2) rename it into a specific format.

1. Go to http://alt.qcri.org/semeval2016/task5/index.php?id=data-and-tools and download training and test data of Arabic (ar), Chinese (zh), English (en), Russian (ru), Spanish (es). You can select subtask 1 for all languages, and you will have to sign up to get access to the data. For Chinese, we used Mobile Phone reviews, but you can use the digital camera one too. 
2. Rename all downloaded data to the format of `{two_character_lang_code}_{train/test}.xml`. For example, the Chinese training file should be renamed to `zh_train.xml`, and the Russian test file should be `ru_test.xml`.
3. make new directory called `data/` and move all files under the directory.
4. Execute this notebook and you are good to go! -- it will create `absa.pkl` under the same `data/` directory.

In [1]:
import pickle
import xml.etree.ElementTree as ET

In [2]:
def aggregate_labels(opinions:list) -> "str":
    labels, categories = zip(*opinions)
    if len(set(labels)) == 1:
        return "pos" if labels[0] == "positive" else "neg"
    else:
        num_pos, num_neg = labels.count("positive"), labels.count("negative")
        if num_pos > num_neg:
            return "pos"
        elif num_neg < num_pos:
            return "neg"
        else:
            return None

In [3]:
data = {}
for lang in ["en","es","ru","ar","zh"]:
    data[lang] = {}
    for mode in ["train","test"]:
        tree = ET.parse(f'data/{lang}_{mode}.xml')
        root = tree.getroot()
        xs, ys = [], []
        for child in root:
            for sentences in child:
                for sent in sentences:
                    if sent.find("Opinions"):
                        opinions = [(opinion.attrib["polarity"], opinion.attrib["category"]) for opinion in sent.find("Opinions")]
                        if len(opinions) > 0:
                            xs.append(sent.find("text").text)
                            ys.append(aggregate_labels(opinions))
        len_old = len(ys)
        xs, ys = zip(*[(x,y) for x,y in zip(xs, ys) if y!=None])
        print(f"{lang}-{mode}: {len_old-len(ys)}/{len_old}")
        assert len(xs) == len(ys)
        data[lang][mode] = list(zip(xs,ys))

en-train: 100/1708
en-test: 32/587
es-train: 91/1626
es-test: 27/677
ru-train: 70/2733
ru-test: 43/908
ar-train: 352/4790
ar-test: 80/1225
zh-train: 0/1333
zh-test: 0/529


In [4]:
with open("data/absa.pkl","wb") as file:
    pickle.dump(data, file)