# User annotation pre-processing script

This script can be used to manualy clean the user annotations for the AcX system.

### Preparation:
 - Use google-cloud-sdk to download the annotations in .json format
 - Add the .json files to single folder (I called it 'google_annotation/json_data/', but you can of course change the name and code
 - When it is the first time you clean the data, commend the following lines:
     - clean_annotations = pd.read_csv('./clean_data/clean_annotations_nl.csv')
     - already_clean_docs = set(clean_annotations['doc_id'])
     - !! ALSO, REMOVE "already_clean_docs" from dirty_docs = total_docs - already_clean_docs !!
     - undo all of this once you've cleaned up the first documents and have a clean_annotations_XX.csv file.

### Cleaning the documents:
* [Difference Finder](#Difference-Finder): The code in these cells checks the differences between the annotators of the same document. There are two common differences: 
     1. One annotator missed an acronym
     2. The annotators wrote the acronym or expansion in two different ways.
A document gets returned if one or both of these differences occurs. 

* [Manual cleaning](#Manual-cleaning)
Cleaning the documents is done using the pandas' framework. Please follow the following steps:
    1. Enter the document of intrest --> document = '....'
    2. Check the output of differences and add the row numbers in the cell below (row_num_1 & row_num_2)
    3. Use the next cell to change the values inside the data frame. You can change values by using the .iloc[.., ..] statement or add a row with the extra_row variable in combination with the append statement. 
    4. All the acronyms and expansions for the document of interest are printed below for one last check.
    5. [Saving changes](#Saving-changes) Finally, You can append and save the output to the clean_annotations_xx.csv. I would recommend doing this after every document. 

### Some other issues you might encounter 
- Mail issue: The document IDs were generated by splitting the names of the .json files based on a _ . However, some emails use an underscore, which will create improper document splits. This issue is solved by explicitly splitting on the name. Therefore, add all email addresses with an underscore to the "exception_mails" list.
- Documents with only one annotator: Some documents will not yet be annotated by two people. You can ignore these documents for now.

In [None]:
"""
Script for processing and cleaning the google drive annotations.
The output is a .csv file with the acronyms and expansions per document.
Date: 23-05-2022
"""

In [None]:
import os
import pandas as pd
import re

## Loading the Annotations

In [None]:
rootdir = '/Users/jesher/Desktop/Master data science UvA/Semester 2/Thesis/google_annotation/json_data/'
df = pd.DataFrame(columns=['acronym', 'expansion', 'language', 'type'])

# Loading the clean annotations (Only you run this if you have clean annotations already)
clean_annotations = pd.read_csv('./clean_data/clean_annotations_nl.csv')

In [None]:

def cleaning_raw_annotation(df, rootdir):
    # Creating the file directories
    for subdir, dirs, files in os.walk(rootdir):
        for file in files:
            abs_path = os.path.join(subdir, file)
            individual_json = pd.read_json(abs_path)

            # Extracting the annotators
            exception_mails = ['jesher_a@hotmail.com']

            # filer for mails with an underscore
            for i in exception_mails:
                if bool(re.search(i, abs_path.split('/')[-1])):
                    annotators = i
                else:
                    annotators = abs_path.split('_')[-1]
                    annotators = annotators[:-5]

            # Extracting the document ID's
            doc_id = abs_path.split('/')[-1]
            doc_id = doc_id.replace(annotators, '')[:-6]

            # Transforming the json  in a proper format
            individual_json = individual_json.transpose()
            individual_json.reset_index(inplace=True)
            individual_json = individual_json.rename(columns={'index':'acronym'})
            individual_json['doc_id'] = doc_id
            individual_json['annotator'] = annotators

            # adding everything together in a df
            df = pd.concat([df, individual_json], ignore_index=True)

            # fill the missing values in the language column
            df['language'].replace('', "other language", inplace=True)
            
            # remove additional white spaces
            df['acronym'] = df['acronym'].str.strip()
            df['expansion'] = df['expansion'].str.strip()

    return df



In [None]:
# Run this cell only at the beginning !!!!!!! All clean annotations will be lost if executed 
annotation_df = cleaning_raw_annotation(df, rootdir)
annotation_df

In [None]:
# Corpus info
print('Number of annotators: {}'.format(len(set(annotation_df['annotator']))))
print('Number of documents: {}'.format(len(set(annotation_df['doc_id']))))
print('Number of clean documents: {}'.format(len(set(clean_annotations['doc_id']))))

## Difference Finder <a class="anchor" id="Difference-Finder"></a>
The following code section looks for the differences between the annotated documents

In [None]:
#
total_docs =set(annotation_df['doc_id'])
already_clean_docs = set(clean_annotations['doc_id'])

# dirty document info
dirty_docs = total_docs - already_clean_docs
print("Number of dirty documents:{}".format(len(dirty_docs)))


# Return all doc_id's with an uneven number of acroyms (which means...)
print("\nDOCUMENTS WITH AN UNEVEN NUMBER OF ACRONYMS:")
dirt_1 = []
for i in set(annotation_df['doc_id']):
    if i in dirty_docs:
        if len(annotation_df[annotation_df['doc_id'] == str(i)]) % 2 != 0:
            dirt_1.append(i)
            print("  -", i)
        
print("\nDOCUMENTS WITH DUPLICATE ROWS:")
dirt_2 = []
for i in set(annotation_df['doc_id']):
    sub_set_df = annotation_df[annotation_df['doc_id'] == i]
    if i in dirty_docs:
    # Delete duplicate rows
        if len(sub_set_df.drop_duplicates(subset=["acronym", "expansion"], keep=False)) != 0:
            dirt_2.append(i)
            print("  -", i)

del sub_set_df

## Adding clean documents to the clean data

In [None]:
all_dirty =  dirt_2 + dirt_1
doc_subs = set.union(already_clean_docs, all_dirty)


all_docs = set(annotation_df['doc_id'])
good_docs = all_docs - doc_subs
print('Clean documents: \n{}'.format(good_docs))



# adding all clean documents together
clean_doc_all = pd.DataFrame(columns=['acronym', 'expansion','language', 'type', 'doc_id','annotator'])

for i in good_docs:
    df_empty = pd.DataFrame(columns=['acronym', 'expansion','language', 'type', 'doc_id','annotator'])
    sub_df = annotation_df[annotation_df['doc_id'] == i]
    clean_doc = pd.concat([df_empty, sub_df])
    clean_doc_all = pd.concat([clean_doc_all, clean_doc])
    
if set(clean_doc_all['doc_id']) == good_docs:
    print('\nAll good')
    

clean_annotations = pd.concat([clean_annotations, clean_doc_all])

# Save the changes in the main .csv file
# clean_annotations.to_csv('./clean_data/clean_annotations_nl.csv', index=False)

## Manual cleaning<a class="anchor" id="Manual-cleaning"></a>

In [None]:
print(annotation_df.iloc[row_num_1, 1])
print(annotation_df.iloc[row_num_2, 1])


In [None]:
# show the annotations that have issues
document = "Maybach"

sub_set_df = annotation_df[annotation_df['doc_id'] == document]
sub_set_df.drop_duplicates(subset=["acronym", "expansion"], keep=False).sort_values(by='acronym')

In [None]:
# Fill in the index to see of the values are the same
row_num_1 = 375
row_num_2 = 998

# Results
print("Are the acronyms the same:")
print(annotation_df.iloc[row_num_1, 0] == annotation_df.iloc[row_num_2, 0])
print("\nAre the expansion the same:")
print(annotation_df.iloc[row_num_1, 1] == annotation_df.iloc[row_num_2, 1])
print(annotation_df.iloc[row_num_1, 1])
print(annotation_df.iloc[row_num_2, 1])

In [None]:
pd.set_option('display.max_rows', None)
# fill in the index of the cell you want to change
# annotation_df.iloc[80, 0] = "MD"

# Use the code below if you need toadd a new row
# extra_row = {'acronym':'ITV-netwerk', 'expansion':'Independent Television-netwerk', 'language':'en', 'type':'out_expansion', 'doc_id':document, 'annotator':'jesher.appels@gmail.com'}


# Show the results
# annotation_df = annotation_df.append(extra_row, ignore_index = True)         # <-- uncomment if you need to add a extra row
results = annotation_df[annotation_df['doc_id'] == document]  
results.sort_values(by='acronym')

In [None]:
# results.drop(565, inplace=True)

## Saving the changes<a class="anchor" id=""></a>

In [None]:
# # Save the changes
df_empty = pd.DataFrame(columns=['acronym', 'expansion','language', 'type', 'doc_id','annotator'])
clean_annotations_sub = pd.concat([df_empty, results])
clean_annotations = pd.concat([clean_annotations, clean_annotations_sub])

# Save the changes in the main .csv file
clean_annotations.to_csv('./clean_data/clean_annotations_nl.csv', index=False)

In [None]:
set(clean_annotations['doc_id'].sort_values())

In [None]:
# pd.set_option('display.max_rows', None)
# clean_annotations[clean_annotations['doc_id'] =='Waterschapsverkiezingen'].sort_values(by='acronym')

In [None]:
clean_annotations.drop_duplicates(keep='last')

## Adding a seperate file

In [None]:
# file_name  = "/Users/jesher/Desktop/Master data science UvA/Semester 2/Thesis/google_annotation/clean_data/clean_annotations_nl_wouter.csv"
# second_file = pd.read_csv(file_name)

# second_file

In [None]:
# clean_annotations = pd.concat([clean_annotations, second_file])

# Save the changes in the main .csv file
# clean_annotations.to_csv('./clean_data/clean_annotations_nl.csv', index=False)