# Goals

1. Ingest and 'clean' annotations from multiple json files, annotators, and informed consent documents
1. Ingest and 'clean' raw textual content from informed consent documents (including documents corrupted by upstream processes)
1. Align annotations against their respective informed consent documents
1. Save cannonical files for next analysis steps

# Dependencies

In [1]:
import sys
import os
import json
import pandas as pd
import hashlib
import spacy
import re
import random
from importlib import reload
from datetime import datetime
from collections import defaultdict
from pprint import pprint

import custom_classes

# Load Raw Data

In [2]:
%time
annotations = custom_classes.Raw_Annotations()
annotations.save_annotations()
annotations.save_document_map()

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs
Saved to: `processed_annotations/ANNOTATIONS_02-14-2020.csv`
Saved to: `processed_annotations/DOCUMENT_MAP_02-14-2020.json`


In [3]:
help(custom_classes)

Help on module custom_classes:

NAME
    custom_classes

DESCRIPTION
    Get a list of files from the `data/` directory.
    Each is a `.json. file with annotations for multiple informed consent documents. 
    Each `.json` file is a single annotator.

CLASSES
    builtins.object
        Raw_Annotations
    
    class Raw_Annotations(builtins.object)
     |  Raw_Annotations(data_dir='../data/', nlp_lib='en_core_web_lg')
     |  
     |  A class to help manage annotations from DataTurks
     |  
     |  Methods defined here:
     |  
     |  __init__(self, data_dir='../data/', nlp_lib='en_core_web_lg')
     |      Args: 
     |          - data_dir (str): a path to a directory of json files 
     |              containing annotations
     |  
     |  build_document_map(self)
     |      A function to get a map of documents and ids
     |  
     |  format_annotator_name(self, filename)
     |      A function to return the formatted name of an annotator given a consistently 
     |      na