##### -.-.-.-.-.-. OBJECTIVE .-.-.-.-.-.-.-
The objective of this program is to build a consolidated LIST_CONSOL database between: 
1. MetaDrive data from LIST_ARCH describing the files collected in the LLITRA collaboration between Digital iuris and BJGS
2. The anonymized data from LIST_CONC constituting the textual material for identifying legal concepts

##### -.-.-.-.-.-. How to run .-.-.-.-.-.-.-

Step 1. Choose the folder where the files from Universal Data Tool containing the anonymized data are located to construct LIST_CONC

Step 2. Choose the folder where the files collected through BJGS are located to construct MD5 identification and Path for LIST_ARCH

Step 3. Select the CSV file holding the rest of LIST_ARCH data from MetaDrive

##### -.-.-.-.-.-.-. PROCESS .-.-.-.-.-.-.-.-

The data from Step 1 is first cleaned to be anonymized and a path is assigned to each record, which are stored in LIST_CONC.
The data from Step 2 and 3 is then concatenated in LIST_ARCH. Each register of the anonymized database is completed by the Metadrive metadata according to the file from which it is extracted.

LIST_ARCH is compared with LIST_CONC to assign the registers in the latter to the appropriate registers in the former, based on the path and file name from both lists.
- If the record from LIST_CONC does not have metadrive data, or is not anonymized, then it is added to a LIST_CORR database pending correction.
- If the record from LIST_CONC has a corresponding LIST_ARCH register, the result is stored in LIST_CONSOL.

No manipulation is performed on the data indicated to the program: the use of the OS module is limited to data extraction. 
The transformation and load process are performed in memory before saving LIST_CONSOL and LIST_CORR.

In [1]:
import os, hashlib, ntpath, json, re, csv
import pandas as pd
from tkinter import Tk, filedialog, messagebox
##CUR_DIR = str(os.path.dirname(__file__))
CUR_DIR = str(os.path.dirname("WRITE HERE THE CURRENT PATH WHERE YOU EXECUTE THE PROGRAM (PYTHON NOTEBOOK DOES NOT SUPPORT THE __file__ REFERENCE"))

## 1. Get folders and file location

In [2]:
def select_data(choosefileorfolder, boxTitle, boxDescription):
    root = Tk() # pointing root to Tk() to use it as Tk() in program.
    root.withdraw() # Hides small tkinter window.
    root.attributes('-topmost', True) # Opened windows will be active. above all windows despite of selection.

    messagebox.showinfo(boxTitle, boxDescription)
    if choosefileorfolder == 1:
        path_from_box = filedialog.askdirectory(initialdir=CUR_DIR) # Returns opened path as str
    elif choosefileorfolder == 2:
        path_from_box = filedialog.askopenfilename(initialdir=CUR_DIR) # Returns opened path as str
    
    if path_from_box == "":
        exit()
    return path_from_box

Step 1. Choose the folder where the files from Universal Data Tool containing the anonymized data are located to construct LIST_CONC

In [3]:
folder_UDT_Path = select_data(1, "Folder with UDT files", "Please select the folder where the files from Universal Data Tool containing the anonymized data are located to construct LIST_CONC.")

2022-11-03 17:17:39.276 python[25340:478519] +[CATransaction synchronize] called within transaction
2022-11-03 17:17:39.553 python[25340:478519] +[CATransaction synchronize] called within transaction


Step 2. Choose the folder where the files collected through BJGS are located to construct MD5 identification and Path for LIST_ARCH

In [4]:
folder_BJGS_RAW_Path = select_data(1, "Folder with BJGS PDF and DOCX files", "Please select the folder where the files collected through BJGS are located to construct MD5 identification and Path for LIST_ARCH")

2022-11-03 17:18:32.034 python[25340:478519] +[CATransaction synchronize] called within transaction
2022-11-03 17:18:32.133 python[25340:478519] +[CATransaction synchronize] called within transaction
2022-11-03 17:19:00.511 python[25340:478519] +[CATransaction synchronize] called within transaction
2022-11-03 17:19:36.599 python[25340:478519] +[CATransaction synchronize] called within transaction


Step 3. Select the CSV file holding the rest of LIST_ARCH data from MetaDrive

In [5]:
CSV_Metadrive_Path = select_data(2, "CSV file with Metadrive Data", "Please select the CSV File where the Metadrive data is stored to populate the rest of LIST_ARCH data from MetaDrive")

2022-11-03 17:19:47.396 python[25340:478519] +[CATransaction synchronize] called within transaction
2022-11-03 17:19:47.449 python[25340:478519] +[CATransaction synchronize] called within transaction


## 2. Building the panda dataframe from BJGS files with file_MD5, file_path, file_name and file_dirname

In [6]:
def mnp_dict(folder_path):
    mnp_dict = {"file_MD5":[], "file_name":[], "file_path":[], "file_dirname":[]}
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            if file.endswith((".doc", ".docx", ".pdf")):
                new_path_result = os.path.join(root, file)
                new_name_result = ntpath.basename(new_path_result)
                new_md5_result  = hashlib.md5(open(new_path_result,"rb").read()).hexdigest()
                new_dirname     = os.path.basename(root)

                if new_md5_result not in mnp_dict["file_MD5"]:
                    mnp_dict["file_MD5"].append(new_md5_result)
                    mnp_dict["file_path"].append(new_path_result)
                    mnp_dict["file_name"].append(new_name_result)
                    mnp_dict["file_dirname"].append(new_dirname)

    return mnp_dict

LIST_ARCH = mnp_dict(folder_BJGS_RAW_Path)
df_arch = pd.DataFrame.from_dict(LIST_ARCH)

## 3. Building the dataframe from MetaDrive CSV file

In [7]:
input_file = open(CSV_Metadrive_Path).read().split('\n')
CSV_Metadrive = {}
keys = input_file[0].split('''"\t"''')
keys[-1] = keys[-1][:-1]
input_file.pop(0)

for index in keys:
    CSV_Metadrive[index.replace('\ufeff','')] = []

for line in input_file:
    line = line.split('''"\t"''')
    i = 0
    for element in line:
        CSV_Metadrive[keys[i].replace('\ufeff','')].append(element)
        i += 1


df_metadrive = pd.DataFrame.from_dict(CSV_Metadrive)

df_metadrive['Name'] = df_metadrive['Name'].str[1:]
df_metadrive['Folder'] = df_metadrive['Folder'].str[:-1]

In [8]:
for key, value in CSV_Metadrive.items():
    if len(CSV_Metadrive[key]) != 749:
        print (key)

Name
Tipo de Juicio
Cuaderno
Materia
Juzgado
Secretaria
Núm. Exp.
Categoría
Tipo de documento
Etapa procesal
¿Tiene anexos?
Documentos anexos
Documento relacionado
¿Quién lo emite?
¿A quién se dirige?
Término (días)
Folio
Description
Fecha de promoción
Fecha de publicación
Fecha de surtimiento de efectos
Acuerdo que le recae
Día de la notificación
Document Link
Modified
Created
Modified By
Last viewed by me
Owner
Folder


## 4. Generate dataframe from JSONs (UDT)
A. Partir del TXT para corregir los samples de los diferentes JSON

In [9]:
def return_jsons(path_dir = folder_UDT_Path):
    
    return_list_json = []
    for root, dirs, files in os.walk(path_dir): 
        for file in files:
            if file.endswith('.json'):
                return_list_json.append(os.path.join(root, file))
    #return [os.path.join(root, file) for root, dirs, files in os.walk(path_dir) for file in files if file.endswith('.json')]
    return return_list_json

In [10]:
from html import entities
from regex import F
import unicodedata as ud
import string

def remove_accents(input_str):
    nkfd_form = ud.normalize('NFKD', input_str)
    return u"".join([c for c in nkfd_form if not ud.combining(c)])

dict2dump = {
    '_id':[], 
    'document':[], 
    'file_name':[], 
    'file_path':[], 
    'file_dirname':[], 
    'annotation.entities.0.text':[], 
    'annotation.entities.0.label':[], 
    'annotation.entities.0.start':[], 
    'annotation.entities.0.end':[]}

for file in return_jsons():
    f = open(file)
    
        ## Treat file data to dump it in a dictionary suitable for pandas dataframe:
    with open(file, 'r') as file:
        data = json.load(file)                                                                          # Charge file in data variable
        data.pop("interface",404)                                                                       # Delete interface key
        data.pop("name",404)                                                                            # Delete name key
        for i in data['samples']:                                                                       # For each sample
            dict2dump['_id'].append(i['_id'])                                                           # Append _id in the dict
            dict2dump['document'].append(i['document'])                                                 # Append document in the dict (complete phrase)
            dict2dump['file_name'].append(os.path.basename(file.name))                                  # Append file name from ios_text type
            dict2dump['file_path'].append(file.name[:-(1+len(os.path.basename(file.name)))])            # Append path from ios_text type
            dict2dump['file_dirname'].append(os.path.basename(file.name[:-(1+len(os.path.basename(file.name)))])) # Same for directory name
            if ('annotation' in i):                                                                     # Check if key "annotation" exists
                if len(i['annotation']['entities']) == 0:                                               # if so, check if it has zero entities
                    numannot = 0
                    while True:                                                                         # if it has zero entities, we need to populate the appropriate entities columns with NaN
                        if f'annotation.entities.{numannot}.text' in dict2dump:                         # Check if the entities column exists
                            dict2dump[f'annotation.entities.{numannot}.text'].append('NaN')             # If the entity columns exists, then populate the four columns
                            dict2dump[f'annotation.entities.{numannot}.label'].append('NaN')
                            dict2dump[f'annotation.entities.{numannot}.start'].append('NaN')
                            dict2dump[f'annotation.entities.{numannot}.end'].append('NaN')
                            numannot += 1                                                               # Repeat with next entity number
                        else:
                            break                                                                       # Stop populating if entity number does not exist
                
                else:                                                                                   # If it has at least one entity
                    numannot = 0
                    for annot in i['annotation']['entities']:                                           # For each entity in the sample annotation

                        if (f'annotation.entities.{numannot}.text' in dict2dump):                       # Check if entity number is in the dict.
                            dict2dump[f'annotation.entities.{numannot}.text'].append(annot['text'])     # If so, dump values for the four entities columns
                            dict2dump[f'annotation.entities.{numannot}.label'].append(annot['label'])
                            dict2dump[f'annotation.entities.{numannot}.start'].append(annot['start'])
                            dict2dump[f'annotation.entities.{numannot}.end'].append(annot['end'])

                        else:                                                                           # If entity number is not in the dict
                            dict2dump[f'annotation.entities.{numannot}.text']= []                       # Create the four columns with the new entity number
                            dict2dump[f'annotation.entities.{numannot}.label']= []
                            dict2dump[f'annotation.entities.{numannot}.start']= []
                            dict2dump[f'annotation.entities.{numannot}.end']= []

                            for p in range(len(dict2dump['_id'])-1):                                    # Get the number of rows in a range, minus the last one
                                dict2dump[f'annotation.entities.{numannot}.text'].append('NaN')         # For each row, populate the new four new entity columns with NaN
                                dict2dump[f'annotation.entities.{numannot}.label'].append('NaN')
                                dict2dump[f'annotation.entities.{numannot}.start'].append('NaN')
                                dict2dump[f'annotation.entities.{numannot}.end'].append('NaN')
                            
                            dict2dump[f'annotation.entities.{numannot}.text'].append(annot['text'])     # Append the last row of the four new entity columns
                            dict2dump[f'annotation.entities.{numannot}.label'].append(annot['label'])   # with the four value of the entity
                            dict2dump[f'annotation.entities.{numannot}.start'].append(annot['start'])
                            dict2dump[f'annotation.entities.{numannot}.end'].append(annot['end'])                            
                            
                        numannot += 1                                                                   # Repeat with next entity number
                    while True:
                        if f'annotation.entities.{numannot}.text' in dict2dump:                         # Check if other entities column exists
                            dict2dump[f'annotation.entities.{numannot}.text'].append('NaN')             # If so, then populate the four columns
                            dict2dump[f'annotation.entities.{numannot}.label'].append('NaN')
                            dict2dump[f'annotation.entities.{numannot}.start'].append('NaN')
                            dict2dump[f'annotation.entities.{numannot}.end'].append('NaN')
                            numannot += 1
                        else:
                            break

            else:                                                                                       # If key "annotation" does not exist
                numannot = 0
                while True:
                    if f'annotation.entities.{numannot}.text' in dict2dump:                             # Check if the entities column exists
                        dict2dump[f'annotation.entities.{numannot}.text'].append('NaN')                 # If the entity columns exists, then populate the four columns
                        dict2dump[f'annotation.entities.{numannot}.label'].append('NaN')
                        dict2dump[f'annotation.entities.{numannot}.start'].append('NaN')
                        dict2dump[f'annotation.entities.{numannot}.end'].append('NaN')
                        numannot += 1
                    else:
                        break

df_json_udt = pd.DataFrame.from_dict(dict2dump)
df_json_udt['file_dirname'] = df_json_udt['file_dirname'].replace('-', ' ', regex = True)
df_arch['file_dirname'] = df_arch['file_dirname'].replace('-', ' ', regex = True)


df_arch['file_dirname'] = remove_accents('%&'.join(df_arch['file_dirname'])).split('%&')
df_json_udt['file_dirname'] = remove_accents('%&'.join(df_json_udt['file_dirname'])).split('%&')

df_arch['file_name'] = remove_accents('%&'.join(df_arch['file_name'])).split('%&')
df_json_udt['file_name'] = remove_accents('%&'.join(df_json_udt['file_name'])).split('%&')



## 4. Join dataframes (df_json_udt | df_arch | df_metadrive) on file and directory names

A. Join df_json_udt & df_arch on file and directory names

In [11]:
df_arch['file_name'] = df_arch['file_name'].replace(".pdf","", regex=True)

df_json_udt['file_name'] = df_json_udt['file_name'].replace('.udt.json','', regex = True)
df_json_udt['file_name'] = df_json_udt['file_name'].replace('.json','', regex = True)


df_json_arch = pd.merge(df_json_udt, df_arch, on=['file_name', 'file_dirname'], how='inner')
df_json_arch['file_dirname'] = df_json_arch['file_dirname'].replace('Ó','Ó', regex = True)
df_json_arch.rename(columns = {'file_path_x':'file_path'}, inplace = True)
del df_json_arch['file_path_y']

B. Final join with df_metadrive on file and directory names

In [12]:
df_metadrive.rename(columns = {'Name':'file_name'}, inplace = True)
df_metadrive.rename(columns = {'Folder':'file_dirname'}, inplace = True)
df_metadrive['file_dirname'] = df_metadrive['file_dirname'].replace('-', ' ', regex = True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.pdf","", regex=True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.jpeg","", regex=True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.jpg","", regex=True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.png","", regex=True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.doc","", regex=True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.docx","", regex=True)
df_metadrive['file_name'] = df_metadrive['file_name'].replace("\.txt","", regex=True)
df_metadrive['file_dirname'] = df_metadrive['file_dirname'].replace("/"," ", regex=True)

df_metadrive['file_dirname'] = remove_accents('%&'.join(df_metadrive['file_dirname'])).split('%&')
df_metadrive['file_name'] = remove_accents('%&'.join(df_metadrive['file_name'])).split('%&')

df_final = pd.merge(df_metadrive, df_json_arch, on=['file_name', 'file_dirname'], how='inner')

In [13]:
ctrl_list = []
for i in pd.unique(df_json_arch['file_name']):
    if i not in pd.unique(df_metadrive['file_name']):
        ctrl_list.append(i)

print(ctrl_list)



['20131115 - DOCUMENTO 4']


In [15]:
dict_iob ={
    'Word':[], 
    'Tag':[],
    'MD5':[],
    'type':[],
    'category':[]}

for ind in df_final.index:
    if "EXP" in df_final['document'][ind]:
        sample = df_final['document'][ind].replace("."," ")
    else:
        sample = df_final['document'][ind]

    sample = remove_accents(sample)
    
    if df_final['annotation.entities.0.start'][ind] == 'NaN':   # If there is not any anotation,
        for w in sample.split(" "):                             # then store the whole sample spliting with space
            dict_iob['Word'].append(w)
            dict_iob['Tag'].append('O')
            dict_iob['MD5'].append(df_final['file_MD5'][ind])
            dict_iob['type'].append(df_final['Tipo de documento'][ind])
            dict_iob['category'].append(df_final['Categoría'][ind])
        continue                                                # and continue

    else:

        i = 0
        annotationlist = {'start':[],'document':[],'label':[],'end':[]}     # Make a dict of whole anotations
        while True:
            try:
                if df_final[f'annotation.entities.{i}.start'][ind] == 'NaN':
                    break                
                else:
                    annotationlist['start'].append(df_final[f'annotation.entities.{i}.start'][ind])
                    annotationlist['document'].append(remove_accents(df_final[f'annotation.entities.{i}.text'][ind].replace("."," ")))
                    annotationlist['label'].append(df_final[f'annotation.entities.{i}.label'][ind])
                    annotationlist['end'].append(df_final[f'annotation.entities.{i}.end'][ind])
                i+=1     
            except:
                break
        
        sampleinit = sample[:annotationlist['start'][0]]
        for w in sampleinit.split(" "):
            dict_iob['Word'].append(w)
            dict_iob['Tag'].append('O')
            dict_iob['MD5'].append(df_final['file_MD5'][ind])
            dict_iob['type'].append(df_final['Tipo de documento'][ind])
            dict_iob['category'].append(df_final['Categoría'][ind])
        
        df_annot = pd.DataFrame.from_dict(annotationlist)
        for indannot in df_annot.index:
            for w in df_annot['document'][indannot].split(" "):
                dict_iob['Word'].append(w)
                if dict_iob['Tag'][-1] == ('I-' + df_annot['label'][indannot]):
                    dict_iob['Tag'].append('I-' + df_annot['label'][indannot])
                    dict_iob['MD5'].append(df_final['file_MD5'][ind])
                    dict_iob['type'].append(df_final['Tipo de documento'][ind])
                    dict_iob['category'].append(df_final['Categoría'][ind])
                else:
                    if (dict_iob['Tag'][-1].startswith("B")) and (dict_iob['Tag'][-1][2:] == df_annot['label'][indannot]):
                        dict_iob['Tag'].append('I-' + df_annot['label'][indannot])
                        dict_iob['MD5'].append(df_final['file_MD5'][ind])
                        dict_iob['type'].append(df_final['Tipo de documento'][ind])
                        dict_iob['category'].append(df_final['Categoría'][ind])
                    else:
                        dict_iob['Tag'].append('B-' + df_annot['label'][indannot])
                        dict_iob['MD5'].append(df_final['file_MD5'][ind])
                        dict_iob['type'].append(df_final['Tipo de documento'][ind])
                        dict_iob['category'].append(df_final['Categoría'][ind])
            try:
                if df_annot['end'][indannot]+1 == df_annot['start'][indannot+1]:
                    continue
                else:
                    sampleinbetween = sample[df_annot['end'][indannot]:df_annot['start'][indannot+1]]
                    for w in sampleinbetween.split(" "):
                        dict_iob['Word'].append(w)
                        dict_iob['Tag'].append('O')
                        dict_iob['MD5'].append(df_final['file_MD5'][ind])
                        dict_iob['type'].append(df_final['Tipo de documento'][ind])
                        dict_iob['category'].append(df_final['Categoría'][ind])
            except:
                continue

        samplefinal = sample[annotationlist['end'][-1]:]
        for w in samplefinal.split(" "):
            dict_iob['Word'].append(w)
            dict_iob['Tag'].append('O')
            dict_iob['MD5'].append(df_final['file_MD5'][ind])
            dict_iob['type'].append(df_final['Tipo de documento'][ind])
            dict_iob['category'].append(df_final['Categoría'][ind])
        
df_iob = pd.DataFrame.from_dict(dict_iob)  

df_iob['Sentence #'] = ''
df_iob['Sentence #'][0] = 'Sentence : 0'

i = 1
for ind in df_iob.index:
    if '.' in df_iob['Word'][ind]:
        df_iob['Sentence #'][ind+1] = f'Sentence : {i}'
        if any(x in df_iob['Word'][ind+1] for x in '''!"#$%&'()*+,-:;<=>?@[]^_`{|}~—”�''') and len(df_iob['Word'][ind+1])<2:
            df_iob['Word'][ind] = ' ' + df_iob['Word'][ind]
        i += 1

    if any(x in df_iob['Word'][ind] for x in '''!"#$%&'()*+,-:;<=>?@[]^_`{|}~—”�'''):
        for character in '''!"#$%&'()*+,-:;<=>?@[]^_`{|}~—”�''':
            df_iob['Word'][ind] = df_iob['Word'][ind].replace(character, '')
    
    if df_iob['Sentence #'][ind] == "":
        df_iob['Sentence #'][ind] = df_iob['Sentence #'][ind -1]

raws2drop = []
for ind in df_iob.index:
    if (df_iob['Word'][ind] == ""):
        raws2drop.append(ind)
df_iob = df_iob.drop(raws2drop)
df_iob = df_iob.reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
del df_iob['index']

for ind in df_iob.index:    
    if ind > 0:
        if df_iob['Tag'][ind].startswith('I') and df_iob['Tag'][ind-1] == 'O':
            df_iob['Tag'][ind] = 'B' + df_iob['Tag'][ind][1:]
        if df_iob['Tag'][ind].startswith('B') and df_iob['Tag'][ind][1:] == df_iob['Tag'][ind-1][1:]:
            df_iob['Tag'][ind] = 'I' + df_iob['Tag'][ind][1:]
        if df_iob['Tag'][ind].startswith('I') and df_iob['Sentence #'][ind-1] != df_iob['Sentence #'][ind]:
            df_iob['Tag'][ind] = "B" + df_iob['Tag'][ind][1:]

first_column = df_iob.pop('Sentence #')
df_iob.insert(0, 'Sentence #', first_column)
df_iob

KeyboardInterrupt: 

In [None]:
df_iob.to_csv('df_iob.csv')