# Dataset preprocessing for Relation Extraction

### Steps to follow:

#### 1. Load the dataset and use ast library to preserve the data types

##### 1.1. Create a copy of the dataset to work with

#### 2. Extract sentences, entity pairs and labels

#### 3. Mask the entity objetives for each sentences based on entity pairs field & create the prompt estructure for each of the sentence

#### 4. Save the prompt copy in JSON format to train.

## 1. Load the daset and use ast library to preserve the data types
### 1.1. Create a copy if the dataset to work with

In [1]:
import pandas as pd
import numpy as np
import ast
import regex as re
import json
from prompt_processing import process_re_prompt, process_re_prompt_blackbox

In [2]:
df = pd.read_csv('/home/tensorboard/Documentos/1. D4R/Relation extraction/Relations_Diego_Corrections_14-02-24_Version_2(1).csv')
df_to_work = df.copy()
df_to_work['Phrase_tokenized'] = df_to_work['Phrase_tokenized'].apply(ast.literal_eval)
df_to_work = df_to_work.drop('Unnamed: 0', axis=1)

## 2. Extract sentences, entity pairs and labels

In [3]:
df_reduced = df_to_work.copy()

df_reduced = df_reduced[['Sentence', 'Number_of_entities', 'Entity_1', 'Entity_2', 'Entity_1_type', 'Entity_2_type', 'Relations']]

df_reduced = df_reduced.rename(columns={'Relations': 'Label'})

In [4]:
specifics = set(df_reduced['Entity_1_type'].values.tolist())
specifics

{'ORG', 'PERSON', 'PERSON_REFERENCE', 'PLACE'}

In [5]:
rows_with_nulls = df_reduced[df_reduced.isnull().any(axis=1)]
rows_with_nulls

Unnamed: 0,Sentence,Number_of_entities,Entity_1,Entity_2,Entity_1_type,Entity_2_type,Label
115,Y que le parece que [PERSON]Pedro de Cazalla...,4,Pedro de Cazalla,este confesante,PERSON,PERSON_REFERENCE,


In [6]:
df_reduced.dropna(inplace=True)


In [7]:
rows_with_nulls = df_reduced[df_reduced.isnull().any(axis=1)]
rows_with_nulls

Unnamed: 0,Sentence,Number_of_entities,Entity_1,Entity_2,Entity_1_type,Entity_2_type,Label


In [8]:
df_reduced['Number_of_entities'].value_counts()

Number_of_entities
5    469
4    323
3    183
1    157
2     99
Name: count, dtype: int64

## 3. Mask the entity objetives for each sentences based on entity pairs field & 4. Create the prompt estructure for each of the sentence and add it as field in the copy of the dataset. Also, creating position embeddings.


In [9]:
new_df_list = df_reduced[['Sentence', 'Label']]

In [12]:
import regex as re 
new_df_list['Sentence'] = new_df_list['Sentence'].apply(lambda x: re.sub(r'\[.*?\]', '', x))
new_df_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df_list['Sentence'] = new_df_list['Sentence'].apply(lambda x: re.sub(r'\[.*?\]', '', x))


Unnamed: 0,Sentence,Label
0,Doña Ana Enríquez y de oídas lo depuso en 23...,Actos Procesales
1,Doña Juana de Fonseca lo depuso en 19 de abr...,Actos Procesales
2,"Fray Alonso de Horozco depone de oídas , lo...",Actos Procesales
3,Testigoº5 Doña Antonia de Branches depuso a ...,Actos Procesales
4,Testigoº No declarado en el proceso Fray Ant...,Actos Procesales
...,...,...
1227,Siendo a todo ello testigos Juan Velázquez d...,Roles Procesales
1228,Siendo a todo ello testigos Juan Velázquez d...,Roles Procesales
1229,Siendo a todo ello testigos Juan Velázquez d...,Roles Procesales
1230,Siendo a todo ello testigos Juan Velázquez d...,Roles Procesales


In [13]:
new_df_list.to_csv('test_20-06-24.csv')

In [11]:
new_df_list = process_re_prompt(df_reduced=df_reduced)


In [None]:
new_df_list_black_box = process_re_prompt_blackbox(df_reduced=df_reduced)

In [13]:
import time
# Test the original function
start_time = time.time()
result_original = process_re_prompt(df_reduced)
end_time = time.time()
print(f"Original function took {end_time - start_time:.2f} seconds")

# Test the improved function
start_time = time.time()
result_improved = process_re_prompt_blackbox(df_reduced)
end_time = time.time()
print(f"Improved function took {end_time - start_time:.2f} seconds")


Original function took 0.08 seconds
Improved function took 0.04 seconds


In [11]:
with open('prompted_RE_senteces_19-06-24.json', 'w', encoding='utf-8') as f:
    json.dump(new_df_list, f, ensure_ascii=False, indent=4)