# Brat Tags Data Analysis

## IDs in Brat 

- T: text-bound annotation
- R: relation
- E: event
- A: attribute
- M: modification (alias for attribute, for backward compatibility)
- N: normalization [new in v1.3]

<br> <br>

### Annotation ID conventions
All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

<br>

#### Entity annotations
Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization) and the span of characters containing the entity mention (represented as a "start end" offset pair).

| ID 	| Type And Span      	| Text     	|
|----	|--------------------	|----------	|
| T1 	| Organization 0 4   	| Sony     	|
| T3 	| Organization 33 41 	| Ericsson 	|
| T3 	| Country 75 81      	| Sweden   	|

<br> 

#### Attribute and modification annotations
Attribute annotations are binary or multi-valued "flags" that specify further aspects of other annotations. Attributes have a unique ID and are defined by reference to the ID of the annotation that the attribute marks and the attribute value.

| ID 	| Type & Entity ID  	|
|----	|-------------------	|
| A1 	| Negation T1       	|
| A2 	| Confidence T2     	|

<br>

#### Relation annotations
Binary relations have a unique ID and are defined by their type (e.g. Origin, Part-of) and their arguments.

| ID 	| Type and Args          	|
|----	|------------------------	|
| R1 	| Origin Arg1:T3 Arg2:T4 	|



## First Iteration

<br>


- Hand-picked: These tweets were handpicked if they were good examples of true public services reports or if the tweet seemed to not be a report but it really was one (confusing tweets.

- Sampled: Random sample of tweets was drawn from the original dataframe. The sampling process was done apart. 

<br>

- Hand-picked


| id_parsed 	|          annotation_parsed 	| Count 	|
|----------:	|---------------------------:	|------:	|
| A         	| without-service            	| 63    	|
|           	| location                   	| 38    	|
|           	| duration                   	| 16    	|
|           	| time                       	| 10    	|
|           	| fake-information           	| 2     	|
|           	| with-service               	| 2     	|
|           	| reason                     	| 1     	|
| T         	| circumstantial-information 	| 65    	|
|           	| social-report              	| 39    	|
|           	| electricity                	| 33    	|
|           	| gasoline                   	| 25    	|
|           	| water                      	| 6     	|
|           	| gas                        	| 4     	|

<br>

- Sampled

| id_parsed 	|          annotation_parsed 	| Freq 	|
|----------:	|---------------------------:	|-----:	|
| A         	| without-service            	| 24   	|
|           	| location                   	| 24   	|
|           	| time                       	| 10   	|
|           	| utility-company            	| 5    	|
|           	| duration                   	| 4    	|
|           	| fake-information           	| 4    	|
|           	| politician                 	| 4    	|
|           	| reason                     	| 3    	|
|           	| news-company               	| 3    	|
|           	| with-service               	| 2    	|
|           	| other                      	| 1    	|
| T         	| circumstantial-information 	| 41   	|
|           	| electricity                	| 26   	|
|           	| social-report              	| 25   	|
|           	| twitter-account            	| 12   	|
|           	| gasoline                   	| 3    	|
|           	| water                      	| 1    	|
|           	| water                      	| 1    	|

In [3]:
import pandas as pd
import numpy as np 
pd.set_option('display.max_colwidth', None)



### Read Data
original_df = pd.read_csv('../brat-v1.3_Crunchy_Frog/data/first-iter/balanced_dataset_brat.ann', sep = '\t',header = None)

# Rename coumns 
original_df.columns = ['id', 'annotation', 'text']

# Remove the ID numbers to know if it's an entity (T) or Attribute (A)
original_df['id_parsed'] = original_df.id.str.replace('\d', '')

# Remove text span and IDs (T & A) from column. This columns has the name of the attributes and etitites 
original_df['annotation_parsed'] = original_df.annotation.str.replace('[\dTA]', '')


# Remove Relation tags
# Change Relation Id to Null
original_df.id_parsed.replace('R', np.nan, inplace= True)

# Remove nulls
original_df.dropna(subset=['id_parsed'], inplace= True)

# Group by id_parsed, annotation parsed and count results
df = original_df[['id_parsed', 'annotation_parsed']].groupby(['id_parsed', 'annotation_parsed'], sort = True).agg({'annotation_parsed':['count']}).copy()

# After the group by there's multi-index columns. We rename the columns to have the level that we want (count)
df.columns = df.columns.levels[1]

# sort_values by index. Here the trick is to also use sort_index!!
df.sort_values('count', ascending=False)\
    .sort_index(level=[0], ascending=[True])

Unnamed: 0_level_0,Unnamed: 1_level_0,count
id_parsed,annotation_parsed,Unnamed: 2_level_1
A,without-service,63
A,location,38
A,duration,16
A,time,10
A,news-company,4
A,fake-information,2
A,with-service,2
A,reason,1
T,circumstantial-information,65
T,social-report,39


## Sampled

The following data was randomly sampled with the helper function ``` data_sampler.py ``` 

- Random_state: 58 
- Sample: 30 

` python data_sampler.py 58 30 brat-v1.3_Crunchy_Frog/data/first-iter/sampled_58_30.txt `


In [7]:
complete_df = pd.read_csv('tagging-set-original_for_jupyter_tagging.csv')

# pd.DataFrame(complete_df.sample(30, random_state = 9).full_text)



test_balance = pd.read_csv('../brat-v1.3_Crunchy_Frog/data/first-iter/balanced_dataset_brat.ann', sep = '\t')


# Rename coumns 
test_balance.columns = ['id', 'annotation', 'text']

# Remove the ID numbers to know if it's an entity (T) or Attribute (A)
test_balance['id_parsed'] = test_balance.id.str.replace('\d', '')

# Remove text span and IDs (T & A) from column. This columns has the name of the attributes and etitites 
test_balance['annotation_parsed'] = test_balance.annotation.str.replace('[\dTA]', '')


# Remove Relation tags
# Change Relation Id to Null
test_balance.id_parsed.replace('R', np.nan, inplace= True)

# Remove nulls
test_balance.dropna(subset=['id_parsed'], inplace= True)

# Group by id_parsed, annotation parsed and count results
df = test_balance[['id_parsed', 'annotation_parsed']].groupby(['id_parsed', 'annotation_parsed'], sort = True).agg({'annotation_parsed':['count']}).copy()

# After the group by there's multi-index columns. We rename the columns to have the level that we want (count)
df.columns = df.columns.levels[1]

# sort_values by index. Here the trick is to also use sort_index!!
df.sort_values('count', ascending=False)\
    .sort_index(level=[0], ascending=[True])


Unnamed: 0_level_0,Unnamed: 1_level_0,count
id_parsed,annotation_parsed,Unnamed: 2_level_1
A,without-service,63
A,location,38
A,duration,16
A,time,10
A,news-company,4
A,fake-information,2
A,with-service,2
A,reason,1
T,circumstantial-information,64
T,social-report,39


### Reading Data to input into baseline model

Strategy:

<br>

- Volarme atributos y relations. 
- Picar dataframe en espacios, anotaicion, comienzo de span y final de span
- Hacer Sort del Dataframe por la segunda clumna (comienzo de span)
- Cargar el archivo original de texto, La primera columna es la posicion en el documento de inicio de la linea
- Hacer join con el inicio de la linea. Schema - inicio de la linea y tweet. 

<br>

Roadblock

- The problem with this approach is that the text file's lines do not match with the annotation file's span. This is due that the span is for the tag, not necessarily the first word of the tweet is going to be tagged. The same is for the last tag's span vs the end of the string. 


#### Helper Function

This helper function reads brat's annotation file (.ann) and parses to a dataframe with the following schema: 


| Column            	| Dtype  	|
|-------------------	|--------	|
| id                	| object 	|
| annotation        	| object 	|
| text              	| object 	|
| id_parsed         	| object 	|
| annotation_parsed 	| object 	|

In [8]:
import pandas as pd
def annotation_parser(dir_ann_file, print_grouped_annotations = False):
    '''
        Helper function to parse Brat's annotation file (.ann). 
            Returns a dataframe with the following schema: 
                | id                	| object 	|
                | annotation        	| object 	|
                | text              	| object 	|
                | id_parsed         	| object 	|
                | annotation_parsed 	| object 	|
    --------------------------------------------------------------
    
    Params:
        
        dir_ann_file: String, Default = None. 
            Path to ann file.
        
        print_grouped_annotation: Bool. Default = False. 
            Prints count of Entities and Attributes.
    '''
    
    ### Read Data
    to_parse_df = pd.read_csv(dir_ann_file, sep = '\t',header = None)

    # Rename coumns 
    to_parse_df.columns = ['id', 'annotation', 'text']

    # Remove the ID numbers to know if it's an entity (T) or Attribute (A)
    to_parse_df['id_parsed'] = to_parse_df.id.str.replace('\d', '')

    # Remove text span and IDs (T & A) from column. This columns has the name of the attributes and etitites 
    to_parse_df['annotation_parsed'] = to_parse_df.annotation.str.replace('[\dTA]', '')


    # Remove Relation tags
    # Change Relation Id to Null
    to_parse_df.id_parsed.replace('R', np.nan, inplace= True)

    # Remove nulls
    to_parse_df.dropna(subset=['id_parsed'], inplace= True)

    # Group by id_parsed, annotation parsed and count results
    df = to_parse_df[['id_parsed', 'annotation_parsed']].groupby(['id_parsed', 'annotation_parsed'], sort = True).agg({'annotation_parsed':['count']}).copy()

    # After the group by there's multi-index columns. We rename the columns to have the level that we want (count)
    df.columns = df.columns.levels[1]

    # sort_values by index. Here the trick is to also use sort_index!!
    
    if print_grouped_annotations == True:
    
        print(df.sort_values('count', ascending=False)\
            .sort_index(level=[0], ascending=[True]))
    else:
        pass
    
    return to_parse_df

### Read Annotations

<br>

- Used annotation_parser helper function to read the data. 
- Subset only Entities for simplicity sake. 
- Create span columns and store it into data frame split_ann

In [57]:

# Read sampled data
sampled_ann = annotation_parser('../brat-v1.3_Crunchy_Frog/data/first-iter/balanced_dataset_brat.ann') # 308 rows 

# Subset Entities and rewrite dataframe
sampled_ann = sampled_ann[sampled_ann.id_parsed == 'T'] # 109 rows

# Create span columns. Split by space.
split_ann = sampled_ann.annotation.str.split(' ', expand = True)

# Rename Columns
split_ann.columns = ['Entities', 'first_char', 'last_char'] # It's already sorted by first_char ascending

# Create new columns with each annotation's text.
split_ann['text'] = sampled_ann.loc[sampled_ann.id_parsed == 'T', 'text']

split_ann.head()

Unnamed: 0,Entities,first_char,last_char,text
0,circumstantial-information,9,56,terrazas del club hipico calle bolivia
2,electricity,0,7,#sinluz
4,electricity,60,67,#sinluz
6,circumstantial-information,69,90,valle de la pascua
8,circumstantial-information,92,113,desde las 7 00 am
...,...,...,...,...
303,social-report,6320,6453,pongan la luz por favor el calor es insoportable @corpoelecinfo @fbritomaestre @mppeevzla @omarprietogob @diariopanorama
310,twitter-account,1214,1229,@luisgonzaloprz
317,twitter-account,1881,1896,@traffivalencia
322,twitter-account,2074,2084,@vtvcanal8


### Read Text and Merge with Annotation File

<br>

Approach

- The base strategy is to find the starting position of each new line and join it with the the first_char column in the annotation table. 

<br>

Roadblock

- The problem with this approach is that the text file's lines do not match with the annotation file's span. This is due that the span is for the tag, not necessarily the first word of the tweet is going to be tagged. The same is for the last tag's span vs the end of the string. 

<br>

To do:

- I am thinking on using fuzzy string matching to match `split_ann` text column with the original text data frame. 

In [87]:
# Import Regex Module
import re

# Set max column width to none in order to read all the tweet's text. 
pd.options.display.max_colwidth = None

# Read text
text = pd.read_csv('../brat-v1.3_Crunchy_Frog/data/first-iter/balanced_dataset_brat.txt', header=None )

# Rename column to tweets
text.columns = ['tweets']

# Convert tweets to string
text_string = text.tweets.to_string( header = False)

# Remove extra spaces and replace them with just 1 space.
text_string = re.sub(' +', ' ', text_string)

#last_char = [pos for pos, char in enumerate(text_string) if char == '\n']

# last_char

Attempt to solving the roadblock

In [105]:
# read the txt as a string file
with open('../brat-v1.3_Crunchy_Frog/data/first-iter/balanced_dataset_brat.txt') as f:
    # only replace the break lines
    REPLACE_br = lambda s: s.replace("\n","\n")
    lines = map( REPLACE_br, f.readlines() )
    
    # save number line, length of the text and text without break lines: /n
    # assuming one line corresponds to a single tweet
    tuple_tweets = [(len(l), l) for l in lines if len(l) > 0]

    start, end, text_ = list(), list(), list()
    new_start = 0
    for ttw in tuple_tweets:
        # adds the length of the tweet
        start.append(new_start)
        # finds the location of the last character of the tweet
        end.append(new_start + ttw[0] -1 )
        text_.append(ttw[1])
        
        # gets the starting position of the next tweet
        new_start = new_start + ttw[0]

    df = pd.DataFrame({
            "start": start,
            "end": end,
            "text": text_
            })
df.head(10)

Unnamed: 0,start,end,text
0,0,58,#sinluz terrazas del club hipico calle bolivia \n
1,59,59,\n
2,60,113,#sinluz valle de la pascua desde las 7 00 am\n
3,114,114,\n
4,115,139,20 horas #sinluz\n
5,140,140,\n
6,141,200,hoy toca ramen a la luz de una vela #sinluz #lptm\n
7,201,201,\n
8,202,395,dios si nos devolves la loooz me vuelvo provida heterosexual catolico apostolico romano y practicamente homofobico y machirulo te va todo eso #sinluz\n
9,396,396,\n


In [104]:
split_ann.head(10)

Unnamed: 0,Entities,first_char,last_char,text
0,circumstantial-information,9,56,terrazas del club hipico calle bolivia
2,electricity,0,7,#sinluz
4,electricity,60,67,#sinluz
6,circumstantial-information,69,90,valle de la pascua
8,circumstantial-information,92,113,desde las 7 00 am
10,circumstantial-information,115,124,20 horas
12,electricity,132,139,#sinluz
14,social-report,141,184,hoy toca ramen a la luz de una vela
15,electricity,186,193,#sinluz
17,social-report,202,382,dios si nos devolves la loooz me vuelvo provida heterosexual catolico apostolico romano y practicamente homofobico y machirulo te va todo eso
