# 1. Short research

## Tutorials
- [Good and short explanation of the Brazilian court system](https://www.brazilcounsel.com/blog/understanding-the-structure-of-the-brazilian-court-system)
- [Another good description](https://www.brasildefato.com.br/2018/01/19/what-is-the-structure-of-brazilian-judicial-branch)


## Our data from the R script
- The data that we scraped with [these R scripts](https://github.com/jjesusfilho) come from the [TJSP page](https://esaj.tjsp.jus.br/cjsg/consultaCompleta.do?f=1) which is *Tribunal de Justiça (TJ) in São Paulo* , i.e. these are **state courts** of appeals that hear civil, family and criminal cases.
     - *"The highest court of a state judicial system is the Court of Justice (Portuguese: Tribunal de Justiça). EachBrazilian state has only one Court of Justice, headquartered in the State's capital, functioning mostly as an appellate court. Second instance judgments are usually rendered by three judges, called desembargadores, however in specific cases the decision may be made by a single Judge.[24] Large courts are usually divided into different sections, specialized by subject matter"* [Wikipedia](https://en.wikipedia.org/wiki/Judiciary_of_Brazil)
  - We hae only *cjpg* , i.e. consultation of **first-degree judges** , it basically downloads data from the [database of sentences](http://esaj.tjsp.jus.br/cjpg/) .
  
  
  
## BrCAD-5
- The other dataset [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5) comes from Brazilian Federal Small Claims Courts (FSCC) within the 5th Regional **Federal Court** (TRF5) jurisdiction
 - TRF5 covers Northeast Brazil, see [map](https://pt.wikipedia.org/wiki/Tribunais_Regionais_Federais). Sao Paolo lies outside this region.
 

## Conclusion
- Since Sao Paolo is not part of the region of TRF5, according to my understanding, there is no way that the data that we scraped with the [R script](https://github.com/jjesusfilho) are the same as the data from [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5)
- A quick comparison below does not reveal any overlapping

In [1]:
import os
import pandas as pd
from pathlib import Path

# 2. Loading the data

- Brazilian court decisions scraped by us
- I do not load everything into memory, because it is too much

In [2]:
_path = '/Users/vetonmatoshi/Library/CloudStorage/OneDrive-BernerFachhochschule/Datensaetze_Pretraining/brazilian_caselaw/results_as_json'
_path = Path(_path)
all_data = list()
c = 0
for p in _path.glob('**/*'):
    if c < 2:
        df = pd.read_json(p, lines=True)
        all_data.append(df)
        c += 1
    else:
        break

all_data = pd.concat(all_data)


In [3]:
all_data.shape

(2880227, 13)

In [4]:
all_data.head()

Unnamed: 0,processo,pagina,hora_coleta,duplicado,classe,assunto,magistrado,comarca,foro,vara,disponibilizacao,julgado,cd_doc
0,1.007337e+19,1,,False,Procedimento Comum Cível,Interpretação / Revisão de Contrato,Lincoln Antônio Andrade de Moura,Guarulhos,Foro de Guarulhos,10ª Vara Cível,16454,TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COM...,68000780E0000-224-PG5GRU-19843231
1,4.023716e+19,1,,False,Procedimento Comum Cível,Acidente de Trânsito,Lincoln Antônio Andrade de Moura,Guarulhos,Foro de Guarulhos,10ª Vara Cível,16454,TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COM...,680005FQG0000-224-PG5GRU-19843173
2,4.019649e+19,1,,False,Procedimento Comum Cível,Indenização por Dano Moral,Lincoln Antônio Andrade de Moura,Guarulhos,Foro de Guarulhos,10ª Vara Cível,16454,TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COM...,680002DH00000-224-PG5GRU-19843122
3,4.00551e+19,1,,False,Procedimento Comum Cível,Ato / Negócio Jurídico,Lincoln Antônio Andrade de Moura,Guarulhos,Foro de Guarulhos,10ª Vara Cível,16454,TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COM...,680000HYK0000-224-PG5GRU-19843080
4,4.033468e+19,1,,False,Procedimento Comum Cível,Direitos / Deveres do Condômino,Lincoln Antônio Andrade de Moura,Guarulhos,Foro de Guarulhos,10ª Vara Cível,16454,TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COM...,680006EPW0000-224-PG5GRU-19843029


- Brazilian court decisions from [BrCAD-5](https://www.kaggle.com/datasets/eliasjacob/brcad5)

In [5]:
os.listdir('./archive/')

['human_experts_data.csv',
 'pretrained_models',
 'language_modeling_texts.parquet',
 'valid_en.parquet',
 'expert_label_identification.csv',
 'humans_en.parquet',
 'train_en.parquet',
 'test_en.parquet']

In [6]:
brcad = pd.read_parquet('./archive/language_modeling_texts.parquet', engine='pyarrow')


In [7]:
brcad.shape

(3128292, 2)

In [8]:
brcad.head()

Unnamed: 0,case_number,full_text_first_instance_court_ruling
0,0519514-80.2010.4.05.8300,SENTENÇA Homologo o acordo celebrado pelas par...
1,0502940-39.2011.4.05.8302,SENTENÇA A parte autora pleiteia a revisão da ...
2,0514951-09.2011.4.05.8300,SENTENÇA Cuida a hipótese de ação especial cív...
3,0508179-48.2016.4.05.8302,SENTENÇA I – RELATÓRIO Trata-se de ação especi...
4,0501738-37.2005.4.05.8302,SENTENÇA Trata-se de ação proposta em face do ...


# 3. Compare

In [9]:
set(all_data.julgado).intersection(set(brcad.full_text_first_instance_court_ruling))

set()

- No overlap