# Install dependencies

* IMPORTANT: you only need to run this one time, when you first spin up your docker notebook server
* IMPORTANT: you need to restart the kernel after you run this
* TODO: these deps should be eventually put into a dockerfile

* INFO: read more about [what the hell is pip](https://pip.pypa.io/en/stable/) and [why installing dependencies via pip in a notebook is a stupid idea, but OK since we are prototyping here](https://towardsdev.com/pip-install-from-jupyter-notebook-485c218b50b)

In [None]:
!pip install pandas
!pip install datasets

# Import libraries

also setup the data directory

In [10]:
import os
import re
import glob
import shutil
import string
import pathlib

# set the data directory
data_dir = os.path.abspath(os.path.join(os.getcwd(),'..','data'))

#give matplotlib a folder to save its configs
os.environ['MPLCONFIGDIR'] = os.path.join(data_dir,'plt_configs')
import matplotlib.pyplot as plt

#give huggingface a folder to save its stuff in too
#you only need this if you are using a huggingface dataset
os.environ['HF_HOME'] = os.path.join(data_dir,'hf_cache')
import datasets

import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras.layers import TextVectorization

In [22]:
# setup directories, create them if they do not exist within data folder

# We want our data directory to contain two folders, one for each class we are predicting ("headshots" aka normalish photos of normalish people (I guess) or "mugshots")

filings_dir = os.path.join(data_dir,'filings')
if not os.path.exists(filings_dir):
    os.makedirs(filings_dir)
    
yes_dir = os.path.join(data_dir,'filings','yes')
if not os.path.exists(yes_dir):
    os.makedirs(yes_dir)
    
no_dir = os.path.join(data_dir,'filings','no')
if not os.path.exists(no_dir):
    os.makedirs(no_dir)

# Download the data

* [huggingface datasets quickstart](https://huggingface.co/docs/datasets/quickstart)
* [joelito/brazilian_court_decisions](https://huggingface.co/datasets/joelito/brazilian_court_decisions)

In [2]:
dataset = datasets.load_dataset("joelito/brazilian_court_decisions")

Using custom data configuration joelito--brazilian_court_decisions-cc57c1c8c69e3b04
Found cached dataset json (/tf/data/hf_cache/datasets/joelito___json/joelito--brazilian_court_decisions-cc57c1c8c69e3b04/0.0.0/e6070c77f18f01a5ad4551a8b7edfba20b8438b7cad4d94e6ad9378022ce4aab)


  0%|          | 0/3 [00:00<?, ?it/s]

# Inspect the data 

* [Know your dataset](https://huggingface.co/docs/datasets/access)

In [13]:
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

complete_dataset = pd.concat([train_df,test_df])
complete_dataset.head()

Unnamed: 0,process_number,orgao_julgador,publish_date,judge_relator,ementa_text,decision_description,judgment_text,judgment_label,unanimity_text,unanimity_label
0,0800304-08.2018.8.02.0000,Tribunal Pleno,12/03/2019,Des. João Luiz Azevedo Lessa,DIREITO PENAL E PROCESSUAL PENAL. REVISÃO CRIM...,DIREITO PENAL E PROCESSUAL PENAL. REVISÃO CRIM...,REVISÃO CRIMINAL JULGADA PARCIALMENTE PROCEDENTE,partial,DECISÃO UNÂNIME,unanimity
1,0700071-82.2015.8.02.0040,3ª Câmara Cível,09/02/2019,Des. Celyrio Adamastor Tenório Accioly,APELAÇÃO CÍVEL. MANDADO DE SEGURANÇA. SERVIDOR...,APELAÇÃO CÍVEL. MANDADO DE SEGURANÇA. SERVIDOR...,RECURSO CONHECIDO E NÃO PROVIDO,no,-2,not_determined
2,0801729-70.2018.8.02.0000,2ª Câmara Cível,25/02/2019,Des. Pedro Augusto Mendonça de Araújo,PROCESSUAL CIVIL. EMBARGOS DE DECLARAÇÃO EM AG...,PROCESSUAL CIVIL. EMBARGOS DE DECLARAÇÃO EM AG...,RECURSO CONHECIDO E REJEITADO,no,DECISÃO UNÂNIME,unanimity
3,0804894-62.2017.8.02.0000,2ª Câmara Cível,19/03/2019,Des. Klever Rêgo Loureiro,AGRAVO DE INSTRUMENTO. AUXÍLIO DOENÇA. SUSPENS...,AGRAVO DE INSTRUMENTO. AUXÍLIO DOENÇA. SUSPENS...,RECURSO CONHECIDO E PROVIDO,yes,DECISÃO UNÂNIME,unanimity
4,0702761-41.2014.8.02.0001,1ª Câmara Cível,13/02/2019,Des. Fábio José Bittencourt Araújo,DIREITO DO CONSUMIDOR. APELAÇÃO INTERPOSTA EM ...,DIREITO DO CONSUMIDOR. APELAÇÃO INTERPOSTA EM ...,APELO CONHECIDO E PROVIDO EM PARTE,partial,UNANIMIDADE,unanimity


In [29]:
yes_df = complete_dataset[complete_dataset['judgment_label'] == 'yes']
no_df = complete_dataset[complete_dataset['judgment_label'] == 'no']

# TODO what about partials?

write the text from the dataframe into a folder/file structure where each folder contains text from the "yes" votes


In [24]:
i=0
for text in yes_df['ementa_text']:
    with open(os.path.join(yes_dir,str(i)+'.txt'), 'w') as f:
        f.write(text)
    i+=1

In [25]:
i=0
for text in no_df['ementa_text']:
    with open(os.path.join(no_dir,str(i)+'.txt'), 'w') as f:
        f.write(text)
    i+=1

# Load the dataset for training
* [Load text](https://www.tensorflow.org/tutorials/load_data/text)

In [28]:
batch_size = 32
seed = 42

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    pathlib.Path(filings_dir),
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed)

Found 2869 files belonging to 2 classes.
Using 2296 files for training.


In [31]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(10):
    print("Filing: ", text_batch.numpy()[i])
    print("Label:", label_batch.numpy()[i])

Filing:  b'AGRAVO DE INSTRUMENTO. A\xc3\x87\xc3\x83O DE COBRAN\xc3\x87A DE HONOR\xc3\x81RIOS ADVOCAT\xc3\x8dCIOS CONTRATUAIS. VERBA FUNDEB (ANTIGO FUNDEF). VINCULA\xc3\x87\xc3\x83O CONSTITUCIONAL. IMPRESCIND\xc3\x8dVEL A SUA APLICA\xc3\x87\xc3\x83O TOTAL PARA O DESENVOLVIMENTO E MANUTEN\xc3\x87\xc3\x83O DO ENSINO, EFETIVANDO O DIREITO CONSTITUCIONAL \xc3\x80 EDUCA\xc3\x87\xc3\x83O DE QUALIDADE. TESE FIXADA NO SUPERIOR TRIBUNAL DE JUSTI\xc3\x87A PELA IMPOSSIBILIDADE DA RETEN\xc3\x87\xc3\x83O. DEMAIS QUEST\xc3\x95ES. NECESSIDADE DE DILA\xc3\x87\xc3\x83O PROBAT\xc3\x93RIA. RECURSO CONHECIDO E IMPROVIDO. DECIS\xc3\x83O UN\xc3\x82NIME.'
Label: 0
Filing:  b'AGRAVO DE INSTRUMENTO. BUSCA E APREENS\xc3\x83O. DECIS\xc3\x83O QUE DENEGOU A ORDEM. PREENCHIMENTO DOS REQUISITOS PREVISTOS NO DECRETO-LEI N\xc2\xba 911/69. RECURSO CONHECIDO E PROVIDO.'
Label: 1
Filing:  b'AGRAVO DE INSTRUMENTO. A\xc3\x87\xc3\x83O DE BUSCA E APREENS\xc3\x83O. ALEGA\xc3\x87\xc3\x83O DE OFENSA AOS PRINC\xc3\x8dPIOS DA RAZO

In [32]:
for i, label in enumerate(raw_train_ds.class_names):
  print("Label", i, "corresponds to", label)

Label 0 corresponds to no
Label 1 corresponds to yes


In [35]:
# Create a validation set.
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    pathlib.Path(filings_dir),
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',
    seed=seed)

Found 2869 files belonging to 2 classes.
Using 573 files for validation.


# Prepare the dataset for training
[https://www.tensorflow.org/tutorials/load_data/text#prepare_the_dataset_for_training](https://www.tensorflow.org/tutorials/load_data/text#prepare_the_dataset_for_training)

In [36]:
# TODO Vectorize etc...

# Load data and train model
* [tf text in 5 mins](https://codesearchonline.com/natural-language-processing-with-tensorflow-cheat-sheet/)

In [None]:
# TODO add code here

# Save model and test prediction

In [None]:
# TODO add code here