# 02 - Exploratory Data Analysis
This notebook performs exploratory data analysis on the preprocessed data.

## 1. Import Required Libraries

In [3]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 200)
plt.style.use('ggplot')

## 2. Load Processed Data

In [5]:
# Load processed data
df = pd.read_json('../data/processed/data_processed_test.json', lines=True)

# Display basic info
print("DataFrame Info:")
df.info()
print("\nFirst few rows:")
df.head()

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 28 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   processo              11 non-null     float64
 1   Classe 1ª Instância   11 non-null     object 
 2   Assunto 1ª Instância  11 non-null     object 
 3   Comarca 1ª Instância  11 non-null     object 
 4   foro                  11 non-null     object 
 5   vara                  11 non-null     object 
 6   julgado               11 non-null     object 
 7   cod_doc               11 non-null     object 
 8   Instância             11 non-null     object 
 9   sentença              11 non-null     object 
 10  Comarca               11 non-null     object 
 11  Requerinte            11 non-null     object 
 12  Réu                   11 non-null     object 
 13  Assunto 2ª Instância  0 non-null      float64
 14  Classe 2ª Instância   0 non-null      float64
 15  cd_proces

Unnamed: 0,processo,Classe 1ª Instância,Assunto 1ª Instância,Comarca 1ª Instância,foro,vara,julgado,cod_doc,Instância,sentença,Comarca,Requerinte,Réu,Assunto 2ª Instância,Classe 2ª Instância,cd_processo,data,dispositivo,cdacordao,Comarca 2ª Instância,orgao_julgador,ementa,Recurso 2º Grau,cd_doc,Assunto 2º Instância,latitude,longitude,processed_text
0,3463420000000000.0,Cumprimento de Sentença contra a Fazenda Pública,PROFISSIONAIS DE APOIO,Taquaritinga,Foro de Taquaritinga,Juizado Especial Cível e Criminal,"TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COMARCA de Taquaritinga Foro de Taquaritinga Juizado Especial Cível e Criminal Rua Visconde do Rio Branco, 71, Taquaritinga-SP - cep 15900-000 Horário de...",H70005JZY0000-619-PG5ARQA-45511889,1º Instância,procedente,taquaritinga,Pessoa Física,Estado,,,,,,,,,,,,,-21.425208,-48.537399,"[Paulo, taquaritinga, taquaritinga, cível, criminal, rua, visconder, Rio, branco, taquaritinga, sp, cep, horário, atendimento, Min, Min, laudar, digital, classe, assunto, cumprimento, fazendo, púb..."
1,1.001941e+19,Mandado de Segurança Cível,Estabelecimentos de Ensino,Presidente Epitácio,Foro de Presidente Epitácio,1ª Vara,"TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COMARCA de Presidente Epitácio Foro de Presidente Epitácio 1ª Vara Av.Presidente Vargas 1-31, Presidente Epitacio - SP - cep 19470-000 Horário de Atendim...",DD0004XTS0000-481-PG5PP-84166446,1º Instância,procedente,presidente epitacio,Pessoa Física,Escola Pública,,,,,,,,,,,,,-21.765074,-52.11114,"[Paulo, presidente, epitácio, presidente, epitácio, av, presidente, Vargas, presidente, epitacio, sp, cep, horário, atendimento, Min, Min, laudar, digital, classe, assunto, segurança, cível, ensin..."
2,1.003939e+19,Procedimento do Juizado Especial Cível,Prestação de Serviços,Cerqueira César,Foro de Cerqueira César,Juizado Especial Cível e Criminal,"TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COMARCA de Cerqueira César Foro de Cerqueira César Juizado Especial Cível e Criminal Rua Olimpio Pavan nº 355, Cerqueira Cesar - SP - cep 18760-000 10039...",210004Q3A0000-136-PG5ARCT-114815115,1º Instância,procedente,cerqueira cesar,Pessoa Física,Pessoa Física,,,,,,,,,,,,,-23.035314,-49.165052,"[Paulo, Cerqueira, césar, cerqueirar, césar, cível, criminal, rua, olimpio, pavan, cerqueira, cesar, sp, cep, laudar, classe, assunto, procedimento, cível, prestação, serviço, unesvi, união, Ensin..."
3,1.007129e+19,Mandado de Segurança Cível,Estabelecimentos de Ensino,Lins,Foro de Lins,2ª Vara Cível,"TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COMARCA de Lins - Foro de Lins - 2ª Vara Cível Rua Gil Pimentel Moura, 51, Centro CEP: 16400-920 - Lins - SP Telefone: (14) 3511-1525 - E-mail: Lins2cv@t...",8Y0003PG10000-322-PG5ARCT-114470170,1º Instância,parcial,lins,Pessoa Física,Pessoa Física,,,,,,,,,,,,,-21.649421,-49.682866,"[Paulo, lins, Lins, cível, rua, Gil, pimentel, Moura, Centro, cep, lim, sp, telefone, Mail, Lins, cv, jus, br, horário, atendimento, Min, Min, laudar, digital, classe, assunto, segurança, cível, E..."
4,1.000134e+19,Procedimento Comum Cível,Matrícula - Ausência de Pré-Requisito,Rosana,Foro de Rosana,Vara Única,"TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COMARCA de Rosana Foro de Rosana Vara Única Rua Curimbatá, 788/802 QD 12 - D. Primavera, Primavera - SP - cep 19274-000 Horário de Atendimento ao Público...",EB00018UI0000-515-PG5SJCA-131051340,1º Instância,procedente,rosana,Pessoa Física,Estado,,,,,,,,,,,,,-22.488279,-52.836266,"[Paulo, rosano, rosano, único, rua, curimbatá, qd, primavera, primavera, sp, cep, horário, atendimento, Min, Min, laudar, digital, classe, assunto, procedimento, comum, cível, acesso, matrículo, a..."


In [6]:
pd.set_option('display.max_columns', None)
df.describe(include='all')

Unnamed: 0,processo,Classe 1ª Instância,Assunto 1ª Instância,Comarca 1ª Instância,foro,vara,julgado,cod_doc,Instância,sentença,Comarca,Requerinte,Réu,Assunto 2ª Instância,Classe 2ª Instância,cd_processo,data,dispositivo,cdacordao,Comarca 2ª Instância,orgao_julgador,ementa,Recurso 2º Grau,cd_doc,Assunto 2º Instância,latitude,longitude,processed_text
count,11.0,11,11,11,11,11,11,11,11,11,11,11,11,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,11.0,11.0,11
unique,,5,7,11,11,7,11,11,1,3,11,1,3,,,,,,,,,,,,,,,11
top,,Procedimento Comum Cível,PROFISSIONAIS DE APOIO,Taquaritinga,Foro de Taquaritinga,Juizado Especial Cível e Criminal,"TRIBUNAL DE JUSTIÇA DO ESTADO DE SÃO PAULO COMARCA de Taquaritinga Foro de Taquaritinga Juizado Especial Cível e Criminal Rua Visconde do Rio Branco, 71, Taquaritinga-SP - cep 15900-000 Horário de...",H70005JZY0000-619-PG5ARQA-45511889,1º Instância,procedente,taquaritinga,Pessoa Física,Estado,,,,,,,,,,,,,,,"[Paulo, taquaritinga, taquaritinga, cível, criminal, rua, visconder, Rio, branco, taquaritinga, sp, cep, horário, atendimento, Min, Min, laudar, digital, classe, assunto, cumprimento, fazendo, púb..."
freq,,6,2,1,1,3,1,1,11,7,1,11,7,,,,,,,,,,,,,,,1
mean,8.20039e+18,,,,,,,,,,,,,,,,,,,,,,,,,-22.125617,-49.032649,
std,4.053214e+18,,,,,,,,,,,,,,,,,,,,,,,,,0.61699,1.899863,
min,1442920000000000.0,,,,,,,,,,,,,,,,,,,,,,,,,-23.035314,-52.836266,
25%,1.000274e+19,,,,,,,,,,,,,,,,,,,,,,,,,-22.565849,-49.423959,
50%,1.001359e+19,,,,,,,,,,,,,,,,,,,,,,,,,-21.855157,-48.537399,
75%,1.001963e+19,,,,,,,,,,,,,,,,,,,,,,,,,-21.707247,-47.678201,


### 3. **EDA**

In [7]:
def calculate_basic_stats(df):
    """Calculate and display basic statistics about the dataset."""
    stats = {
        'Total Documents': len(df),
        'Average Words per Document': df['processed_text'].apply(len).mean(),
        'Unique Words': len(set([word for doc in df['processed_text'] for word in doc])),
        'Most Common Words': Counter([word for doc in df['processed_text'] for word in doc]).most_common(10),
        'Missing Values': df.isnull().sum()
    }
    return stats

# Display basic statistics
stats = calculate_basic_stats(df)
for key, value in stats.items():
    print(f"\n{key}:")
    print(value)


Total Documents:
11

Average Words per Document:
623.0

Unique Words:
2025

Most Common Words:
[('fls', 94), ('escolar', 72), ('profissional', 65), ('apoio', 58), ('educação', 58), ('atendimento', 53), ('especializar', 53), ('deficiênciar', 51), ('ensino', 49), ('necessidade', 49)]

Missing Values:
processo                 0
Classe 1ª Instância      0
Assunto 1ª Instância     0
Comarca 1ª Instância     0
foro                     0
vara                     0
julgado                  0
cod_doc                  0
Instância                0
sentença                 0
Comarca                  0
Requerinte               0
Réu                      0
Assunto 2ª Instância    11
Classe 2ª Instância     11
cd_processo             11
data                    11
dispositivo             11
cdacordao               11
Comarca 2ª Instância    11
orgao_julgador          11
ementa                  11
Recurso 2º Grau         11
cd_doc                  11
Assunto 2º Instância    11
latitude                

In [9]:
# Function to calculate descriptive statistics
def calculate_frequencies(df, column):
    """Calculates the absolute and relative frequency of a column"""
    abs_freq = df[column].value_counts()
    rel_freq = df[column].value_counts(normalize=True) * 100  # Converting to percentage
    freq_table = pd.DataFrame({'Absolute Frequency': abs_freq, 'Relative Frequency (%)': rel_freq})
    return freq_table

# Example for the "sentence" column
sentence_freq_table = calculate_frequencies(df, "sentença")

# Example for "Class 1st Instance"
class_freq_table = calculate_frequencies(df, "Classe 1ª Instância")

This section presents the **Exploratory Data Analysis (EDA)** carried out on the set of judicial processes, aiming to understand its structure, identify patterns and verify statistical distributions. The analysis included descriptive statistics, visualizations and inspection of missing values.

#### **3.1. Database Structure**

The analyzed dataset contains **4,259 records** and **27 variables**, covering procedural information such as **class, subject, court, forum, court, decision, instance, plaintiff, and defendant**.  

The initial inspection revealed **missing values**, mainly in variables associated with the **2nd instance**. This indicates that most of the analyzed processes belong to the **1st instance**. The variables with the highest completion rate include `processo` (100%), `Instância` (100%), `sentença` (86.2%) and `Requerente` (89.4%). In contrast, the variables `Assunto 2ª Instância`, `Classe 2ª Instância`, and `Recurso 2º Grau` show reduced completion rates, indicating that few processes reach the appeal phase.


#### **3.2. Distribution of Procedural Classes in the 1st Instance**

The analysis of the distribution of procedural classes reveals that some categories have a significantly higher frequency (Figure 1). The predominant classes are:
- **Civil Security Mandate** (44.0%) and **Civil Common Procedure** (35.1%), concentrating most cases.

In [10]:
class_freq_table

Unnamed: 0_level_0,Absolute Frequency,Relative Frequency (%)
Classe 1ª Instância,Unnamed: 1_level_1,Unnamed: 2_level_1
Procedimento Comum Cível,6,54.545455
Mandado de Segurança Cível,2,18.181818
Cumprimento de Sentença contra a Fazenda Pública,1,9.090909
Procedimento do Juizado Especial Cível,1,9.090909
Procedimento do Juizado Especial da Fazenda Pública,1,9.090909


In [11]:
# Visualization 1: Distribution of Classes of Processes in the 1st Instance
fig1 = px.histogram(df,
                    x="Classe 1ª Instância",
                    title="Distribution of Classes of Processes in the 1st Instance",
                    labels={"Classe 1ª Instância": "Process Class"}
                    )
fig1.update_xaxes(categoryorder='total descending')
fig1.show()

#### **3.3. Distribution of Legal Subjects in the 1st Instance**

The distribution of legal subjects most frequently appears in Figure 2. The most recurring themes are:
- **Education Establishments** (44.1%) and **Basic and Secondary Education** (35.1%) as the most recurring themes.
- **Constitutional Guarantees** (12.9%), **People with Disabilities** (4.8%) and **Pre-school Care** (1.3%) also show representation.

The high volume of actions related to education and social inclusion indicates that the judiciary plays a relevant role in guaranteeing fundamental rights.


In [12]:
df_classes = df["Classe 1ª Instância"].value_counts().reset_index()
df_classes.columns = ["Classe Processual", "Frequência Absoluta"]
df_classes["Frequência Relativa (%)"] = (df_classes["Frequência Absoluta"] / df_classes["Frequência Absoluta"].sum()) * 100

fig = px.bar(df_classes, x="Classe Processual", y="Frequência Absoluta",
             text=df_classes["Frequência Relativa (%)"].apply(lambda x: f"{x:.1f}%"),
             title="Distribution of Classes of Processes in the 1st Instance")

fig.update_xaxes(categoryorder='total descending')
fig.update_traces(textposition='outside')
fig.show()


### **3.4. Distribution of Sentences**

The analysis of the sentences issued indicates that the majority of cases result in a **procedente** decision (Figure 3), suggesting that a large part of the actions have a solid legal basis.

Distribution of decisions:
- **Procedente:** Predominant sentence (50.7%).
- **Parcialmente procedente:** Represents a significant portion of cases (28.0%).
- **Improcedente:** Less common, but present (18.5%).
- **Homologação:** Minority of cases (2.8%).

The high rate of procedência may indicate that the petitions presented have a solid legal basis.


In [13]:
# Creating dataframe for visualization
df_sentenca = df["sentença"].value_counts().reset_index()
df_sentenca.columns = ["Sentence Type", "Absolute Frequency"]
df_sentenca["Relative Frequency (%)"] = (df_sentenca["Absolute Frequency"] / df_sentenca["Absolute Frequency"].sum()) * 100

# Creating bar chart with percentage
fig = px.bar(df_sentenca, x="Sentence Type", y="Absolute Frequency",
             text=df_sentenca["Relative Frequency (%)"].apply(lambda x: f"{x:.1f}%"),
             title="Sentence Distribution (Absolute and Relative Frequencies)")

fig.update_traces(textposition='outside')
fig.show()


In [14]:
sentence_freq_table

Unnamed: 0_level_0,Absolute Frequency,Relative Frequency (%)
sentença,Unnamed: 1_level_1,Unnamed: 2_level_1
procedente,7,63.636364
parcial,3,27.272727
improcedente,1,9.090909


#### **3.5. Distribution of Processes by Comarca**

The geographical analysis reveals that the majority of the processes are concentrated in a few specific comarcas. Figures 4 and 5 present the general distribution and the comarcas with the highest volume of lawsuits.

- **São Paulo (20.5%) concentrates the highest number of processes**, followed by **Cabreúva** (9.7%) and **Lençóis Paulista** (4.7%).
- Comarcas like **Francisco Morato** (4.6%), **Guarulhos** (3.7%) and **Campinas** (2.5%) also present high judicial demand.


In [15]:
# Visualization 4: Counting occurrences of each Comarca
df_comarca = df["Comarca"].value_counts().reset_index()
df_comarca.columns = ["Comarca", "Absolute Frequency"]
df_comarca["Relative Frequency (%)"] = (df_comarca["Absolute Frequency"] / df_comarca["Absolute Frequency"].sum()) * 100

fig = px.bar(df_comarca, x="Comarca", y="Absolute Frequency",
             text=df_comarca["Relative Frequency (%)"].apply(lambda x: f"{x:.1f}%"),
             title="Number of Processes by Comarca (Absolute and Relative)")

fig.update_xaxes(categoryorder='total descending')
fig.update_traces(textposition='outside')
fig.show()


#### **3.6. Geospatial Mapping of Lawsuits**

Figure 6 presents a **heatmap** of the distribution of lawsuits in the analyzed territory.

- Regions in **red** indicate higher concentration of lawsuits.
- São Paulo and its metropolitan region have **high lawsuit density**.
- Three intense heat clusters are observed centered in the regions of **São Paulo**, **Campinas** and **Ribeirão Preto** in decreasing order.

This data corroborates the thesis that more populous areas of the state have more legal activity.


In [16]:
import folium
from folium.plugins import HeatMap

# Aggregate count of processos per comarca
comarca_counts = df["Comarca"].value_counts().to_dict()

# Define min and max radius sizes
min_radius = 3
max_radius = 30

# Compute max and min process count for scaling
min_process_count = min(comarca_counts.values())
max_process_count = max(comarca_counts.values())

# Function to scale radius dynamically
def scale_radius(process_count):
    if max_process_count == min_process_count:
        return min_radius  # Avoid division by zero if all values are the same
    return min_radius + (max_radius - min_radius) * ((process_count - min_process_count) / (max_process_count - min_process_count))

# Create a base map centered on São Paulo
map = folium.Map(location=[-23.5505, -46.6333], zoom_start=7)

# Add dynamically sized circle markers based on aggregated process count per comarca
for _, row in df.iterrows():
    comarca_name = row["Comarca"]
    process_count = comarca_counts.get(comarca_name, 0)  # Get aggregated count of processes
    circle_radius = scale_radius(process_count)  # Scale dynamically

    folium.CircleMarker(
        location=[row["latitude"], row["longitude"]],
        radius=circle_radius,  # Dynamic size based on process count
        color="blue",
        fill=True,
        fill_color="blue",
        fill_opacity=0.6,  # Adjust transparency
        popup=f"{comarca_name}<br>Processos: {process_count}",
        tooltip=comarca_name
    ).add_to(map)

# Add HeatMap
heat_data = [[row["latitude"], row["longitude"], comarca_counts.get(row["Comarca"], 0)] for _, row in df.iterrows()]
HeatMap(heat_data, max_zoom=13, radius=20).add_to(map)

# Display the map
map


#### **3.7. Relationship between Plaintiffs and Defendants**

The Figure 7 shows the distribution of plaintiffs and defendants, separated by categories. The main findings include:
- **Individuals** are the most frequent plaintiffs, mainly against **the State and private universities**.
- **Associations and public bodies** also feature as active parties.


In [18]:
import plotly.subplots as sp
import plotly.graph_objects as go

# Visualization: Creating separate graphs for different Requerente types
df_req = df.dropna(subset=["Requerinte"])

unique_requerentes = df_req["Requerinte"].unique()
num_unique = len(unique_requerentes)

# Determinar a grade de subplots dinamicamente
rows = (num_unique // 2) + (num_unique % 2)  # Garante que haja linhas suficientes
cols = 2 if num_unique > 1 else 1

# Criar a figura com múltiplos subgráficos
fig5 = sp.make_subplots(rows=rows, cols=cols, subplot_titles=[f"Requerente: {req}" for req in unique_requerentes])

row, col = 1, 1
for req in unique_requerentes:
    subset = df[df["Requerinte"] == req]
    trace = go.Histogram(x=subset["Réu"], name=req)
    fig5.add_trace(trace, row=row, col=col)

    # Ajustando linha e coluna corretamente
    col += 1
    if col > cols:
        col = 1
        row += 1

# Atualizar layout da figura
fig5.update_layout(title_text="Distribuição de Requerentes e Réus (Separado por Tipo de Requerente)")
fig5.update_xaxes(categoryorder='total descending')

# Mostrar a figura
fig5.show()

In [23]:
df_sentenca

Unnamed: 0,Sentence Type,Absolute Frequency,Relative Frequency (%)
0,procedente,7,63.636364
1,parcial,3,27.272727
2,improcedente,1,9.090909


In [24]:
summary = f"""
- The dataset contains a total of {len(df):,} processes.
- The most common procedural class is "{df_classes.iloc[0, 0]}", representing {df_classes.iloc[0, 2]:.1f}% of the cases.
- The most common judicial subject is "{df['Assunto 1ª Instância'].mode()[0]}", being the subject of {df['Assunto 1ª Instância'].value_counts(normalize=True).iloc[0] * 100:.1f}% of the processes.
- Regarding the sentences, the "procedente" decision represents {df_sentenca[df_sentenca['Sentence Type'] == 'procedente']['Relative Frequency (%)'].values[0]:.1f}% of the total.
- The comarca with the highest volume of processes is "{df_comarca.iloc[0, 0]}", responding for {df_comarca.iloc[0, 2]:.1f}% of the analyzed cases.
"""

print(summary)


- The dataset contains a total of 11 processes.
- The most common procedural class is "Procedimento Comum Cível", representing 54.5% of the cases.
- The most common judicial subject is "DIREITO PROCESSUAL CIVIL E DO TRABALHO", being the subject of 18.2% of the processes.
- Regarding the sentences, the "procedente" decision represents 63.6% of the total.
- The comarca with the highest volume of processes is "taquaritinga", responding for 9.1% of the analyzed cases.



## **Conclusion**

The exploratory data analysis revealed relevant patterns regarding the volume and distribution of the analyzed legal processes. The main findings include:

- **Predominance of security writs and civil actions**, indicating demands focused on guaranteeing fundamental rights.
- **Education and social assistance as recurring themes**, reinforcing the importance of the judiciary in defending social rights.
- **High concentration of processes in the city of São Paulo**, reflecting its relevance in the legal scenario.
- **Most sentences are procedent**, suggesting that the presented petitions have consistent grounds.
- **Geospatial distribution reveals areas of high litigation**, contributing to possible future analyses on the decentralization of the judicial system.

The obtained results provide a solid basis for the next stages of the study, including the application of topic modeling to identify latent patterns in the analyzed processes.