Project Details:
Legal Text Mining
This project is provided by ”Open Legal Data” and Dr. Malte Ostendorff.
# Introduction
Courts publish tens of thousands of decisions every year, yet most remain unread except by specialists.
Open Legal Data (openlegaldata.io) is an open-source, non-profit initiative that collects court decisions and
other legal information to turn this hidden trove into an open, machine-readable corpus. Open Legal Data
provides this corpus through a web interface, bulk downloads and a REST API. By doing so, Open Legal
Data aims to enhance the transparency of the jurisdiction with the help of open data and by supporting
people without legal education to understand the justice system, in line with the Open-Data Principles and
the Free Access to Law movement.
# Case Study Problem
By transforming raw legal text into structured evidence, the project shows how judicial and legislative
signals can be quantified, visualized, and explored to uncover trends, guide policy and practice, and open
new avenues for legal research - work that lies squarely at the intersection of data science and law.
# Case Study Material
In the Legal Text Mining project, you will receive a dataset provided by Open Legal Data. The dataset
consists of a pre-cleaned dump of court decisions enriched with basic metadata (court, chamber, date, ECLI,
citations).
# Case Study Core Tasks
In the summer school project, you will develop an analytic pipeline to discover insights hidden in the large
amount of legal text, for example:
• Topic & trend analysis:
tracing how climate-risk litigation or consumer-protection disputes evolve across time and jurisdictions;
• Citation and precedent networks:
mapping which judgments or legislative acts act as hubs and how influence propagates through the
legal system;
• Outcome modelling:
linking textual or structural features to damages, penalties, or success rates, and examining regional
or court-level disparities;

Install requires Libraries

In [2]:
!pip install -q datasets transformers scikit-learn pandas numpy matplotlib seaborn

I tried to load dataset directly throw kaggel dataset. 

In [5]:
import pandas as pd
import os

# Check files in the dataset
os.listdir("/kaggle/input/ecthrnaacl2021")


['dev.jsonl', 'test.jsonl', 'train.jsonl']

In [None]:
Now, turn by turn check all three json files and check what kind of dataset you have.

In [6]:
import pandas as pd

df_train = pd.read_json("/kaggle/input/ecthrnaacl2021/train.jsonl", lines=True)
df_train.head()


Unnamed: 0,case_id,case_no,title,judgment_date,facts,applicants,defendants,allegedly_violated_articles,violated_articles,court_assessment_references,silver_rationales,gold_rationales
0,001-59587,25702/94,CASE OF K. AND T. v. FINLAND,2001-07-12,[11. At the beginning of the events relevant ...,"[K., T.]",[FINLAND],"[13, 8]",[8],"{'8': ['12', '140', '155', '156', '157', '158'...","[1, 13, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30...",[]
1,001-59591,42527/98,CASE OF PRINCE HANS-ADAM II OF LIECHTENSTEIN v...,2001-07-12,[9. The applicant is the monarch of Liechtens...,[PRINCE HANS-ADAM II OF LIECHTENSTEIN],[GERMANY],"[14, P1-1, 6]",[],"{'6': ['12', '15', '24', '25', '26', '27', '28...","[3, 6]",[]
2,001-59590,33071/96,CASE OF MALHOUS v. THE CZECH REPUBLIC,2001-07-12,[9. In June 1949 plots of agricultural land o...,[MALHOUS],[CZECH REPUBLIC],[6],[6],"{'6': ['13', '14', '35', '40', '41', '42', '43...","[4, 5]",[]
3,001-59588,29032/95,CASE OF FELDEK v. SLOVAKIA,2001-07-12,"[8. In 1991 Mr Dušan Slobodník, a research wo...",[FELDEK],[SLOVAKIA],"[14, 10, 9]",[10],{'10': ['35']},[27],[]
4,001-59589,44759/98,CASE OF FERRAZZINI v. ITALY,2001-07-12,"[9. The applicant is an Italian citizen, born...",[FERRAZZINI],[ITALY],"[14, 6]",[],"{'6': ['13', '14', '35', '40', '41', '42', '43...",[4],[]


In [7]:
import pandas as pd

df_dev = pd.read_json("/kaggle/input/ecthrnaacl2021/dev.jsonl", lines=True)
df_dev.head()


Unnamed: 0,case_id,case_no,title,judgment_date,facts,applicants,defendants,allegedly_violated_articles,violated_articles,court_assessment_references,silver_rationales,gold_rationales
0,001-160404,20488/11,CASE OF KOSIŃSKI v. POLAND,2016-02-09,[5. The applicant was born in 1983 and is det...,[KOSIŃSKI],[POLAND],[8],[],{},[],[]
1,001-160406,61050/11,CASE OF MESCEREACOV v. THE REPUBLIC OF MOLDOVA,2016-02-09,[5. The applicant was born in 1982 and is cur...,[MESCEREACOV],[MOLDOVA],[3],[3],"{'3': ['10', '11', '9']}","[4, 5, 6]",[]
2,001-160417,40852/05,CASE OF SHLYCHKOV v. RUSSIA,2016-02-09,[5. The applicant was born in 1955 and lives ...,[SHLYCHKOV],[RUSSIA],"[3, 6]","[6, 3]","{'3': ['18', '24', '27', '34', '37', '47'], '6...","[10, 12, 13, 14, 19, 22, 29, 32]",[]
3,001-160432,38395/12,CASE OF DALLAS v. THE UNITED KINGDOM,2016-02-11,[6. The applicant was born in 1977 and lives ...,[DALLAS],[UNITED KINGDOM],[7],[],"{'7': ['10', '11', '27', '29', '30', '31', '32...","[2, 3, 4, 5, 21, 23, 24, 25, 26, 27, 28, 31, 3...",[]
4,001-160425,42534/09,"CASE OF MITROVA AND SAVIK v. ""THE FORMER YUGOS...",2016-02-11,[6. The applicants were born in 1983 and 2007...,"[MITROVA, SAVIK]",[FORMER YUGOSLAV MACEDONIA],"[11, 5, 6, 8]",[],"{'11': ['67', '68', '69', '70', '71', '72', '7...",[],[]


In [10]:
df_test = pd.read_json("/kaggle/input/ecthrnaacl2021/test.jsonl", lines=True)
df_test.head(5)


Unnamed: 0,case_id,case_no,title,judgment_date,facts,applicants,defendants,allegedly_violated_articles,violated_articles,court_assessment_references,silver_rationales,gold_rationales
0,001-177349,21272/12,CASE OF BECKER v. NORWAY,2017-10-05,"[5. The applicant is a journalist for DN.no, ...",[BECKER],[NORWAY],[10],[10],"{'10': ['11', '12', '13', '14', '15', '16', '1...","[1, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18...",[]
1,001-177341,33015/06,CASE OF VOSKOBOYNIKOV v. UKRAINE,2017-10-05,[5. The applicant was born in 1940 and lives ...,[VOSKOBOYNIKOV],[UKRAINE],"[13, 8]",[8],"{'8': ['27', '28'], '13': ['22']}",[17],[]
2,001-177354,45758/14,CASE OF BERÁNEK v. THE CZECH REPUBLIC,2017-10-05,[5. The applicant was born in 1965 and lives ...,[BERÁNEK],[CZECH REPUBLIC],[6],[6],{'6': ['12']},[7],[]
3,001-177342,32598/07,CASE OF SUKHANOV v. UKRAINE,2017-10-05,[5. The applicant was born in 1967 and lives ...,[SUKHANOV],[UKRAINE],[6],[6],"{'6': ['10', '11', '12', '15', '18', '9']}","[4, 5, 6, 7]",[]
4,001-177355,34197/15,CASE OF MITEV v. BULGARIA,2017-10-05,[5. The applicant was born in 1967 and lives ...,[MITEV],[BULGARIA],"[13, 3, 6]",[],"{'6': ['10', '11', '12', '15', '18', '9']}","[4, 5, 6, 7, 10, 13]",[]


Lets check columns

Inshortly check just columns name of this dataset.

In [9]:
df_train.columns


Index(['case_id', 'case_no', 'title', 'judgment_date', 'facts', 'applicants',
       'defendants', 'allegedly_violated_articles', 'violated_articles',
       'court_assessment_references', 'silver_rationales', 'gold_rationales'],
      dtype='object')

In [11]:
df_test.columns

Index(['case_id', 'case_no', 'title', 'judgment_date', 'facts', 'applicants',
       'defendants', 'allegedly_violated_articles', 'violated_articles',
       'court_assessment_references', 'silver_rationales', 'gold_rationales'],
      dtype='object')

In [12]:
df_dev.columns

Index(['case_id', 'case_no', 'title', 'judgment_date', 'facts', 'applicants',
       'defendants', 'allegedly_violated_articles', 'violated_articles',
       'court_assessment_references', 'silver_rationales', 'gold_rationales'],
      dtype='object')