GitHub - MiriamSchirmer/genocide-transcript-corpus: Dataset of text sections of genocide-related court transcripts.

Genocide Transcript Corpus (GTC)

The Genocide Transcript Corpus (GTC) provides transcript data from three different genocide tribunals: the Extraordinary Chambers in the Courts of Cambodia (ECCC), the International Criminal Tribunal for Rwanda (ICTR), and the International Criminal Tribunal for the Former Yugoslavia (ICTY).

GTC Version 2 - June 2023

Besides meta data regarding the respective tribunal and transcript annotation this version also includes the annotation of text segments that inlude potentially traumatic witness experiences.

The updated version of the GTC contains 52,845 text segments of a total of 90 transcripts that can be attributed to an individual person or court proceedings. The final data set includes the following variables:

Case information: tribunal, case number, accused
Transcript information: document ID, url-link to the original transcript, date
Witness information: witness name or pseudonym, number of witnesses per transcript
Text information: speaker (e.g., Witness, LawyerQA), text, trauma label
Annotation information: annotation ID, start ID, and document ID

Codebook V2

Variable Name	Description
tribunal	Name of the tribunal (ICTY, ICTR, or ECCC)
id_transcript	Document/transcript number/ID
case	Case number/ID
accused	Name of the accused
date	Date of the respective trial day corresponding to the transcript
text	Transcript segment separated by speaker
trauma	Potentially trauma-related content: not containing trauma-related content = 0 containing trauma-related content = 1
role	Legal role of the person speaking: Witness Accused JudgeProc (Judge talking about procedural matters) JudgeQA (Judge examining a witness LawyerProc (Lawyer talking about procedural matters) LawyerQA (Lawyer examining a witness Proceedings (procedural matters)
witnesses	Names or pseudonyms of witnesses
n_witnesses	Number of witnesses examined in the hearing
start	Starting point of the respective segment annotation (useful for ordering the data chronologically)
id_annotation	Annotation ID of the segment
id_document	Document ID in reference to the annotation process
url	Link to transcript

Please refer to the corresponding paper for further context, including details on the labeling process:

Miriam Schirmer, Isaac Misael Olguín Nolasco, Edoardo Mosca, Shanshan Xu, and Jürgen Pfeffer. 2023. Uncovering Trauma in Genocide Tribunals: An NLP Approach Using the Genocide Transcript Corpus. In Nineteenth International Conference on Artificial Intelligence and Law (ICAIL 2023), June 19–23, 2023, Braga, Portugal. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3594536.3595147

GTC Version 1 - June 2022

All samples were labeled according to whether they contain a witness’s description of experienced violence. Violence in this context includes accounts of experienced torture, interrogation, death, beating, psychological violence, experienced military attacks, destruction of villages, and looting.

The transcript data was divided into equally large text chunks of 250 words each. Numbers and punctuation were removed.

Codebook V1

Variable Name	Description
paragraph	A text passage from a genocide tribunal transcript (250 words each).
label	Violence-related content: not containing violence = 0 containing violence = 1
tribunal	The specific tribunal the transcript data is from: ECCC = 1 ICTY = 2 ICTR = 3
witness	The witness's name or a pseudonym.
document	The document number / ID.
case	The case number / ID.
date	The trial date.

General Note: All of the used transcripts are openly accessible on the respective courts' websites.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
GTC_Version1		GTC_Version1
logs		logs
src		src
01_WebScraping_TribunalTranscriptCases.ipynb		01_WebScraping_TribunalTranscriptCases.ipynb
02_InfoExtractionFromJson.ipynb		02_InfoExtractionFromJson.ipynb
03_BinaryClassification.ipynb		03_BinaryClassification.ipynb
04_ActiveLearning.ipynb		04_ActiveLearning.ipynb
05_AnalysisOfComparableModels.ipynb		05_AnalysisOfComparableModels.ipynb
06_XAI_Shap.ipynb		06_XAI_Shap.ipynb
Dataset_GTC-V2.csv		Dataset_GTC-V2.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GTC_Version1

GTC_Version1

logs

logs

src

src

01_WebScraping_TribunalTranscriptCases.ipynb

01_WebScraping_TribunalTranscriptCases.ipynb

02_InfoExtractionFromJson.ipynb

02_InfoExtractionFromJson.ipynb

03_BinaryClassification.ipynb

03_BinaryClassification.ipynb

04_ActiveLearning.ipynb

04_ActiveLearning.ipynb

05_AnalysisOfComparableModels.ipynb

05_AnalysisOfComparableModels.ipynb

06_XAI_Shap.ipynb

06_XAI_Shap.ipynb

Dataset_GTC-V2.csv

Dataset_GTC-V2.csv

README.md

README.md

Repository files navigation

Genocide Transcript Corpus (GTC)

About

Releases

Packages

Languages

MiriamSchirmer/genocide-transcript-corpus

Folders and files

Latest commit

History

Repository files navigation

Genocide Transcript Corpus (GTC)

About

Resources

Stars

Watchers

Forks

Languages