# WebScraping of tribunal transcript cases

This notebook aims to show the functionality for cleaning transcripts of cases through web scraping.

Following the general ethical guidelines when using WebScraping, it was retrieved the permission on the three tribunals:
- [International criminal tribunal for the former Yugoslavia](https://www.icty.org/), permission can be fetched [here](https://www.icty.org/robots.txt)
- [Extraordinary Chamber in the Courts of Cambodia](https://www.eccc.gov.kh/), permission can be fetched [here](https://www.eccc.gov.kh/robots.txt)
- [International Criminal Tribunal for Rwanda](https://ucr.irmct.org/) no robots.txt file was found

### Imports

In [None]:
#import sys  
#!{sys.executable} -m pip install PyPDF2

In [35]:
%load_ext autoreload
%autoreload 2

import requests
from bs4 import BeautifulSoup
from os import listdir
from os.path import isfile, join
from PyPDF2 import PdfFileReader

import src.cleaning_transcripts as cleaning_transc

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Globals

In [53]:
GLB_FILE_ENCODING_UTF8 = "utf8"
GLB_FILE_WRITE_OP = "w"
GLB_FILE_BINARY_OP = "wb"

GLB_HTML_P_TAG = "p"
GLB_HTML_PARSER = "html.parser"

GLB_PATH_OUTPUT_DIRECTORY_ICTY = "output/clean_transcripts/icty"
GLB_PATH_INPUT_DIRECTORY_ECCC = "input/transcripts/eccc"
GLB_PATH_OUTPUT_DIRECTORY_ECCC = "output/clean_transcripts/eccc"
GLB_PATH_INPUT_DIRECTORY_ICTR = "input/transcripts/ictr"
GLB_EXTENSION_TXT = ".txt"

GLB_CHAR_NEWLINE = "\n"

GLB_COURT_PREFIX_FILE_ICTY = "ICTY_"
GLB_COURT_PREFIX_FILE_ECCC = "ECCC_"

DEBUG = True

## International criminal tribunal for the former Yugoslavia

In [37]:
transcript_icty_case_url = "https://www.icty.org/x/cases/tadic/trans/en/960719ed.htm"

In [50]:
if DEBUG:
    # Extracting information from the "International criminal tribunal for the former Yugoslavia"
    response = requests.get(transcript_icty_case_url)
    
    # Get just the content of the page
    #response.__dict__["_content"]

    content_soup = BeautifulSoup(response.content, "html.parser")

    # Print in text content in html format
    #print(content_soup.prettify())

    # Get list of paragraphs (this is how the information is retrieved)
    list_p = list()
    for p in content_soup.find_all("p"):
        list_p.append(p)
    print(f'Number of retrieved paragraphs of transcript {transcript_icty_case_url} is {len(list_p)}')
    print("="*50)

    counter = 0
    for paragraph in list_p:
        clean_paragraph = cleaning_transc.cleanParagraphsICFYtranscript(str(paragraph))
        if clean_paragraph != cleaning_transc.GLB_EMPTY_STRING:
            counter+=1
            print(clean_paragraph)

Number of retrieved paragraphs of transcript https://www.icty.org/x/cases/tadic/trans/en/960719ed.htm is 2756
THE INTERNATIONAL CRIMINAL TRIBUNAL 		CASE NO. IT-94-1-T
FOR THE FORMER YUGOSLAVIA
IN THE TRIAL CHAMBER
Friday, 19th July 1996
SUADA RAMIC, recalled.
THE PRESIDING JUDGE:  Miss Hollis, could you continue, please?
MISS HOLLIS:  Yes, your Honour.
Examined by MISS HOLLIS, continued.
THE PRESIDING JUDGE:  You may be seated, Mrs. Ramic.  You are still under
oath, the oath that you took yesterday. You may be seated.  Thank you.
MISS HOLLIS:  Perhaps we could turn off one of those microphones and, Mrs.
Ramic, if you could sit back a bit from the microphone?  Thank you.
Mrs. Ramic, when we finished yesterday you had indicated that after
the events that occurred in the military barracks you a few days later
went to your village to visit your brother and after some time you
came back again to your home in Prijedor.  Do you recall your home
being searched after your return to Prijedor?
A.

#### Save Documents

In [39]:
list_url_cases_icty = [transcript_icty_case_url]

for index_case, url_html_case in enumerate(list_url_cases_icty):
    response = requests.get(url_html_case)
    
    content_soup = BeautifulSoup(response.content, GLB_HTML_PARSER)
    
    list_p = list()
    for p in content_soup.find_all(GLB_HTML_P_TAG):
        list_p.append(p)
        
    counter = 0
    id_case = url_html_case[url_html_case.rindex("/")+1: url_html_case.rindex(".")]
    f = open(join(GLB_PATH_OUTPUT_DIRECTORY_ICTY, GLB_COURT_PREFIX_FILE_ICTY + id_case + GLB_EXTENSION_TXT), GLB_FILE_WRITE_OP, encoding=GLB_FILE_ENCODING_UTF8)
    for paragraph in list_p:
        clean_paragraph = cleaning_transc.cleanParagraphsICFYtranscript(str(paragraph))
        if clean_paragraph != cleaning_transc.GLB_EMPTY_STRING:
            counter+=1
            f.write(clean_paragraph + GLB_CHAR_NEWLINE)
    f.close()
        
    print(f'{index_case+1}) Number of retrieved paragraphs of transcript {transcript_icty_case_url} is {len(list_p)} was reduced to {counter}')
    

1) Number of retrieved paragraphs of transcript https://www.icty.org/x/cases/tadic/trans/en/960719ed.htm is 2756 was reduced to 2486


## Extraordinary Chamber in the Courts of Cambodia

In [40]:
transcript_eccc_case_url = "https://www.eccc.gov.kh/sites/default/files/documents/courtdoc/%5Bdate-in-tz%5D/E1_41.1_TR001_20090701_Final_EN_Pub.pdf"

In [49]:
if DEBUG:
    list_all_sentences = list()
    # Extracting information from the "Extraordinary Chamber in the Courts of Cambodia"
    # Get the PDF
    response = requests.get(transcript_eccc_case_url)
    doc_name = transcript_eccc_case_url[transcript_eccc_case_url.rindex("/")+1:]
    
    f = open(join(GLB_PATH_INPUT_DIRECTORY_ECCC, doc_name), GLB_FILE_BINARY_OP)
    f.write(response.content)
    f.close()
    
    # Get content of the PDF
    pdf = PdfFileReader(join(GLB_PATH_INPUT_DIRECTORY_ECCC, doc_name))
    number_of_pages = len(pdf.pages)
    #print(f"Number of pages {number_of_pages}")
    index_page_of_interest = 1
    patter_was_found = False
    for index_page in range(number_of_pages):
        page_pdf = pdf.pages[index_page]
        text_page = page_pdf.extract_text()
        
        if not patter_was_found:
            index_pattern = text_page.index(cleaning_transc.GLB_ECCC_PATTERN_BEGIN_CONTENT_OF_INTEREST) if cleaning_transc.GLB_ECCC_PATTERN_BEGIN_CONTENT_OF_INTEREST in text_page else -1
            if index_pattern >= 0:
                patter_was_found = True
                
                list_aux = cleaning_transc.cleanPagePdfECCCtranscript(text_page, index_page_of_interest)
                list_all_sentences = [*list_all_sentences, *list_aux]
                index_page_of_interest += 1
        else:
            list_aux = cleaning_transc.cleanPagePdfECCCtranscript(text_page, index_page_of_interest)
            list_all_sentences = [*list_all_sentences, *list_aux]
            index_page_of_interest += 1

    print(f"Total num of sentences from PDF file {len(list_all_sentences)}")
    print("="*50)
    
    for index, sentence in enumerate(list_all_sentences):
        clean_sentence = cleaning_transc.cleanSentenceECCCtranscript(sentence)
        if clean_sentence != cleaning_transc.GLB_EMPTY_STRING:
            counter+=1
            print(clean_sentence)
    

Total num of sentences from PDF file 2600
P R O C E E D I N G S
(Judges enter courtroom)
MR. PRESIDENT:
Please be seated.  The Court is now in session.
According to our schedule, today we're going to hear the
testimony of another survivor; the third person among the nine
survivors of S-21.
The lawyer, I note your presence.  Would you like to make any
comments?
MS. STUDZINSKY:
Mr. President, good morning.  Your Honours, good morning, dear
colleagues.
Yes, I would like to make some observations and also I'm seeking
for clarification before we hear the next survivor.
We have observed that Mr. Chum Mey yesterday was overwhelmed
sometimes when he accounted his story and he had to cry, and he
could not control his emotions any more.  He shares his
traumatization as well as the next survivor, who is my client
together with Cambodian colleagues, and he shares this situation
with other survivors, victims, civil parties and witnesses.
I would like to make a proposal.  I would like that the Chamb

#### Save Documents

In [45]:
list_url_cases_eccc = [transcript_eccc_case_url]

for index_case, url_html_case in enumerate(list_url_cases_eccc):
    list_all_sentences = list()
    
    response = requests.get(url_html_case)
    doc_name = transcript_eccc_case_url[transcript_eccc_case_url.rindex("/")+1:]
    
    # Write PDF
    f = open(join(GLB_PATH_INPUT_DIRECTORY_ECCC, doc_name), GLB_FILE_BINARY_OP)
    f.write(response.content)
    f.close()
    
    # Get content of the PDF
    pdf = PdfFileReader(join(GLB_PATH_INPUT_DIRECTORY_ECCC, doc_name))
    number_of_pages = len(pdf.pages)
    
    index_page_of_interest = 1
    patter_was_found = False
    for index_page in range(number_of_pages):
        page_pdf = pdf.pages[index_page]
        text_page = page_pdf.extract_text()
        
        if not patter_was_found:
            index_pattern = text_page.index(cleaning_transc.GLB_ECCC_PATTERN_BEGIN_CONTENT_OF_INTEREST) if cleaning_transc.GLB_ECCC_PATTERN_BEGIN_CONTENT_OF_INTEREST in text_page else -1
            if index_pattern >= 0:
                patter_was_found = True
                
                list_aux = cleaning_transc.cleanPagePdfECCCtranscript(text_page, index_page_of_interest)
                list_all_sentences = [*list_all_sentences, *list_aux]
                index_page_of_interest += 1
        else:
            list_aux = cleaning_transc.cleanPagePdfECCCtranscript(text_page, index_page_of_interest)
            list_all_sentences = [*list_all_sentences, *list_aux]
            index_page_of_interest += 1
        
    counter = 0
    id_case = url_html_case[url_html_case.rindex("/")+1: url_html_case.rindex(".")]
    f = open(join(GLB_PATH_OUTPUT_DIRECTORY_ECCC, GLB_COURT_PREFIX_FILE_ECCC + id_case + GLB_EXTENSION_TXT), GLB_FILE_WRITE_OP, encoding=GLB_FILE_ENCODING_UTF8)
    
    for index, sentence in enumerate(list_all_sentences):
        clean_paragraph = cleaning_transc.cleanSentenceECCCtranscript(sentence)
        if clean_paragraph != cleaning_transc.GLB_EMPTY_STRING:
            counter+=1
            f.write(clean_paragraph + GLB_CHAR_NEWLINE)
    f.close()
            
    print(f'{index_case+1}) Number of retrieved paragraphs of transcript {url_html_case} is {len(list_all_sentences)} was reduced to {counter}')
    

1) Number of retrieved paragraphs of transcript https://www.eccc.gov.kh/sites/default/files/documents/courtdoc/%5Bdate-in-tz%5D/E1_41.1_TR001_20090701_Final_EN_Pub.pdf is 2600 was reduced to 2482


## International Criminal Tribunal for Rwanda

In [51]:
transcript_ictr_case_url = "https://ucr.irmct.org/LegalRef/CMSDocStore/Public/English/Transcript/NotIndexable/ICTR-96-04/TRS13317R0000613662.PDF"

In [59]:
if DEBUG:
    list_all_sentences = list()
    # Extracting information from the "International Criminal Tribunal for Rwanda"
    # Get the PDF
    response = requests.get(transcript_ictr_case_url)
    doc_name = transcript_eccc_case_url[transcript_ictr_case_url.rindex("/")+1:]
    
    f = open(join(GLB_PATH_INPUT_DIRECTORY_ICTR, doc_name), GLB_FILE_BINARY_OP)
    f.write(response.content)
    f.close()
    
    # Get content of the PDF
    pdf = PdfFileReader(join(GLB_PATH_INPUT_DIRECTORY_ICTR, doc_name))
    number_of_pages = len(pdf.pages)
    """
    page_pdf = pdf.pages[0]#index_page
    text_page = page_pdf.extract_text()
    
    print(text_page)
    """
    for index_page in range(number_of_pages):
        page_pdf = pdf.pages[index_page]
        text_page = page_pdf.extract_text()
    
    

In [58]:
page_pdf = pdf.pages[2]#index_page
text_page = page_pdf.extract_text()

print(text_page)

AKAYESU 
2 MR. PRESIDENT: 
3 21 FEB 97 
3 
(Interpreter) Could the Registry please 
4 remind us of the case on today's docket? 
5 THE REGISTRY: 
6 The Trial Chamber One of the 
7 International Criminal Tribunal Rwanda, 
8 composed of Judge Laity Kama presiding, 
9 Judge Navanethem Pillay and Judge Lennart 
10 Aspegren, is now in session for the 
11 continued trial in the matter of the 
12 
13 
14 MR. PRESIDENT: Prosecutor versus Jean Paul Akayesu, case 
number ICTR-9-4-T. I'm obliged. 
15 Thank you very much. 
16 Bailiff, please bring in the witness. 
17 Good morning, Madame. 
18 THE WITNESS: 
19 
20 MR. PRESIDENT: (Interpreter) Good morning. 
21 Is the defence ready to begin its cross 
22 examination of the witness? I give the 
23 defence the floor. 
24 MR. MONTHE: 
25 (Interpreter) Mr. President, your 
REX LEAR, OFFICIAL REPORTER 
ICTR -CHAMBER I 


In [61]:
pdf = PdfFileReader(join(GLB_PATH_INPUT_DIRECTORY_ICTR, doc_name))
number_of_pages = len(pdf.pages)

for index_page in range(number_of_pages):
    page_pdf = pdf.pages[index_page]
    text_page = page_pdf.extract_text()
    print(text_page)

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
--'"-~ 20 
21 
22 
23 
24 
23 
24 
25 21 FEB 97 
AKAYESU 
ICt:R-
CRIMINAL REGISTRY . 
RECEIVED 
INTERNATIONAjlltlptq~llf\.rpri2f[Q9NAL FOR RWANDA 
In the Matter of 
JEAN-PAUL AKAYESU 
Case No. ICTR-96-4-T 
DATE: 21 February 1997 
TRIBUNAL MEMBERS: President Laity Kama 
Lennart Aspergren 
Navanethem Pillay 
PROSECUTORS: Pierre Richard Prosper 
DEFENCE: Sara Darehshori 
Yakob Haile Mariam 
Patrice Monthe 
Nicolas Tiangaye 
REX LEAR COURT REPORTER 
ICTR -CHAMBER I 
1 
-AKAYESU 
1 
2 I N D E X 
3 ALISON DES FORGES, PH.D. 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 Cross-Examination 
REX LEAR COURT REPORTER 
ICTR -CHAMBER I 
2 21 FEB 97 
8 
AKAYESU 
2 MR. PRESIDENT: 
3 21 FEB 97 
3 
(Interpreter) Could the Registry please 
4 remind us of the case on today's docket? 
5 THE REGISTRY: 
6 The Trial Chamber One of the 
7 International Criminal Tribunal Rwanda, 
8 composed of Judge Laity Kama 

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 r-, 
19 
20 
21 
22 
23 
24 
25 AKAYESU 
Q. 21 FEB 97 
33 
changes in the policies. At a given 
time, they had accepted the power of the 
king. But some months later they changed 
direction and they reduced the power of 
the king for a certain period. And then 
later on, a few months later, they went 
back to their initial stance. 
So, they, at one point in time, forced 
the king to accept a certain amount of 
change in the powers that he had. And at 
that time the king had to order the 
freedom of worship for all his subjects, 
because there was some who protested that 
they did not have freedom of worship. 
And this was the reaction of the colonial 
administration, to impose that liberty -­
that freedom. 
I thank you, Doctor. Now, at this level, 
before we come to the establishment of 
political parties, historically, were 
there problems of coexistence between the 
various peoples in Rwanda? 
Did --were people aware that 

1 
2 
3 
4 
5 
6 
7 
8 
9 , __ 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 AKAYESU 
Q. 21 FEB 97 
65 
perhaps I did not express myself 
correctly, that it was because of these 
experiences, of these first elections, 
that the Hutu political leaders were able 
to establish, in their own minds and in 
their followers' minds, the idea that a 
political majority can --or could be the 
same thing as an ethnic majority. 
Very well. At that time, after the 
referendum of September '61, we have 
independence, July 1st, 1962. And the 
government the structure of the 
government changes. We go from a 
monarchy to a much more modern state 
which is based along the lines of 
democracy. That is to say that the 
leaders of the Collines become 
bourgmestre --are replaced with 
bourgmestres, those who govern the 
provinces became the prefectures, and 
instead of a king we have a --a 
president. 
But, in your opinion, I don't know if 
you've studied this issue, but what was 
the rea

1 
2 
3 
4 
5 
6 
7 
8 
9 ,-
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 AKAYESU 21 FEB 97 
93 
(Tape No. 2, a.m., concludes and Tape No. 
3 , a. m. , begins . ) 
BY MR. MONTHE: (Cont'g.) 
Q. 
A. 
Q. 
A. 
Q. 
A. (Off microphone) --Hutu and Tutsi, were 
necessarily members of MRND. So, there 
were Tutsi and Hutu (sic) within the 
power --within the party? 
Yes. There were Huti --Hutu and Tutsi, 
there was no choice within the party and 
within the administration. 
If I was a member of the party, given the 
fact that I belonged to a party, which 
was leading the country, I could easily 
have obtained a post in the 
administration? 
No, not necessarily. Not necessarily. 
The fact of being a member of MRND was 
not adequate to make one a member of the 
to enable one to get an 
administrative post. 
But, from what you said, I seem to 
understand that the areas which were 
closed was particularly the army. But 
that there were people in the government. 
However, it's a mat

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 AKAYESU 
Q. 
A. 21 FEB 97 
125 
attacks by the RPF, which discouraged 
some people within the country, and that 
there was need for a redefinition of the 
political conflict which was becoming an 
ethnic one. 
This is a matter which seems to me 
important for discussion. I would like 
to know if you can give us some examples, 
if possible. 
Could you please tell us how this --this 
methodology of spreading ethnic 
discrimination, how did it spread, how 
was it translated by the powers that be? 
Because you did tell us that, obviously, 
there was the economic crisis. There was 
the enemy, the RPF, outside the country. 
And the powers that be were trying to 
turn away from the true reasons. But how 
did this actually manifest itself in -­
in social life, in everyday life? 
What were the effects that were manifest, 
to --to say, for instance, that the 
Tutsi were discriminated against? 
If you were t

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 AKAYESU 
A. 
Q. 
A. 
Q. 
A. 
Q. 
A. 21 FEB 97 
radio in Rwanda; is that true? 
Yes, That's true. 154 
These radio press releases, communiques, 
did they indicate clearly that the Tutsi 
were going to attack Hutu in Bugesera? 
I don't have the specific words on hand. 
But I believe that if you're referring to 
the 
Report from our commission, you will find 
the data in this report as regards this 
broadcast. 
I did read your report and found it to be 
quite interesting, rest assured, Doctor. 
My question is the following. This Radio 
Rwanda broadcast, you said that the radio 
was directly under the head of the state; 
is that true? 
Yes, that's true. 
And you said that it was directly under 
the head of the state. So, could we 
therefore deduce that if this radio 
broadcast was broadcasted, then we could 
say that this had the approval or support 
of the presidency? 
In the Rwandan system, as I kno

1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 AKAYESU 21 FEB 97 
185 
so that we can find a solution. And it 
is in this objective that we have tried 
to contact the two parties. 
But I believe that the next time this 
is why I'm also saying that this is the 
last time, the next time it will be up to 
the prosecutor to approach the defence 
and discuss and come to the Tribunal to 
discuss --to make a proposal and the 
Tribunal will rule. It is not the rule 
of the game that the prosecution takes 
these decisions. That is not the rule of 
the game. That must be very clear. 
So, I thank you very much, Counsel, for 
your concessions. As far as I'm 
concerned, I wanted to ask you and I am 
grateful you accepted. This is the last 
time that I ask you to make concessions 
of this nature. Each person must be 
responsible. And I want to be very clear 
21 in my statement here. 
22 We shall adjourn the trial and it shall 
23 resume to hear the two witnesses on the 
24 6th