# WebScraping of tribunal transcript cases

This notebook aims to show the functionality for cleaning transcripts of cases through web scraping.

Following the general ethical guidelines when using WebScraping, it was retrieved the permission on the three tribunals:
- [International criminal tribunal for the former Yugoslavia](https://www.icty.org/), permission can be fetched [here](https://www.icty.org/robots.txt)

### Imports

In [19]:
%load_ext autoreload
%autoreload 2

import requests
from bs4 import BeautifulSoup
from os import listdir
from os.path import isfile, join

import src.cleaning_transcripts as cleaning_transc

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


### Globals

In [27]:
GLB_FILE_ENCODING_UTF8 = "utf8"
GLB_FILE_WRITE_OP = "w"
GLB_HTML_P_TAG = "p"
GLB_HTML_PARSER = "html.parser"

GLB_PATH_OUTPUT_DIRECTORY_ICTY = "output_cleaning_transcripts/icty"
GLB_EXTENSION_TXT = ".txt"

GLB_CHAR_NEWLINE = "\n"

GLB_COURT_PREFIX_FILE_ICTY = "ICTY_"

## International criminal tribunal for the former Yugoslavia

In [21]:
# Extracting information from the "International criminal tribunal for the former Yugoslavia"
transcript_icty_case_url = "https://www.icty.org/x/cases/tadic/trans/en/960719ed.htm"
response = requests.get(transcript_icty_case_url)

In [22]:
# Get just the content of the page
#response.__dict__["_content"]

content_soup = BeautifulSoup(response.content, "html.parser")

# Print in text content in html format
#print(content_soup.prettify())

# Get list of paragraphs (this is how the information is retrieved)
list_p = list()
for p in content_soup.find_all("p"):
    list_p.append(p)
print(f'Number of retrieved paragraphs of transcript {transcript_icty_case_url} is {len(list_p)}')

Number of retrieved paragraphs of transcript https://www.icty.org/x/cases/tadic/trans/en/960719ed.htm is 2756


In [23]:
#print(list_p[0:20])

In [24]:
#import src.cleaning_transcripts as cleaning_transc
counter = 0
for paragraph in list_p:
    clean_paragraph = cleaning_transc.cleanParagraphsICFYtranscript(str(paragraph))
    if clean_paragraph != cleaning_transc.GLB_EMPTY_STRING:
        counter+=1
        print(clean_paragraph)

THE INTERNATIONAL CRIMINAL TRIBUNAL 		CASE NO. IT-94-1-T
FOR THE FORMER YUGOSLAVIA
IN THE TRIAL CHAMBER
Friday, 19th July 1996
SUADA RAMIC, recalled.
THE PRESIDING JUDGE:  Miss Hollis, could you continue, please?
MISS HOLLIS:  Yes, your Honour.
Examined by MISS HOLLIS, continued.
THE PRESIDING JUDGE:  You may be seated, Mrs. Ramic.  You are still under
oath, the oath that you took yesterday. You may be seated.  Thank you.
MISS HOLLIS:  Perhaps we could turn off one of those microphones and, Mrs.
Ramic, if you could sit back a bit from the microphone?  Thank you.
Mrs. Ramic, when we finished yesterday you had indicated that after
the events that occurred in the military barracks you a few days later
went to your village to visit your brother and after some time you
came back again to your home in Prijedor.  Do you recall your home
being searched after your return to Prijedor?
A.  Yes.
Q.  Did you recognise any of the people who searched your home?
A.  Yes.
Q.  Who did you recognise?
A. 

#### Generate document

In [30]:
list_url_cases_icty = [transcript_icty_case_url]

for index_case, url_html_case in enumerate(list_url_cases_icty):
    response = requests.get(url_html_case)
    
    content_soup = BeautifulSoup(response.content, GLB_HTML_PARSER)
    
    list_p = list()
    for p in content_soup.find_all(GLB_HTML_P_TAG):
        list_p.append(p)
        
    counter = 0
    id_case = url_html_case[url_html_case.rindex("/")+1: url_html_case.rindex(".")]
    f = open(join(GLB_PATH_OUTPUT_DIRECTORY_ICTY, GLB_COURT_PREFIX_FILE_ICTY + id_case + GLB_EXTENSION_TXT), GLB_FILE_WRITE_OP, encoding=GLB_FILE_ENCODING_UTF8)
    for paragraph in list_p:
        clean_paragraph = cleaning_transc.cleanParagraphsICFYtranscript(str(paragraph))
        if clean_paragraph != cleaning_transc.GLB_EMPTY_STRING:
            counter+=1
            f.write(clean_paragraph + GLB_CHAR_NEWLINE)
    f.close()
        
    print(f'{index_case+1}) Number of retrieved paragraphs of transcript {transcript_icty_case_url} is {len(list_p)} was reduced to {counter}')
    

1) Number of retrieved paragraphs of transcript https://www.icty.org/x/cases/tadic/trans/en/960719ed.htm is 2756 was reduced to 2486
