## Generating dataset

A dataframe for each article is created. In each dataframe, 

- Each row represents a legal case from ECHR. 

- The columns are case id(i.e, Itemid), judgement text, and judicial decision(Judgement).  

While creating the dataframe, following cleaning steps were made:

1. any cases that were missing judgement text were marked as 'unavailable'.

2. dataframes were shuffled as violation and no-violation cases were merged to form the dataframe for each article, it created the structure of all rows from violation cases are on top, followed by the rows with no-violation. 

Dataframes were combined into one dataframe to be stored in a PostgresSQL database. 

In [17]:
from os import path, listdir, mkdir
import pandas as pd
import numpy as np
import psycopg2
from sqlalchemy import create_engine

In [2]:
# dataframe for each article 
def article_df(input_folder, article_number):
    ''' Collect data for textual analysis and make dataframe for each article 
    
    parameters
    ----------
    input_folder : str
                   path for where judgement documents per article is stored
    article_number : str
                     article number to retrieve the judgement documents 
    '''
    
    art = 'article_{}'.format(article_number)
    folder = path.join(input_folder, art)
    judgement = ['violation', 'no-violation']
    df_list = []
    
    for j in judgement:
        df = pd.read_csv(path.join(folder, 'case_outcome.csv'))
        id_list = list(df[df.Judgement == j].Itemid) # get case id list from case outcome df
        
        doc_list = listdir(path.join(folder, j)) # get list of availble text docs in the folder
        doc_name_list = [x.replace('_Judgement_text.txt', '') for x in doc_list]
    
        # find missing cases  
        id_list.sort()
        doc_name_list.sort()
        doc_path = []
        text_list = []
        if id_list == doc_name_list: 
            for i in id_list:
                text_path = '{}_Judgement_text.txt'.format(i)
                doc_path.append(text_path)
                # loading the text
                with open(path.join(folder, j, text_path), 'r') as t:
                    text = t.read()
                    text_list.append(text)
                    t.close()
            # check empty textfile
            if '' in text_list:
                empty_count = text_list.count('')
                text_list = ['unavailable' if doc == '' else doc for doc in text_list]
                print("{} missing {} documents in {}".format(empty_count ,j, art))
            else:
                print ("No missing {} documents in {}".format(j, art)) 
                
        # missing docs            
        else: 
            diff_len = len(id_list) - len(doc_name_list)
            missing_id = np.setdiff1d(id_list, doc_name_list)
            for i in id_list:
                text_path = '{}_Judgement_text.txt'.format(i)
                if i in missing_id:
                    doc_path.append('unavailable')
                    text_list.append('unavailable')
                else:
                    doc_path.append(text_path)
                    with open(path.join(folder, j, text_path), 'r') as t:
                        text = t.read()
                        text_list.append(text)
                        t.close()
            # check empty textfile
            if '' in text_list:
                empty_count = text_list.count('')
                text_list = ['unavailable' if doc == '' else doc for doc in text_list]
                print ("{} missing {} documents in {}".format(diff_len + empty_count, j, art))
            else:
                print ("{} missing {} documents in {}".format(diff_len, j, art))
                
        data_frame = pd.DataFrame(list(zip(id_list, doc_path, text_list)), 
                                  columns = ['Itemid', 'Document_path', 'Judgement_text'])
        data_frame['Judgement'] = j
        df_list.append(data_frame)

    article_df = pd.concat([df_list[0], df_list[1]], ignore_index = True)
    
    return article_df

In [3]:
input_folder = './HUDOC_data/docs_per_article'
article_number = ['2', '3', '5', '6', '8', '10', '11', '13', '14']

In [4]:
article_df_list = []
for a in article_number:
    df = article_df(input_folder, a)
    article_df_list.append(df)

4 missing violation documents in article_2
1 missing no-violation documents in article_2
9 missing violation documents in article_3
3 missing no-violation documents in article_3
8 missing violation documents in article_5
No missing no-violation documents in article_5
32 missing violation documents in article_6
7 missing no-violation documents in article_6
8 missing violation documents in article_8
4 missing no-violation documents in article_8
3 missing violation documents in article_10
No missing no-violation documents in article_10
1 missing violation documents in article_11
No missing no-violation documents in article_11
13 missing violation documents in article_13
2 missing no-violation documents in article_13
4 missing violation documents in article_14
3 missing no-violation documents in article_14


In [5]:
# shuffle rows of each df
article_df_list = [df.sample(frac = 1).reset_index(drop = True) for df in article_df_list]

In [None]:
mkdir('./articles')
for a in range(0, 9):
    # saving dataframe for each article as csv file
    article_df_list[a].to_csv('articles/article_{}'.format(article_number[a]), index = False)
    # provide article label to indicate which dataframe the entries come from
    article_df_list[a]['Article'] = article_number[a]

In [16]:
# concatenate the dataframe 
concat_df = pd.concat(article_df_list)

In [21]:
# store it into PostgresSQL 
engine = create_engine("postgresql://postgres:xfkLVeMj@localhost/hudoc")
con = engine.connect()
table_name = 'judgement_text'
concat_df.to_sql(table_name, con)
print(engine.table_names())
con.close()

['case_info_14', 'case_info_10', 'case_info_p1-1', 'case_info_6', 'case_info_13', 'case_info_5', 'case_info_3', 'case_info_8', 'case_info_34', 'case_info_2', 'case_info_11', 'case_info', 'judgement_text']
