# Assigment 2: Detecting Contradictions in Law Documents            
We have a Slovene dataset and English dataset, we can work with either or combine both. The ultimate goal is to create a system that could be used by legal professionals: when drafting a new section of a law or contract, the system should automatically identify potentially contradisting extisting sections.

In [46]:
import numpy as np
import pandas as pd

## Task 1 - Data preparation and exploration       
Load the provided JSONL datasets, transform them into a format suitable for use with your model. For the **Slovene** dataset, this may involve *creating pairs of sections with binary labels (contradicting/not contradicting)*. For the **English** dataset, use the provided labels directly or merge classes into binary categories if appropriate.  

In [47]:
# for now, I'll be using only the English dataset (maybe later Slovene also)
# loading the dataset - english one has two files - train and test
en_train = pd.read_json('Seminar 2 datasets/data/English dataset/train.jsonl', lines=True)
en_test = pd.read_json('Seminar 2 datasets/data/English dataset/test.jsonl', lines=True)

#print(en_train.head())

# we dont need doc_id or key
en_train = en_train.drop(columns=['doc_id', 'key'])
en_test = en_test.drop(columns=['doc_id', 'key'])

labelVal_train = en_train['label'].value_counts()
labelVal_test = en_test['label'].value_counts()

print(labelVal_train)
print(labelVal_test)


label
Entailment       3530
NotMentioned     2820
Contradiction     841
Name: count, dtype: int64
label
Entailment       968
NotMentioned     903
Contradiction    220
Name: count, dtype: int64


Perform basic exploratory analysis: the questions to answer:      
* Are there missing values or duplicate entries?      
* How many examples are in the chosen dataset?
* How long are typical sections?
* Are contradictions balanced across the datasets?

In [48]:
missing_or_empty_train = en_train.isnull() | (en_train == "")
total_issues = missing_or_empty_train.sum()
print(f"Missing or empty values (like empty strings) in train set:\n{total_issues}")
missing_or_empty_test = en_test.isnull() | (en_test == "")
total_issues = missing_or_empty_test.sum()
print(f"Missing or empty values (like empty strings) in test set:\n{total_issues}")

# so: there are no 'missing values' but there are empty strings in column 'premise' -> label: NotMentioned

# if we drop them
en_train = en_train[en_train["premise"] != ""]
en_test = en_test[en_test["premise"] != ""]

labelVal_train = en_train['label'].value_counts()
labelVal_test = en_test['label'].value_counts()

print(labelVal_train)
print(labelVal_test)


Missing or empty values (like empty strings) in train set:
premise       2820
hypothesis       0
label            0
dtype: int64
Missing or empty values (like empty strings) in test set:
premise       903
hypothesis      0
label           0
dtype: int64
label
Entailment       3530
Contradiction     841
Name: count, dtype: int64
label
Entailment       968
Contradiction    220
Name: count, dtype: int64


In [49]:

# checking for the duplicates
dupl_train = en_train.value_counts()
dupl_train = dupl_train[dupl_train > 1]
print(dupl_train)

dupl_test = en_test.value_counts()
dupl_test = dupl_test[dupl_test > 1]
print(dupl_test)


premise                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     hypothesis                                                  

In [50]:
# getting rid of duplicates
en_train = en_train.drop_duplicates(keep="first")
en_test = en_test.drop_duplicates(keep='first')

In [51]:
# how many examples in a chosen dataset
print(f"Examples in a train set: {len(en_train)}")
print(f"Examples in a test set: {len(en_test)}")

Examples in a train set: 4292
Examples in a test set: 1165


In [53]:
# length of typical sections
en_train["premise_len"] = en_train["premise"].str.len()
en_train["hypothesis_len"] = en_train["hypothesis"].str.len()
en_test["premise_len"] = en_test["premise"].str.len()
en_test["hypothesis_len"] = en_test["hypothesis"].str.len()

print(en_train[["premise_len", "hypothesis_len"]].describe())
print(en_test[["premise_len", "hypothesis_len"]].describe())


       premise_len  hypothesis_len
count  4292.000000     4292.000000
mean    489.243942       98.609273
std     339.184711       27.187489
min      36.000000       66.000000
25%     262.000000       80.000000
50%     393.000000       95.000000
75%     606.000000      119.000000
max    3098.000000      162.000000
       premise_len  hypothesis_len
count  1165.000000     1165.000000
mean    487.676395       98.139914
std     353.770771       27.346670
min      64.000000       66.000000
25%     244.000000       80.000000
50%     382.000000       90.000000
75%     619.000000      119.000000
max    2489.000000      162.000000


In [None]:
# are contradictions balanced across datasets?
# normalize shows percentage
print(en_train["label"].value_counts(normalize=True))
print(en_test["label"].value_counts(normalize=True))
# they are, around 80% entailment and 20% contradiction

label
Entailment       0.805918
Contradiction    0.194082
Name: proportion, dtype: float64
label
Entailment       0.813734
Contradiction    0.186266
Name: proportion, dtype: float64


Split the data into **train**, **validation**, and **test** sets. Prepare tokenized text representations suitable for traditional machine learning models.       
Additionally, include a short **data presentations and visualization section**. Visualize the data distribution, text lengths, and example contradictions using tables and graphs.