**This code performs a case relevance check between the input case and the case that is being cited by the input**


*   For initial stage, we have use .txt files. We will update them during each phases.
*   This notebook uses 3 .txt files


1.   Case1.txt - input case
2.   Subcase1.txt - Subparas of input cases where we have found the citations
3. Citation1.txt - The whole case document that is being cited





To ensure relevance, we're using a method called **Cosine Similarity** check. This technique is part of the extractive text summarization approach. Cosine similarity helps measure how similar two vectors are. In our case, we're applying it to sentences represented as vectors. The calculation involves finding the angle between these vectors. If the sentences are similar, the angle will be zero.

# 1. Uploading and reading necessary files

**For Human generated legal document**

In [10]:
from google.colab import files

uploaded = files.upload()

Saving Case1.txt to Case1 (1).txt
Saving Citation1.txt to Citation1 (1).txt
Saving Subcase1.txt to Subcase1 (1).txt


In [11]:
#Reading files
with open('Case1.txt', 'r') as file:
    original_case = file.read()

with open('Citation1.txt', 'r') as file:
    cited_case = file.read()

with open('Subcase1.txt', 'r') as file:
    sub_case = file.read()

# 2. Text Preprocessing

In [12]:
import re
import string

def preprocess_text(text):
    text = text.lower()
    text = re.sub('['+string.punctuation+']', '', text)
    # Planning to add more stop words and delimiters
    return text

original_case = preprocess_text(original_case)
cited_case = preprocess_text(cited_case)
sub_case=preprocess_text(sub_case)

# 3. TF-IDF

Relevance between Case1.txt and Citation1.txt - This compares the whole input case with the cited case

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit([original_case, cited_case])
original_case_vector = vectorizer.transform([original_case])
cited_case_vector = vectorizer.transform([cited_case])

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_score = cosine_similarity(original_case_vector, cited_case_vector)[0][0]
print(similarity_score)

0.7387157948898612


Relevance between Subcase1.txt and Citation1.txt - This compares the subparas of input case where that particular citation is being used

In [15]:
vectorizer = TfidfVectorizer()
vectorizer.fit([sub_case, cited_case])
sub_case_vector = vectorizer.transform([sub_case])
cited_case_vector = vectorizer.transform([cited_case])

similarity_score = cosine_similarity(sub_case_vector, cited_case_vector)[0][0]
print(similarity_score)

0.7143842702535722


**Its being observed that the relevance score between the whole document and the cited document is higher**



*   It's gives us a better understanding on weather to consider whole document or just the sub para
*   The exact sub para similarity score was 0.43 however, on add two para before and after the main sub para which is 70% of the whole text, it yields us 0.71.


*   So, we have decided to use the whole document instead of sub paras.





**Testing for AI generated Cases and its citation**

In [16]:
from google.colab import files

uploaded = files.upload()

Saving Case2.txt to Case2.txt
Saving Citation2.txt to Citation2.txt


In [17]:
#Reading files
with open('Case2.txt', 'r') as file:
    original_case = file.read()

with open('Citation2.txt', 'r') as file:
    cited_case = file.read()

In [18]:
import re
import string

def preprocess_text(text):
    text = text.lower()
    text = re.sub('['+string.punctuation+']', '', text)
    # Planning to add more stop words and delimiters
    return text

original_case = preprocess_text(original_case)
cited_case = preprocess_text(cited_case)
sub_case=preprocess_text(sub_case)

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectorizer.fit([original_case, cited_case])
original_case_vector = vectorizer.transform([original_case])
cited_case_vector = vectorizer.transform([cited_case])

In [20]:
from sklearn.metrics.pairwise import cosine_similarity

similarity_score = cosine_similarity(original_case_vector, cited_case_vector)[0][0]
print(similarity_score)

0.6120052702656775


**The similarity score is bit moderate compared to the human generated document**

Possible FUTURE WORKS


*  Work on Multiple Datasets for both Human and AI generated
*   Analysing which score is the best or moderate or poor by using mean scores

