# Legal Data Viewer

This notebook visualizes the original legal documents and the extracted clauses filtered by topics (Privacy, Liability).

In [1]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

In [2]:
# Load Data
original_path = '../data/parquets/mistral_instruction_data.parquet'
processed_path = '../data/preprocessed/filtered_clauses.parquet'

try:
    df_orig = pd.read_parquet(original_path)
    print(f"Original Data Loaded: {df_orig.shape}")
except FileNotFoundError:
    print("Original data file not found.")

try:
    df_proc = pd.read_parquet(processed_path)
    print(f"Processed Data Loaded: {df_proc.shape}")
except FileNotFoundError:
    print("Processed data file not found. Please run src/preprocessing.py first.")

Original Data Loaded: (100, 4)
Processed Data Loaded: (107, 3)


## Statistics

In [3]:
if 'df_proc' in locals():
    print("Topic Distribution:")
    print(df_proc['matched_topic'].value_counts())

Topic Distribution:
matched_topic
Privacy      54
Liability    53
Name: count, dtype: int64


## Sample Verification

Let's look at a few extracted clauses and their original context (if mapped).

In [4]:
if 'df_proc' in locals():
    print("--- Sample Extracted Clauses ---")
    display(df_proc.sample(5))

--- Sample Extracted Clauses ---


Unnamed: 0,original_index,clause_text,matched_topic
6,1,"IDENTIFICATION, REVIEW, AND PUBLIC DISCLOSURE OF HUMAN RIGHTS \n RECORDS REGARDING GUATEMALA AND HONDURAS.",Privacy
41,13,The initial privacy impact analysis or a summary \n shall be signed by the senior agency official with primary \n responsibility for privacy policy and be published in the \n Federal Register at the time of the publication of a general \n notice of proposed rulemaking for the rule.,Privacy
96,87,"(6) Drug.--The term ``drug'' has the meaning given such \n term in section 201(g)(1) of the Federal Food, Drug, and \n Cosmetic Act (21 U.S.C. 321(g)(1)).\n (7) Economic damages.--The term ``economic damages'' means \n objectively verifiable monetary losses incurred as a result of \n the provision of, use of, or payment for (or failure to \n provide, use, or pay for) health care services or medical \n products such as past and future medical expenses, loss of past \n and future earnings, cost of obtaining domestic services, loss \n of employment, loss due to death, burial costs, and loss of \n business or employment opportunities.",Liability
38,13,2. REQUIREMENT THAT AGENCY RULEMAKING TAKE INTO CONSIDERATION \n IMPACTS ON INDIVIDUAL PRIVACY.,Privacy
69,36,and\n (iii) the researcher has in place \n appropriate safeguards to protect the privacy \n and confidentiality of any information about \n identifiable individuals,Privacy


In [5]:
if 'df_proc' in locals() and 'df_orig' in locals():
    # Get one random index from processed
    sample_idx = df_proc.sample(1).iloc[0]['original_index']
    
    print(f"--- Original Document (Index: {sample_idx}) ---")
    print(df_orig.loc[sample_idx, 'input'][:1000] + "...") # Truncate for display
    
    print(f"\n--- Extracted Clauses for Index {sample_idx} ---")
    display(df_proc[df_proc['original_index'] == sample_idx])

--- Original Document (Index: 1) ---
SECTION 1. SHORT TITLE.

    This Act may be cited as the ``Human Rights Information Act''.

SEC. 2. FINDINGS.

    Congress finds the following:
            (1) The people of the United States consider the national 
        and international protection and promotion of human rights and 
        the rule of law the most important values of any democracy. The 
        founding fathers defined human rights prominently in the Bill 
        of Rights, giving those rights a special priority and 
        protection in the Constitution.
            (2) Federal agencies are in possession of documents 
        pertaining to gross human rights violations abroad which are 
        needed by foreign authorities to document, investigate, and 
        subsequently prosecute instances of continued and systematic 
        gross human rights violations, including those directed against 
        citizens of the United States.
            (3) The United States will co

Unnamed: 0,original_index,clause_text,matched_topic
6,1,"IDENTIFICATION, REVIEW, AND PUBLIC DISCLOSURE OF HUMAN RIGHTS \n RECORDS REGARDING GUATEMALA AND HONDURAS.",Privacy
7,1,"(a) In General.--Notwithstanding any other provision of law, the \nprovisions of this Act shall govern the declassification and public \ndisclosure of human rights records by agencies.",Privacy
8,1,"SEC. 5. GROUNDS FOR POSTPONEMENT OF PUBLIC DISCLOSURE OF RECORDS.\n\n (a) In General.--An agency may postpone public disclosure of a \nhuman rights record or particular information in a human rights record \nonly if the agency determines that there is clear and convincing \nevidence that--\n (1) the threat to the military defense, intelligence \n operations, or conduct of foreign relations of the United \n States raised by public disclosure of the human rights record \n is of such gravity that it outweighs the public interest, and \n such public disclosure would reveal--\n (A) an intelligence agent whose identity currently \n requires protection",Privacy
9,1,and\n (iii) the disclosure of which would \n interfere with the conduct of intelligence \n activities,Privacy
10,1,"or\n (C) any other matter currently relating to the \n military defense, intelligence operations, or conduct \n of foreign relations of the United States, the \n disclosure of which would demonstrably impair the \n national security of the \n United States",Privacy
11,1,(2) the public disclosure of the human rights record would \n reveal the name or identity of a living individual who provided \n confidential information to the United States and would pose a \n substantial risk of harm to that individual,Privacy
12,1,"(3) the public disclosure of the human rights record could \n reasonably be expected to constitute an unwarranted invasion of \n personal privacy, and that invasion of privacy is so \n substantial that it outweighs the public interest",Privacy
13,1,"or\n (4) the public disclosure of the human rights record would \n compromise the existence of an understanding of confidentiality \n currently requiring protection between a Government agent and a \n cooperating individual or a foreign government, and public \n disclosure would be so harmful that it outweighs the public \n interest.",Privacy
14,1,"(b) Special Treatment of Certain Information.--It shall not be \ngrounds for postponement of disclosure of a human rights record that an \nindividual named in the human rights record was an intelligence asset \nof the United States Government, although the existence of such \nrelationship may be withheld if the criteria set forth in subsection \n(a) are met.",Privacy
15,1,(a) Duties of the Appeals Panel.--The Interagency Security \nClassification Appeals Panel or any other entity subsequently \nestablished by law or Executive order and charged with carrying out the \nfunctions currently carried out by such Panel (referred to in this Act \nas the ``Appeals Panel'') shall review all determinations by an agency \nto postpone public disclosure of any human rights record.,Privacy
