In [None]:
# AI Cleaning of Content Section

In [1]:
import pandas as pd
import numpy as np
import google.generativeai as genai
import os
import dotenv
import time
from datetime import datetime
import json
from tqdm import tqdm


In [2]:


df=pd.read_csv(r"Title18.csv", encoding='latin-1')
print(df.shape)
print(df.columns)

(1647, 4)
Index(['Section', 'Url', 'Content', 'Metadata'], dtype='object')


In [3]:
df['Content'][1]

'Whoever, within the special maritime and territorial jurisdiction of the United States, by force and violence, or by intimidation, takes or attempts to take from the person or presence of another anything of value, shall be imprisoned not more than fifteen years.(June 25, 1948, ch. 645, 62 Stat. 796; Pub. L. 103Ã¢Â\x80Â\x93322, title XXXII, Ã\x82Â§320903(a)(1), Sept. 13, 1994, 108 Stat. 2124.)Historical and Revision NotesBased on title 18, U.S.C., 1940 ed., Ã\x82Â§463 (Mar. 4, 1909, ch. 321, Ã\x82Â§284, 35 Stat. 1144).Words "within the special maritime and territorial jurisdiction of the United States" were added to restrict the place of the offense to those places described in section 451 of title 18, U.S.C., 1940 ed., now section 7 of this title.Minor changes were made in phraseology.Editorial NotesAmendments1994-Pub. L. 103Ã¢Â\x80Â\x93322 inserted "or attempts to take" after "takes".Statutory Notes and Related SubsidiariesShort Title of 1996 AmendmentPub. L. 104Ã¢Â\x80Â\x93217, Ã\x

In [4]:


# Load environment variables from .env file
dotenv.load_dotenv()
genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))

In [7]:
systemprompt1= """You are a helpful expert AI Assistant on United States Code for Title 18. Your job is to preprocess and clean the text from the title 18 excrepts.
You will be given with law text from Title 18, remove any unnecessary non english characters and unrelated text to law and return the clean and meaningful law text. Please do not modify or alter the meaning of the text.   """


systemprompt= """ You are an expert AI Assistant specializing in the United States Code, specifically Title 18 (Crimes and Criminal Procedure). Your primary function is to process and clean text excerpts from this title.

**Task:**

You will receive text input containing legal information extracted from Title 18 of the U.S. Code. Your task is to preprocess and clean this text, ensuring it is suitable for further analysis or use.  This involves several specific steps:

1. **Remove Non-English Characters:**  Identify and remove any characters that are not part of standard English text. This includes, but is not limited to:
    * Special symbols (e.g., Ã‚Â, ¶, †, ‡, /x,/n, Next >>[Print], << Previous, Result 1 of 1 , etc., except for those that are standard legal symbols like the section symbol §).
    * Control characters.
    * Characters from other languages (e.g., accented characters, characters from non-Latin alphabets).
    * Extraneous punctuation or symbols not typically found in legal text.

2. **Remove Unrelated Text:**  Identify and remove any text that is not directly related to the legal content of Title 18. This may include:
    * Headings, subheadings, or titles if they are redundant or not part of the core legal text. If headings are essential to understanding the structure of the law, keep them.
    * Editorial notes, annotations, or commentary that are not part of the official legal text.
    * Footnotes, unless they contain essential legal information. If you keep footnotes, ensure they are formatted clearly and consistently.
    * References to other sections or titles, unless they are crucial for understanding the current excerpt. If kept, ensure they are formatted consistently.
    * Table of contents entries.
    * Anything that is clearly not part of the codified law itself.

3. **Preserve Meaning:**  Crucially, you **must not** modify or alter the meaning of the legal text.  Your cleaning process should only remove extraneous or non-essential elements; it should not change the legal content in any way.

4. **Return Clean and Meaningful Text:** The output you provide should be the cleaned and processed legal text. It should be grammatically correct, clearly formatted, and ready for further use.  Maintain the original structure and hierarchy of the legal text where possible (e.g., keep paragraphs, subsections, etc.).  If the input text includes citations to other legal sources, maintain the format of those citations.

**Example Input:**
"Whoever, within the special maritime and territorial jurisdiction of the United States, by force and violence, or by intimidation, takes or attempts to take from the person or presence of another anything of value, shall be imprisoned not more than fifteen years.(June 25, 1948, ch. 645, 62 Stat. 796; Pub. L. 103Ã¢Â€Â“322, title XXXII, Ã‚Â§320903(a)(1), Sept. 13, 1994, 108 Stat. 2124.)Historical and Revision NotesBased on title 18, U.S.C., 1940 ed., Ã‚Â§463 (Mar. 4, 1909, ch. 321, Ã‚Â§284, 35 Stat. 1144).Words "within the special maritime and territorial jurisdiction of the United States" were added to restrict the place of the offense to those places described in section 451 of title 18, U.S.C., 1940 ed., now section 7 of this title.Minor changes were made in phraseology.Editorial NotesAmendments1994-Pub. L. 103Ã¢Â€Â“322 inserted "or attempts to take" after "takes".Statutory Notes and Related SubsidiariesShort Title of 1996 AmendmentPub. L. 104Ã¢Â€Â“217, Ã‚Â§1, Oct. 1, 1996, 110 Stat. 3020, provided that: "This Act [amending section 2119 of this title] may be cited as the 'Carjacking Correction Act of 1996'.""

**Example Output (Cleaned):**
"Whoever, within the special maritime and territorial jurisdiction of the United States, by force and violence, or by intimidation, takes or attempts to take from the person or presence of another anything of value, shall be imprisoned not more than fifteen years.
(June 25, 1948, ch. 645, 62 Stat. 796 ; Pub. L. 103–322, title XXXII, §320903(a)(1), Sept. 13, 1994, 108 Stat. 2124 .)
Historical and Revision Notes
Based on title 18, U.S.C., 1940 ed., §463 (Mar. 4, 1909, ch. 321, §284, 35 Stat. 1144 ).
Words "within the special maritime and territorial jurisdiction of the United States" were added to restrict the place of the offense to those places described in section 451 of title 18, U.S.C., 1940 ed., now section 7 of this title.
Minor changes were made in phraseology.


Editorial Notes
Amendments
1994-Pub. L. 103–322 inserted "or attempts to take" after "takes".


Statutory Notes and Related Subsidiaries
Short Title of 1996 Amendment
Pub. L. 104–217, §1, Oct. 1, 1996, 110 Stat. 3020 , provided that: "This Act [amending section 2119 of this title] may be cited as the 'Carjacking Correction Act of 1996'." "

"""

In [5]:
prompt = f"""
    You are an AI assistant that cleans U.S. legal text while **preserving all original headings and structure**.
    
    **Rules for Cleaning:**
    - **Do NOT add new headings.** Only keep the ones already present in the text.
    - **Remove encoding artifacts** (e.g., Ã‚Â, Ã¢Â€Â“).
    - **Remove HTML/ Page Elements like Next >>[Print], << Previous, Result 1 of 1 , etc., except for those that are standard legal symbols like the section symbol §).
    - **Maintain original section titles and bold formatting** (e.g., **Historical and Revision Notes**).
    - **Do NOT insert extra information, commentary, or inferred text.**
    - **Preserve all legal citations and amendments.**
    - **Normalize spacing and punctuation for readability.**

    **Now clean the text while keeping all existing headings and returning only the cleaned version. Do NOT add new headings or modify structure.**
    """

In [None]:
count = 0
errors = 0
model = genai.GenerativeModel('models/gemini-2.0-flash-001',
                              system_instruction=prompt)

for index, row in df.iterrows():
    try:
        response_content = model.generate_content(row['Content'])
        df.at[index, 'Content'] = response_content.text
    except Exception as e:
        df.at[index, 'Content'] = ""
        print(f"Error in Content at index {index}: {e}")
        errors += 1
    
    try:
        response_section = model.generate_content(row['Section'])
        df.at[index, 'Section'] = response_section.text
    except Exception as e:
        df.at[index, 'Section'] = ""
        print(f"Error in Section at index {index}: {e}")
        errors += 1
    
    try:
        response_metadata = model.generate_content(row['Metadata'])
        df.at[index, 'Metadata'] = response_metadata.text
    except Exception as e:
        df.at[index, 'Metadata'] = ""
        print(f"Error in Metadata at index {index}: {e}")
        errors += 1
    
    count += 1
    print(count)
    if count % 100 == 0:
        print(f"Milestone!!!!!!!!!!! {count}")

print("Total errors:", errors)


In [None]:
df.to_csv(r"Title18_reprocessed.csv", index=False)

In [12]:
print(df['Processed_Content'][140])

(a) Action and Jurisdiction.—Any national of the United States injured in his or her person, property, or business by reason of an act of international terrorism, or his or her estate, survivors, or heirs, may sue therefor in any appropriate district court of the United States and shall recover threefold the damages he or she sustains and the cost of the suit, including attorney’s fees.

(b) Estoppel Under United States Law.—A final judgment or decree rendered in favor of the United States in any criminal proceeding under section 1116, 1201, 1203, or 2332 of this title or section 46314, 46502, 46505, or 46506 of title 49 shall estop the defendant from denying the essential allegations of the criminal offense in any subsequent civil proceeding under this section.

(c) Estoppel Under Foreign Law.—A final judgment or decree rendered in favor of any foreign state in any criminal proceeding shall, to the extent that such judgment or decree may be accorded full faith and credit under the law

In [7]:
def clean_data_with_tracking(df, prompt, checkpoint_dir='checkpoints', resume=True):
    """
    Clean data with progress tracking and ability to resume from last checkpoint
    
    Args:
        df: pandas DataFrame with Content, Section, and Metadata columns
        prompt: system instruction prompt for the Gemini model
        checkpoint_dir: directory to store checkpoints
        resume: whether to attempt to resume from last checkpoint
    """
    # Initialize Gemini model
    model = genai.GenerativeModel('models/gemini-2.0-flash-001',
                                 system_instruction=prompt)
    
    # Create checkpoint directory if it doesn't exist
    os.makedirs(checkpoint_dir, exist_ok=True)
    
    # Initialize or load progress state
    state_file = os.path.join(checkpoint_dir, 'cleaning_state.json')
    if resume and os.path.exists(state_file):
        with open(state_file, 'r') as f:
            state = json.load(f)
        start_index = state['last_processed_index'] + 1
        errors = state['errors']
        print(f"\nResuming from row {start_index}")
    else:
        start_index = 0
        errors = {
            'Content': [],
            'Section': [],
            'Metadata': []
        }
    
    start_time = time.time()
    total_rows = len(df)
    
    # Create progress bar starting from resume point
    progress_bar = tqdm(total=total_rows, initial=start_index, desc="Processing rows")
    
    def save_state(current_index):
        """Save current progress and errors to checkpoint"""
        state = {
            'last_processed_index': current_index,
            'errors': errors,
            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
        }
        with open(state_file, 'w') as f:
            json.dump(state, f)
        
        # Save the DataFrame checkpoint
        checkpoint_file = os.path.join(checkpoint_dir, 'data_checkpoint.csv')
        df.to_csv(checkpoint_file, index=False)
        
        print(f"\nCheckpoint saved at row {current_index + 1}")
    
    def process_column(row, index, column_name):
        try:
            # Skip if already processed (in case of resume)
            if pd.notna(row[column_name]) and resume and index < start_index:
                return None
                
            response = model.generate_content(row[column_name])
            df.at[index, column_name] = response.text
            return None
        except Exception as e:
            error_info = {
                'index': index,
                'error': str(e),
                'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            }
            errors[column_name].append(error_info)
            df.at[index, column_name] = ""
            return error_info

    try:
        for index, row in df.iloc[start_index:].iterrows():
            # Process each column
            for column in ['Content', 'Section', 'Metadata']:
                error = process_column(row, index, column)
                if error:
                    print(f"\nError in {column} at row {index}: {error['error']}")
            
            # Update progress
            progress_bar.update(1)
            elapsed_time = time.time() - start_time
            processed_rows = index - start_index + 1
            avg_time_per_row = elapsed_time / processed_rows
            remaining_rows = total_rows - (index + 1)
            estimated_time_remaining = remaining_rows * avg_time_per_row
            
            # Print detailed status every 10 rows
            if (index + 1) % 10 == 0:
                print(f"\nStatus Update:")
                print(f"Processed {index + 1}/{total_rows} rows ({((index + 1)/total_rows)*100:.1f}%)")
                print(f"Elapsed time: {elapsed_time/60:.1f} minutes")
                print(f"Estimated time remaining: {estimated_time_remaining/60:.1f} minutes")
                print(f"Current error count: {sum(len(e) for e in errors.values())}")
            
            # Save checkpoint every 100 rows or if there's an error
            if (index + 1) % 100 == 0 or any(len(e) > 0 for e in errors.values()):
                save_state(index)
    
    except KeyboardInterrupt:
        print("\nProcess interrupted by user. Saving checkpoint...")
        save_state(index)
        raise
    
    except Exception as e:
        print(f"\nUnexpected error occurred. Saving checkpoint...")
        save_state(index)
        raise
    
    finally:
        progress_bar.close()
    
    # Final summary
    print("\nProcessing Complete!")
    print("=" * 50)
    print(f"Total rows processed: {total_rows}")
    print(f"Total time taken: {elapsed_time/60:.1f} minutes")
    print("\nError Summary:")
    for column, column_errors in errors.items():
        print(f"{column}: {len(column_errors)} errors")
    
    return df, errors

try:
    cleaned_df, error_log = clean_data_with_tracking(
        df,
        prompt=prompt,
        checkpoint_dir='checkpoints',
        resume=True
    )
except KeyboardInterrupt:
    print("\nProcess stopped by user. You can resume from the last checkpoint later.")
except Exception as e:
    print(f"\nCritical error occurred: {e}")
    print("You can resume from the last checkpoint later.")

Processing rows: 100%|█████████▉| 1641/1647 [2:13:53<02:03, 20.62s/it]


Error in Metadata at row 1640: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1641


Processing rows: 100%|█████████▉| 1642/1647 [2:14:10<01:37, 19.45s/it]


Error in Metadata at row 1641: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1642


Processing rows: 100%|█████████▉| 1643/1647 [2:14:23<01:10, 17.58s/it]


Error in Metadata at row 1642: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1643


Processing rows: 100%|█████████▉| 1644/1647 [2:14:30<00:43, 14.36s/it]


Error in Metadata at row 1643: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1644


Processing rows: 100%|█████████▉| 1645/1647 [2:14:35<00:23, 11.57s/it]


Error in Metadata at row 1644: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1645


Processing rows: 100%|█████████▉| 1646/1647 [2:14:39<00:09,  9.18s/it]


Error in Metadata at row 1645: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1646


Processing rows: 100%|██████████| 1647/1647 [2:14:41<00:00, 11.92s/it]


Error in Metadata at row 1646: Could not create `Blob`, expected `Blob`, `dict` or an `Image` type(`PIL.Image.Image` or `IPython.display.Image`).
Got a: <class 'float'>
Value: nan

Checkpoint saved at row 1647

Processing Complete!
Total rows processed: 1647
Total time taken: 134.7 minutes

Error Summary:
Content: 1 errors
Section: 50 errors
Metadata: 455 errors





In [8]:
df.to_csv(r"Title18_reprocessed.csv", index=False, encoding='utf-8', errors='replace')