In [None]:
import google.generativeai as genai
import os
import dotenv
dotenv.load_dotenv()
genai.configure(api_key=os.getenv('GOOGLE_API_KEY'))


In [3]:
systemprompt= """ You are an expert AI Assistant specializing in the United States Code, specifically Title 18 (Crimes and Criminal Procedure). Your primary function is to process and clean text excerpts from this title.

**Task:**

You will receive text input containing legal information extracted from Title 18 of the U.S. Code. Your task is to preprocess and clean this text, ensuring it is suitable for further analysis or use.  This involves several specific steps:

1. **Remove Non-English Characters:**  Identify and remove any characters that are not part of standard English text. This includes, but is not limited to:
    * Special symbols (e.g., Ã‚Â, ¶, †, ‡, /x,/n, Next >>[Print], << Previous, Result 1 of 1 , etc., except for those that are standard legal symbols like the section symbol §).
    * Control characters.
    * Characters from other languages (e.g., accented characters, characters from non-Latin alphabets).
    * Extraneous punctuation or symbols not typically found in legal text.

2. **Remove Unrelated Text:**  Identify and remove any text that is not directly related to the legal content of Title 18. This may include:
    * Headings, subheadings, or titles if they are redundant or not part of the core legal text. If headings are essential to understanding the structure of the law, keep them.
    * Editorial notes, annotations, or commentary that are not part of the official legal text.
    * Footnotes, unless they contain essential legal information. If you keep footnotes, ensure they are formatted clearly and consistently.
    * References to other sections or titles, unless they are crucial for understanding the current excerpt. If kept, ensure they are formatted consistently.
    * Table of contents entries.
    * Anything that is clearly not part of the codified law itself.

3. **Preserve Meaning:**  Crucially, you **must not** modify or alter the meaning of the legal text.  Your cleaning process should only remove extraneous or non-essential elements; it should not change the legal content in any way.

4. **Return Clean and Meaningful Text:** The output you provide should be the cleaned and processed legal text. It should be grammatically correct, clearly formatted, and ready for further use.  Maintain the original structure and hierarchy of the legal text where possible (e.g., keep paragraphs, subsections, etc.).  If the input text includes citations to other legal sources, maintain the format of those citations.

**Example Input:**
"Whoever, within the special maritime and territorial jurisdiction of the United States, by force and violence, or by intimidation, takes or attempts to take from the person or presence of another anything of value, shall be imprisoned not more than fifteen years.(June 25, 1948, ch. 645, 62 Stat. 796; Pub. L. 103Ã¢Â€Â“322, title XXXII, Ã‚Â§320903(a)(1), Sept. 13, 1994, 108 Stat. 2124.)Historical and Revision NotesBased on title 18, U.S.C., 1940 ed., Ã‚Â§463 (Mar. 4, 1909, ch. 321, Ã‚Â§284, 35 Stat. 1144).Words "within the special maritime and territorial jurisdiction of the United States" were added to restrict the place of the offense to those places described in section 451 of title 18, U.S.C., 1940 ed., now section 7 of this title.Minor changes were made in phraseology.Editorial NotesAmendments1994-Pub. L. 103Ã¢Â€Â“322 inserted "or attempts to take" after "takes".Statutory Notes and Related SubsidiariesShort Title of 1996 AmendmentPub. L. 104Ã¢Â€Â“217, Ã‚Â§1, Oct. 1, 1996, 110 Stat. 3020, provided that: "This Act [amending section 2119 of this title] may be cited as the 'Carjacking Correction Act of 1996'.""

**Example Output (Cleaned):**
"Whoever, within the special maritime and territorial jurisdiction of the United States, by force and violence, or by intimidation, takes or attempts to take from the person or presence of another anything of value, shall be imprisoned not more than fifteen years.
(June 25, 1948, ch. 645, 62 Stat. 796 ; Pub. L. 103–322, title XXXII, §320903(a)(1), Sept. 13, 1994, 108 Stat. 2124 .)
Historical and Revision Notes
Based on title 18, U.S.C., 1940 ed., §463 (Mar. 4, 1909, ch. 321, §284, 35 Stat. 1144 ).
Words "within the special maritime and territorial jurisdiction of the United States" were added to restrict the place of the offense to those places described in section 451 of title 18, U.S.C., 1940 ed., now section 7 of this title.
Minor changes were made in phraseology.


Editorial Notes
Amendments
1994-Pub. L. 103–322 inserted "or attempts to take" after "takes".


Statutory Notes and Related Subsidiaries
Short Title of 1996 Amendment
Pub. L. 104–217, §1, Oct. 1, 1996, 110 Stat. 3020 , provided that: "This Act [amending section 2119 of this title] may be cited as the 'Carjacking Correction Act of 1996'." "

"""

In [4]:
model = genai.GenerativeModel('models/gemini-1.5-flash',
                              system_instruction=systemprompt)

In [5]:
import pandas as pd
import os
import glob
import re

# Define the folder path where your CSV files are stored
folder_path = r'Extracted_Data'  # Change this to your folder path

# Get all CSV files in the folder
csv_files = glob.glob(os.path.join(folder_path, "*.csv"))

# Initialize a list to store the merged data
merged_data = []

# Function to clean text (if needed)
def clean_text(text):
    cleaned_text = re.sub(r'[^\x00-\x7F§]+', '', str(text))
    return cleaned_text

# Process each file
for file in csv_files:
    df = pd.read_csv(file, encoding='utf-8')
    print(f"Processing: {file}")

    # Extract the file name without extension
    file_name = os.path.basename(file)

    # Check if 'Content' column exists
    if 'Content' in df.columns:
        # Concatenate all 'Content' rows into a single string
        full_content = " ".join(df['Content'].dropna().astype(str))
        full_metadata = " ".join(df['Metadata'].dropna().astype(str))
        # Clean text if needed
        full_content = clean_text(full_content)
        try:
            response_content = model.generate_content(full_content)
        except:
            response_content= full_content
            print("Error with model generation, skipping file")
            continue
        full_metadata = clean_text(full_metadata)
        # Append to the list as a dictionary
        merged_data.append({'Chapter': file_name, 'Content': response_content, 'Metadata': full_metadata})
    else:
        print(f"Skipping {file}, 'Content' column not found.")

# Create a new DataFrame with the merged content
new_merged_df = pd.DataFrame(merged_data)

# Display the final DataFrame
print(new_merged_df.head())

# Save the new DataFrame to a CSV file
#new_merged_df.to_csv(os.path.join(folder_path, 'Merged_Content.csv'), index=False, encoding='utf-8')



Processing: Extracted_Data\Chapter1-3_Robbery_and_Burglary.csv
Processing: Extracted_Data\Chapter101_Records_and_Reports.csv
Processing: Extracted_Data\Chapter102_Riots.csv
Processing: Extracted_Data\Chapter105_Sabotage.csv
Processing: Extracted_Data\Chapter107_Seamen_and_Stowaways.csv
Processing: Extracted_Data\Chapter109A_Sexual_Abuse.csv
Processing: Extracted_Data\Chapter109B_SEX_OFFENDER_AND_CRIMES_AGAINST_CHILDREN_REGISTRY.csv
Processing: Extracted_Data\Chapter109_Searches_and_seizures.csv
Processing: Extracted_Data\Chapter10_Biological_Weapons.csv
Processing: Extracted_Data\Chapter110A_Domestic_Violence_and_stalking.csv
Processing: Extracted_Data\Chapter110_SEXUAL_EXPLOITATION_AND_OTHER_ABUSE_OF_CHILDREN.csv
Processing: Extracted_Data\Chapter111A_DESTRUCTION_OF_OR_INTERFERENCE_WITH_VESSELS_OR_MARITIME_FACILITIES.csv
Processing: Extracted_Data\Chapter111_Shipping.csv
Processing: Extracted_Data\Chapter113A_TELEMARKETING_AND_EMAIL_MARKETING_FRAUD.csv
Processing: Extracted_Data\Chapt

In [10]:
# Assuming 'Content' contains BeautifulSoup objects
new_merged_df['Content'] = new_merged_df['Content'].apply(lambda x: x.text if hasattr(x, 'text') else x)

# Now, 'Content' column contains only the extracted text


In [23]:
new_merged_df['Content'] = new_merged_df['Content'].str.replace(r'\*\*', '', regex=True)

In [38]:
import random
ran= random.randint(0, len(new_merged_df)-1)
print(new_merged_df['Content'][ran])

TITLE 18, CHAPTER 61: LOTTERIES

§ 1301. Importing or transporting lottery tickets.

Whoever brings into the United States for the purpose of disposing of the same, or knowingly deposits with any express company or other common carrier for carriage, or carries in interstate or foreign commerce any paper, certificate, or instrument purporting to be or to represent a ticket, chance, share, or interest in or dependent upon the event of a lottery, gift enterprise, or similar scheme, offering prizes dependent in whole or in part upon lot or chance, or any advertisement of, or list of the prizes drawn or awarded by means of, any such lottery, gift enterprise, or similar scheme; or, being engaged in the business of procuring for a person in 1 State such a ticket, chance, share, or interest in a lottery, gift enterprise or similar scheme conducted by another State (unless that business is permitted under an agreement between the States in question or appropriate authorities of those States), k

In [42]:
new_merged_df['Chapter'].unique()

array(['Chapter1-3_Robbery_and_Burglary.csv',
       'Chapter101_Records_and_Reports.csv', 'Chapter102_Riots.csv',
       'Chapter105_Sabotage.csv', 'Chapter107_Seamen_and_Stowaways.csv',
       'Chapter109A_Sexual_Abuse.csv',
       'Chapter109B_SEX_OFFENDER_AND_CRIMES_AGAINST_CHILDREN_REGISTRY.csv',
       'Chapter109_Searches_and_seizures.csv',
       'Chapter10_Biological_Weapons.csv',
       'Chapter110A_Domestic_Violence_and_stalking.csv',
       'Chapter110_SEXUAL_EXPLOITATION_AND_OTHER_ABUSE_OF_CHILDREN.csv',
       'Chapter111A_DESTRUCTION_OF_OR_INTERFERENCE_WITH_VESSELS_OR_MARITIME_FACILITIES.csv',
       'Chapter111_Shipping.csv',
       'Chapter113A_TELEMARKETING_AND_EMAIL_MARKETING_FRAUD.csv',
       'Chapter113B_Terrorism.csv', 'Chapter113C_Torture.csv',
       'Chapter113_Stolen_property.csv',
       'Chapter114_TRAFFICKING_IN_CONTRABAND_CIGARETTES_AND_SMOKELESS_TOBACCO.csv',
       'Chapter115_TREASON_SEDITION,_AND_SUBVERSIVE_ACTIVITIES.csv',
       'Chapter117_TRANSPOR

In [40]:
new_merged_df.to_csv("Title18_CSV_Data\Title18_processed_chapters.csv", index=False, encoding='utf-8')