# Task
Modify the code to allow the user to upload a CSV dataset, perform sentiment analysis on the 'Text' column, and save the results to a new CSV file named "predicted_output.csv".

## Add file upload functionality

### Subtask:
Add code to the notebook to prompt the user to upload a CSV file and get the path to the uploaded file.


**Reasoning**:
Import the necessary module and use it to prompt the user for a file upload, then store the path of the uploaded file.



In [1]:
!pip install transformers torch pandas
from google.colab import files

uploaded = files.upload()
uploaded_file_path = list(uploaded.keys())[0]
print(f"Uploaded file: {uploaded_file_path}")

Saving custom_legal_dataset (1).csv to custom_legal_dataset (1).csv
Uploaded file: custom_legal_dataset (1).csv


## Modify data loading to use the uploaded file

### Subtask:
Update the data loading part of the code in cell `UNYkNccLZJU_` to read the CSV file uploaded by the user.


**Reasoning**:
Modify the code to use the uploaded file path for reading the CSV data.



**Reasoning**:
The previous code block failed because `pd` and `sys` were not defined. I need to import `pandas` as `pd` and the `sys` module. I also need to import `os` which is used later in the code.



**Reasoning**:
The traceback indicates that `AutoTokenizer` and `AutoModelForSeq2SeqLM` were not imported. I need to add the import statement for these classes from the `transformers` library. The previous cell already imported `pandas`, `sys`, and `os`, so I don't need to repeat those imports.



**Reasoning**:
The traceback shows that `args` is not defined. The original code used `args` to specify the output file path. Since I am modifying the code to use the uploaded file and hardcode the output file name, I need to define a variable for the output file path.



In [5]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Load and process data
try:
    # The uploaded_file_path is available from the previous cell
    print(f"Attempting to load {uploaded_file_path}")
    # Modify pd.read_csv to use the uploaded_file_path
    data = pd.read_csv(uploaded_file_path, encoding='utf-8', on_bad_lines='warn')
    print(f"Raw columns detected: {data.columns.tolist()}")
    if data.empty:
        raise ValueError("Input CSV is empty.")

    # Adjust if "Text,Sentiment" is a single column
    if "Text,Sentiment" in data.columns and ("Text" not in data.columns or "Sentiment" not in data.columns):
        print("Splitting 'Text,Sentiment' column...")
        data[["Text", "Sentiment"]] = data["Text,Sentiment"].str.split(",", n=1, expand=True)
    elif "Text" not in data.columns or "Sentiment" not in data.columns:
        raise ValueError("CSV must have 'Text' and 'Sentiment' columns or a 'Text,Sentiment' column to split.")

    data["Text"] = data["Text"].astype(str).str.strip().str.strip('"')
    data["Sentiment"] = data["Sentiment"].astype(str).str.strip().str.strip('"')
    texts = data["Text"].tolist()
    expected_sentiments = data["Sentiment"].tolist()
    print(f"Loaded {len(texts)} rows from CSV: {texts[:5]}...")  # Show first five for better debug
except FileNotFoundError as e:
    print(f"Error: {e}")
    sys.exit(1)
except Exception as e:
    print(f"Error loading or processing CSV: {e}")
    sys.exit(1)

# Set up FLAN-T5 model
try:
    model_name = "google/flan-t5-base"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda:0")
except Exception as e:
    print(f"Error loading model: {e}")
    sys.exit(1)

# Sentiment classification function
def get_sentiment(text):
    print(f"Processing text: {text[:100]}...") # Print first 100 characters of text
    prompt = f"Classify the sentiment of this legal text as Positive, Negative, or Neutral. Examples: 'The case was resolved favorably.' → Positive, 'The contract terms are unfair.' → Negative, 'The summary is neutral.' → Neutral. Text: {text}\nSentiment:"
    try:
        inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).to("cuda:0")
        outputs = model.generate(**inputs, max_length=10, num_return_sequences=1)
        sentiment = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        if "Positive" in sentiment:
            return "Positive"
        elif "Negative" in sentiment:
            return "Negative"
        elif "Neutral" in sentiment:
            return "Neutral"
        else:
            return "Neutral"
    except Exception as e:
        print(f"Error processing text '{text[:100]}...': {e}") # Print error with truncated text
        return "Neutral"


# Apply to all texts
sentiments = [get_sentiment(text) for text in texts]

# Create and save output
output_df = pd.DataFrame({"Statement": texts, "Predicted": sentiments, "Expected": expected_sentiments})

# Define the output file path
output_csv_path = "predicted_output.csv"

# Create the directory for the output CSV if it doesn't exist
output_dir = os.path.dirname(output_csv_path)
if output_dir and not os.path.exists(output_dir):
    os.makedirs(output_dir, exist_ok=True)
    print(f"Created directory: {output_dir}")

try:
    output_df.to_csv(output_csv_path, index=False)
    print(f"Results saved to {output_csv_path}")
    # Display the output table
    print("\nSentiment Analysis Table:")
    print(output_df.to_string(index=False))
    # Verify file creation
    if os.path.exists(output_csv_path):
        print(f"Verified: Output file exists at {output_csv_path}")
    else:
        print(f"Warning: Output file was not created at {output_csv_path}")
except PermissionError:
    print(f"Error: Permission denied when saving to {output_csv_path}. Check directory permissions.")
    sys.exit(1)
except Exception as e:
    print(f"Error saving output: {e}")
    sys.exit(1)

Attempting to load custom_legal_dataset (1).csv
Raw columns detected: ['Text', 'Sentiment']
Loaded 25 rows from CSV: ['The court ruled in favor of the plaintiff, awarding full compensation.', 'The contract terms were deemed unfair and unenforceable by the judge.', 'The legal review concluded with no significant issues identified.', 'The attorney provided exceptional support during the trial.', 'The settlement negotiations broke down due to irreconcilable differences.']...
Processing text: The court ruled in favor of the plaintiff, awarding full compensation....
Processing text: The contract terms were deemed unfair and unenforceable by the judge....
Processing text: The legal review concluded with no significant issues identified....
Processing text: The attorney provided exceptional support during the trial....
Processing text: The settlement negotiations broke down due to irreconcilable differences....
Processing text: The new law ensures equal rights for all citizens without bias...

## Clean up dummy data logic

### Subtask:
Remove or comment out the code related to creating dummy data and handling the `--dummy` argument as it will no longer be the primary method for getting input.


**Reasoning**:
Remove the argparse logic and the dummy data creation code to rely solely on the uploaded file.



In [6]:
# Load and process data
try:
    # The uploaded_file_path is available from the previous cell
    print(f"Attempting to load {uploaded_file_path}")
    # Modify pd.read_csv to use the uploaded_file_path
    data = pd.read_csv(uploaded_file_path, encoding='utf-8', on_bad_lines='warn')
    print(f"Raw columns detected: {data.columns.tolist()}")
    if data.empty:
        raise ValueError("Input CSV is empty.")

    # Adjust if "Text,Sentiment" is a single column
    if "Text,Sentiment" in data.columns and ("Text" not in data.columns or "Sentiment" not in data.columns):
        print("Splitting 'Text,Sentiment' column...")
        data[["Text", "Sentiment"]] = data["Text,Sentiment"].str.split(",", n=1, expand=True)
    elif "Text" not in data.columns or "Sentiment" not in data.columns:
        raise ValueError("CSV must have 'Text' and 'Sentiment' columns or a 'Text,Sentiment' column to split.")

    data["Text"] = data["Text"].astype(str).str.strip().str.strip('"')
    data["Sentiment"] = data["Sentiment"].astype(str).str.strip().str.strip('"')
    texts = data["Text"].tolist()
    expected_sentiments = data["Sentiment"].tolist()
    print(f"Loaded {len(texts)} rows from CSV: {texts[:5]}...")  # Show first five for better debug
except FileNotFoundError as e:
    print(f"Error: {e}")
    sys.exit(1)
except Exception as e:
    print(f"Error loading or processing CSV: {e}")
    sys.exit(1)

# Set up FLAN-T5 model
try:
    model_name = "google/flan-t5-base"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to("cuda:0")
except Exception as e:
    print(f"Error loading model: {e}")
    sys.exit(1)

# Sentiment classification function
def get_sentiment(text):
    print(f"Processing text: {text[:100]}...") # Print first 100 characters of text
    prompt = f"Classify the sentiment of this legal text as Positive, Negative, or Neutral. Examples: 'The case was resolved favorably.' → Positive, 'The contract terms are unfair.' → Negative, 'The summary is neutral.' → Neutral. Text: {text}\nSentiment:"
    try:
        inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True).to("cuda:0")
        outputs = model.generate(**inputs, max_length=10, num_return_sequences=1)
        sentiment = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
        if "Positive" in sentiment:
            return "Positive"
        elif "Negative" in sentiment:
            return "Negative"
        elif "Neutral" in sentiment:
            return "Neutral"
        else:
            return "Neutral"
    except Exception as e:
        print(f"Error processing text '{text[:100]}...': {e}") # Print error with truncated text
        return "Neutral"


# Apply to all texts
sentiments = [get_sentiment(text) for text in texts]

# Create and save output
output_df = pd.DataFrame({"Statement": texts, "Predicted": sentiments, "Expected": expected_sentiments})

# Define the output file path
output_csv_path = "predicted_output.csv"

# Create the directory for the output CSV if it doesn't exist
output_dir = os.path.dirname(output_csv_path)
if output_dir and not os.path.exists(output_dir):
    os.makedirs(output_dir, exist_ok=True)
    print(f"Created directory: {output_dir}")

try:
    output_df.to_csv(output_csv_path, index=False)
    print(f"Results saved to {output_csv_path}")
    # Display the output table
    print("\nSentiment Analysis Table:")
    print(output_df.to_string(index=False))
    # Verify file creation
    if os.path.exists(output_csv_path):
        print(f"Verified: Output file exists at {output_csv_path}")
    else:
        print(f"Warning: Output file was not created at {output_csv_path}")
except PermissionError:
    print(f"Error: Permission denied when saving to {output_csv_path}. Check directory permissions.")
    sys.exit(1)
except Exception as e:
    print(f"Error saving output: {e}")
    sys.exit(1)

Attempting to load custom_legal_dataset (1).csv
Raw columns detected: ['Text', 'Sentiment']
Loaded 25 rows from CSV: ['The court ruled in favor of the plaintiff, awarding full compensation.', 'The contract terms were deemed unfair and unenforceable by the judge.', 'The legal review concluded with no significant issues identified.', 'The attorney provided exceptional support during the trial.', 'The settlement negotiations broke down due to irreconcilable differences.']...
Processing text: The court ruled in favor of the plaintiff, awarding full compensation....
Processing text: The contract terms were deemed unfair and unenforceable by the judge....
Processing text: The legal review concluded with no significant issues identified....
Processing text: The attorney provided exceptional support during the trial....
Processing text: The settlement negotiations broke down due to irreconcilable differences....
Processing text: The new law ensures equal rights for all citizens without bias...

## Summary:

### Data Analysis Key Findings

*   The code was successfully modified to allow users to upload a CSV file for sentiment analysis using `google.colab.files.upload()`.
*   The data loading process was updated to read from the uploaded file path.
*   The code now successfully identifies and loads data from CSVs with separate 'Text' and 'Sentiment' columns or a combined 'Text,Sentiment' column.
*   The sentiment analysis is performed using the FLAN-T5 model on the 'Text' column of the uploaded data.
*   The results, including the original text, predicted sentiment, and expected sentiment, are saved to a new CSV file named "predicted\_output.csv".
*   The dummy data generation logic and the use of `argparse` for handling dummy data were successfully removed, ensuring the code relies solely on the uploaded file for input.

### Insights or Next Steps

*   The current implementation assumes the uploaded CSV has either 'Text' and 'Sentiment' columns or a 'Text,Sentiment' column; adding more robust error handling or input validation for different column names could improve usability.
*   Consider adding a feature to allow the user to specify the output file name instead of hardcoding "predicted\_output.csv".
