# Sentiment Analysis for Oil & Gas Articles using Local LLama

## Overview

This Notebook contains a Python script for performing sentiment analysis on a dataset of oil and gas industry articles. The script categorizes each article into predefined categories and assigns a sentiment (Positive, Neutral, Negative, or N/A) to each category.

*Key Components*

1. Data Preparation

- The script loads a pre-cleaned dataset of articles.
- It removes specific rows that are known to be unprocessable.

2. File Management

- The script checks for an existing results file.
- If the file doesn't exist, it creates an empty DataFrame to store results.

3. Sentiment Analysis Process

- The script uses a pre-defined template for sentiment analysis.
- It processes articles one by one, assigning sentiments to each category.
- Results are saved to a CSV file after each successful analysis.

4. Error Handling and Retries

- The script includes a retry mechanism for failed analyses.
- After a maximum number of retries, it marks the entry as "No JSON found".

5. Progress Tracking

- The script prints progress updates every 10 iterations.
- It displays the number of articles processed and the elapsed time.

In [11]:
# Import Libraries
import os
import sys
import time
import pandas as pd


In [12]:
# Setup paths
project_dir = os.path.abspath(os.path.join(os.getcwd(), '../..'))
source_code_dir = os.path.join(project_dir, '10_Source_Code')
sys.path.append(source_code_dir)

# import local .py helper functions
import llama_setup as ls
import data_setup as ds

In [13]:
# Global Variables
CATEGORIES = [
        "Finance",
        "Production",
        "Reserves / Exploration / Acquisitions / Mergers / Divestments",
        "Environment / Regulatory / Geopolitics",
        "Alternative Energy / Lower Carbon",
        "Oil Price / Natural Gas Price / Gasoline Price"
        ]

SENTIMENT_RESULTS_FILE_PATH = 'Full_data_LLama_model_sentiment_analysis_results.csv'
ROWS_TO_DROP = ['PQ-2840736837']
MAX_TRIES = 5

In [14]:
# Load and prepare data
text_df = ds.load_cleaned_data()
text_df = ds.drop_unprocessable_rows(text_df, ROWS_TO_DROP)

In [15]:
# Check if sentiment analysis results file exists and create if it doesn't
if not ds.check_file_exists(SENTIMENT_RESULTS_FILE_PATH):
    empty_sentiment_df = ds.create_empty_sentiment_df(text_df, CATEGORIES)
    ds.save_dataframe_to_csv(empty_sentiment_df, SENTIMENT_RESULTS_FILE_PATH)
    print(f"Created and saved an empty sentiment analysis DataFrame to {SENTIMENT_RESULTS_FILE_PATH}")
else:
    print(f"The file exists in the current directory.")

The file exists in the current directory.


In [16]:
# Define the template
TEMPLATE = """<s>Classify the following article into categories with sentiment (Positive, Neutral, Negative, N/A if not applicable or not mentioned) and provide the output in the specified dictionary format.
Example:
Article: ExxonMobil announced a significant increase in quarterly profits due to rising oil prices and increased production levels.
Output: {{"Finance": "Positive", 'Production': "Positive", "Reserves / Exploration / Acquisitions / Mergers / Divestments": 'Neutral', "Environment / Regulatory / Geopolitics": 'Neutral', "Alternative Energy / Lower Carbon": 'Neutral', "Oil Price / Natural Gas Price / Gasoline Price": "Positive"}}

Example:
Article: Chevron plans to invest heavily in renewable energy projects, aiming to reduce its carbon footprint over the next decade.
Output: {{'Finance': 'Neutral', 'Production': 'Neutral', "Reserves / Exploration / Acquisitions / Mergers / Divestments": 'Neutral', "Environment / Regulatory / Geopolitics": "Positive", "Alternative Energy / Lower Carbon": "Positive", "Oil Price / Natural Gas Price / Gasoline Price": 'Neutral'}}

Example:
Article: BP faced regulatory challenges in its latest drilling project, delaying operations and increasing costs.
Output: {{'Finance': 'Negative', "Production": 'Negative', "Reserves / Exploration / Acquisitions / Mergers / Divestments": 'Negative', "Environment / Regulatory / Geopolitics": 'Negative', "Alternative Energy / Lower Carbon": 'Neutral', "Oil Price / Natural Gas Price / Gasoline Price": 'Neutral'}}

Article: {article}

Output only the EXACT dictionary format:
{{"Finance": '[Sentiment]', "Production": '[Sentiment]', "Reserves / Exploration / Acquisitions / Mergers / Divestments": '[Sentiment]', "Environment / Regulatory / Geopolitics":: '[Sentiment]', "Alternative Energy / Lower Carbon": '[Sentiment]', "Oil Price / Natural Gas Price / Gasoline Price": '[Sentiment]'}}

Do not use any other format or additional information. Please provide the output in the specified format only.</s>"""

In [17]:
# Function to process a single unique ID
def process_unique_id(unique_id):
    """Process a single article for sentiment analysis.

    Args:
        unique_id (str): The unique identifier for the article.

    Returns:
        bool: True if analysis was successful, False otherwise.

    Tries to perform sentiment analysis on an article, updating results in a CSV file.
    If unsuccessful after MAX_TRIES attempts, saves "No JSON found" for each category.
    """
    
    for _ in range(MAX_TRIES):
        try:
            company, source, headline, text = ds.get_model_inputs(text_df, unique_id)
            response = ls.get_ollama_response(text, TEMPLATE)
            sentiment_dict = ds.extract_and_convert_to_dict(response)

            if isinstance(sentiment_dict, dict):
                ds.update_csv(SENTIMENT_RESULTS_FILE_PATH, unique_id, sentiment_dict, CATEGORIES)
                return True
            print("Error: Sentiment dictionary not found. Retrying...")
        except Exception as e:
            print(f"Error: {e}. Retrying...")
    
    print(f"Max retries reached for Unique_ID '{unique_id}'. Inserting 'No JSON found' for each category.")
    sentiment_dict = {category: "No JSON found" for category in CATEGORIES}
    ds.update_csv(SENTIMENT_RESULTS_FILE_PATH, unique_id, sentiment_dict, CATEGORIES)
    return False

In [18]:

unique_id = ds.find_first_unique_id_with_empty_values(SENTIMENT_RESULTS_FILE_PATH, CATEGORIES)
    if not unique_id:
        break
    
    process_unique_id(unique_id)
    count += 1
    if count % 10 == 0:
        elapsed_time = time.time() - start_time
        minutes, seconds = divmod(elapsed_time, 60)
        print(f"Iteration: {count}, Elapsed Time: {int(minutes)} minutes and {seconds:.2f} seconds")

print("Processing complete.")

Row with Unique_ID 'PQ-2324792481' has been updated.
Row with Unique_ID 'PQ-2323314971' has been updated.
Row with Unique_ID 'PQ-2323189609' has been updated.
Row with Unique_ID 'PQ-2300871043' has been updated.
Row with Unique_ID 'PQ-2300871006' has been updated.
Row with Unique_ID 'PQ-2293021546' has been updated.
Row with Unique_ID 'PQ-2268136378' has been updated.
Row with Unique_ID 'PQ-2244937767' has been updated.
Row with Unique_ID 'PQ-2244906592' has been updated.
Row with Unique_ID 'PQ-2244478210' has been updated.
Iteration: 10, Elapsed Time: 1 minutes and 10.30 seconds
Row with Unique_ID 'PQ-2244475922' has been updated.
Row with Unique_ID 'PQ-3047045957' has been updated.
Row with Unique_ID 'PQ-3040189451' has been updated.
Row with Unique_ID 'PQ-3034936247' has been updated.
Row with Unique_ID 'PQ-2942028692' has been updated.
Row with Unique_ID 'PQ-2938179872' has been updated.
Row with Unique_ID 'PQ-2933659819' has been updated.
Row with Unique_ID 'PQ-2933332666' has bee

KeyboardInterrupt: 