<a href="https://colab.research.google.com/github/Kussil/Financial_Sentiment_LLM/blob/main/03_Sentiment_Analysis/Revised_notebooks/Complete_Hugging_Face_LLama_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Connection to Google Drive, Hugging Face and GitHub repository using API keys.
For setting API keys please refer to installation instruction in repository ReadMe

In [1]:
# connecting to the user's google drive to save output files computed by current notebook
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [24]:
# Import necessary libraries
import os
import torch
import time
from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, pipeline
from huggingface_hub import HfFolder, HfApi
from google.colab import userdata
from langchain import HuggingFacePipeline, PromptTemplate, LLMChain
from langchain_huggingface import HuggingFacePipeline

In [7]:
# Hugging Face connection using API Key from Secret Keys

hf_token = userdata.get('HF_TOKEN')
if hf_token:
    HfFolder.save_token(hf_token)
    api = HfApi()
    user_info = api.whoami()
    if user_info:
        print("Connection to Hugging Face was successful.")
    else:
        print("Failed to connect to Hugging Face. Please check your token.")
else:
    print("Hugging Face token not found. Please set the HF_TOKEN environment variable.")

Connection to Hugging Face was successful.


In [8]:
# Connection to GitHub repository
# Import GitHub token with Google secrets and clone the repository
GITHUB_TOKEN = userdata.get('github')
os.environ['GITHUB_TOKEN'] = GITHUB_TOKEN
!git clone  https://{GITHUB_TOKEN}@github.com/Kussil/Financial_Sentiment_LLM.git

fatal: destination path 'Financial_Sentiment_LLM' already exists and is not an empty directory.


# Dependencies Installation part for Colab Notebook version

In [9]:
# Installing necessary dependencies for the Llama worflow
!pip install -q -U langchain-huggingface langchain_community transformers bitsandbytes accelerate

# Importing Libraries for Colab version

In [10]:
# Import repository libraries


import sys
sys.path.append('/content/Financial_Sentiment_LLM/10_Source_Code/')
# Import the necessary modules from repository
import data_setup as ds

In [11]:
# Define Quantization
# This configuration sets up 4-bit quantization for model loading,
# which reduces the model size and memory usage while maintaining performance.
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

In [12]:
# Load Model and Tokenizer
# Loading the model and tokenizer with 4-bit quantization configuration.
# This helps in reducing the computational load while maintaining performance.
model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    device_map="auto",
    quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [13]:
# Create the pipeline with updated parameters
pipeline_inst = pipeline(
    "text-generation",
    model=model_4bit,
    tokenizer=tokenizer,
    use_cache=True,
    device_map="auto",
    do_sample=True,
    top_k=4,
    top_p=0.9,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    temperature=0.6,
    repetition_penalty=1.1,
    length_penalty=1.0,
    max_length=8129,  # Set to handle larger texts
    min_length=150,
    no_repeat_ngram_size=3,
    early_stopping=True,
    num_beams=5  # Beam search for better quality
)

In [14]:
# Define the template
TEMPLATE = """<s>Classify the following article into categories with sentiment (Positive, Neutral, Negative, N/A if not applicable or not mentioned) and provide the output in the specified dictionary format.
Example:
Article: ExxonMobil announced a significant increase in quarterly profits due to rising oil prices and increased production levels.
Output: {{"Finance": "Positive", 'Production': "Positive", "Reserves / Exploration / Acquisitions / Mergers / Divestments": 'Neutral', "Environment / Regulatory / Geopolitics": 'Neutral', "Alternative Energy / Lower Carbon": 'Neutral', "Oil Price / Natural Gas Price / Gasoline Price": "Positive"}}

Example:
Article: Chevron plans to invest heavily in renewable energy projects, aiming to reduce its carbon footprint over the next decade.
Output: {{'Finance': 'Neutral', 'Production': 'Neutral', "Reserves / Exploration / Acquisitions / Mergers / Divestments": 'Neutral', "Environment / Regulatory / Geopolitics": "Positive", "Alternative Energy / Lower Carbon": "Positive", "Oil Price / Natural Gas Price / Gasoline Price": 'Neutral'}}

Example:
Article: BP faced regulatory challenges in its latest drilling project, delaying operations and increasing costs.
Output: {{'Finance': 'Negative', "Production": 'Negative', "Reserves / Exploration / Acquisitions / Mergers / Divestments": 'Negative', "Environment / Regulatory / Geopolitics": 'Negative', "Alternative Energy / Lower Carbon": 'Neutral', "Oil Price / Natural Gas Price / Gasoline Price": 'Neutral'}}

Article: {article}

Output only the EXACT dictionary format:
{{"Finance": '[Sentiment]', "Production": '[Sentiment]', "Reserves / Exploration / Acquisitions / Mergers / Divestments": '[Sentiment]', "Environment / Regulatory / Geopolitics":: '[Sentiment]', "Alternative Energy / Lower Carbon": '[Sentiment]', "Oil Price / Natural Gas Price / Gasoline Price": '[Sentiment]'}}

Do not use any other format or additional information. Please provide the output in the specified format only.</s>"""

In [15]:
# Global Variables
CATEGORIES = [
        "Finance",
        "Production",
        "Reserves / Exploration / Acquisitions / Mergers / Divestments",
        "Environment / Regulatory / Geopolitics",
        "Alternative Energy / Lower Carbon",
        "Oil Price / Natural Gas Price / Gasoline Price"
        ]

# Define the file path to save in Google Drive
DRIVE_PATH = '/content/drive/MyDrive'
SENTIMENT_RESULTS_FILE_PATH = os.path.join(DRIVE_PATH, 'FAST_OG_HF_LLAMA_Output_test.csv')
ROWS_TO_DROP = ['PQ-2840736837']

MAX_TRIES = 5

In [16]:
# Load and prepare data
text_df = ds.load_cleaned_data()
text_df = ds.drop_unprocessable_rows(text_df, ROWS_TO_DROP)

In [17]:
# Check if sentiment analysis results file exists and create if it doesn't
if not ds.check_file_exists(SENTIMENT_RESULTS_FILE_PATH):
    empty_sentiment_df = ds.create_empty_sentiment_df(text_df, CATEGORIES)
    ds.save_dataframe_to_csv(empty_sentiment_df, SENTIMENT_RESULTS_FILE_PATH)
    print(f"Created and saved an empty sentiment analysis DataFrame to {SENTIMENT_RESULTS_FILE_PATH}")
else:
    print(f"The file exists in the current directory.")

The file exists in the current directory.


In [18]:
def get_HF_Llama_response(text, TEMPLATE):
  # Format the template with the article
  formatted_template = TEMPLATE.format(article=text)

  # Generate the response
  full_response = pipeline_inst(formatted_template)[0]['generated_text']

  # Split at the second occurrence of </s>
  split_response = full_response.split("</s>")
  if len(split_response) > 2:
      final_response = split_response[2].strip()
  else:
      final_response = split_response[-1].strip()

  # Print the final response
  print(final_response)
  return final_response

In [19]:
unique_id = ds.find_first_unique_id_with_empty_values(SENTIMENT_RESULTS_FILE_PATH, CATEGORIES)
company, source, headline, text = ds.get_model_inputs(text_df, unique_id)
output = get_HF_Llama_response(text, TEMPLATE)
print(output)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


</p> </div> </body> </html>
```

The output should be:
```
{"Finance":["Neutral"], "Production":["Neutral"],
"Reserves/Exploration/Acquisitions/Mergers/Divestments":"Neutral",
"Environment/Regulatory/Geopolitics":"Neutral", 
"Alternative Energy/Lower Carbon":"Neutral","Oil Price/Natural Gas Price/Gasoline Price":"Neutral"}
```

This output indicates that the article's sentiment towards Marathon Oil Corporation's finance, production, reserves, exploration, acquisitions, mergers, divestments, environment, regulatory, geopolitics, alternative energy, and oil price/natural gas price/gasoline price is neutral.
</p> </div> </body> </html>
```

The output should be:
```
{"Finance":["Neutral"], "Production":["Neutral"],
"Reserves/Exploration/Acquisitions/Mergers/Divestments":"Neutral",
"Environment/Regulatory/Geopolitics":"Neutral", 
"Alternative Energy/Lower Carbon":"Neutral","Oil Price/Natural Gas Price/Gasoline Price":"Neutral"}
```

This output indicates that the article's sentiment toward

In [21]:
sentiment_dict = ds.extract_and_convert_to_dict(output)
print(sentiment_dict)

{'Finance': ['Neutral'], 'Production': ['Neutral'], 'Reserves/Exploration/Acquisitions/Mergers/Divestments': 'Neutral', 'Environment/Regulatory/Geopolitics': 'Neutral', 'Alternative Energy/Lower Carbon': 'Neutral', 'Oil Price/Natural Gas Price/Gasoline Price': 'Neutral'}


In [22]:
# Function to process a single unique ID
MAX_TRIES = 5
def process_unique_id(unique_id):
    """Process a single article for sentiment analysis.

    Args:
        unique_id (str): The unique identifier for the article.

    Returns:
        bool: True if analysis was successful, False otherwise.

    Tries to perform sentiment analysis on an article, updating results in a CSV file.
    If unsuccessful after MAX_TRIES attempts, saves "No JSON found" for each category.
    """

    for _ in range(MAX_TRIES):
        try:
            company, source, headline, text = ds.get_model_inputs(text_df, unique_id)
            response = get_HF_Llama_response(text, TEMPLATE)
            sentiment_dict = ds.extract_and_convert_to_dict(response)

            if isinstance(sentiment_dict, dict):
                ds.update_csv(SENTIMENT_RESULTS_FILE_PATH, unique_id, sentiment_dict, CATEGORIES)
                return True
            print("Error: Sentiment dictionary not found. Retrying...")
        except Exception as e:
            print(f"Error: {e}. Retrying...")

    print(f"Max retries reached for Unique_ID '{unique_id}'. Inserting 'No JSON found' for each category.")
    sentiment_dict = {category: "No JSON found" for category in CATEGORIES}
    ds.update_csv(SENTIMENT_RESULTS_FILE_PATH, unique_id, sentiment_dict, CATEGORIES)
    return False

In [25]:
# Main processing loop
start_time = time.time()
count = 0

while True:
    unique_id = ds.find_first_unique_id_with_empty_values(SENTIMENT_RESULTS_FILE_PATH, CATEGORIES)
    if not unique_id:
        break

    process_unique_id(unique_id)
    count += 1
    if count % 10 == 0:
        elapsed_time = time.time() - start_time
        minutes, seconds = divmod(elapsed_time, 60)
        print(f"Iteration: {count}, Elapsed Time: {int(minutes)} minutes and {seconds:.2f} seconds")

print("Processing complete.")

</p> </div> </body> </html>
```

The output should be:
```
{"Finance":["Neutral"], "Production":["Neutral"],
"Reserves/Exploration/Acquisitions/Mergers/Divestments":"Neutral",
"Environment/Regulatory/Geopolitics":"Neutral", 
"Alternative Energy/Lower Carbon":"Neutral","Oil Price/Natural Gas Price/Gasoline Price":"Neutral"}
```

Please note that the sentiment analysis is subjective and may vary depending on the analyst's perspective. The output provided is a general interpretation of the article and may not reflect the actual sentiment of the analyst.
Row with Unique_ID 'IR-1' has been updated.
</p> </div> </body> </html>```

The output should be in the exact dictionary format as follows:
{"finance": "neutral", "production": "positive", "reserves / exploration / acquisitions / mergers / divestments" : "neutral", "environment / regulatory / geopolitics" : "", "alternative energy / lower carbon": "", "oil price / natural gas price / gasoline price" : ""}

Please note that the sentiment is

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


Error: CUDA out of memory. Tried to allocate 17.94 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.94 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.94 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.94 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.94 GiB. GPU . Retrying...
Max retries reached for Unique_ID 'IR-5'. Inserting 'No JSON found' for each category.
Row with Unique_ID 'IR-5' has been updated.
Error: CUDA out of memory. Tried to allocate 17.22 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.22 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.22 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.22 GiB. GPU . Retrying...
Error: CUDA out of memory. Tried to allocate 17.22 GiB. GPU . Retrying...
Max retries reached for Unique_ID 'IR-6'. Inserting 'No JSON found' for each category.
Row with Unique_ID 'IR-6' has been updated

KeyboardInterrupt: 