<a href="https://colab.research.google.com/github/Rohit999zzz/-Domain-Specific-LLM-for-Financial-Analysis/blob/main/final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

financial statement generation and summarization system:

1. Data Ingestion and Preprocessing:

Data Sources: The system ingests data from two primary sources:
Transaction Data: Raw transactional data containing information about individual transactions, such as date, amount, account type, etc. (e.g., transaction_data.csv).
Financial Reports: Textual financial reports containing detailed financial information and insights (e.g., financial_reports.csv).
Preprocessing:
Transaction Data: The process_transaction_data function cleans, transforms, and aggregates the raw transaction data into a structured format suitable for model input. This includes handling date formats, aggregating transactions by account type and month, and calculating summary statistics.
Financial Reports: The textual financial reports are combined into a single corpus, split into training and validation sets, and tokenized using the GPT-2 tokenizer.
2. Model Fine-tuning:

Model Selection: The system leverages the pre-trained GPT-2 language model as the foundation for financial statement generation and summarization. GPT-2 is a powerful language model capable of learning complex patterns in text data.
Fine-tuning: The pre-trained GPT-2 model is fine-tuned on a combination of the structured transaction data and the textual financial reports. This fine-tuning process adapts the model to the specific language and concepts relevant to financial statements and summaries.
Custom Trainer: A custom CustomTrainer class is implemented to handle the training process and define a custom loss function tailored for the language modeling task.
Dataset Loading: The datasets library is used to load a relevant financial summarization dataset from the Hugging Face Hub, providing additional training data and improving the model's understanding of financial language.
3. Inference and Task-Specific Functions:

Financial Statement Generation: The generate_financial_statement function takes structured transaction data as input and uses the fine-tuned GPT-2 model to generate a financial statement based on the provided data.
Financial Report Summarization: The summarize_financial_report function takes a financial report text as input and utilizes the fine-tuned model to generate a concise summary of key financial insights.
Query Answering: The answer_query function enables users to ask questions about financial topics, and the model attempts to provide answers based on its learned knowledge.
4. Model Deployment and Usage:

Model Saving: The fine-tuned GPT-2 model and tokenizer are saved to disk for future use.
Example Usage: The code provides example scenarios demonstrating how the model can be used for financial statement generation, report summarization, and query answering.
Overall Architecture Diagram:
Key Components:
Data Sources: Transaction data and financial reports.
Preprocessing: Data cleaning, transformation, and aggregation.
Model: Pre-trained GPT-2 language model.
Fine-tuning: Adapting the model to financial data.
Custom Trainer: Managing the training process.
Inference Functions: Generating statements, summaries, and answering queries.
Model Deployment: Saving and loading the fine-tuned model.
This architecture enables the system to ingest financial data, learn patterns from it, and perform various financial tasks, such as generating financial statements, summarizing reports, and answering questions. By leveraging the power of GPT-2 and fine-tuning it on relevant data, the system can provide valuable insights and assist with financial analysis

In [None]:
import pandas as pd
import numpy as np
import transformers
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TrainingArguments, Trainer
from sklearn.model_selection import train_test_split
import torch

# Step 1: Data Preparation
# Assume `transaction_data.csv` contains raw transactional data
# and `financial_reports.csv` contains financial reports for analysis.

transaction_data = pd.read_csv("transaction_data.csv")
financial_reports = pd.read_csv("financial_reports.csv")

# Process raw transactional data into a structured format for model input.
def process_transaction_data(data):
    # Ensure 'Transaction_Date' column exists and is valid
    if 'Transaction_Date' not in data.columns:
        raise KeyError("'Transaction_Date' column is missing in the input data.")

    data['Transaction_Date'] = pd.to_datetime(data['Transaction_Date'], errors='coerce')

    # Drop rows with invalid or missing dates
    data = data.dropna(subset=['Transaction_Date'])

    # Create 'Month' column based on 'Transaction_Date'
    data['Month'] = data['Transaction_Date'].dt.to_period('M')

    # Aggregate data by 'Account_Type' and 'Month'
    aggregated_data = data.groupby(['Account_Type', 'Month']).agg({
        'Amount': ['sum', 'mean'],
        'Transaction_ID': 'count'
    }).reset_index()

    aggregated_data.columns = ['_'.join(col).strip() for col in aggregated_data.columns.values]
    return aggregated_data

structured_transactions = process_transaction_data(transaction_data)
structured_transactions.to_csv("structured_transactions.csv", index=False)

# Prepare labeled data for fine-tuning the model
# Combine financial text data into a single corpus
all_text = "\n".join(financial_reports['Content'].dropna().tolist())

# Split data into training and evaluation
train_texts, val_texts = train_test_split(all_text.split("\n"), test_size=0.2, random_state=42)

# Step 2: Tokenization
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512)

# Create Dataset objects for PyTorch
class FinancialDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __len__(self):
        return len(self.encodings['input_ids'])

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        # Add labels for language modeling task
        item['labels'] = item['input_ids'].clone()  # Labels are the same as input_ids for language modeling
        return item

train_dataset = FinancialDataset(train_encodings)
val_dataset = FinancialDataset(val_encodings)

# Step 3: Model Fine-Tuning
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Resize token embeddings after adding a new pad token
model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir="./results",           # output directory
    run_name="fine_tune_financial_model",  # Custom run name
    num_train_epochs=3,               # total number of training epochs
    per_device_train_batch_size=4,    # batch size for training
    per_device_eval_batch_size=4,     # batch size for evaluation
    warmup_steps=500,                 # number of warmup steps for learning rate scheduler
    weight_decay=0.01,                # strength of weight decay
    logging_dir="./logs",            # directory for storing logs
    logging_steps=10,
    save_steps=500
)

# Define a custom compute_loss function for the Trainer
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None): # Add num_items_in_batch argument
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get("logits")
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        loss_fct = torch.nn.CrossEntropyLoss()
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        return (loss, outputs) if return_outputs else loss

trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=tokenizer
)

trainer.train()

# Step 4: Inference and Task-Specific Functions
# Generate financial statements from raw transactional data
def generate_financial_statement(data):
    input_text = tokenizer("Generate financial statement from data:\n" + str(data), return_tensors="pt", padding=True, truncation=True)
    # Increase max_length or set max_new_tokens
    output = model.generate(
        input_ids=input_text.input_ids,
        attention_mask=input_text.attention_mask,
        max_length=512,
        pad_token_id=tokenizer.pad_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

def summarize_financial_report(report_text):
    input_text = tokenizer("Summarize key financial insights:\n" + report_text, return_tensors="pt", padding=True, truncation=True)
    output = model.generate(
        input_ids=input_text.input_ids,
        attention_mask=input_text.attention_mask,
        max_length=150,
        pad_token_id=tokenizer.pad_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

def answer_query(query):
    input_text = tokenizer("Query: " + query, return_tensors="pt", padding=True, truncation=True)
    output = model.generate(
        input_ids=input_text.input_ids,
        attention_mask=input_text.attention_mask,
        max_length=100,
        pad_token_id=tokenizer.pad_token_id
    )
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example usage
result = generate_financial_statement(structured_transactions.head().to_dict())
summary = summarize_financial_report("Company X's revenue grew by 20% in Q3, driven by...")
query_result = answer_query("What were the effects of the 2008 recession on global markets?")

# Save the fine-tuned model
model.save_pretrained("./financial_gpt2")
tokenizer.save_pretrained("./financial_gpt2")


In [None]:
# Print the results:
print("Financial Statement:\n", result)
print("\nSummary:\n", summary)
print("\nQuery Result:\n", query_result)

# Further processing (example):
# You can store the results in a file, analyze the text,
# or use them as input to other functions or models.
with open("financial_statement.txt", "w") as f:
    f.write(result)