# Analysing the best performing model using Captum to understand the decision making process from the model

##### This notebook follows the official tutorial for BERT models from Captum as can be found in this link: https://captum.ai/tutorials/Bert_SQUAD_Interpret

In [1]:
import pandas as pd
import torch
from captum.attr import LayerIntegratedGradients, visualization as viz
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig

# Prepare data and load model

In [2]:
# Load prediction and test data
model_pred_df = pd.read_csv(
    '../2. models/predictions/nickmuchi-sec-bert-finetuned-finance-classification_False_FinancialStrengthText_MAMConfig_2024-04-30_predictions.csv')
test_df = pd.read_csv('../1. data/final/test.csv')

# Identify the indices of texts with the three lowest and highest prediction values
lowest_indices = model_pred_df['1'].nsmallest(3).index
highest_indices = model_pred_df['1'].nlargest(3).index

# Retrieve the actual texts corresponding to these indices for analysis
lowest_texts = test_df.loc[lowest_indices, 'FinancialStrengthText']
highest_texts = test_df.loc[highest_indices, 'FinancialStrengthText']

In [3]:
# Set up the computation device for PyTorch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [4]:
# Load the pre-trained model and its configuration from a checkpoint
checkpoint_path = '../results/final/llm/llm_fine-tine_MAMConfig/nickmuchi-sec-bert-finetuned-finance-classification/checkpoint-138'
config = AutoConfig.from_pretrained(checkpoint_path)

model = AutoModelForSequenceClassification.from_pretrained(checkpoint_path, config=config)
model.to(device)
model.eval()  # Set model to evaluation mode
model.zero_grad()  # Zero gradients to ready for new computation

In [5]:
# Load tokenizer to convert text into format suitable for the model
tokenizer = AutoTokenizer.from_pretrained('nickmuchi/sec-bert-finetuned-finance-classification')

# Define special token IDs that will be used to format input data properly
ref_token_id = tokenizer.pad_token_id  # Padding token for reference sequences
sep_token_id = tokenizer.sep_token_id  # Separator token (often used in BERT models)
cls_token_id = tokenizer.cls_token_id  # Classifier token (marks start of sequence in BERT)

# Captum part

#### Helper functions

In [6]:
def predict(inputs):
    """Make a forward pass through the model to obtain predictions."""
    return model(inputs)[0]

def construct_input_ref_pair(text, ref_token_id, sep_token_id, cls_token_id):
    """Prepare input and reference pairs required for Captum's attribution analysis."""
    text_ids = tokenizer.encode(text, add_special_tokens=False)
    input_ids = [cls_token_id] + text_ids + [sep_token_id]
    ref_input_ids = [cls_token_id] + [ref_token_id] * len(text_ids) + [sep_token_id]
    return torch.tensor([input_ids], device=device), torch.tensor([ref_input_ids], device=device)


def custom_forward(inputs):
    """Custom forward function to extract specific outputs for analysis with Captum."""
    preds = predict(inputs)
    return torch.softmax(preds, dim=1)[0][0].unsqueeze(-1)


def summarize_attributions(attributions):
    """Normalize and summarize attributions across the input tokens."""
    attributions = attributions.sum(dim=-1).squeeze(0)
    return attributions / torch.norm(attributions)

### Analyze the text

In [7]:
def analyze_texts(texts):
    """Analyze given texts to visualize model attributions and their impact on predictions."""
    lig = LayerIntegratedGradients(custom_forward, model.bert.embeddings)

    for text in texts:
        input_ids, ref_input_ids = construct_input_ref_pair(text, ref_token_id, sep_token_id, cls_token_id)
        score = predict(input_ids)

        attributions, delta = lig.attribute(inputs=input_ids, baselines=ref_input_ids, return_convergence_delta=True)
        attributions_sum = summarize_attributions(attributions)

        indices = input_ids[0].detach().tolist()
        all_tokens = tokenizer.convert_ids_to_tokens(indices)

        # Prepare data for visualization
        score_vis = viz.VisualizationDataRecord(
            attributions_sum,
            torch.softmax(score, dim=1)[0][0],
            torch.argmax(torch.softmax(score, dim=1)[0]),
            0,
            text,
            attributions_sum.sum(),
            all_tokens,
            delta)

        print('\033[1m', 'Visualization For Score', '\033[0m')
        viz.visualize_text([score_vis])

### Perform analysis on texts with the lowest and highest model predictions

In [8]:
print("Analyzing lowest texts:")
analyze_texts(lowest_texts)

Analyzing lowest texts:
[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,0 (0.96),"Anhui Conch Cement boasts a strong financial position. As at end-2022, the companys net gearing ratio was one of the lowest among Chinas leading cement makers, given its net cash position. The firms adjusted EBITDA/interest expense ratio has been improving from 10 times in 2012 to more than 40 times in 2022. Such a strong balance sheet implies little solvency risk and allows the company to issue debts to fund future acquisitions, if necessary.",3.03,"[CLS] an ##hui conc ##h cement boa ##sts a strong financial position . as at end - 2022 , the companys net gear ##ing ratio was one of the lowest among china ##s leading cement makers , given its net cash position . the firms adjusted ebitda / interest expense ratio has been improving from 10 times in 2012 to more than 40 times in 2022 . such a strong balance sheet implies little solvency risk and allows the company to issue debts to fund future acquisitions , if necessary . [SEP]"
,,,,


[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,0 (0.96),"Veolias net debt was EUR 18.1 billion at year-end 2022, largely above the EUR 9.5 billion at the end of 2021 due to the acquisition of Suez that was completed in January 2022. We project net debt to decrease to EUR 17.5 billion by 2027 as operating cash flow will more than cover investments and dividends. This will entail a decrease in the net debt/EBITDA ratio from 2.9 in 2022 to 2.2 in 2027which is below the groups ceiling of 3leaving headroom for acquisitions. We forecast the dividend to grow by 9.7% per year on average between 2022 and 2027, in line with current EPS growth, as targeted by the group through 2025 and implying a decent average payout of 66%. This points to a 2027 dividend of EUR 1.78.",1.52,"[CLS] ve ##olia ##s net debt was eur 18 . 1 billion at year - end 2022 , largely above the eur 9 . 5 billion at the end of 2021 due to the acquisition of suez that was completed in january 2022 . we project net debt to decrease to eur 17 . 5 billion by 2027 as operating cash flow will more than cover investments and dividends . this will entail a decrease in the net debt / ebitda ratio from 2 . 9 in 2022 to 2 . 2 in 2027 ##wh ##ich is below the groups ceiling of 3 ##le ##aving head ##room for acquisitions . we forecast the dividend to grow by 9 . 7 % per year on average between 2022 and 2027 , in line with current eps growth , as targeted by the group through 2025 and imply ##ing a dece ##nt average payout of 66 % . this points to a 2027 dividend of eur 1 . 78 . [SEP]"
,,,,


[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,0 (0.95),"We think Zhonghuans financial status is healthy.Zhonghuans debt to equity ratio ranged from 0.3 times to 0.5 times between 2018 and 2021 but jumped to 0.8 times in 2022 due to large debt-financed capital expenditure. The interest coverage ratio increased to 8.1 times in 2022 from 2.1 times in 2018, and the current ratio rose to 1.4 times in 2022 from 0.8 times in 2018, suggesting little risk in meeting near-term obligations.",-0.04,"[CLS] we think zhong ##hua ##ns financial status is healthy . zhong ##hua ##ns debt to equity ratio ranged from 0 . 3 times to 0 . 5 times between 2018 and 2021 but jump ##ed to 0 . 8 times in 2022 due to large debt - financed capital expenditure . the interest coverage ratio increased to 8 . 1 times in 2022 from 2 . 1 times in 2018 , and the current ratio rose to 1 . 4 times in 2022 from 0 . 8 times in 2018 , suggesting little risk in meeting near - term obligations . [SEP]"
,,,,


In [9]:
print("Analyzing highest texts:")
analyze_texts(highest_texts)

Analyzing highest texts:
[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,1 (0.12),"LiveRamp ended fiscal 2023 with approximately $497 million in cash and cash equivalents and no debt. The firm generated operating cash flow of $34.4 million. We expect the firm to generate cash from operations due to revenue growth and margin expansion going forward. We have modeled a 32% average annual growth in cash from operations. Capital expenditure is likely to stay between 1% and 1.5% of revenue given the firms cloud-based structure. LiveRamp generated free cash flow for the first time during fiscal 2022, which we expect will continue. We expect free cash flow to be 11.5% of revenue in fiscal 2028 from 5% in fiscal 2023.",1.18,"[CLS] liver ##amp ended fiscal 2023 with approximately $ 497 million in cash and cash equivalents and no debt . the firm generated operating cash flow of $ 34 . 4 million . we expect the firm to generate cash from operations due to revenue growth and margin expansion going forward . we have modeled a 32 % average annual growth in cash from operations . capital expenditure is likely to stay between 1 % and 1 . 5 % of revenue given the firms cloud - based structure . liver ##amp generated free cash flow for the first time during fiscal 2022 , which we expect will continue . we expect free cash flow to be 11 . 5 % of revenue in fiscal 2028 from 5 % in fiscal 2023 . [SEP]"
,,,,


[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,1 (0.13),"Broadridges financial health is sound, in our view. As of June 30, 2023, Broadridge had debt of approximately $3.4 billion, putting gross debt/adjusted EBITDA at about 2.4 times. We're not concerned about the firm's leverage, given the firm's resilient business model. Of the $4.2 billion in fee revenue that Broadridge generated in fiscal 2023, over 90% was classified as recurring. Also, during the last financial crisis, equity proxy position count was flat to slightly negative and mutual fund/exchange-traded fund positions grew.",-5.31,"[CLS] broad ##ridge ##s financial health is sound , in our view . as of june 30 , 2023 , broad ##ridge had debt of approximately $ 3 . 4 billion , putting gross debt / adjusted ebitda at about 2 . 4 times . we ' re not concerned about the firm ' s leverage , given the firm ' s resilient business model . of the $ 4 . 2 billion in fee revenue that broad ##ridge generated in fiscal 2023 , over 90 % was classified as recurring . also , during the last financial crisis , equity proxy position count was flat to slightly negative and mutual fund / exchange - traded fund positions grew . [SEP]"
,,,,


[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,1 (0.15),"F5 is a financially sound firm, in our opinion. The company has a strong balance sheet, has solidly generated strong free cash flow, and possesses a war chest of cash. As of Sept. 30, 2023, F5 possessed $803 million in cash and investments and no debt on its books. . In our view, F5 could continue to pursue possible acquisitions, potentially targeting software and cloud-based application delivery, security, management, and analytics technologies. F5 has a solid track record of repurchasing shares and expects to return half of its free cash flow to investors via share buybacks starting in fiscal 2023. The company has never paid a dividend, and we do not foresee F5 paying one in our explicit forecast. With its strong balance sheet and free cash generation, we expect F5 to continue bolting on acquisitions in areas that support application delivery and security.",0.31,"[CLS] f ##5 is a financially sound firm , in our opinion . the company has a strong balance sheet , has solid ##ly generated strong free cash flow , and possesses a war chest of cash . as of sept . 30 , 2023 , f ##5 possessed $ 803 million in cash and investments and no debt on its books . . in our view , f ##5 could continue to pursue possible acquisitions , potentially targeting software and cloud - based application delivery , security , management , and analytics technologies . f ##5 has a solid track record of repurchasing shares and expects to return half of its free cash flow to investors via share buyback ##s starting in fiscal 2023 . the company has never paid a dividend , and we do not foresee f ##5 paying one in our explicit forecast . with its strong balance sheet and free cash generation , we expect f ##5 to continue bolt ##ing on acquisitions in areas that support application delivery and security . [SEP]"
,,,,


### Manually add and analyze a specific custom text

In [10]:
custom_text = [
    "As of June 2023, AMD has $6.3 billion in cash and cash equivalents against total debt of $2.5 billion. AMD took on debt to acquire Xilinx, but Xilinx generates healthy cash flow, and now that AMD has gained meaningful market share in PC and server CPUs, we are comfortable that AMD will generate healthy free cash flow and should be able to work down its debt obligations over time. AMD does not pay a dividend but has bought back shares in recent years as part of a share-repurchase program announced in Feb 2022the company has about $6 billion remaining as of June 2023. Wed expect any capital distributions in the years ahead to be done via additional buybacks as part of this program."]
analyze_texts(custom_text)

[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
0.0,1 (0.29),"As of June 2023, AMD has $6.3 billion in cash and cash equivalents against total debt of $2.5 billion. AMD took on debt to acquire Xilinx, but Xilinx generates healthy cash flow, and now that AMD has gained meaningful market share in PC and server CPUs, we are comfortable that AMD will generate healthy free cash flow and should be able to work down its debt obligations over time. AMD does not pay a dividend but has bought back shares in recent years as part of a share-repurchase program announced in Feb 2022the company has about $6 billion remaining as of June 2023. Wed expect any capital distributions in the years ahead to be done via additional buybacks as part of this program.",0.87,"[CLS] as of june 2023 , amd has $ 6 . 3 billion in cash and cash equivalents against total debt of $ 2 . 5 billion . amd took on debt to acquire xilinx , but xilinx generates healthy cash flow , and now that amd has gained meaningful market share in pc and server cpu ##s , we are comfortable that amd will generate healthy free cash flow and should be able to work down its debt obligations over time . amd does not pay a dividend but has bought back shares in recent years as part of a share - repurchase program announced in feb 2022 ##the company has about $ 6 billion remaining as of june 2023 . wed expect any capital distributions in the years ahead to be done via additional buyback ##s as part of this program . [SEP]"
,,,,
