# Sign In and Login HuggingFace

To apply the model on HuggingFace, you need to create your own account and an authorized token.  
Please follow the step below:

1. Create an account for [HuggingFace](https://huggingface.co/)
2. Click your profile (On right top)
3. Select "Access Tokens"
4. Select "Create new tokens"
5. Name your token in the "Token name" space.
5. Roll down the page and select "Create token" (**Please don't select any other option before you click "Create token" botton**)
6. Copy the token to this notebook

If you encounter some problem, check the steps on [notion](https://sideways-perfume-247.notion.site/Ollama-FinBERT-Installation-1c8cdd37722280b4bb5cfc04b66db8c4?pvs=4).

In [None]:
pip install transformers huggingface_hub

In [None]:
from huggingface_hub import login

HF_TOKEN = "DDDDDDDDDD"
login(token=HF_TOKEN, add_to_git_credential=True)

In [None]:
!huggingface-cli whoami

HuiChing


# Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]

Collecting datasets
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2025.3.0,>=2023.1.0 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.5.1-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m4.

## Pipeline

The pipelines are a great and easy way to use models for inference. These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use.

Transformers 庫最基礎的對象就是 pipeline() 函數，它封裝了預訓練模型和對應的前處理和後處理環節。我們只需輸入文本，就能得到預期的答案。目前常用的 pipelines 有：

sentiment-analysis（情感分析）

zero-shot-classification（零訓練樣本分類）

text-generation（文本生成）

fill-mask（填充被遮蓋的詞、片段）

ner（命名實體識別）

question-answering（自動問答）

summarization（自動摘要）

translation（機器翻譯）

feature-extraction（獲得文本的向量化表示）


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cuda:0


[{'label': 'POSITIVE', 'score': 0.9598046541213989}]

In [None]:
classifier(
    ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"]
)

[{'label': 'POSITIVE', 'score': 0.9598046541213989},
 {'label': 'NEGATIVE', 'score': 0.9994558691978455}]

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")
classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


{'sequence': 'This is a course about the Transformers library',
 'labels': ['education', 'business', 'politics'],
 'scores': [0.844595193862915, 0.11197695881128311, 0.04342786595225334]}

In [None]:
from transformers import pipeline

generator = pipeline("text-generation")
generator("In this course, we will teach you how to")

No model was supplied, defaulted to openai-community/gpt2 and revision 607a30d (https://huggingface.co/openai-community/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to connect the dots between our personal experience and professional opportunities and the jobs we fill, and how you can find your work within the best environment for you to succeed.'}]

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2")
generator(
    "In this course, we will teach you how to",
    max_length=30,
    num_return_sequences=2,
)

Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'In this course, we will teach you how to use the web browser and how to use the following skills to keep your computer running smoothly so that your'},
 {'generated_text': 'In this course, we will teach you how to convert to Catholicism via our own resources. In many ways, our goal is to help you understand how'}]

In [None]:
from transformers import pipeline

unmasker = pipeline("fill-mask")
unmasker("This course will teach you all about <mask> models.", top_k=10)

No model was supplied, defaulted to distilbert/distilroberta-base and revision fb53ab8 (https://huggingface.co/distilbert/distilroberta-base).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/331M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert/distilroberta-base were not used when initializing RobertaForMaskedLM: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


[{'score': 0.19619838893413544,
  'token': 30412,
  'token_str': ' mathematical',
  'sequence': 'This course will teach you all about mathematical models.'},
 {'score': 0.04052715003490448,
  'token': 38163,
  'token_str': ' computational',
  'sequence': 'This course will teach you all about computational models.'},
 {'score': 0.03301798552274704,
  'token': 27930,
  'token_str': ' predictive',
  'sequence': 'This course will teach you all about predictive models.'},
 {'score': 0.031941574066877365,
  'token': 745,
  'token_str': ' building',
  'sequence': 'This course will teach you all about building models.'},
 {'score': 0.024522822350263596,
  'token': 3034,
  'token_str': ' computer',
  'sequence': 'This course will teach you all about computer models.'},
 {'score': 0.023129606619477272,
  'token': 774,
  'token_str': ' role',
  'sequence': 'This course will teach you all about role models.'},
 {'score': 0.019632143899798393,
  'token': 265,
  'token_str': ' business',
  'sequence

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Device set to use cuda:0


[{'entity_group': 'PER',
  'score': np.float32(0.9981694),
  'word': 'Sylvain',
  'start': 11,
  'end': 18},
 {'entity_group': 'ORG',
  'score': np.float32(0.9796019),
  'word': 'Hugging Face',
  'start': 33,
  'end': 45},
 {'entity_group': 'LOC',
  'score': np.float32(0.9932106),
  'word': 'Brooklyn',
  'start': 49,
  'end': 57}]

In [None]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question="Where do I work?",
    context="My name is Sylvain and I work at Hugging Face in Brooklyn",
)

{'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'}
klyn",
)


In [None]:
from transformers import pipeline

summarizer = pipeline("summarization")
summarizer(
    """
    America has changed dramatically during recent years. Not only has the number of
    graduates in traditional engineering disciplines such as mechanical, civil,
    electrical, chemical, and aeronautical engineering declined, but in most of
    the premier American universities engineering curricula now concentrate on
    and encourage largely the study of engineering science. As a result, there
    are declining offerings in engineering subjects dealing with infrastructure,
    the environment, and related issues, and greater concentration on high
    technology subjects, largely supporting increasingly complex scientific
    developments. While the latter is important, it should not be at the expense
    of more traditional engineering.

    Rapidly developing economies such as China and India, as well as other
    industrial countries in Europe and Asia, continue to encourage and advance
    the teaching of engineering. Both China and India, respectively, graduate
    six and eight times as many traditional engineers as does the United States.
    Other industrial countries at minimum maintain their output, while America
    suffers an increasingly serious decline in the number of engineering graduates
    and a lack of well-educated engineers.
"""
)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cuda:0


[{'summary_text': ' The number of engineering graduates in the United States has declined in recent years . China and India graduate six and eight times as many traditional engineers as the U.S. does . Rapidly developing economies such as China continue to encourage and advance the teaching of engineering . There are declining offerings in engineering subjects dealing with infrastructure, infrastructure, the environment, and related issues .'}]

In [None]:
from transformers import pipeline

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en")
translator("Ce cours est produit par Hugging Face.")

[{'translation_text': 'This course is produced by Hugging Face.'}]


In [None]:
from transformers import pipeline

setence_vec = pipeline("feature-extraction")
setence_vec("Hugging Face transformer rocks.")

No model was supplied, defaulted to distilbert/distilbert-base-cased and revision 6ea8117 (https://huggingface.co/distilbert/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/263M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Device set to use cuda:0


[[[0.3262096047401428,
   0.06740760058164597,
   -0.1800243854522705,
   -0.293072372674942,
   -0.3830827474594116,
   0.03734317421913147,
   0.08649615198373795,
   0.001205121399834752,
   -0.03756137937307358,
   -1.0100231170654297,
   -0.4392012059688568,
   0.23930427432060242,
   -0.09029210358858109,
   -0.0776762068271637,
   -0.6154975295066833,
   0.03333872929215431,
   0.2637574374675751,
   0.27685225009918213,
   -0.23289339244365692,
   -0.06546097248792648,
   0.13497962057590485,
   -0.18861621618270874,
   0.5004849433898926,
   -0.465809166431427,
   0.206257626414299,
   -0.015177843160927296,
   0.27487713098526,
   0.07840054482221603,
   -0.15195757150650024,
   0.32629838585853577,
   0.009691542945802212,
   0.3115546405315399,
   0.016048867255449295,
   0.09666934609413147,
   -0.1988566666841507,
   0.2858717143535614,
   -0.12985780835151672,
   -0.3907455801963806,
   0.0011373927118256688,
   -0.17238102853298187,
   -0.4628649652004242,
   0.09680376

# Details inside the PiPeLine

## Tokenization
使用分詞器進行預處理
與其他神經網絡一樣，Transformer 模型無法直接處理原始文本， 因此我們管道的第一步是將文本輸入轉換為模型能夠理解的數字。 為此，我們使用tokenizer(標記器)，負責：

將輸入拆分為單詞、子單詞或符號（如標點符號），稱為標記(token)
將每個標記(token)映射到一個整數
添加可能對模型有用的其他輸入.

所有這些預處理都需要以與模型預訓練時完全相同的方式完成，因此我們首先需要從Model Hub中下載這些信息。為此，我們使用AutoTokenizer類及其from_pretrained()方法。使用我們模型的檢查點名稱，它將自動獲取與模型的標記器相關聯的數據，並對其進行緩存（因此只有在您第一次運行下面的代碼時才會下載）。

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

You can try different models to obtains different token-sppaorches

In [None]:
from transformers import AutoTokenizer

checkpoint = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

## Going through the model
We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an AutoModel class which also has a from_pretrained() method:

In [None]:
pip install huggingface_hub[hf_xet]

In [None]:
from transformers import AutoModel

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)

In [None]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer

In [None]:
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")


A high-dimensional vector?
The vector output by the Transformer module is usually large. It generally has three dimensions:

* Batch size: The number of sequences processed at a time (2 in our example).
* Sequence length: The length of the numerical representation of the sequence (16 in our example).
* Hidden size: The vector dimension of each model input.

In [None]:
outputs = model(**inputs)
outputs

In [None]:
print(outputs.last_hidden_state)

In [None]:
print(outputs.last_hidden_state.shape)

In [None]:
from transformers import AutoModelForSequenceClassification

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
outputs = model(**inputs)
outputs

In [None]:
import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

In [None]:
print(model.config.id2label)

Llama 4

* It may take some time/Space to download the model

In [None]:
pip install --upgrade transformers

In [None]:
from transformers import AutoModel
checkpoint = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)

# Applications: ClimateBERT



This cell shows how to load a pre-trained model from hugging face.You can also find the python code about model loading in [ClinmateBERT model card](https://huggingface.co/climatebert/distilroberta-base-climate-sentiment).

We are going to apply sentiment analysis to the trademark text by using **ClimateBERT**. This model, a variant of DistilBERT fine-tuned on climate-related assays, would generate a sentiment label and a probability score for text.

In [None]:
# if you encounter importing error, try to upgrade your python to 3.11.x or run it on GoogleColab
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm

model_name = "climatebert/distilroberta-base-climate-sentiment"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, max_len=512)

climateBERT_pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

config.json:   0%|          | 0.00/947 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/329M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.38M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/4.48k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


In [None]:
# sample dataset
import pandas as pd
trademark = pd.read_csv("/content/gtm_sample.csv") # use 5 rows only, cuz it might take long time to execute.
trademark.iloc[0:10,:]

Unnamed: 0,id,text
0,88299029,"Technical analysis, planning, continuous monit..."
1,79176667,Shoulder bags; handbags; Boston bags; [ waist ...
2,86740530,"Software for the creation, storage and retriev..."
3,73434623,Electric Motors for Space Vehicles
4,79362423,Voltage regulators; battery chargers; electric...
5,86825134,Computer programs for controlling lighting fix...
6,79244114,Surveying instruments; weighing apparatus and ...
7,77697608,Engineering services for building and property...
8,88847445,Dietary and nutritional supplements for endura...
9,97082186,Research and development in the field of energ...


In [None]:
# the model can output a label and a score.
# 'label' is the category of the text,
# 'score' is the probability of the text belongs to this label.

# label : opportunity , neutral , risk
trademark['label'] = trademark['text'].apply(lambda x: climateBERT_pipe(x)[0]['label'])
# score : 0.333 ~ 1
trademark['score'] = trademark['text'].apply(lambda x: climateBERT_pipe(x)[0]['score'])
trademark

Unnamed: 0,id,text,label,score
0,88299029,"Technical analysis, planning, continuous monit...",neutral,0.855523
1,79176667,Shoulder bags; handbags; Boston bags; [ waist ...,neutral,0.547645
2,86740530,"Software for the creation, storage and retriev...",risk,0.778602
3,73434623,Electric Motors for Space Vehicles,opportunity,0.441063
4,79362423,Voltage regulators; battery chargers; electric...,neutral,0.498441
5,86825134,Computer programs for controlling lighting fix...,opportunity,0.410641
6,79244114,Surveying instruments; weighing apparatus and ...,neutral,0.492469
7,77697608,Engineering services for building and property...,neutral,0.555585
8,88847445,Dietary and nutritional supplements for endura...,opportunity,0.623114
9,97082186,Research and development in the field of energ...,opportunity,0.717397


# Applications: FinBERT

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)

tokenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Device set to use cpu


In [None]:
import pandas as pd
fintext = pd.read_csv("/content/fintext.csv") # use 5 rows only, cuz it might take long time to execute.
fintext.iloc[0:10,:]

Unnamed: 0,ID,text
0,1,The company's earnings call revealed strong pe...
1,2,Recent financial reports indicate a continued ...
2,3,"As the global economy recovers, investor inter..."
3,4,Due to supply chain disruptions and rising raw...
4,5,Latest economic data shows persistent inflatio...


In [None]:
# To do 3 . Classify the Text of Finance Report
fintext['fin_label'] = fintext['text'].apply(lambda x: pipe(x)[0]['label'])
fintext['fin_score'] = fintext['text'].apply(lambda x: pipe(x)[0]['score'])
fintext

Unnamed: 0,ID,text,fin_label,fin_score
0,1,The company's earnings call revealed strong pe...,positive,0.95191
1,2,Recent financial reports indicate a continued ...,positive,0.953901
2,3,"As the global economy recovers, investor inter...",positive,0.922274
3,4,Due to supply chain disruptions and rising raw...,negative,0.974687
4,5,Latest economic data shows persistent inflatio...,negative,0.964536


# Applications: BERT for patents

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="anferico/bert-for-patents")



config.json:   0%|          | 0.00/327 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.38G [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/329k [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
# https://patents.google.com/patent/JPH01235865A/en?oq=1235865
abs_1 = "To detect an electromagnetic wave caused by a corona generated from inside of an electric apparatus in the manner of distinguishing from the noise come from the outside by detecting a signal corresponding to the difference between receiving signals of two antennas which are provided in each location having a high and a weak enough receiving sensitivity. CONSTITUTION:An antenna 15 is provided in the location near the electric apparatus, where the electromagnetic wave caused by the corona discharge of the electric apparatus is sensitively receivable. An antenna 16 is provided in the location at an appropriate distance from the electric apparatus, where the receiving sensitivity for the electromagnetic wave caused by the corona discharge inside of the electric apparatus is low enough, and also in such a location that the receiving sensitivity for an outcoming noise source is made to be equivalent to that of the antenna 15. The signal corresponding to the difference between receiving signals of the antennas 15 and 16 obtained by circuits 17, 18 is amplified by a differential amplifier 19. Thereby, only the electromagnetic wave caused by the corona discharge inside of the electric apparatus can be taken out by actually negating the outcoming noise, so the generation of the insulation abnormality inside of the electric apparatus is surely detected by means of comparing the signal with a reference level in a decision circuit 20."

abs_2 = "To detect an electromagnetic sensor caused by a corona generated from inside of an electric apparatus in the manner of distinguishing from the noise come from the outside by detecting a signal corresponding to the difference between receiving signals of two antennas which are provided in each location having a high and a weak enough receiving sensitivity. CONSTITUTION:An antenna 15 is provided in the location near the electric apparatus, where the electromagnetic wave caused by the corona discharge of the electric apparatus is sensitively receivable. An antenna 16 is provided in the location at an appropriate distance from the electric apparatus, where the receiving sensitivity for the electromagnetic wave caused by the corona discharge inside of the electric apparatus is low enough, and also in such a location that the receiving sensitivity for an outcoming noise source is made to be equivalent to that of the antenna 15. The signal corresponding to the difference between receiving signals of the antennas 15 and 16 obtained by circuits 17, 18 is amplified by a differential amplifier 19. Thereby, only the electromagnetic wave caused by the corona discharge inside of the electric apparatus can be taken out by actually negating the outcoming noise, so the generation of the insulation abnormality inside of the electric apparatus is surely detected by means of comparing the signal with a reference level in a decision circuit 20."

# https://patents.google.com/patent/AU2020201868B2/en?inventor=Hiromasa+Iwashita
abs_3 = "Provided is a plastic bottle that can be depressed and restored without damaging the bottle shoulder portion even with thin plastic bottles and that can effectively handle top loads. A shoulder portion (3) has, in order from the top, a first circumferential rib (311), a second circumferential rib (312), and a third circumferential rib (313), each being annular, on the same axis as an opening (2) and is configured so that a plastic bottle (1) transitions to a depressed state in which the bottle is depressed down by deformation that starts at the first circumferential rib (311), the second circumferential rib (312), and the third circumferential rib (313) when a top load (F) is acting on the bottle and so that the depressed state can be maintained even after the top load (F) is removed."


In [None]:
vec_1 = pipe(abs_1,return_tensors = "pt")[0].numpy().mean(axis=0)
vec_2 = pipe(abs_2,return_tensors = "pt")[0].numpy().mean(axis=0)
vec_3 = pipe(abs_3,return_tensors = "pt")[0].numpy().mean(axis=0)

In [None]:
vec_1

array([ 0.5160622 ,  0.09970355, -0.02904806, ..., -0.9063538 ,
       -0.26012313, -1.2861388 ], dtype=float32)

In [None]:
len(vec_1)

1024

In [None]:
# prompt: # prompt: use the model="anferico/bert-for-patents")..generate the cosince similarity of two patents abstract, use above

import numpy as np

def cosine_similarity(a, b):
  return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

similarity_1_2 = cosine_similarity(vec_1, vec_2)
similarity_1_3 = cosine_similarity(vec_1, vec_3)
similarity_2_3 = cosine_similarity(vec_2, vec_3)

print(f"Cosine similarity between abs_1 and abs_2: {similarity_1_2}")
print(f"Cosine similarity between abs_1 and abs_3: {similarity_1_3}")
print(f"Cosine similarity between abs_2 and abs_3: {similarity_2_3}")


Cosine similarity between abs_1 and abs_2: 0.9997245073318481
Cosine similarity between abs_1 and abs_3: 0.8104997277259827
Cosine similarity between abs_2 and abs_3: 0.8098441362380981


In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("feature-extraction", model="anferico/bert-for-patents")


# Applicatio: Gender Classification by Name

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Load the model and tokenizer from the Hub
model_name = "imranali291/genderize"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define a mapping for the predicted numeric label
label_map = {
    0: "F",
    1: "M"
}

# Example inference function
def predict_gender(name):
    inputs = tokenizer(name, return_tensors="pt", padding=True, truncation=True, max_length=32)
    outputs = model(**inputs)
    predicted_label = outputs.logits.argmax(dim=-1).item()
    #return model.config.id2label[predicted_label]
    #return label_map[predicted_label]
    return predicted_label

print(predict_gender("Alex"))  # Example output: 'M'
print(predict_gender("Maria"))  # Example output: 'F'

1
0


In [None]:
inputs = tokenizer('Hui-Ching Chuang', return_tensors="pt", padding=True, truncation=True, max_length=32)
outputs = model(**inputs)
outputs.logits.argmax(dim=-1).item()

1

In [None]:
inputs = tokenizer('Mary', return_tensors="pt", padding=True, truncation=True, max_length=32)
outputs = model(**inputs)
outputs.logits.argmax(dim=-1).item()

0

# Application: GPT usuage scores
Reference: heng, J., Sun, Z., Yang, B., & Zhang, A. L. (2024). Generative AI and asset management. Available at SSRN 4786575.


## Meta Lamma

In [None]:
pip install transformers accelerate torch sentencepiece

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

In [None]:
# load conference call data
import pandas as pd
earning_call_2001 = pd.read_pickle("/content/earning_call_2001_master.pkl")
earning_call = earning_call_2001.iloc[:10]
earning_call.columns

FileNotFoundError: [Errno 2] No such file or directory: '/content/earning_call_2001_master.pkl'

In [None]:
earning_call['n_chars'] = earning_call['Text'].str.len()
earning_call

Unnamed: 0,Section ID,Name,Company,Job Title,Text,section,Participant Type,Meeting Time,File Date,Company Ticker,Event ID,num_coporate_participatns,num_conference_participatns,num_sec_presentation,num_sec_QA,ChatGPT_Investment_Score,n_chars
0,1,Unidentified Audience Member,,,1,Presentation,,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,1
1,2,Operator,,,Ladies and gentlemen thank you for standing by...,Presentation,,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,679
2,3,RICHARD M. SCHULZE,,,Thank you Lea. Good morning everyone. Thank ...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,3534
3,4,DARREN JACKSON,,,"Thanks Dick, and good morning everyone. My co...",Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,12511
4,5,RICHARD M. SCHULZE,,,Thanks Darren. First I'd like to provide some ...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,5538
5,6,Operator,,,"Ladies and gentlemen, if you have a question p...",Presentation,,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,579
6,7,DAN WEWER,,,Good morning. A question about the appliance p...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,358
7,8,RICHARD M. SCHULZE,,,"Wade, do you want to respond to this?",Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,37
8,9,WADE R. FENN,,,There is clearly elasticity in this kind of of...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,696
9,10,DAN WEWER,,,And just one other question and I'll hang up. ...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0,181


In [None]:
earning_call['n_chars']

Unnamed: 0,n_chars
0,1
1,679
2,3534
3,12511
4,5538
5,579
6,358
7,37
8,696
9,181


In [None]:
transcript = earning_call.iloc[1]["Text"]
words = transcript.split()
num_words = len(words)
transcript

'Ladies and gentlemen thank you for standing by and welcome to the Best Buy fourth quarter and fiscal 2001 yearend conference call.  At this time, all participants are in a listen-only mode.  Later, we will conduct a question and answer session. At that time, if you have a question, you will need to press 1 on your touchtone phone.  As a reminder, this call is being recorded for playback, and will be available at noon, central time today.  If you need assistance on the call, please press 0 and then the * key and a specialist will assist you offline.  I would now like to turn the conference over to Mr. Richard M. Schulze, Chairman and CEO of Best Buy.  Please go ahead sir.'

In [None]:
def chunk_text_by_words(text: str, max_words: int = 10):
    words = text.split()
    return [
        " ".join(words[i:i+max_words])
        for i in range(0, len(words), max_words)
    ]



chunks     = chunk_text_by_words(transcript)

for i, c in enumerate(chunks, 1):
    preview = c[:200].replace("\n", " ")
    print(f"Chunk {i} (~{len(c.split())} words): {preview}…")

Chunk 1 (~10 words): Ladies and gentlemen thank you for standing by and welcome…
Chunk 2 (~10 words): to the Best Buy fourth quarter and fiscal 2001 yearend…
Chunk 3 (~10 words): conference call. At this time, all participants are in a…
Chunk 4 (~10 words): listen-only mode. Later, we will conduct a question and answer…
Chunk 5 (~10 words): session. At that time, if you have a question, you…
Chunk 6 (~10 words): will need to press 1 on your touchtone phone. As…
Chunk 7 (~10 words): a reminder, this call is being recorded for playback, and…
Chunk 8 (~10 words): will be available at noon, central time today. If you…
Chunk 9 (~10 words): need assistance on the call, please press 0 and then…
Chunk 10 (~10 words): the * key and a specialist will assist you offline.…
Chunk 11 (~10 words): I would now like to turn the conference over to…
Chunk 12 (~10 words): Mr. Richard M. Schulze, Chairman and CEO of Best Buy.…
Chunk 13 (~4 words): Please go ahead sir.…


In [None]:
 # Prompt builder (same as paper), Input: Chunk text from the transcript, output the full prompt/question that you want LLM to answer
def build_prompt(chunk_text: str) -> str:
    return (
        "The following text is an excerpt from a company’s earnings call transcripts. "
        "You are a finance expert. Based on this text only, please answer the following question:\n\n"
        "How does the firm plan to change its capital spending over the next year? "
        "There are five choices: Increase substantially, increase, no change, decrease, and decrease substantially. "
        "Please select one of the above five choices for each question and provide a one-sentence explanation of your choice for each question. "
        "The format for the answer to each question should be “choice - explanation.” "
        "If no relevant information is provided related to the question, answer “no information is provided.”\n\n"
        "[Part of an earnings call transcript:]\n\n"
        f"{chunk_text}"
    )

test = "Ladies and gentlemen thank you for standing by and welcome"
prompt = build_prompt(test)
prompt

'The following text is an excerpt from a company’s earnings call transcripts. You are a finance expert. Based on this text only, please answer the following question:\n\nHow does the firm plan to change its capital spending over the next year? There are five choices: Increase substantially, increase, no change, decrease, and decrease substantially. Please select one of the above five choices for each question and provide a one-sentence explanation of your choice for each question. The format for the answer to each question should be “choice - explanation.” If no relevant information is provided related to the question, answer “no information is provided.”\n\n[Part of an earnings call transcript:]\n\nLadies and gentlemen thank you for standing by and welcome'

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import re
from tqdm.auto import tqdm

# Load LLaMA 4 Chat model
#model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
#model_id = "meta-llama/Llama-3.1-8B"
model_id = "meta-llama/Llama-2-7b-chat-hf"

# if the repo is gated you’ll need your HF token:
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    use_safetensors=True,    # if available
    trust_remote_code=True,  # required by LLaMA repos
    use_auth_token=True
)

# 6. Query LLaMA
def query_llama(prompt: str, max_new_tokens: int = 256) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.0,
            do_sample=False
        )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    # Remove prompt prefix
    return decoded[len(prompt):].strip()



tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]



In [None]:
response = query_llama(prompt)

KeyboardInterrupt: 

In [None]:
# Full script: Replicate research using Hugging Face LLaMA

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import re
from tqdm.auto import tqdm

# 1. Load your earnings-call data
#earning_call_2001 = pd.read_pickle("/content/earning_call_2001_master.pkl")
#earning_call = earning_call_2001.copy()  # remove .copy() or adjust slicing for testing

# 2. Choice-to-score mapping
CHOICE_SCORE = {
    "increase substantially":  1.0,
    "increase":                0.5,
    "no change":               0.0,
    "decrease":               -0.5,
    "decrease substantially": -1.0,
    "no information is provided": 0.0
}

# 3. Prompt builder (same as paper)
def build_prompt(chunk_text: str) -> str:
    return (
        "The following text is an excerpt from a company’s earnings call transcripts. "
        "You are a finance expert. Based on this text only, please answer the following question:\n\n"
        "How does the firm plan to change its capital spending over the next year? "
        "There are five choices: Increase substantially, increase, no change, decrease, and decrease substantially. "
        "Please select one of the above five choices for each question and provide a one-sentence explanation of your choice for each question. "
        "The format for the answer to each question should be “choice - explanation.” "
        "If no relevant information is provided related to the question, answer “no information is provided.”\n\n"
        "[Part of an earnings call transcript:]\n\n"
        f"{chunk_text}"
    )

# 4. Chunking function (≤2,500 words)
def chunk_text_by_words(text: str, max_words: int = 2500):
    words = text.split()
    return [
        " ".join(words[i:i+max_words])
        for i in range(0, len(words), max_words)
    ]

# 5. Load LLaMA 4 Chat model
#model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
model_id = "meta-llama/Llama-3.1-8B"

# if the repo is gated you’ll need your HF token:
tokenizer = AutoTokenizer.from_pretrained(model_id, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    use_safetensors=True,    # if available
    trust_remote_code=True,  # required by LLaMA repos
    use_auth_token=True
)

# 6. Query LLaMA
def query_llama(prompt: str, max_new_tokens: int = 256) -> str:
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.0,
            do_sample=False
        )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)
    # Remove prompt prefix
    return decoded[len(prompt):].strip()

# 7. Parse "choice - explanation"
_choices_regex = re.compile(
    r'^(increase substantially|increase|no change|decrease substantially|decrease|no information is provided)\b',
    flags=re.IGNORECASE
)
def parse_response(text: str):
    m = _choices_regex.match(text.lower())
    if m:
        choice = m.group(1)
    else:
        choice = "no information is provided"
    explanation = text[len(choice):].lstrip(" -:") or ""
    return choice, explanation

# 8. Process each earnings call
chunk_records = []
call_scores   = []

for idx, row in tqdm(earning_call.iterrows(), total=len(earning_call)):
    section_id = row["Section ID"]
    transcript = str(row["Text"] or "")
    chunks     = chunk_text_by_words(transcript)

    scores = []
    for i, chunk in enumerate(chunks, start=1):
        prompt = build_prompt(chunk)
        try:
            reply       = query_llama(prompt)
            choice, exp = parse_response(reply)
            score       = CHOICE_SCORE[choice]
        except Exception as e:
            choice, exp, score = "error", str(e), 0.0

        chunk_records.append({
            "Section ID": section_id,
            "chunk":      i,
            "choice":     choice,
            "score":      score,
            "explanation": exp
        })
        scores.append(score)

    # Compute firm-quarter-level score = average of chunk scores
    avg_score = sum(scores) / len(scores) if scores else 0.0
    call_scores.append({
        "Section ID": section_id,
        "ChatGPT_Investment_Score": avg_score
    })

# 9. Build DataFrames & Merge back
df_chunks = pd.DataFrame(chunk_records)
df_calls  = pd.DataFrame(call_scores)
earning_call = earning_call.merge(df_calls, on="Section ID", how="left")

# 10. Save results
df_chunks.to_csv("chunk_level_results_llama.csv", index=False)
df_calls.to_csv("call_level_scores_llama.csv", index=False)

print("Done!  • chunk-level → chunk_level_results_llama.csv  • call-level → call_level_scores_llama.csv")




Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]



  0%|          | 0/10 [00:00<?, ?it/s]

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


Done!  • chunk-level → chunk_level_results_llama.csv  • call-level → call_level_scores_llama.csv


In [None]:
chunks

["And just one other question and I'll hang up. Could you discuss the performance in the smaller markets during the fourth quarter, and does Benchmark get the major market locations?"]

In [None]:
df_chunks

Unnamed: 0,Section ID,chunk,choice,score,explanation
0,1,1,no information is provided,0.0,that we have been able to increase our capital...
1,2,1,no information is provided,0.0,
2,3,1,no information is provided,0.0,
3,4,1,no information is provided,0.0,"t the industry, we see a number of positive tr..."
4,5,1,no information is provided,0.0,
5,6,1,no information is provided,0.0,"nd then, on the capital spending, I think you ..."
6,7,1,no information is provided,0.0,rnings call transcript:]\n\nWe are not going t...
7,8,1,no information is provided,0.0,a little bit of a change in our capital spendi...
8,9,1,no information is provided,0.0,very different animal than it was 10 years ago...
9,10,1,no information is provided,0.0,he capital spending over the next year?\n\n[Pa...


In [None]:
df_calls

Unnamed: 0,Section ID,ChatGPT_Investment_Score
0,1,0.0
1,2,0.0
2,3,0.0
3,4,0.0
4,5,0.0
5,6,0.0
6,7,0.0
7,8,0.0
8,9,0.0
9,10,0.0


In [None]:
earning_call

Unnamed: 0,Section ID,Name,Company,Job Title,Text,section,Participant Type,Meeting Time,File Date,Company Ticker,Event ID,num_coporate_participatns,num_conference_participatns,num_sec_presentation,num_sec_QA,ChatGPT_Investment_Score
0,1,Unidentified Audience Member,,,1,Presentation,,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
1,2,Operator,,,Ladies and gentlemen thank you for standing by...,Presentation,,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
2,3,RICHARD M. SCHULZE,,,Thank you Lea. Good morning everyone. Thank ...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
3,4,DARREN JACKSON,,,"Thanks Dick, and good morning everyone. My co...",Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
4,5,RICHARD M. SCHULZE,,,Thanks Darren. First I'd like to provide some ...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
5,6,Operator,,,"Ladies and gentlemen, if you have a question p...",Presentation,,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
6,7,DAN WEWER,,,Good morning. A question about the appliance p...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
7,8,RICHARD M. SCHULZE,,,"Wade, do you want to respond to this?",Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
8,9,WADE R. FENN,,,There is clearly elasticity in this kind of of...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0
9,10,DAN WEWER,,,And just one other question and I'll hang up. ...,Presentation,Conference,"APRIL 03, 2001 / 2:00PM GMT",2001-Apr-03,BBY.N,138864923344,0,14,87,0,0.0


## Open-AI (Need to Pay)

In [None]:
import os
import openai
import pandas as pd
from tqdm.auto import tqdm

# ── 1. API Key ────────────────────────────────────────────────────────────────
openai.api_key = os.getenv("OPENAI_API_KEY")  # or set directly: "sk-..."

# ── 2. Load your data ───────────────────────────────────────────────────────
earning_call_2001 = pd.read_pickle("/content/earning_call_2001_master.pkl")
earning_call = earning_call_2001.copy()  # or .iloc[:10] for a quick test

# ── 3. Choice ⇆ Score mapping ───────────────────────────────────────────────
CHOICE_SCORE = {
    "increase substantially":  1.0,
    "increase":                0.5,
    "no change":               0.0,
    "decrease":               -0.5,
    "decrease substantially": -1.0,
    "no information is provided": 0.0
}

# ── 4. Prompt builder ────────────────────────────────────────────────────────
def build_prompt(chunk_text: str) -> str:
    return (
        "The following text is an excerpt from a company’s earnings call transcripts. "
        "You are a finance expert. Based on this text only, please answer the following question:\n\n"
        "How does the firm plan to change its capital spending over the next year? "
        "There are five choices: Increase substantially, increase, no change, decrease, and decrease substantially. "
        "Please select one of the above five choices for each question and provide a one-sentence explanation of your choice for each question. "
        "The format for the answer to each question should be “choice - explanation.” "
        "If no relevant information is provided related to the question, answer “no information is provided.”\n\n"
        "[Part of an earnings call transcript:]\n\n"
        f"{chunk_text}"
    )

# ── 5. Chunking by word count ─────────────────────────────────────────────────
def chunk_text_by_words(text: str, max_words: int = 2500):
    words = text.split()
    return [' '.join(words[i:i+max_words])
            for i in range(0, len(words), max_words)]

# ── 6. Query GPT-3.5-turbo ────────────────────────────────────────────────────
def query_gpt35(prompt: str) -> str:
    resp = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        temperature=0.0,
        messages=[{"role": "user", "content": prompt}]
    )
    return resp.choices[0].message.content.strip()

# ── 7. Parse "choice - explanation" ───────────────────────────────────────────
import re
_choices_pattern = re.compile(r'^(increase substantially|increase|no change|decrease substantially|decrease|no information is provided)\b',
                              flags=re.IGNORECASE)

def parse_response(text: str):
    m = _choices_pattern.match(text.lower())
    if m:
        choice = m.group(1)
    else:
        choice = "no information is provided"
    explanation = text[len(choice):].lstrip(" -:") or ""
    return choice, explanation

# ── 8. Process all calls ─────────────────────────────────────────────────────
chunk_records = []
call_scores   = []

for idx, row in tqdm(earning_call.iterrows(), total=len(earning_call)):
    section_id = row["Section ID"]
    transcript = row["Text"] or ""
    chunks     = chunk_text_by_words(transcript)

    scores = []
    for i, chunk in enumerate(chunks, start=1):
        prompt = build_prompt(chunk)
        try:
            reply       = query_gpt35(prompt)
            choice, exp = parse_response(reply)
            score       = CHOICE_SCORE[choice]
        except Exception as e:
            choice, exp, score = "error", str(e), 0.0

        chunk_records.append({
            "Section ID": section_id,
            "chunk":      i,
            "choice":     choice,
            "score":      score,
            "explanation": exp
        })
        scores.append(score)

    # firm-quarter-level (call-level) score = average of chunk scores
    avg_score = sum(scores) / len(scores) if scores else 0.0
    call_scores.append({
        "Section ID": section_id,
        "ChatGPT_Investment_Score": avg_score
    })

# ── 9. Build DataFrames & Merge ──────────────────────────────────────────────
df_chunks = pd.DataFrame(chunk_records)
df_calls  = pd.DataFrame(call_scores)

# Merge back into your original earning_call table if you like:
earning_call = earning_call.merge(df_calls, on="Section ID", how="left")

# ── 10. (Optional) Save results ──────────────────────────────────────────────
df_chunks.to_csv("chunk_level_results.csv", index=False)
df_calls.to_csv("call_level_scores.csv", index=False)

print("Done!  • chunk-level results → chunk_level_results.csv  • call-level scores → call_level_scores.csv")


# Fine-tune Google-bert
https://huggingface.co/docs/transformers/training

In [None]:
from huggingface_hub import login

login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
pip install datasets




In [None]:
import datasets

Data Fields
'text': The review texts are escaped using double quotes ("), and any internal double quote is escaped by 2 double quotes (""). New lines are escaped by a backslash followed with an "n" character, that is "\n".
'label': Corresponds to the score associated with the review (between 1 and 5).
Data Splits
The Yelp reviews full star dataset is constructed by randomly taking 130,000 training samples and 10,000 testing samples for each review star from 1 to 5. In total there are 650,000 trainig samples and 50,000 testing samples.

https://huggingface.co/datasets/Yelp/yelp_review_full

In [None]:
!pip install --upgrade datasets huggingface_hub




In [None]:
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")

README.md:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/299M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

dataset = load_dataset("yelp_review_full")
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = dataset.map(tokenize, batched=True)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

Map:   0%|          | 0/650000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

prepare dataset for fine-tuning

In [None]:
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(100))

In [None]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 100
})

In [None]:
small_train_dataset['label'][:5]

[4, 2, 4, 0, 2]

In [None]:
small_train_dataset['text'][:5]



["I stalk this truck.  I've been to industrial parks where I pretend to be a tech worker standing in line, strip mall parking lots, and of course the farmer's market.  The bowls are so so absolutely divine.  The owner is super friendly and he makes each bowl by hand with an incredible amount of pride.  You gotta eat here guys!!!",
 "who really knows if this is good pho or not, i was hung tha fuck over and in desperate need of pho therapy. :P but it totally hit the spot and came out super freakin fast!!! omg! aaahhhhh.....\\n\\ni'm pretty sure it wasn't bad pho tho...meat, noodles, broth, all a-ok. the coffee was good too. thought i was gettin ripped off for a $3 cup of coffee but they gave me a big cup so it's all good! :)\\n\\nima make pho a must the next time i go to vegas again fo sure!!! yum! :D",
 'I LOVE Bloom Salon... all of their stylist are very qualified and provide excellent hair care...I prefer to book my appointments with Andrea, but if she is not available I am not afraid

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", eval_strategy="epoch",
    report_to="none")

In [None]:
trainer = Trainer(
    model = model,
    args  = training_args,
    train_dataset = small_train_dataset,
    eval_dataset  = small_eval_dataset,
    compute_metrics = compute_metrics
)

In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.600719,0.25
2,No log,1.534475,0.41
3,No log,1.521755,0.29


TrainOutput(global_step=39, training_loss=1.5234808310484276, metrics={'train_runtime': 52.6226, 'train_samples_per_second': 5.701, 'train_steps_per_second': 0.741, 'total_flos': 78935442739200.0, 'train_loss': 1.5234808310484276, 'epoch': 3.0})

evaluate on testing set

In [None]:

dataset_test = dataset["test"].shuffle(seed=12356).select(range(100))

model.eval()
trainer.predict(dataset_test).metrics

{'test_loss': 1.507846713066101,
 'test_accuracy': 0.34,
 'test_runtime': 3.2065,
 'test_samples_per_second': 31.186,
 'test_steps_per_second': 4.054}

save the fine-tuned model

In [None]:
trainer.save_model('content/google-bert-yelp/')


# Fine-tune FinBert (10 fold CV)
https://huggingface.co/docs/transformers/training

https://github.com/yya518/FinBERT/blob/master/finetune.ipynb

Ref: My conversation with ChatGPT o1

10-Fold Cross-Validation with FinBERT and Hugging Face Trainer
* Introduction
  In this guide, we demonstrate how to perform 10-fold cross-validation for a binary text classification task using Hugging Face’s Trainer with the FinBERT model (ProsusAI/finbert). FinBERT is a BERT-based model pre-trained on financial text and fine-tuned for sentiment analysis in the financial domain​
  HUGGINGFACE.CO. We will adapt it for our binary classification target. The dataset is a pandas DataFrame (macro) with a text column (Strategy_Description) and a binary label column (machine). We will:
  * Split the data into an 85% train+validation set and a 15% holdout test set (stratified by the machine label).
  * Use Stratified 10-fold Cross-Validation on the 85% train/val set to train and evaluate FinBERT with fixed hyperparameters.
  * Collect metrics (accuracy, precision, recall, F1-score, and ROC AUC) after each epoch for both training and validation.
  * Compute the average performance across the 10 folds​ MEDIUM.COM
  * Retrain the model on the full 85% training data (or use the best fold’s model) and evaluate on the 15% holdout test set.
  * Visualize training vs. validation metrics per epoch for one fold to inspect training dynamics.

## Data Preparation and Splitting
First, we stratify-split the DataFrame into an 85% training set and a 15% holdout test set. Stratified sampling ensures the class distribution of machine (our binary label) is preserved in both splits​.
SCIKIT-LEARN.ORG
. We use scikit-learn’s train_test_split with stratify=y for this. We also set a random seed for reproducibility.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Assume `macro` is a pandas DataFrame with 'Strategy_Description' and 'machine' columns.
# Stratified split: 85% train_val, 15% test
# Both outputs should have similar class distribution due to stratification.

fund_strategies = pd.read_csv("fund_strategies.csv")

train_val_df, test_df = train_test_split(
    fund_strategies,
    test_size = 0.15,
    stratify  = macro['machine'],
    random_state = 1103
)
print(train_val_df['machine'].value_counts(normalize=True))
print(test_df['machine'].value_counts(normalize=True))


FileNotFoundError: [Errno 2] No such file or directory: 'fund_strategies.csv'

Next, prepare for 10-fold cross-validation on train_val_df. We use StratifiedKFold to generate 10 splits, again preserving class ratios in each fold​. Each fold will use ~90% of the train_val data for training and ~10% for validation (since 10-fold). We initialize StratifiedKFold(n_splits=10, shuffle=True, random_state=42) and obtain train/val indices for each fold

In [None]:
from sklearn.model_selection import StratifiedKFold
import numpy as np

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1103)
fold_splits = list(skf.split(train_val_df, train_val_df['machine']))
print(f"Total folds: {len(fold_splits)}")  # Should print 10


## Model and Tokenizer Setup
We load the FinBERT tokenizer and model using Hugging Face Transformers. Since the FinBERT model was originally fine-tuned for 3-class sentiment​.
, we specify num_labels=2 to initialize a binary classification head (this will randomly initialize a new classification layer for 2 classes). We also disable any external experiment tracking (like Weights & Biases) to use only built-in logging. Key hyperparameters (fixed for all folds as specified) are:
* Learning rate: 2e-5
* Batch size: 10 (for both training and eval)
* Epochs: 30
* Weight decay: 0.01
* Max sequence length: 512 tokens

We use these to configure TrainingArguments for the Trainer. We enable evaluation at each epoch (evaluation_strategy="epoch") so that validation metrics are computed after every epoch​. We also set logging_strategy="epoch" to log training loss each epoch. A seed is set for reproducibility. Using a GPU is automatic if available (the Trainer will utilize CUDA if torch.cuda.is_available() is True).


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments, set_seed

model_name = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
# Example: prepare model for binary classification
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define training arguments common to all folds
training_args = TrainingArguments(
    output_dir = "finbert_cv",      # output directory for model files (can include fold index later)
    learning_rate = 2e-5,
    per_device_train_batch_size = 10,
    per_device_eval_batch_size = 10,
    num_train_epochs = 30,
    weight_decay = 0.01,
    eval_strategy ="epoch",   # evaluate at end of each epoch
    logging_strategy="epoch",      # log training loss at end of each epoch
    save_strategy = "epoch",             # disable checkpoint saving for brevity (could use "epoch" to save each fold if needed)
    report_to="none",              # disable W&B or other logging
    seed=42
)
set_seed(42)  # set seed for reproducibility (affects weight init, etc.)


## 10-Fold Cross-Validation Training

We loop through each fold, create the training and validation subsets, tokenize them, and train a model. To efficiently handle data, we use the 🤗 Datasets library. We convert our pandas subsets to Dataset objects and then apply the tokenizer. Tokenization is done with padding and truncation up to max_length=512. We rename the label column to "label" since by default Hugging Face expects a column named "label" for Trainer. We remove any unnecessary columns (like the original text or DataFrame index) after tokenization, keeping only the model inputs and label.

Each fold’s training uses the same hyperparameters and runs for 30 epochs. At the end of each epoch, the Trainer evaluates on the fold’s validation set and computes our metrics via a compute_metrics function. We define compute_metrics to calculate accuracy, precision, recall, F1, and ROC AUC from the model predictions and true labels. We use scikit-learn metrics for these calculations. In binary classification, ROC AUC is computed using the probability of the positive class (roc_auc_score expects either probability estimates or decision function scores for the positive class). Our compute_metrics will apply softmax to logits to get probabilities.


In [None]:
from datasets import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
import torch.nn.functional as F

# Define the metric computation function
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # Get predicted class indices
    preds = np.argmax(logits, axis=1)
    # Accuracy
    acc = accuracy_score(labels, preds)
    # Precision, Recall, F1 (binary)
    prec, rec, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    # AUC (use probability of class 1)
    # Softmax to get probabilities for classes
    probs = F.softmax(torch.tensor(logits), dim=1).numpy()  # convert logits to probabilities
    auc = roc_auc_score(labels, probs[:, 1])  # assume label '1' is positive class
    return {"accuracy": acc, "precision": prec, "recall": rec, "f1": f1, "roc_auc": auc}

# Loop over each fold

def tokenize_batch(batch):
        return tokenizer(batch["Strategy_Description"],
                         padding="max_length", truncation=True, max_length=512)


fold_metrics = []  # to collect metrics for each fold's validation
for fold_idx, (train_idx, val_idx) in enumerate(fold_splits, start=1):
    print(f"Starting fold {fold_idx}...")

    # Subset the DataFrame for this fold
    train_df = train_val_df.iloc[train_idx]
    val_df = train_val_df.iloc[val_idx]

    # Convert to Hugging Face Datasets
    train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
    val_dataset = Dataset.from_pandas(val_df.reset_index(drop=True))

    # Tokenize the datasets
    train_dataset = train_dataset.map(tokenize_batch, batched=True)
    val_dataset = val_dataset.map(tokenize_batch, batched=True)

    # Rename target column to 'label' and remove other columns
    train_dataset = train_dataset.rename_column("machine", "label")
    val_dataset = val_dataset.rename_column("machine", "label")

    # Remove the text column now that we have tokenized inputs
    train_dataset = train_dataset.remove_columns(["Strategy_Description"])
    val_dataset = val_dataset.remove_columns(["Strategy_Description"])

    # (If any pandas index column like "__index_level_0__" is present, remove that too)
    for col in train_dataset.column_names:
        if col not in ["input_ids", "attention_mask", "token_type_ids", "label"]:
            train_dataset = train_dataset.remove_columns(col)
    for col in val_dataset.column_names:
        if col not in ["input_ids", "attention_mask", "token_type_ids", "label"]:
            val_dataset = val_dataset.remove_columns(col)

    # Set format for PyTorch
    train_dataset.set_format("torch")
    val_dataset.set_format("torch")

    # Initialize a fresh model for this fold
    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2, ignore_mismatched_sizes=True)

    # Update output_dir to avoid collisions (optional)
    training_args.output_dir = f"finbert_cv/fold{fold_idx}"

    # Initialize Trainer for this fold
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        compute_metrics=compute_metrics
    )
    # Train the model for this fold
    trainer.train()
    # Evaluate on validation set to get final metrics for this fold
    fold_metrics.append(trainer.evaluate(eval_dataset=val_dataset))
    # (Optionally, free up GPU memory here or delete model if not needed further)


A few notes on the above code:
* We use trainer.train() which will train for 30 epochs and automatically evaluate on the validation set at each epoch end (because of evaluation_strategy="epoch" in our training arguments)​
.The compute_metrics function is used to calculate metrics on each evaluation.
* We collect the final evaluation metrics for the fold via trainer.evaluate (though we could also use the last entry of trainer.state.log_history for validation metrics).
* We do not do hyperparameter tuning here; the hyperparameters are fixed for all folds. We also didn’t implement early stopping (though one could add an EarlyStoppingCallback if desired). Each fold runs for the full 30 epochs.

During training, the Trainer logs training loss and validation metrics each epoch. If needed, we can also compute training-set metrics after each epoch by evaluating on the training data. This isn’t done by default (the Trainer only evaluates on eval_dataset), but we can achieve it with a custom callback or by manually calling trainer.evaluate(train_dataset) at epoch end. The logged history (trainer.state.log_history) contains a list of dictionaries with metrics for each logging step​. For example, entries with 'eval_loss', 'eval_accuracy', etc., correspond to validation metrics at each evaluation, and entries with 'loss' correspond to training loss during training​

. By parsing this, we can retrieve per-epoch metrics for both training and validation to later analyze or plot.

## Cross-Validation Results

After running all 10 folds, we have collected the validation metrics for each fold in fold_metrics. We can compute the average metrics across the 10 folds to get an overall estimate of performance​
. For instance:

In [None]:
import pandas as pd

metrics_df = pd.DataFrame(fold_metrics)
avg_metrics = metrics_df.mean()
std_metrics = metrics_df.std()
print("Average CV metrics across 10 folds:")
for metric, avg_val in avg_metrics.items():
    print(f"  {metric}: {avg_val:.4f} ± {std_metrics[metric]:.4f}")


This will output the mean and standard deviation of each metric (accuracy, precision, recall, F1, roc_auc) over the 10 validation folds. Typically, we pay attention to the average validation accuracy and F1 (and AUC, especially for imbalanced classes) as robust estimates of model performance on unseen data. The K-fold process gives us a distribution of metrics rather than a single number, providing more insight into variability and reliability​


## Retraining on Full Training Data and Final Evaluation

With cross-validation done, we can choose how to proceed for the final model: either select the best-performing fold’s model or retrain a new model on the entire 85% training set. Often, after cross-validation, the best practice is to train a final model on all available training data (here, the full 85%) using the same hyperparameters (since we did no hyperparameter tuning, there's no risk of overfitting the test by this choice). This final model should in theory perform similarly to the average fold performance, possibly slightly better since it has more training data.

Let's retrain on the full train_val_df (85% data) and then evaluate on the holdout 15% test set:

In [None]:
# Prepare full training dataset (85%) for final training
full_train_dataset = Dataset.from_pandas(train_val_df.reset_index(drop=True))
full_train_dataset = full_train_dataset.map(tokenize_batch, batched=True)
full_train_dataset = full_train_dataset.rename_column("machine", "label")
full_train_dataset = full_train_dataset.remove_columns(["Strategy_Description"])
# Remove any extra columns and set format
for col in full_train_dataset.column_names:
    if col not in ["input_ids", "attention_mask", "token_type_ids", "label"]:
        full_train_dataset = full_train_dataset.remove_columns(col)
full_train_dataset.set_format("torch")

# Initialize new model for final training
final_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
training_args.output_dir = "finbert_cv/final_model"
final_trainer = Trainer(
    model=final_model,
    args=training_args,
    train_dataset=full_train_dataset,
    compute_metrics=compute_metrics
)
final_trainer.train()

# Prepare holdout test dataset
test_dataset = Dataset.from_pandas(test_df.reset_index(drop=True))
test_dataset = test_dataset.map(tokenize_batch, batched=True)
test_dataset = test_dataset.rename_column("machine", "label")
test_dataset = test_dataset.remove_columns(["Strategy_Description"])
for col in test_dataset.column_names:
    if col not in ["input_ids", "attention_mask", "token_type_ids", "label"]:
        test_dataset = test_dataset.remove_columns(col)
test_dataset.set_format("torch")

# Evaluate on holdout test set
test_metrics = final_trainer.evaluate(eval_dataset=test_dataset)
print("Holdout test metrics:", test_metrics)


After this, test_metrics will contain the accuracy, precision, recall, F1, and ROC AUC on the 15% holdout test set. This is the final check of our model’s performance on completely unseen data. We expect these results to be in line with (perhaps slightly lower than) the cross-validation average metrics, assuming the 10-fold CV did not reveal any major issues. If the test performance is significantly worse, it could indicate some overfitting or that the holdout set has different characteristics.


## Visualization of Training vs. Validation Metrics

It’s often useful to visualize the training process for one of the folds to ensure the model is learning properly and to detect any overfitting. We can extract the per-epoch metric history from the Trainer. For example, using trainer.state.log_history of one fold’s trainer, we can gather the training and validation accuracy for each epoch and plot them using Matplotlib.

Training vs. validation accuracy per epoch for one cross-validation fold. In this fold, the training accuracy (yellow line) improves steadily over the 30 epochs, eventually nearing 100%. The validation accuracy (orange line) climbs initially and then plateaus, with minor fluctuations, reaching about 85–88% by the end. We observe a gap between training and validation accuracy in later epochs, which suggests slight overfitting after around epoch 20 (training accuracy continues to rise while validation stabilizes or dips). Plotting metrics like this for each fold can help verify if training was stable and decide if early stopping or fewer epochs might be warranted. To create such a plot, we retrieve the logged metrics. For example:


In [None]:
import matplotlib.pyplot as plt

# Assume `history` is trainer.state.log_history from one fold
history = trainer.state.log_history  # list of dicts

# Extract per-epoch accuracy
train_acc = []
val_acc = []
epochs = []
for record in history:
    if 'epoch' in record:
        # If the record has validation metrics, it'll have 'eval_accuracy'
        if 'eval_accuracy' in record:
            val_acc.append(record['eval_accuracy'])
            epochs.append(record['epoch'])
        # If the record has training loss (end of epoch), record training accuracy by evaluating on train set
        if 'eval_accuracy' not in record and 'loss' in record:
            # We didn't directly log train accuracy, so we compute it at epoch end:
            train_metrics = trainer.evaluate(eval_dataset=train_dataset)  # evaluate on training set
            train_acc.append(train_metrics['eval_accuracy'])
# Plot (train_acc and val_acc lists should align with epochs)
plt.plot(epochs, train_acc, label="Train Accuracy")
plt.plot(epochs, val_acc, label="Validation Accuracy")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.title("Training vs Validation Accuracy (Fold 1)")
plt.legend()
plt.show()


## Conclusion
Using Hugging Face’s Trainer with cross-validation provides a robust estimate of model performance. We used 10-fold stratified CV to mitigate variance due to a single train-test split, as recommended for reliable evaluation​. With fixed hyperparameters (learning rate, epochs, etc.), we fine-tuned the FinBERT model on each fold and monitored metrics each epoch. After averaging the fold results and training a final model on all training data, we evaluated on a holdout set to ensure our model generalizes well. This process is reproducible (fixed random seeds) and leverages Hugging Face’s built-in functionalities for training and evaluation. The approach can be adapted for other transformers models and datasets, and additional enhancements like early stopping, hyperparameter search, or model ensembling (using the fold models) can be explored if needed. References: K-fold cross-validation concept and benefits​, Hugging Face Trainer usage for evaluation per epoch​
, StratifiedKFold for preserving class distribution​
, and FinBERT model details​. The Hugging Face forums and Stack Overflow provide examples of implementing cross-validation with datasets and Trainer​
, as well as tips for accessing training logs​
 These guided our implementation to ensure a structured and effective cross-validation training workflow.

# Fine-tune FinBert (Hyper-parameter Search)

https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/002_configurations.html#sphx-glr-tutorial-10-key-features-002-configurations-py

https://huggingface.co/docs/transformers/v4.51.1/en/hpo_train

In [None]:

import torch
print("Using device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU")

# --- Required Imports ---
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from sklearn.metrics import accuracy_score, f1_score, precision_score, roc_auc_score
from scipy.special import softmax
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    Trainer,
    TrainingArguments
)


# --- Train/Validation/Test Split ---
temp_df, test_df = train_test_split(
    macro,
    test_size = 0.15,
    random_state = 1103,
    stratify = macro["machine"]
)

train_df, eval_df = train_test_split(
    temp_df,
    test_size = 0.15,
    random_state = 1103,
    stratify = temp_df["machine"]
)

# --- Hyper-parameter Search ------------------------------------------------------------------------------------------------------
train_dataset = Dataset.from_pandas(train_df)
test_dataset  = Dataset.from_pandas(test_df)
eval_dataset  = Dataset.from_pandas(eval_df)

# Rename label column
train_dataset = train_dataset.rename_column("machine", "labels")
test_dataset  = test_dataset.rename_column("machine", "labels")
eval_dataset  = eval_dataset.rename_column("machine", "labels")

# --- Tokenizer & Model ---
model_name = "ProsusAI/finbert"  # You can change to another FinBERT variant
tokenizer = AutoTokenizer.from_pretrained(model_name)

# --- Tokenization Function ---
def tokenize_function(examples):
    return tokenizer(
        examples["Strategy_Description"],
        truncation = True,
        padding    = "max_length",
        max_length = 512
    )

train_dataset = train_dataset.map(tokenize_function, batched=True)
test_dataset  = test_dataset.map(tokenize_function, batched=True)
eval_dataset  = eval_dataset.map(tokenize_function, batched=True)


train_dataset = train_dataset.remove_columns(["Strategy_Description"])
test_dataset  = test_dataset.remove_columns(["Strategy_Description"])
eval_dataset  = eval_dataset.remove_columns(["Strategy_Description"])

train_dataset.set_format("torch")
test_dataset.set_format("torch")
eval_dataset.set_format("torch")

# Define a model_init Function
import torch
from transformers import AutoModelForSequenceClassification

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(
        model_name,
        num_labels = 2,  # binary classification
        ignore_mismatched_sizes=True
    )


# --- Metrics Function ---
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=1)

    #acc = accuracy_score(labels, preds)
    #f1 = f1_score(labels, preds, average="weighted")
    #precision = precision_score(labels, preds, average="binary")
    probs = softmax(logits, axis=1)[:, 1]
    auc = roc_auc_score(labels, probs)

    return {
      #   "accuracy": acc,
      #   "precision": precision,
      #  "f1": f1,
        "auc": auc
    }

# Define Our Hyperparameter Space
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-6, 5e-5, log=True),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [5, 10, 20]),
    }

# --- Training Arguments ---
training_args = TrainingArguments(
    output_dir="./finbert_finetuned",
    eval_strategy = "epoch",
    save_strategy = "epoch",
    per_device_train_batch_size = 8,
    weight_decay = 0.01,
    logging_dir = "./logs",
    load_best_model_at_end = True,
    report_to = "none"
)

# --- Trainer ---
trainer = Trainer(
    model = None,
    args  = training_args,
    train_dataset = train_dataset,
    eval_dataset  = eval_dataset,
    model_init=model_init,
    compute_metrics=compute_metrics
)

# --- Train and Evaluate ---
best_run = trainer.hyperparameter_search(
    direction ="maximize",            # if we want to maximize sum of metrics, or a custom objective
    backend ="optuna",
    hp_space = optuna_hp_space,
    n_trials = 10,
    )

print(best_run.hyperparameters)

# Extract the best hyperparameters
best_lr = best_run.hyperparameters["learning_rate"]
best_epochs = best_run.hyperparameters["num_train_epochs"]


final_training_args = TrainingArguments(
    output_dir="./finbert_best_run",
    eval_strategy="epoch",   # Evaluate each epoch
    save_strategy="epoch",         # Save at each epoch (optional, if you want)
    logging_strategy="epoch",      # Log at each epoch
    learning_rate=best_lr,
    num_train_epochs=best_epochs,
    per_device_train_batch_size=8,
    weight_decay=0.01,
    load_best_model_at_end=True,
    report_to="none"  # no W&B
)

final_trainer = Trainer(
    model=None,
    model_init=model_init,         # Reinitialize model each time
    args=final_training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)

# ==============================
# 1) Save your trained model/ Plot the training process
# ==============================
final_trainer.save_model("./my_finbert_model")
final_trainer.tokenizer.save_pretrained("./my_finbert_model")


history = final_trainer.state.log_history
import matplotlib.pyplot as plt

# We'll parse the log_history for each epoch
train_loss_vals, eval_loss_vals, eval_auc_vals = [], [], []
train_epochs, eval_epochs = [], []

for entry in history:
    if "loss" in entry and "epoch" in entry:
        # Training log
        train_loss_vals.append(entry["loss"])
        train_epochs.append(entry["epoch"])

    if "eval_loss" in entry:
        # Validation log
        eval_loss_vals.append(entry["eval_loss"])
        eval_auc_vals.append(entry["eval_auc"])  # or whichever metrics are in your logs
        eval_epochs.append(entry["epoch"])

# PLOT training vs validation loss
plt.figure()
plt.plot(train_epochs, train_loss_vals, label="Train Loss")
plt.plot(eval_epochs, eval_loss_vals, label="Val Loss")
plt.legend()
plt.show()

# PLOT validation AUC
plt.figure()
plt.plot(eval_epochs, eval_auc_vals, label="Val AUC", marker='o')
plt.legend()
plt.show()

# ==============================
# 2) Load the saved model
# ==============================
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_path = "./my_finbert_model"
new_tokenizer = AutoTokenizer.from_pretrained(model_path)
new_model = AutoModelForSequenceClassification.from_pretrained(model_path)

# ==============================
# 3) Predict on new funds
# ==============================
from transformers import Trainer, TrainingArguments
from datasets import Dataset
import numpy as np
import torch.nn.functional as F
import torch


dummy_args = TrainingArguments(
    output_dir="dummy_out",
    per_device_eval_batch_size=8
)

new_trainer = Trainer(
    model=new_model,
    args=dummy_args
)

# new_dataset is the same format as trainig.val dataset
predictions = new_trainer.predict(new_dataset)
pred_logits = predictions.predictions
pred_classes = np.argmax(pred_logits, axis=1)
print("Predicted classes:", pred_classes)


probs = F.softmax(torch.tensor(pred_logits), dim=1).numpy()
positive_prob = probs[:, 1]
print("Probability of class=1:", positive_prob)