In [1]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

In [2]:
tokenizer = BertTokenizer.from_pretrained('ProsusAI/finbert')
model = BertForSequenceClassification.from_pretrained('ProsusAI/finbert')

In [3]:
def get_sentiment(tokens):
    outputs = model(**tokens)
    probabilities = torch.nn.functional.softmax(outputs[0], dim=-1)
    return probabilities

In [4]:
txt = '''
1. Higher levels of retail participation in crypto than traditional commodity markets pose unique challenges for regulators.
One in five Americans report having traded cryptocurrency, and polls suggest crypto trading is more common among younger adults, men, and racial minorities. This is quite different from other financial instruments regulated by the CFTC, Benham noted. “You’re going to have more vulnerable investors… It’s incumbent on us to educate, to inform, to disclose risks involved.”

Michael Piwowar, a former Securities and Exchange Commissioner and now executive director of the Milken Institute Center for Financial Markets, linked increased Congressional attention to growth in retail crypto: “If you got one in five households that have interacted with crypto… [members of Congress] are going to start hearing it from their constituents.” Legislation to regulate digital assets has been introduced by Senators Lummis and Gillibrand, Stabenow and Boozman, and Toomey, as well as Representative Gottheimer. The Treasury is actively negotiating bipartisan stablecoin legislation with House Financial Services Committee Chair Waters and Ranking Member McHenry. Benham said that stablecoins, digital currency meant to always be equal to one dollar, are more of a “payment mechanism” and thus should be regulated by prudential banking regulators.

Digital asset regulation may require addressing crypto exchanges and digital wallets. American University Law Professor Hilary Allen noted that the stablecoin legislation under discussion does not, saying, “That is a gaping hole… Almost every major stablecoin… is affiliated with an exchange that profits from trading in that stablecoin.” Mark Wetjen, a former CFTC commissioner and current head of policy and regulatory strategy for FTX (one of the largest crypto exchanges), agreed: “The exchanges are the gateways to the entire crypto space, and so oversight of them is probably most important.” He pushed back that there was no current regulation, noting the requirement for state level licenses, such as New York’s Bitlicense: “If you want to list derivatives on bitcoin, for example, you need a license… so it may not be as dire a situation.”

2. Crypto challenges traditional regulatory distinction between securities and commodities.
Traditionally, the SEC regulates securities while the CFTC regulates commodities and derivatives. Whether crypto is a security or commodity remains unclear, as various subcomponents of the crypto ecosystem challenge existing regulatory divisions. For instance, the SEC recently argued  that nine different crypto tokens were securities in an insider trading case while a federal judge ruled that virtual currency like Bitcoin constitutes a commodity.

Benham called on Congress to provide clarity on which of the hundreds – if not thousands – of coins in existence are securities versus commodities: “Ultimately, we’d like to see law drawing lines.” Piwowar said the lack of clarity creates unwelcome delays as many crypto-related applications before the SEC are “not getting answers” on whether their products represent securities. The result is that some crypto firms are “going outside the United States” to locate their business. Allen cautioned, though, that Congressional action could also constitute an indication that the government supports crypto. She warned against letting crypto into the regulated sphere for fear of giving it “implicit guarantees.”

A solution to the regulatory turf battle could be merging the SEC and CFTC, which Piwowar endorsed, as have many others. Congress, however, has shown little appetite to do so given the different Congressional committee jurisdictions involved.

3. CFTC will restructure to better protect consumers and more effectively regulate markets.
Benham announced several changes at the CFTC during the Brookings event. First, LabCFTC will become the Office of Technology Innovation, reporting directly to the Chairman’s office. Behnam justified this by stating, “We are past the incubator stage, and digital assets and decentralized financial technologies have outgrown their sandboxes.” Second, CFTC’s Office of Customer Education and Outreach will be realigned within the Office of Public Affairs, which Behnam said would “leverage resources and a broader understanding of the issues facing the general public towards addressing the most critical needs in the most vulnerable communities.” Restructuring within a regulator may appear a bureaucratic shuffle but can reflect changes in internal power, agency focus, and prioritization. Directly reporting to the chair increases an office’s authority and prestige.

4. Is crypto a passing fad (or worse, a bubble that threatens financial markets)?
Allen argued that crypto is “purposely less efficient and more complicated than a more centralized system,” and does not have any societal value. FTX’s Wetjen disagreed: “The difference here with blockchain as the underpinning means by which you can transfer value is that there are absolutely no gates.” Piwowar broadly agreed with Wetjen that “We’re going to have the new generation of Amazons and Googles come out of this stuff,” but cautioned that while he was at the SEC, “Nine out of ten [crypto applications] were outright fraud, and then out of the one out of ten, nine out of ten of those were probably fraud.” Since January 2021, over 46,000 people have collectively lost over $1 billion to scams involving crypto.

Everyone wants to avoid a repeat of the 2008 global financial crisis. To do so, regulators have focused on avoiding and mitigating “systemic risk” to the financial system. Asked if he sees a “clear and present danger to the existing economic system,” Benham said he did not, pointing out that crypto is not sufficiently interconnected to pose systemic risk. He noted the decrease in crypto values over the past several months did not cause ripples in the financial system or the broader economy. Piwowar turned the question of systemic risk back onto the actions of financial regulators asking: “What is systemic risk?  It’s the risk that a federal policymaker is going to bail out a bank, either directly or indirectly.” Allen agreed that bailing out crypto would be a mistake quipping: “If anything should be able to fail, it should be crypto, which isn’t… funding productive economic capacity.”

Allen also noted the similarity in arguments centered on American global competitiveness which promoted lax regulation for derivatives: “It’s almost identical to the rhetoric we saw around swaps in the 1990s.” Credit default swaps, like crypto now, faced loose regulation and ultimately helped fuel the subprime mortgage crisis. Behnam noted that one of 2008’s biggest lessons was the need for the CFTC to promote market transparency in the “OTC [over-the-counter] derivative space.” Crypto proponents point to the underlying technology as being inherently more transparent, while critics point to the lack of understanding of aspects of the market, such as what assets back stablecoins like Tether.

5. Does crypto increase financial inclusion?
Cryptocurrency proponents frequently cite financial inclusion as a major benefit linking the higher usage of youth and communities of color who have higher rates of being unbanked or underbanked by traditional finance. Allen cautioned against “predatory inclusion” arguing that, “Because there’s no productive capacity behind them, their value derives from finding someone else to buy them from you.” Wetjen responded, blending his experience serving as a CFTC Commissioner with his time in the crypto industry: “From my own experience… at the CFTC, there’s plenty of authority that’s already in place for the agency to… be pretty thoughtful and relatively prescriptive, even in terms of what actually should be disclosed to, particularly, retail investors, or users of a platform such as FTX.” He argued that the right policy is “giving people the opportunity to be involved and invest in the space that they like but making sure that it’s done with the right safeguards.”
'''

In [5]:
tokens = tokenizer.encode_plus(txt, add_special_tokens=False)
tokens

Token indices sequence length is longer than the specified maximum sequence length for this model (1651 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': [1015, 1012, 3020, 3798, 1997, 7027, 6577, 1999, 19888, 2080, 2084, 3151, 19502, 6089, 13382, 4310, 7860, 2005, 25644, 1012, 2028, 1999, 2274, 4841, 3189, 2383, 7007, 19888, 10085, 3126, 7389, 5666, 1010, 1998, 14592, 6592, 19888, 2080, 6202, 2003, 2062, 2691, 2426, 3920, 6001, 1010, 2273, 1010, 1998, 5762, 14302, 1012, 2023, 2003, 3243, 2367, 2013, 2060, 3361, 5693, 12222, 2011, 1996, 12935, 13535, 1010, 3841, 3511, 3264, 1012, 1523, 2017, 1521, 2128, 2183, 2000, 2031, 2062, 8211, 9387, 1529, 2009, 1521, 1055, 7703, 2006, 2149, 2000, 16957, 1010, 2000, 12367, 1010, 2000, 26056, 10831, 2920, 1012, 1524, 2745, 14255, 12155, 9028, 1010, 1037, 2280, 12012, 1998, 3863, 5849, 1998, 2085, 3237, 2472, 1997, 1996, 6501, 2368, 2820, 2415, 2005, 3361, 6089, 1010, 5799, 3445, 7740, 3086, 2000, 3930, 1999, 7027, 19888, 2080, 1024, 1523, 2065, 2017, 2288, 2028, 1999, 2274, 3911, 2008, 2031, 11835, 2098, 2007, 19888, 2080, 1529, 1031, 2372, 1997, 3519, 1033, 2024, 2183, 2000, 2707, 499

In [6]:
tokens.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])

In [7]:
print(len(tokens.input_ids))
print(len(tokens.token_type_ids))
print(len(tokens.attention_mask))

1651
1651
1651


In [8]:
input_ids = tokens['input_ids']
token_type_ids = tokens['token_type_ids']
attention_mask = tokens['attention_mask']

In [9]:
input_ids[: 10]

[1015, 1012, 3020, 3798, 1997, 7027, 6577, 1999, 19888, 2080]

In [10]:
start = 0
window_len = 512
total_len = len(input_ids)
loop = True

while loop:
    end = start + window_len
    if end >= total_len:
        loop = False
        end = total_len
    
    print(f'start = {start}')
    print(f'end = {end}')
    start = end

start = 0
end = 512
start = 512
end = 1024
start = 1024
end = 1536
start = 1536
end = 1651


In [11]:
def chunk_text_to_window_size_and_predict_proba(attention_mask, input_ids, total_len):
    proba_list = []
    start = 0
    window_len = 510

    loop = True

    while loop:
        end = start + window_len
        if end >= total_len:
            loop = False
            end = total_len

        #1 ==> define the text chunk
        input_ids_chunk = input_ids[start:end]
        attention_mask_chunk = attention_mask[start:end]

        #2 ==> append cls and sep
        input_ids_chunk = [101] + input_ids_chunk + [102]
        attention_mask_chunk = [1] + attention_mask_chunk + [1]

        #3 ==> convert dict to pytorch tensor
        input_dict = {
            'input_ids' : torch.Tensor([input_ids_chunk]).long(),
            'attention_mask' : torch.Tensor([attention_mask_chunk]).int()
        }

        outputs = model(**input_dict)
        probabilities = torch.nn.functional.softmax(outputs[0], dim=-1)
        proba_list.append(probabilities)

        start = end

    return proba_list

In [12]:
proba_list = chunk_text_to_window_size_and_predict_proba(input_ids, attention_mask, total_len)
proba_list

[tensor([[0.3153, 0.1912, 0.4935]], grad_fn=<SoftmaxBackward0>),
 tensor([[0.3153, 0.1912, 0.4935]], grad_fn=<SoftmaxBackward0>),
 tensor([[0.3153, 0.1912, 0.4935]], grad_fn=<SoftmaxBackward0>),
 tensor([[0.0490, 0.0577, 0.8933]], grad_fn=<SoftmaxBackward0>)]

In [13]:
stacks = torch.stack(proba_list)
stacks

tensor([[[0.3153, 0.1912, 0.4935]],

        [[0.3153, 0.1912, 0.4935]],

        [[0.3153, 0.1912, 0.4935]],

        [[0.0490, 0.0577, 0.8933]]], grad_fn=<StackBackward0>)

In [14]:
shape = stacks.shape
shape

torch.Size([4, 1, 3])

In [15]:
torch.reshape(stacks, (shape[0], shape[2]))

tensor([[0.3153, 0.1912, 0.4935],
        [0.3153, 0.1912, 0.4935],
        [0.3153, 0.1912, 0.4935],
        [0.0490, 0.0577, 0.8933]], grad_fn=<ViewBackward0>)

In [16]:
def get_mean_from_proba(proba_list):
    with torch.no_grad():
        stacks = torch.stack(proba_list)
        stacks = stacks.resize(stacks.shape[0], stacks.shape[2])
        mean = stacks.mean(dim=0)
    return mean

mean = get_mean_from_proba(proba_list)



In [17]:
torch.argmax(mean).item()

2

In [18]:
tokens = tokenizer.encode_plus(txt, add_special_tokens=False, return_tensors='pt')
print(len(tokens))
tokens

3


{'input_ids': tensor([[1015, 1012, 3020,  ..., 2015, 1012, 1524]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}

In [19]:
input_ids_chunks = tokens['input_ids'][0].split(510)
attention_mask_chunks = tokens['attention_mask'][0].split(510)

In [21]:
def get_input_ids_and_attention_mask_chunks():
    chunksize = 512
    input_ids_chunks = list(tokens['input_ids'][0].split(chunksize - 2))
    attention_mask_chunks = list(tokens['attention_mask'][0].split(chunksize - 2))

    for i in range(len(input_ids_chunks)):
        input_ids_chunks[i] = torch.cat([
            torch.tensor([101]), input_ids_chunks[i], torch.tensor([102])
        ])

        attention_mask_chunks[i] = torch.cat([
            torch.tensor([1]), attention_mask_chunks[i], torch.tensor([1])
        ])

        pad_len = chunksize - input_ids_chunks[i].shape[0]

        if pad_len > 0:
            input_ids_chunks[i] = torch.cat([
                input_ids_chunks[i], torch.Tensor([0] * pad_len)
            ])
            attention_mask_chunks[i] = torch.cat([
                attention_mask_chunks[i], torch.Tensor([0] * pad_len)
            ])

    return input_ids_chunks, attention_mask_chunks
    


In [22]:
input_ids_chunks, attention_mask_chunks = get_input_ids_and_attention_mask_chunks()


In [23]:
input_ids = torch.stack(input_ids_chunks)
attention_mask = torch.stack(attention_mask_chunks)

input_dict = {
    'input_ids' : input_ids.long(),
    'attention_mask' : attention_mask.int()
}

input_dict

{'input_ids': tensor([[  101,  1015,  1012,  ...,  1010,  1996,   102],
         [  101, 10819,  3728,  ...,  1521,  2128,   102],
         [  101,  2183,  2000,  ...,  2078,  5838,   102],
         [  101,  1010, 23293,  ...,     0,     0,     0]]),
 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 0, 0, 0]], dtype=torch.int32)}

In [24]:
output = model(**input_dict)

probabilities = torch.nn.functional.softmax(output[0], dim=-1)

mean_probabilities = probabilities.mean(dim=0)

mean_probabilities

tensor([0.0734, 0.1197, 0.8068], grad_fn=<MeanBackward1>)