<a href="https://colab.research.google.com/github/Crystal-Reshea/FinBert-Albert-nlp/blob/main/Fine_Tuning_Albert_for_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-Tuning Albert on SQUAD 2.0

In [None]:
pip install transformers

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import json
def read_squad(file_name):
  """
   Navigating SQUAD training file by
   separating context, questions, and answers
  """
  # open JSON file and load intro dictionary
  with open(file_name, 'rb') as file:
    squad2_dict = json.load(file)
        
  contexts = []
  questions = []
  answers = []
  # iterate through all data in squad data
  for key in squad2_dict['data']:
    for passage in key['paragraphs']:
      context = passage['context']
      for qa in passage['qas']:
          question = qa['question']
          # check if we need to be extracting from 'answers' or 'plausible_answers'
          if 'plausible_answers' in qa.keys():
              access = 'plausible_answers'
          else:
              access = 'answers'
          for answer in qa[access]:
            # append data to lists
            contexts.append(context)
            questions.append(question)
            answers.append(answer)
    # return formatted data lists
    return contexts, questions, answers

In [None]:
train_path = '/content/drive/MyDrive/NLP_POC/train-v2.0.json'
dev_path = '/content/drive/MyDrive/NLP_POC/dev-v2.0.json'

# execute our read SQuAD function for training and validation sets
train_contexts, train_questions, train_answers = read_squad(train_path)
val_contexts, val_questions, val_answers = read_squad(dev_path)

In [None]:
def add_end_idx(answers, contexts):
    # loop through each answer-context pair
    for answer, context in zip(answers, contexts):
        # target_text is the answer we are looking for within context
        target_text = answer['text']
        # where the answer starts in context
        start_index = answer['answer_start']
        # where the answer should end
        end_index = start_index + len(target_text)

        # sometimes the answers are slightly shifted 
        if context[start_index:end_index] == target_text: 
            # if the end index is correct, we add to the dictionary
            answer['answer_end'] = end_index
        else:
            for n in range(1,4):
                if context[start_index-n:end_index-n] == target_text:
                    answer['answer_start'] = start_index - n
                    answer['answer_end'] = end_index - n
            

add_end_idx(train_answers, train_contexts)
add_end_idx(val_answers, val_contexts)

In [None]:
from transformers import AlbertTokenizerFast
tokenizer = AlbertTokenizerFast.from_pretrained('albert-base-v2')

Downloading:   0%|          | 0.00/742k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

In [None]:
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
val_encodings = tokenizer(val_contexts, val_questions, truncation=True, padding=True)

In [None]:
tokenizer.decode(train_encodings['input_ids'][250])

'[CLS] her fourth studio album 4 was released on june 28, 2011 in the us. 4 sold 310,000 copies in its first week and debuted atop the billboard 200 chart, giving beyonce her fourth consecutive number-one album in the us. the album was preceded by two of its singles "run the world (girls)" and "best thing i never had", which both attained moderate success. the fourth single "love on top" was a commercial success in the us. 4 also produced four other singles; "party", "countdown", "i care" and "end of time". "eat, play, love", a cover story written by beyonce for essence that detailed her 2010 career break, won her a writing award from the new york association of black journalists. in late 2011, she took the stage at new york\'s roseland ballroom for four nights of special performances: the 4 intimate nights with beyonce concerts saw the performance of her 4 album to a standing room only.[SEP] where did beyonce perform for four nights of standing room only concerts in 2011?[SEP]<pad><pa

In [None]:
def add_token_positions(encodings, answers):
  """
  Creates tokens for the start and 
  end positions that can be understood
  by the tokenizer
  """
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
      # append start/end token position using char_to_token method
      start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
      end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

      # if start position is None, the answer passage has been truncated
      if start_positions[-1] is None:
          start_positions[-1] = tokenizer.model_max_length
      # end position cannot be found, char_to_token found space, so shift position until found
      shift = 1
      while end_positions[-1] is None:
          end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - shift)
          shift += 1
  # update our encodings object with the new token-based start/end positions
  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

# apply function to our data
add_token_positions(train_encodings, train_answers)
add_token_positions(val_encodings, val_answers)

In [None]:
train_encodings.keys()

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions'])

In [None]:
from transformers import AlbertTokenizer, AlbertForQuestionAnswering
model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')

Downloading:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForQuestionAnswering: ['predictions.decoder.bias', 'predictions.dense.weight', 'predictions.LayerNorm.bias', 'predictions.LayerNorm.weight', 'predictions.bias', 'predictions.decoder.weight', 'predictions.dense.bias']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN t

In [None]:
import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# build datasets for both our training and validation sets
train_dataset = SquadDataset(train_encodings)
val_dataset = SquadDataset(val_encodings)

In [None]:
from torch.utils.data import DataLoader
from transformers import AdamW
from tqdm import tqdm

# setup GPU/CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=5e-5)

# initialize data loader for training data
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

for epoch in range(3):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # extract loss
        loss = outputs[0]
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Epoch 0: 100%|██████████| 16290/16290 [2:06:57<00:00,  2.14it/s, loss=0.796]
Epoch 1: 100%|██████████| 16290/16290 [2:06:56<00:00,  2.14it/s, loss=1.16]
Epoch 2: 100%|██████████| 16290/16290 [2:06:56<00:00,  2.14it/s, loss=0.769]


In [None]:
model_path = '/content/drive/MyDrive/NLP_POC/models'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/NLP_POC/models/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_POC/models/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_POC/models/tokenizer.json')

# Using the New Model
## Bringing in Model


In [None]:
pip install transformers

In [None]:
import torch

In [None]:
import transformers
from transformers import AlbertForQuestionAnswering
from transformers import AlbertTokenizerFast

In [None]:
# get new model and tokenizer from path
model = AlbertForQuestionAnswering.from_pretrained('/content/drive/MyDrive/NLP_POC/models')
tokenizer = AlbertTokenizerFast.from_pretrained('/content/drive/MyDrive/NLP_POC/models')

## Function to Process Answers through Model

In [None]:
def answer_question(question, answer_text):
  '''
  Takes a `question` string and an `answer_text` string (which contains the
  answer), and identifies the words within the `answer_text` that are the
  answer. Prints them out.
  '''
  # get input ids from the tokenizer
  input_ids = tokenizer.encode(question, answer_text, truncation=True)
  # get segment ids
  sep_index, num_seg_a, num_seg_b, segment_ids = get_segment_ids(input_ids)
  # get outputs 
  outputs = model(torch.tensor([input_ids]), # The tokens representing our input text.
                  token_type_ids=torch.tensor([segment_ids]), # The segment IDs to differentiate question from answer_text
                  return_dict=True) 

  start_scores = outputs.start_logits
  end_scores = outputs.end_logits
  # construct answer
  answer = construct_answer(start_scores, end_scores, input_ids)
  return answer

In [None]:
def get_segment_ids(input_ids): 
  # find SEP token id
  sep_index = input_ids.index(tokenizer.sep_token_id)
  # the number of segment a and b tokens
  num_seg_a = sep_index + 1
  num_seg_b = len(input_ids) - num_seg_a
  # create segment ids of 1 and 0 
  segment_ids = [0]*num_seg_a + [1]*num_seg_b
  assert len(segment_ids) == len(input_ids), "The length of segment ids and input ids must be equal"
  return sep_index, num_seg_a, num_seg_b, segment_ids

In [None]:
def construct_answer(start_scores, end_scores, input_ids):
  # find the tokens with the highest start and end scores 
  answer_start = torch.argmax(start_scores)
  answer_end = torch.argmax(end_scores)
  tokens = tokenizer.convert_ids_to_tokens(input_ids)

  alpha = {'a':'a', 'b':'b', 'c':'c', 'd':'d', 'e':'e', 'f':'f', 'g':'g','h':'h','i':'i', 'j':'j', 'k':'k', 'l':'l',
            'm':'m', 'n':'n', 'o':'o', 'p':'p', 'q':'q', 'r':'r', 's':'s', 't':'t', 'u':'u', 'v':'v', 'w':'w', 'x':'x', 'y':'y', 'z':'z',
            '.':'.', '!':'!', ',':',', '(':'('}
  # remove underscores from answers
  if tokens[answer_start][0:1] not in alpha:
    answer = tokens[answer_start][1:]
  else: 
    answer = tokens[answer_start]

  # Select the remaining answer tokens and join them with whitespace.
  for i in range(answer_start + 1, answer_end + 1):
      # If it's a subword token, then recombine it with the previous token.
      if tokens[i][0:1] not in alpha :
          answer += " " + tokens[i][1:]
      # Otherwise, add a space then the token.
      else:
          answer += '' + tokens[i]
  return answer

## Sample Question Answering about Erica

In [None]:
questions = ["What languages does Erica know?", "Do I need the app to use Erica?", "What does Erica use to work?","How long does Erica keep my conversations?"]
context = "Erica leverages the latest technologies, in advanced analytics and cognitive messaging to serve as your trusted financial assistant. Erica is able to consider a range of data within Bank of America, like your cash flow, balances, transaction history and upcoming bills, to help you stay on top of your finances.Right now, Erica is exclusively available in the Mobile Banking app (app versions 7.6 and above). Just download the app today to get started!Erica is also planning to be available in Online Banking.or now, Erica is only available in English, but it is expected to learn Spanish.we keep a record of your conversations with Erica for quality assurance, to maintain an accurate account of your requests, identify opportunities to make Erica's responses more helpful and ensure Erica's performance is optimal. When you speak with Erica by voice, the discussions are recorded and saved for 90 days so they can be analyzed to help refine listening skills."

In [None]:
from pprint import pprint as pp
for i in range(len(questions)): 
  pp(questions[i]+ ': ' + answer_question(questions[i], context))
  print("")

'What languages does Erica know?: english,'

'Do I need the app to use Erica?: just download the app today to get started!'

('What does Erica use to work?: advanced analytics and cognitive messaging to '
 'serve as your trusted financial assistant.')

'How long does Erica keep my conversations?: 90 days'



# Question Answering on 10-k Form

In [None]:
file = '/content/drive/MyDrive/NLP_POC/bby-202110k.txt'

## Pre-Processing 10-K Form

In [None]:
import re
def process_text(file_name): 
  # collect only the necessary lines of text
  data = line_collection(file_name)
  # collect ITEM names 
  text, toc = find_toc(data)
  # create list of items in table of contents
  items = list(toc.keys())
  # return dictionary of item content pairs
  return extract_text(text,toc)

In [None]:
def line_collection(file_name):
  data = []
  with open(file_name, 'r') as file: 
    for line in file:  # Reading in file and remove unnessecary lines
      new_line = line.replace('\n',' ')
      # skip lines that are obviously not needed 
      if re.sub(r"\s+", "", line).lower() == "tableofcontents" or len(line) <= 3 or line.startswith("PART"):
        continue
      else:
        # append lines that are headings within Items 
        if len(new_line) >= 8 and len(new_line) <50: 
          if new_line[0].isupper() and "." not in new_line and re.sub(r"\s+", "", new_line).isalpha():
            data.append(line.upper())
          else: 
            data.append(new_line)
        else: 
          data.append(new_line)
  file.close()
  return data

In [None]:
def find_toc(data): 
  toc = {}
  # Adding names of headers to table of contents dictionary
  for line in data: 
    if line.startswith("ITEM") or line == 'SIGNATURES': 
      toc[line] = ""
  # Converting list to string
  text = "".join(data) 
  return(text, toc)

In [None]:
def extract_text(text,toc):
  items = list(toc.keys())
  # Collecting text between headers and adding them to dictionary
  for i in range(1, len(items)): 
    start = items[i-1]
    end = items[i]
    toc[start] = re.search(r'((?<=' + start + ').*(?=' + end + '))', text, re.S | re.M)[0]
  return toc, items

## Function to split up text in Items by headings

In [None]:
def fill_item_dict(arr_split):
  dict = {}
  for i in range(1,len(arr_split)): 
    heading = re.findall(r'\b[A-Z]+(?:\s+[A-Z]+)*\b',arr_split[i-1])[-1]
    content = arr_split[i]
    dict[heading] = content
  return dict


## Extracting Text from Item #7 of 10-K Form




In [None]:
# process 10-K data
file_name = '/content/drive/MyDrive/NLP_POC/bby-202110k.txt'
text_dict, toc_list = process_text(file_name)

In [None]:
# string of all relevant item 7 content
item7 = text_dict[toc_list[7]]
# list of item 7 content split by new lines
item7_split = item7.split('\n')
# dictionary of all item 7 content organized by headingd
item7_dict = fill_item_dict(item7_split)

In [None]:
item7_headings = list(item7_dict.keys())
pp(item7_headings)

['OF OPERATIONS',
 'OVERVIEW',
 'RESTRUCTURING AND BUSINESS TRANSFORMATION',
 'SUMMARY OF FINANCIAL PERFORMANCE',
 'RESULTS OF OPERATIONS',
 'FISCAL YEAR ENDED',
 'NET SALES',
 'PERCENTAGE',
 'PERCENTAGE CHANGE',
 'COST OF SALES',
 'GROSS PROFIT',
 'GOODWILL AND OTHER IMPAIRMENTS',
 'GAIN ON EXTINGUISHMENT OF DEBT',
 'LOSS BEFORE PROVISION FOR INCOME TAXES',
 'BENEFIT FROM INCOME TAXES',
 'NET LOSS',
 'OPERATING LOSS',
 'INCOME TAXES',
 'TRANSFORMATION',
 'LIQUIDITY AND CAPITAL RESOURCES',
 'TOTAL CONTRACTUAL OBLIGATIONS',
 'SEASONALITY',
 'INFLATION',
 'CRITICAL ACCOUNTING POLICIES']


# Answering Questions about sections of Item 7: Management’s Discussion and Analysis of Financial Condition and Results of Operations

## Transformations
We are executing on a comprehensive plan to transform our business and position us for long-term success under the leadership of our President and CEO Mark Tritton, who joined the Company on November 4, 2019. Mr. Tritton has been assessing our operations, portfolio, capabilities and culture and is developing and implementing the initial stages of a strategic plan designed to re-establish our leading position as the preferred omnichannel home destination, which is grounded in five key pillars: product, price, promise, place and people. With these five pillars as our framework, and a singular purpose to make it easy for customers to feel at home, we are embracing a commitment to build and manage a modern, durable omnichannel model. Early actions include the extensive restructure of our leadership team. Interim leaders were appointed in merchandising, marketing, digital, stores, operations, finance, legal and human resources. During fiscal 2020, we announced the hiring of a new leadership team, consisting of the following: On March 4, 2020, Joe Hartsig joined the Company as Executive Vice President, Chief Merchandising Officer of the Company and President of Harmon Stores Inc.; On May 4, 2020, Gustavo Arnal joined the Company as Executive Vice President, Chief Financial Officer and Treasurer; On May 11, 2020, Rafeh Masood joined the Company as Executive Vice President, Chief Digital Officer; On May 11, 2020, Gregg Melnick assumed the role of Executive Vice President, Chief Stores Officer. Previously, Mr. Melnick served as the Company’s interim Chief Digital Officer; On May 18, 2020, John Hartmann joined the Company as Chief Operating Officer of the Company and President, buybuy BABY; On May 18, 2020, Arlene Hong joined the Company as Executive Vice President, Chief Legal Officer and Corporate Secretary; On May 26, 2020, Cindy Davis joined the Company as Executive Vice President, Chief Brand Officer of the Company and President, Decorist; and On September 28, 2020, Lynda Markoe joined the Company as Executive Vice President, Chief People and Culture Officer. As discussed in "Overview" above, as part of our business transformation, we are also pursuing deliberate actions as part of our restructuring program to drive profit improvement over the next two-to-three years. We expect to reinvest a portion of the expected cost savings into future growth initiatives. LIQUIDITY AND CAPITAL RESOURCES

In [None]:
transformation = item7_dict['TRANSFORMATION']

In [None]:
context = transformation
questions = ["Who is the president of the company?", "whate are the five key pillars of the strategic plan?", 'How does the company plan to grow?' ]

In [None]:
for i in range(len(questions)):
  print("Question: " + questions[i] + "\nAnswer:  " + answer_question(questions[i], context))
  print("")

Question: Who is the president of the company?
Answer:  mark tritton,

Question: whate are the five key pillars of the strategic plan?
Answer:  product, price, promise, place and people.

Question: How does the company plan to grow?
Answer:  reinvest a portion of the expected cost savings into future growth initiatives.

