<a href="https://colab.research.google.com/github/Crystal-Reshea/FinBert-Albert-nlp/blob/main/Hugging_Face_Task_and_QA_Example_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# A 🤗 tour of transformer applications

In this notebook we take a tour around transformers applications. The transformer architecture is very versatile and allows us to perform many NLP tasks with only minor modifications. For this reason they have been applied to a wide range of NLP tasks such as classification, named entity recognition, or translation.

## Pipeline

We experiment with models for these tasks using the high-level API called pipeline. The pipeline takes care of all preprocessing and returns cleaned up predictions. The pipeline is primarily used for inference where we apply fine-tuned models to new examples.

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/pipeline.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=800>

## Setup

Before we start we need to make sure we have the transformers library installed as well as the sentencepiece tokenizer which we'll need for some models.

In [None]:
%%capture
!pip install transformers
!pip install sentencepiece

Furthermore, we create a textwrapper to format long texts nicely.

In [None]:
import textwrap
wrapper = textwrap.TextWrapper(width=80, break_long_words=False, break_on_hyphens=False)

## Classification

We start by setting up an example text that we would like to analyze with a transformer model. This looks like your standard customer feedback from a transformer:

In [None]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

print(wrapper.fill(text))

Dear Amazon, last week I ordered an Optimus Prime action figure from your online
store in Germany. Unfortunately, when I opened the package, I discovered to my
horror that I had been sent an action figure of Megatron instead! As a lifelong
enemy of the Decepticons, I hope you can understand my dilemma. To resolve the
issue, I demand an exchange of Megatron for the Optimus Prime figure I ordered.
Enclosed are copies of my records concerning this purchase. I expect to hear
from you soon. Sincerely, Bumblebee.


One of the most common tasks in NLP and especially when dealing with customer texts is _sentiment analysis_. We would like to know if a customer is satisfied with a service or product and potentially aggregate the feedback across all customers for reporting.

For text classification the model gets all the inputs and makes a single prediction as shown in the following example:

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/clf_arch.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=600>

We can achieve this by setting up a `pipeline` object which wraps a transformer model. When initializing we need to specify the task. Sentiment analysis is a subfield of text classification where a single label is given to a group of text.

In [None]:
from transformers import pipeline

sentiment_pipeline = pipeline('text-classification')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

You can see a warning message: we did not specify in the pipeline which model we would like to use. In that case it loads a default model. The `distilbert-base-uncased-finetuned-sst-2-english` model is a small BERT variant trained on [SST-2](https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary) which is a sentiment analysis dataset.

You'll notice that the first time you execute the model a download is executed. The model is downloaded from the 🤗 Hub! The second time the cached model will be used.

Now we are ready to run our example through pipeline and look at some predictions:

In [None]:
sentiment_pipeline(text)

[{'label': 'NEGATIVE', 'score': 0.9015457034111023}]

The model predicts negative sentiment with a high confidence which makes sense. You can see that the pipeline returns a list of dicts with the predictions. We can also pass several texts at the same time in which case we would get several dicts in the list for each text one.

## Named entity recognition

Let's see if we can do something a little more sophisticated. Instead of just finding the overall sentiment let's see if we can extract named entities such as organizations, locations, or individuals from the text. This task is called named entity recognition (NER). Instead of predicting just a class for the whole text a class is predicted for each token, thus this task belongs to the category of token classification:

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/ner_arch.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=550>

Again, we just load a pipeline for the NER task without specifying a model. This will load a default BERT model that has been trained on the [CoNLL-2003](https://huggingface.co/datasets/conll2003).

In [None]:
ner_pipeline = pipeline('ner')

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


Downloading:   0%|          | 0.00/998 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

When we pass our text through the model we get a long list of dicts: each dict corresponds to one detected entity. Since multiple tokens can correspond to a a single entity we can apply an aggregation strategy that merges entities if the same class appears in consequtive tokens.

In [None]:
entities = ner_pipeline(text, aggregation_strategy="simple")
print(entities)

[{'entity_group': 'ORG', 'score': 0.87900966, 'word': 'Amazon', 'start': 5, 'end': 11}, {'entity_group': 'MISC', 'score': 0.9908588, 'word': 'Optimus Prime', 'start': 36, 'end': 49}, {'entity_group': 'LOC', 'score': 0.9997547, 'word': 'Germany', 'start': 90, 'end': 97}, {'entity_group': 'MISC', 'score': 0.55656844, 'word': 'Mega', 'start': 208, 'end': 212}, {'entity_group': 'PER', 'score': 0.5902572, 'word': '##tron', 'start': 212, 'end': 216}, {'entity_group': 'ORG', 'score': 0.66969234, 'word': 'Decept', 'start': 253, 'end': 259}, {'entity_group': 'MISC', 'score': 0.498349, 'word': '##icons', 'start': 259, 'end': 264}, {'entity_group': 'MISC', 'score': 0.77536106, 'word': 'Megatron', 'start': 350, 'end': 358}, {'entity_group': 'MISC', 'score': 0.98785394, 'word': 'Optimus Prime', 'start': 367, 'end': 380}, {'entity_group': 'PER', 'score': 0.8120963, 'word': 'Bumblebee', 'start': 502, 'end': 511}]


Let's clean the outputs a bit up:

In [None]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

Amazon: ORG (0.88)
Optimus Prime: MISC (0.99)
Germany: LOC (1.00)
Mega: MISC (0.56)
##tron: PER (0.59)
Decept: ORG (0.67)
##icons: MISC (0.50)
Megatron: MISC (0.78)
Optimus Prime: MISC (0.99)
Bumblebee: PER (0.81)


It seems that the model found most of the named entities but was confused about the class of the transformer characters. This is no surprise since the original dataset probably did not contain many transformer characters. For this reason it makes sense to further fine-tune a model on your on dataset!

## Summarization

Let's see if we can go beyond these natural language understanding tasks (NLU) where BERT excels and delve into the generative domain. Note that generation is much more expensive since we usually generate one token at a time and need to run this several times.

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/gen_steps.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=600>

A popular task involving generation is summarization. Let's see if we can use a transformer to generate a summary for us:

In [None]:
summarization_pipeline = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.76k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

This model is trained was trained on the [CNN/Dailymail dataset](https://huggingface.co/datasets/cnn_dailymail) to summarize news articles.

In [None]:
outputs = summarization_pipeline(text, max_length=45, clean_up_tokenization_spaces=True)
print(wrapper.fill(outputs[0]['summary_text']))

 Bumblebee ordered an Optimus Prime action figure from your online store in
Germany. Unfortunately, when I opened the package, I discovered to my horror
that I had been sent an action figure of Megatron instead.


## Translation

But what if there is no model in the language of my data? You can still try to translate the text. The Helsinki NLP team has provided over 1000 language pair models for translation. Here we load one that translates English to Japanese:

In [None]:
translator = pipeline("translation_en_to_ja", model="Helsinki-NLP/opus-tatoeba-en-ja")

Let's translate the a text to Japanese:

In [None]:
text = 'At the MLT workshop in Tokyo we gave an introduction about Transformers.'

In [None]:
outputs = translator(text, clean_up_tokenization_spaces=True)
print(wrapper.fill(outputs[0]['translation_text']))

東京のMLTワークショップで,トランスフォーマーについて紹介しました.


We can see that the text is clearly not perfectly translated, but the core meaning stays the same. Another cool application of translation models is data augmentation via backtranslation!

## Question-answering

We have now seen an example of text and token classification using transformers. However, there are more interesting tasks we can use transformers for. One of them is question-answering. In this task the model is given a question and a context and needs to find the answer to the question within the context. This problem can be rephrased into a classification problem: For each token the model needs to predict whether it is the start or the end of the answer. In the end we can extract the answer by looking at the span between the token with the highest start probability and highest end probability:

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/qa_arch.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=600>

You can imagine that this requires quite a bit of pre- and post-processing logic. Good thing that the pipeline takes care of all that!

In [None]:
qa_pipeline = pipeline("question-answering")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


Downloading:   0%|          | 0.00/473 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/249M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

This default model is trained on the canonical [SQuAD dataset](https://huggingface.co/datasets/squad). Let's see if we can ask it what the customer wants:

In [None]:
question = "What does the customer want?"

outputs = qa_pipeline(question=question, context=text)
outputs

{'answer': 'an exchange of Megatron',
 'end': 358,
 'score': 0.6312924027442932,
 'start': 335}

Awesome, that sounds about right!

## Custom Model for Question-Answering 
In this example we are training the BERT model, Albert, on the SQUAD 2.0 dataset for question-answering. 

### Pre-Processing

First we are going to take the training data and transform it into a format that can be understood by the model. 

In [None]:
import json
def read_squad(file_name):
  """
   Navigating SQUAD training file by
   separating context, questions, and answers
  """
  # open JSON file and load intro dictionary
  with open(file_name, 'rb') as file:
    squad2_dict = json.load(file)
        
  contexts = []
  questions = []
  answers = []
  # iterate through all data in squad data
  for key in squad2_dict['data']:
    for passage in key['paragraphs']:
      context = passage['context']
      for qa in passage['qas']:
          question = qa['question']
          # check if we need to be extracting from 'answers' or 'plausible_answers'
          if 'plausible_answers' in qa.keys():
              access = 'plausible_answers'
          else:
              access = 'answers'
          for answer in qa[access]:
            # append data to lists
            contexts.append(context)
            questions.append(question)
            answers.append(answer)
    # return formatted data lists
    return contexts, questions, answers

In [None]:
train_path = '/content/drive/MyDrive/NLP_POC/train-v2.0.json'
# execute our read SQuAD function for training
train_contexts, train_questions, train_answers = read_squad(train_path)

Now that the context, questions, and answers are extracted we must find where the answers begin and end in the context.

In [None]:
def add_end_idx(answers, contexts):
    # loop through each answer-context pair
    for answer, context in zip(answers, contexts):
        # target_text is the answer we are looking for within context
        target_text = answer['text']
        # where the answer starts in context
        start_index = answer['answer_start']
        # where the answer should end
        end_index = start_index + len(target_text)

        # sometimes the answers are slightly shifted 
        if context[start_index:end_index] == target_text: 
            # if the end index is correct, we add to the dictionary
            answer['answer_end'] = end_index
        else:
            for n in range(1,4):
                if context[start_index-n:end_index-n] == target_text:
                    answer['answer_start'] = start_index - n
                    answer['answer_end'] = end_index - n
            

add_end_idx(train_answers, train_contexts)

In [4]:
!pip install transformers
from transformers import AlbertTokenizerFast
tokenizer = AlbertTokenizerFast.from_pretrained('albert-base-v2')

Downloading:   0%|          | 0.00/742k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/684 [00:00<?, ?B/s]

Next we'll take the start and end positions and convert them to tokens that are understood by the tokenizer.

In [None]:
train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)

In [None]:
tokenizer.decode(train_encodings['input_ids'][250])

In [None]:
def add_token_positions(encodings, answers):
  """
  Creates tokens for the start and 
  end positions that can be understood
  by the tokenizer
  """
  start_positions = []
  end_positions = []
  for i in range(len(answers)):
      # append start/end token position using char_to_token method
      start_positions.append(encodings.char_to_token(i, answers[i]['answer_start']))
      end_positions.append(encodings.char_to_token(i, answers[i]['answer_end']))

      # if start position is None, the answer passage has been truncated
      if start_positions[-1] is None:
          start_positions[-1] = tokenizer.model_max_length
      # end position cannot be found, char_to_token found space, so shift position until found
      shift = 1
      while end_positions[-1] is None:
          end_positions[-1] = encodings.char_to_token(i, answers[i]['answer_end'] - shift)
          shift += 1
  # update our encodings object with the new token-based start/end positions
  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})

# apply function to our data
add_token_positions(train_encodings, train_answers)

### Model Development

This section demonstrates how Albert was trained on the SQUAD dataset that we just pre-proccessed.

In [5]:
from transformers import AlbertTokenizer, AlbertForQuestionAnswering
model = AlbertForQuestionAnswering.from_pretrained('albert-base-v2')

Downloading:   0%|          | 0.00/45.2M [00:00<?, ?B/s]

Some weights of the model checkpoint at albert-base-v2 were not used when initializing AlbertForQuestionAnswering: ['predictions.LayerNorm.bias', 'predictions.bias', 'predictions.dense.weight', 'predictions.decoder.bias', 'predictions.decoder.weight', 'predictions.dense.bias', 'predictions.LayerNorm.weight']
- This IS expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing AlbertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of AlbertForQuestionAnswering were not initialized from the model checkpoint at albert-base-v2 and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN t

In [None]:
import torch

class SquadDataset(torch.utils.data.Dataset):
    def __init__(self, encodings):
        self.encodings = encodings

    def __getitem__(self, idx):
        return {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}

    def __len__(self):
        return len(self.encodings.input_ids)

# build dataset for the training dataset
train_dataset = SquadDataset(train_encodings)

In [None]:
from torch.utils.data import DataLoader
from transformers import AdamW
from tqdm import tqdm

# setup GPU/CPU
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# move model over to detected device
model.to(device)
# activate training mode of model
model.train()
# initialize adam optimizer with weight decay (reduces chance of overfitting)
optim = AdamW(model.parameters(), lr=5e-5)

# initialize data loader for training data
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)

for epoch in range(3):
    # set model to train mode
    model.train()
    # setup loop (we use tqdm for the progress bar)
    loop = tqdm(train_loader, leave=True)
    for batch in loop:
        # initialize calculated gradients (from prev step)
        optim.zero_grad()
        # pull all the tensor batches required for training
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        start_positions = batch['start_positions'].to(device)
        end_positions = batch['end_positions'].to(device)
        # train model on batch and return outputs (incl. loss)
        outputs = model(input_ids, attention_mask=attention_mask,
                        start_positions=start_positions,
                        end_positions=end_positions)
        # extract loss
        loss = outputs[0]
        # calculate loss for every parameter that needs grad update
        loss.backward()
        # update parameters
        optim.step()
        # print relevant info to progress bar
        loop.set_description(f'Epoch {epoch}')
        loop.set_postfix(loss=loss.item())

Epoch 0: 100%|██████████| 16290/16290 [2:06:57<00:00,  2.14it/s, loss=0.796]
Epoch 1: 100%|██████████| 16290/16290 [2:06:56<00:00,  2.14it/s, loss=1.16]
Epoch 2: 100%|██████████| 16290/16290 [2:06:56<00:00,  2.14it/s, loss=0.769]


In [None]:
model_path = '/content/drive/MyDrive/NLP_POC/models'
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

('/content/drive/MyDrive/NLP_POC/models/tokenizer_config.json',
 '/content/drive/MyDrive/NLP_POC/models/special_tokens_map.json',
 '/content/drive/MyDrive/NLP_POC/models/tokenizer.json')

### Post-Processing
This section demonstrates the steps taken during post-processing. In this section we will process the questions and answer using the tokenizer and then, once ran throuhg the model, we will process the output. Here we will see how this can be done manually and with Hugging Face's pipeline. 

In [14]:
# get new model and tokenizer from path
model = AlbertForQuestionAnswering.from_pretrained('/content/drive/MyDrive/NLP_POC/models')
tokenizer = AlbertTokenizerFast.from_pretrained('/content/drive/MyDrive/NLP_POC/models')

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import string

In [None]:
def answer_question(question, answer_text):
  '''
  Takes a `question` string and an `answer_text` string (which contains the
  answer), and identifies the words within the `answer_text` that are the
  answer. Prints them out.
  '''
  # get input ids from the tokenizer
  input_ids = tokenizer.encode(question, answer_text, truncation=True)
  # get segment ids
  sep_index, num_seg_a, num_seg_b, segment_ids = get_segment_ids(input_ids)
  # get outputs 
  outputs = model(torch.tensor([input_ids]), # the tokens representing our input text.
                  token_type_ids=torch.tensor([segment_ids]), # the segment IDs to differentiate question from answer_text
                  return_dict=True) 

  start_scores = outputs.start_logits
  end_scores = outputs.end_logits
  # construct answer
  answer = construct_answer(start_scores, end_scores, input_ids)
  return answer

In [None]:
def get_segment_ids(input_ids): 
  # find SEP token id
  sep_index = input_ids.index(tokenizer.sep_token_id)
  # the number of segment a and b tokens
  num_seg_a = sep_index + 1
  num_seg_b = len(input_ids) - num_seg_a
  # create segment ids of 1 and 0 
  segment_ids = [0]*num_seg_a + [1]*num_seg_b
  assert len(segment_ids) == len(input_ids), "The length of segment ids and input ids must be equal"
  return sep_index, num_seg_a, num_seg_b, segment_ids

In [None]:
def construct_answer(start_scores, end_scores, input_ids):
  # find the tokens with the highest start and end scores 
  answer_start = torch.argmax(start_scores)
  answer_end = torch.argmax(end_scores)
  tokens = tokenizer.convert_ids_to_tokens(input_ids)

  alpha = string.ascii_letters + ".!{(,"
  # remove underscores from answers
  if tokens[answer_start][0:1] not in alpha:
    answer = tokens[answer_start][1:]
  else: 
    answer = tokens[answer_start]

  # Select the remaining answer tokens and join them with whitespace.
  for i in range(answer_start + 1, answer_end + 1):
      # If it's a subword token, then recombine it with the previous token.
      if tokens[i][0:1] not in alpha :
          answer += " " + tokens[i][1:]
      # Otherwise, add a space then the token.
      else:
          answer += '' + tokens[i]
  return answer

### Results

#### Question Answering about Erica

Context:  

"Erica leverages the latest technologies, in advanced analytics and cognitive messaging to serve as your trusted financial assistant. Erica is able to consider a range of data within Bank of America, like your cash flow, balances, transaction history and upcoming bills, to help you stay on top of your finances.Right now, Erica is exclusively available in the Mobile Banking app (app versions 7.6 and above). Just download the app today to get started!Erica is also planning to be available in Online Banking.or now, Erica is only available in English, but it is expected to learn Spanish.we keep a record of your conversations with Erica for quality assurance, to maintain an accurate account of your requests, identify opportunities to make Erica's responses more helpful and ensure Erica's performance is optimal. When you speak with Erica by voice, the discussions are recorded and saved for 90 days so they can be analyzed to help refine listening skills."

Questions:
1. Which languages does Erica know? 
2. Do I need the app to use Erica?
2. What tools does Erica use to work?
3. How long does Erica keep my conversation?

In [68]:
questions = ["What languages does Erica know?", "Do I need the app to use Erica?", "What does Erica use to work?","How long does Erica keep my conversations?"]
context = "Erica leverages the latest technologies, in advanced analytics and cognitive messaging to serve as your trusted financial assistant. Erica is able to consider a range of data within Bank of America, like your cash flow, balances, transaction history and upcoming bills, to help you stay on top of your finances.Right now, Erica is exclusively available in the Mobile Banking app (app versions 7.6 and above). Just download the app today to get started!Erica is also planning to be available in Online Banking.or now, Erica is only available in English, but it is expected to learn Spanish.we keep a record of your conversations with Erica for quality assurance, to maintain an accurate account of your requests, identify opportunities to make Erica's responses more helpful and ensure Erica's performance is optimal. When you speak with Erica by voice, the discussions are recorded and saved for 90 days so they can be analyzed to help refine listening skills."

In [16]:
from transformers import pipeline

In [28]:
qa_pipeline = pipeline(model=model,tokenizer=tokenizer,task="question-answering")
for i in range(len(questions)): 
  print("Question: " + questions[i] + "\nAnswer: " + qa_pipeline(question = questions[i], context = context)['answer'] + "\n")

Question: What languages does Erica know?
Answer: English,

Question: Do I need the app to use Erica?
Answer: Just download the app today to get started!Erica

Question: What does Erica use to work?
Answer: advanced analytics and cognitive messaging to serve as your trusted financial assistant.

Question: How long does Erica keep my conversations?
Answer: 90 days



#### Question Answering using data from Form 10-K

The Form 10-K is an annual report that provides an overview of a company's business and financial condition. Today we are looking at Bed Bath & Beyond's Form 10-k from 2021.



#### Pre-Processing 10-K Form

In [98]:
file = '/content/drive/MyDrive/NLP_POC/bby-202110k.txt'

In [99]:
import re
def process_text(file_name): 
  # collect only the necessary lines of text
  data = line_collection(file_name)
  # collect ITEM names 
  text, toc = find_toc(data)
  # create list of items in table of contents
  items = list(toc.keys())
  # return dictionary of item content pairs
  return extract_text(text,toc)

In [100]:
def line_collection(file_name):
  data = []
  with open(file_name, 'r') as file: 
    for line in file:  # Reading in file and remove unnessecary lines
      new_line = line.replace('\n',' ')
      # skip lines that are obviously not needed 
      if re.sub(r"\s+", "", line).lower() == "tableofcontents" or len(line) <= 3 or line.startswith("PART"):
        continue
      else:
        # append lines that are headings within Items 
        if len(new_line) >= 8 and len(new_line) <50: 
          if new_line[0].isupper() and "." not in new_line and re.sub(r"\s+", "", new_line).isalpha():
            data.append(line.upper())
          else: 
            data.append(new_line)
        else: 
          data.append(new_line)
  file.close()
  return data

In [101]:
def find_toc(data): 
  toc = {}
  # Adding names of headers to table of contents dictionary
  for line in data: 
    if line.startswith("ITEM") or line == 'SIGNATURES': 
      toc[line] = ""
  # Converting list to string
  text = "".join(data) 
  return(text, toc)

In [102]:
def extract_text(text,toc):
  items = list(toc.keys())
  # Collecting text between headers and adding them to dictionary
  for i in range(1, len(items)): 
    start = items[i-1]
    end = items[i]
    toc[start] = re.search(r'((?<=' + start + ').*(?=' + end + '))', text, re.S | re.M)[0]
  return toc, items

##### Function to split up text in Items by headings

In [103]:
def fill_item_dict(arr_split):
  dict = {}
  for i in range(1,len(arr_split)): 
    heading = re.findall(r'\b[A-Z]+(?:\s+[A-Z]+)*\b',arr_split[i-1])[-1]
    content = arr_split[i]
    dict[heading] = content
  return dict


In [104]:
import pandas as pd
item7 = pd.read_csv('/content/drive/MyDrive/NLP_POC/bby_10k_item7')
item7 = item7.to_dict('dict')

In [105]:
# process 10-K data
file_name = '/content/drive/MyDrive/NLP_POC/bby-202110k.txt'
text_dict, toc_list = process_text(file_name)

##### Processing Item 7 of 10-K 

In [106]:
# string of all relevant item 7 content
item7 = text_dict[toc_list[7]]
# list of item 7 content split by new lines
item7_split = item7.split('\n')
# dictionary of all item 7 content organized by headingd
item7_dict = fill_item_dict(item7_split)

#### Question Answering on Item 7 of 10-K

Context:  
"We are executing on a comprehensive plan to transform our business and position
us for long-term success under the leadership of our President and CEO Mark
Tritton, who joined the Company on November 4, 2019. Mr. Tritton has been
assessing our operations, portfolio, capabilities and culture and is developing
and implementing the initial stages of a strategic plan designed to re-establish
our leading position as the preferred omnichannel home destination, which is
grounded in five key pillars: product, price, promise, place and people. With
these five pillars as our framework, and a singular purpose to make it easy for
customers to feel at home, we are embracing a commitment to build and manage a
modern, durable omnichannel model. Early actions include the extensive
restructure of our leadership team. Interim leaders were appointed in
merchandising, marketing, digital, stores, operations, finance, legal and human
resources. During fiscal 2020, we announced the hiring of a new leadership team,
consisting of the following: On March 4, 2020, Joe Hartsig joined the Company as
Executive Vice President, Chief Merchandising Officer of the Company and
President of Harmon Stores Inc.; On May 4, 2020, Gustavo Arnal joined the
Company as Executive Vice President, Chief Financial Officer and Treasurer; On
May 11, 2020, Rafeh Masood joined the Company as Executive Vice President, Chief
Digital Officer; On May 11, 2020, Gregg Melnick assumed the role of Executive
Vice President, Chief Stores Officer. Previously, Mr. Melnick served as the
Company’s interim Chief Digital Officer; On May 18, 2020, John Hartmann joined
the Company as Chief Operating Officer of the Company and President, buybuy
BABY; On May 18, 2020, Arlene Hong joined the Company as Executive Vice
President, Chief Legal Officer and Corporate Secretary; On May 26, 2020, Cindy
Davis joined the Company as Executive Vice President, Chief Brand Officer of the
Company and President, Decorist; and On September 28, 2020, Lynda Markoe joined
the Company as Executive Vice President, Chief People and Culture Officer. As
discussed in "Overview" above, as part of our business transformation, we are
also pursuing deliberate actions as part of our restructuring program to drive
profit improvement over the next two-to-three years. We expect to reinvest a
portion of the expected cost savings into future growth initiatives."

Questions: 
1. Who is the president of the company?
2. What are the five key pillars of the strategic plan?
3. How does the company plan to grow?

In [108]:
context = item7_dict['TRANSFORMATION']
questions = ["Who is the president of the company?", "What are the five key pillars of the strategic plan?", 'How does the company plan to grow?']

In [109]:
qa_pipeline = pipeline(model=model,tokenizer=tokenizer,task="question-answering")
for i in range(len(questions)): 
  print("Question: " + questions[i] + "\nAnswer: " + qa_pipeline(question = questions[i], context = context)['answer'] + "\n")

Question: Who is the president of the company?
Answer: Rafeh Masood

Question: What are the five key pillars of the strategic plan?
Answer: product, price, promise, place and people.

Question: How does the company plan to grow?
Answer: reinvest a portion of the expected cost savings into future growth initiatives.



# More pipelines

There are many more pipelines that you can experiment with. Look at the following list for an overview:

In [None]:
from transformers import pipelines
for task in pipelines.SUPPORTED_TASKS:
    print(task)

audio-classification
automatic-speech-recognition
feature-extraction
text-classification
token-classification
question-answering
table-question-answering
fill-mask
summarization
translation
text2text-generation
text-generation
zero-shot-classification
conversational
image-classification
object-detection


Transformers not only work for NLP but can also be applied to other modalities. Let's have a look at a few.

### Computer vision

Recently, transformer models have also entered computer vision. Check out the DETR model on the [Hub](https://huggingface.co/facebook/detr-resnet-101-dc5):

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/object_detection.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=400>

### Audio

Another promising area is audio processing. Especially Speech2Text there have been some promising advancements recently. See for example the [wav2vec2 model](https://huggingface.co/facebook/wav2vec2-base-960h):

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/speech2text.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=400>

### Table QA

Finally, a lot of real world data is still in form of tables. Being able to query tables is very useful and with [TAPAS](https://huggingface.co/google/tapas-large-finetuned-wtq) you can do tabular question-answering:

<img src="https://github.com/huggingface/workshops/blob/main/machine-learning-tokyo/images/tapas.png?raw=1" alt="Alt text that describes the graphic" title="Title text" width=400>

# Cache

Whenever we load a new model from the Hub it is cached on the machine you are running on. If you run these examples on Colab this is not an issue since the persistent storage will be cleaned after your session anyway. However, if you run this notebook on your laptop you might have just filled several GB of your hard drive. By default the cache is saved in the folder `~/.cache/huggingface/transformers`. Make sure to clear it from time to time if your hard drive starts to fill up.