This Colab uses the [BC3: British Columbia Conversation Corpora](https://www.cs.ubc.ca/cs-research/lci/research-groups/natural-language-processing/bc3.html) to generate a training dataset for Google Cloud Vertex AI Entity Extraction to train an email signature extraction model.

## Load Python Modules
First, let's load some packages to help with parsing the XML file provided by the University of British Columbia


In [3]:
! pip install bs4 lxml



In [36]:
from bs4 import BeautifulSoup
import lxml
import html
import pandas as pd
import random
import re
import json

## Download Email Data

In [5]:
! wget https://www.cs.ubc.ca/cs-research/lci/research-groups/natural-language-processing/bc3/bc3.1.0.zip

--2021-09-03 22:06:59--  https://www.cs.ubc.ca/cs-research/lci/research-groups/natural-language-processing/bc3/bc3.1.0.zip
Resolving www.cs.ubc.ca (www.cs.ubc.ca)... 142.103.6.5
Connecting to www.cs.ubc.ca (www.cs.ubc.ca)|142.103.6.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124299 (121K) [application/zip]
Saving to: ‘bc3.1.0.zip’


2021-09-03 22:07:00 (519 KB/s) - ‘bc3.1.0.zip’ saved [124299/124299]



In [6]:
! unzip bc3.1.0.zip

Archive:  bc3.1.0.zip
  inflating: annotation.dtd          
  inflating: annotation.xml          
  inflating: corpus.dtd              
  inflating: corpus.xml              
  inflating: README.txt              


## Load Email Data from XML

In [10]:
with open("./corpus.xml", "r") as file:
  soup = BeautifulSoup(file, "lxml")

In [24]:
# regex to remove prior thread
regex = re.compile(r'On .* wrote: .*', flags=re.DOTALL)
email_soup = soup.find_all('text')
emails = []

for email in email_soup:
  # email_text = email.text.replace("\n"," ") # remove new lines
  email_text = re.sub(regex, '', email.text) # remove prior thread
  emails.append(html.unescape(email_text))

## Write Email Data to JSONL
Vertex AI will allow you to import a jsonl file as training data automatically, and it also handles the train/test split.

In [39]:
with open(f'./bc3_emails.jsonl', 'w') as outfile:
  for e in emails:
    json.dump(
        {
            "textContent": e
        }, outfile)
    outfile.write('\n')