# Configuring Notebook

1. Follow the "GETTING STARTED" steps laid out in the README first
1. Click "Select Kernel" in upper right corner of this pane
1. Click "Python Environments..."
1. Click ".venv (Python 3.12.3)"
1. Click "Run All"
1. When prompted if you want to install ipykernel, click "Install"

In [1]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


## First we create a project

Projects house all of the data for an experiment including all of the raw datasources, model, weights/biases, context (questions/answers)

In [2]:
from fy_bot.project import create_project, delete_project, project_exists

PROJECT_NAME = "tax_examples"

# This notebook will start from scratch
# So, if the project already exists delete it
# and create a new project
if project_exists(PROJECT_NAME):
    delete_project(PROJECT_NAME)

create_project(PROJECT_NAME)

2024-06-18 23:31:47:INFO:Deleting project: green_ggs...
2024-06-18 23:31:47:INFO:Project deleted.
2024-06-18 23:31:47:INFO:Creating new project: green_ggs...
2024-06-18 23:31:47:INFO:Project Created.


## Next we add data sources to the project

We use add_document with a url to download the document. Once the document is downloaded it is placed in the downloads folder of the project. The document is then scraped for raw text. The raw text is saved in the raw folder of the project.

In [3]:
from fy_bot.datasource import add_document, compile_corpus

# Adding documents to the project downloads this file
# and puts it in the "downloads" folder of the project
# It then extracts the raw text from the downloaded file
# and saves it in the "raw" folder of the project

# Tax publications
# https://www.irs.gov/forms-pubs/ebook

# Your Federal Income Tax (For Individuals)
add_document(PROJECT_NAME, "https://www.irs.gov/pub/ebook/p17.epub")

# Federal Income Tax Withholding Methods
add_document(PROJECT_NAME, "https://www.irs.gov/pub/ebook/p15t.epub")

# Armed Forces' Tax Guide
add_document(PROJECT_NAME, "https://www.irs.gov/pub/ebook/p3.epub")

2024-06-18 23:31:53:INFO:Adding document https://www.site.uottawa.ca/~lucia/courses/2131-02/A2/trythemsource.txt to project green_ggs
2024-06-18 23:31:54:INFO:File downloaded successfully...
2024-06-18 23:31:54:INFO:Raw text successfully extracted.


## Compile the corpus

Since you can add multiple documents to a project we must compile the corpus. During compilation all raw texts are aggregated and cleaned.During the cleaning all sentences that are syntactically invalid are removed. We do this cleaning because we are possibly scraping pdfs/ebooks/transcripts/etc so there will be a bunch of non-sensical text that was scraped. We discard that and we are left with only syntactically valid sentences. The logic for this cleaning can be found in the __is_syntactically_correct method in fy_bot/datasource.py. Additional cleaning is done by converting all characters to lowercase, special characters are removed, useless whitespace is removed, etc. The logic for this cleaning can be found in the compile_corpus method in fy_bot/datasource.py

In [4]:
# Compiling the corpous cleans all the raw data files and
# saves the aggregated file to corpus.txt in the project folder.
compile_corpus(PROJECT_NAME)

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\clayt\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
2024-06-18 23:31:54:INFO:Compiling corpus...
Cleaning text..: 100%|██████████| 1/1 [00:00<00:00,  1.58it/s]
Writing corpus to disk..: 100%|██████████| 74/74 [00:00<00:00, 73846.89it/s]
2024-06-18 23:31:55:INFO:Corpus compilation complete.


## Create context

To create a chatbot we need to train our model with context (i.e., question/answer dialog). Since all of our datasources we must generate this context as a preprocessing step. Currently, we do this using a pretrained T5 model (https://huggingface.co/mrm8488/t5-base-finetuned-question-generation-ap). We can explore other options, but for now, it does the job. The logic for the context generation can by found in fy_bot/content_generation.py. Ass part of this process three files are created: context.txt, questions.txt, answers.txt. context.txt is the context in question/answer format. answers.txt is just the answers. questions.txt is just the questions.

In [5]:
from fy_bot.context_generation import generate_context_question

questions = generate_context_question(PROJECT_NAME, device)

print(questions)

  from .autonotebook import tqdm as notebook_tqdm
2024-06-18 23:31:57:INFO:Compiling corpus...
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
Generating context questions...: 100%|██████████| 74/74 [00:25<00:00,  2.87it/s]
Writing context...: 100%|██████████| 50/50 [00:00<?, ?it/s]

{'What do i not like about samiam?': 'i do not like that samiam!', 'Would you like green eggs and ham?': 'do would you like green eggs and ham?', 'Do you like themsamiam?': 'i do not like themsamiam.', 'What do i not like about green eggs and ham?': 'i do not like green eggs and ham!', 'Would you like them here or there?': 'would you like them here or there?', 'i would not like them here or there.': 'i would not like them here or there.', 'Where would i find them?': 'i would not like them anywhere.', "What do you think about samiam's comments?": 'i do not like them samiam.', 'Would you like them in a house?': 'would you like them in a house?', 'What would you like to do with a mouse?': 'would you like then with a mouse?', 'Do i like them in a house?': 'i do not like them in a house.', 'What do i not like with a mouse?': 'i do not like them with a mouse.', 'Do i like them here or there?': 'i do not like them here or there.', 'What do i not like about them?': 'i do not like them anywhere




In [6]:
from fy_bot.model import build_model

model = build_model()

Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [7]:
from fy_bot.model import get_tokenized_context

question_encodings, answer_encodings = get_tokenized_context(PROJECT_NAME)