# Collect data from SEC.gov form PDF Files



The objective of this project is to build a question-answering model using content extracted from forms available on sec.gov (U.S. Securities and Exchange Commission). While base GPT-3 models excel at answering questions when the answers are directly within the text, they often generate speculative answers when the information is not present in the given context.

To create a model that answers questions only when there's ample context, we begin by constructing a dataset of questions and answers derived from text paragraphs. To train the model to discern when answers are genuinely present, we introduce adversarial examples. These examples involve questions that do not align with the context. In such situations, we instruct the model to respond with "Insufficient context for answering the question."

Our project will unfold across three notebooks:

> **Data Collection (This Notebook):** In this initial notebook, we concentrate on gathering data that was not part of GPT-3's pre-training. We selected data from sec.gov and downloaded various PDF forms to compile our dataset. We then partitioned the content into smaller token-based segments, which will serve as the context for posing and addressing questions.



> **Question Asking and Answering:** In the second notebook, we will leverage the Davinci-instruct model to ask questions based on SEC forms and generate answers based on the provided context.

> **Adversarial Questions and Discriminator Model:** In the third notebook, we will employ the dataset containing context, questions, and answers to create adversarial pairs of questions and context where the question was not originally generated from that context. In these scenarios, the model will be prompted to respond with "Insufficient context for answering the question." Additionally, we will train a discriminator model capable of predicting whether a question can be answered based on the provided context. This step will further enhance the model's ability to discern suitable responses based on context.

## Installing Dependencies

In [None]:
 !pip install pypdf

Collecting pypdf
  Downloading pypdf-3.16.4-py3-none-any.whl (276 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m276.6/276.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-3.16.4


In [8]:
 !pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m104.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m
Co

## Data Parsing using PyPDF

The PyPDF library allows us to work with PDF files. It provides a range of functionalities for reading, manipulating, and writing PDF documents. One of its key features is the ability to parse and extract content from PDF files using the PdfReader(), this feature has been utilized to extract PDF content in this exercise. This feature can be handy for tasks like extracting text, images, and metadata from PDFs. This library simplifies working with PDFs in Python, making it a valuable tool for tasks involving PDF file processing and analysis.

In [None]:
from pypdf import PdfReader
import requests
from io import BytesIO

def fetch_pdf_from_url(url):
    # Send an HTTP GET request to the provided URL.
    response = requests.get(url)

    # Check if the request was successful (status code 200).
    if response.status_code != 200:
        raise Exception("Failed to fetch the PDF.")

    # Return the content of the PDF as a BytesIO object.
    return BytesIO(response.content)

def parse_pdf_with_pypdf(local_filename):
    # Create a PdfReader object to read the PDF file.
    reader = PdfReader(local_filename)

    # Get the total number of pages in the PDF.
    number_of_pages = len(reader.pages)

    # Initialize an empty string to store the extracted text.
    text = ''

    # Iterate through each page in the PDF and extract its text.
    for i in range(number_of_pages):
        page = reader.pages[i]
        text = text + page.extract_text()

    # Return the concatenated text extracted from all pages.
    return text

In [None]:
# Prompt the user to enter a PDF file URL for processing and store it in the pdf_url variable.
pdf_url = input('Enter the file URL for processing: ')

Enter the file URL for processing: https://www.sec.gov/files/form1-n.pdf


In [None]:
# Use 'fetch_pdf_from_url' to fetch the PDF content from the provided URL and store in memory
pdf_in_memory = fetch_pdf_from_url(pdf_url)
# Use 'parse_pdf_with_pypdf' to parse the content of the PDF and store the parsed content in a string variable.
pdf_content = parse_pdf_with_pypdf(pdf_in_memory)

## Splitting the data and storing in a dataframe

The parsed pdf content has been splitted as per the line breaks and then concatenated back within the limit of maximum token to create content for qa. The final content is stored in a dataframe along with the total number of tokens with in content.

In [64]:
from pandas.core.frame import DataFrame
import pandas as pd
from transformers import GPT2TokenizerFast

# Initialize the GPT-2 tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    """Count the number of tokens in a string"""
    return len(tokenizer.encode(text, add_special_tokens=False))

def split_context(text: str) -> DataFrame:
  """
  The function takes a text string as input and returns a DataFrame containing
  the token count and content of each paragraph.
  """
    # Initialize a dictionary to store data for the DataFrame
    data = {
        "num_tokens": [],  # List to store the token count for each paragraph
        "content": []      # List to store the content of each paragraph
    }
    current_row = ""  # Initialize an empty string to keep track of the current paragraph
    current_token_count = 0  # Initialize the token count for the current paragraph

    # Split the input text into paragraphs using '\n' as the delimiter
    paragraphs = text.split('\n')

    # Iterate through each paragraph in the input text
    for paragraph in paragraphs:
        # Calculate the token count for the current paragraph using a function count_tokens
        token_count = count_tokens(paragraph)

        # Append the token count and the content of the current paragraph to the data dictionary
        data["num_tokens"].append(token_count)
        data["content"].append(paragraph.strip())

        # Create a DataFrame using the data dictionary
        df = pd.DataFrame(data)

    return df

def create_dataframe(df: DataFrame, max_tokens: int) -> DataFrame:
    # Initialize variables to keep track of the current row and the concatenated content
    current_row = 0
    current_content = df.at[0, 'content']
    current_num_tokens = df.at[0, 'num_tokens']

    # Initialize lists to store the updated data
    new_num_tokens = []
    new_content = []

    # Iterate through the DataFrame
    for i in range(1, len(df)):
        num_tokens = df.at[i, 'num_tokens']
        content = df.at[i, 'content']

        # Check if adding the current row would max token limit
        if current_num_tokens + num_tokens <= max_tokens:
            current_content += ' ' + content
            current_num_tokens += num_tokens
        else:
            # Save the concatenated content and num_tokens for the current row
            new_num_tokens.append(current_num_tokens)
            new_content.append(current_content)

            # Move to the next row
            current_row = i
            current_content = content
            current_num_tokens = num_tokens

    # Append the last row
    new_num_tokens.append(current_num_tokens)
    new_content.append(current_content)

    # Create a new DataFrame with the updated values
    new_df = pd.DataFrame({'num_tokens': new_num_tokens, 'content': new_content})

    return new_df



In [69]:
# Split the PDF content into smaller chunks, creating a DataFrame
df_split = split_context(pdf_content)

# Set a maximum number of tokens per chunk
max_tokens = 1000

# Create a final DataFrame where content is grouped into chunks of up to max_tokens
df_final = create_dataframe(df_split, max_tokens)

# Display the first few rows of the final DataFrame
df_final.head()

Unnamed: 0,num_tokens,content
0,993,OMB APPROVAL OMB Number: 3235-0554 Expires: F...
1,997,This collection of information has been review...
2,986,The exchange consents that service of any civi...
3,900,Exhibit D Describe the manner of operation of ...


Final dataframe above, contains number of tokens in every content withthe respective content that we will be using to perform qa using open ai. We stored this dataframe as a csv file in google drive, too utilize in next the notebook.

In [72]:
# Save the final DataFrame to a CSV file on a google drive
df_final.to_csv("/content/drive/MyDrive/DAMG7245/pdf_content_openai.csv", index=False)