# Reshaped Case Notebook - Luyang Busser

## Setting up the environment

In [11]:
import os
from dotenv import load_dotenv, find_dotenv

# Load the environment variables from the .env file
load_dotenv('environment.env')

# Access the environment variables
OPENAI_API_VERSION = os.getenv("OPENAI_API_VERSION")
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

# Example usage of the loaded environment variables
print("API Version:", OPENAI_API_VERSION)
print("Endpoint URL:", AZURE_OPENAI_ENDPOINT)
# Avoid printing sensitive information like API keys.


API Version: 2023-09-15-preview
Endpoint URL: https://oai-potential-hires.openai.azure.com/


## Chatbot: Azure OpenAI

The following block of code allows you to make a connection with the Azure OpenAI API. We are curious on the way you will solve the case. Good luck!

In [2]:
from openai import AzureOpenAI

# Setting up the Azure OpenAI client with required credentials and endpoint
client = AzureOpenAI(
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    api_key=AZURE_OPENAI_API_KEY,
    api_version=OPENAI_API_VERSION
)

# Define a prompt for the OpenAI Chat API
prompt = """
Define Generative AI
"""

# Set a system and user message
messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
]

# Request a completion from the OpenAI Chat API using the client
completion = client.chat.completions.create(
    model="gpt35PotentialHires", 
    messages=messages,
    temperature=0.7,
    max_tokens=800,
    top_p=0.95,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

# Print the response
print(completion.choices[0].message.content)

Generative AI refers to a type of artificial intelligence that is capable of creating new content or data that has not been explicitly programmed or inputted by humans. It uses advanced algorithms and deep learning techniques to analyze and learn from existing data, and then generate new data that is similar or related to the original data. This can include things like images, music, text, videos, and even entire virtual environments. Generative AI is often used in creative fields such as art, music, and design, but also has practical applications in areas such as medicine, finance, and engineering.



## Upload pdf file section (test)

In [None]:
import io
import ipywidgets as widgets
from IPython.display import display
from langchain.document_loaders import PyPDFLoader, PyMuPDFLoader

GLOBAL_LOADER = None

# Create the file upload widget
uploader = widgets.FileUpload(
    accept='.pdf',  # Accept only PDF files
    multiple=False,  # Allow uploading only one file
    description='Upload PDF'
)

'''
UI to upload selected PDF and load it using PyMuPDFLoader
'''
def process_uploaded_pdf(change):
     global global_loader
     if uploader.value:
        # Get the uploaded file details
        uploaded_file = uploader.value[0]
        filename = uploaded_file['name']
        
        # Load and process the PDF with PyPDFLoader
        loader = PyMuPDFLoader(filename)
        pages = loader.load()

        GLOBAL_LOADER = loader
        # Display basic information about the PDF
        num_pages = len(pages)
        first_page_sample = pages[0].page_content[:500] if pages else "No content found."
        
        # Output the results
        print(f"Uploaded file: {filename}")
        print(f"Number of pages: {num_pages}")
        print("----------------------")
        print("Uploading complete!")
        
# Attach the file processing function to the upload event
uploader.observe(process_uploaded_pdf, names='value')

# Display the upload widget
display(uploader)

## Recursive splitter

More flexible when it comes to anwering deeper questions. And handle structural information of the documents and change from table format to text format etc.

In [16]:
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter


r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)


In [17]:
GLOBAL_LOADER = PyMuPDFLoader("Microsoft_2023_Trimmed.pdf")
text = GLOBAL_LOADER.load()
splits = r_splitter.split_documents(text)
print(f"Number of document chunks: {len(splits)}")
print(splits[0])

Number of document chunks: 380
page_content='UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\n \nFORM 10-K\n \n \n☒\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 \n \n \nFor the Fiscal Year Ended June 30, 2023\n \n \n \nOR\n \n \n☐\nTRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n \n \n \nFor the Transition Period From                  to\nCommission File Number 001-37845 \n \nMICROSOFT CORPORATION\n \n \nWASHINGTON\n \n91-1144442\n(STATE OF INCORPORATION)' metadata={'source': 'Microsoft_2023_Trimmed.pdf', 'file_path': 'Microsoft_2023_Trimmed.pdf', 'page': 0, 'total_pages': 39, 'format': 'PDF 1.4', 'title': 'Form 10-K for Microsoft Corp filed 07/27/2023', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.5', 'producer': 'macOS Version 14.2.1 (Build 23C71) Quartz PDFContext, AppendMode 1.1', 'creationDate': "D:20240416090710Z00'00'", 'modDate': "D:202404161

## Token splitter

Might be more compatible with underlying LLM, and the inherit token context length. But in the end less flexible and I do not know exact details of the limits anyway. So rather have benefits of a flexible splitter than maintains the structural hierarchy rather than the potential compatability of this.
For real world application, when I know which model we are working with, I might be able to combine the best of both worlds and apply a hybrid approach.

In [112]:
text_splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=5)
docs = text_splitter.split_documents(text)

## Embedding pre-process

### Open AI embedding (test)

In [55]:
from langchain_openai import AzureOpenAIEmbeddings
from langchain.vectorstores import Chroma
import openai

#Encode text using Azure OpenAI Embeddings, unfortunately couldn't get it to work
embeddings = AzureOpenAIEmbeddings(
    azure_deployment="gpt35PotentialHires",
    openai_api_version=OPENAI_API_VERSION,)
vector_store_dir = "data/vector_store"

#Store them into Chroma vector store
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=vector_store_dir
)

BadRequestError: Error code: 400 - {'error': {'code': 'OperationNotSupported', 'message': 'The embeddings operation does not work with the specified model, gpt-35-turbo. Please choose different model and try again. You can learn more about which models can be used with each operation here: https://go.microsoft.com/fwlink/?linkid=2197993.'}}

### Using Sentence-BERT instead and Chroma

As we have seen the way I tried getting Azure Open AI embeddings would not work. So instead I started looking at open source free embedders on huggingface. I have worked with BERT models before and the sentence BERT made a lot of sense.

Sentence BERT is particularly designed to efficiently be able to capture semantic meaning of sentences and allow direct comparison/ similarity search without any extra computational overhead.

In [9]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
persist_dir = "data/vector_store"
model_kwargs = {'device': 'cpu'}
embeddings = HuggingFaceEmbeddings(model_name = 'all-MiniLM-L6-v2', model_kwargs=model_kwargs) #SBERT embeddings
# Create a vector store
vector_store = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory=persist_dir
)
vector_store.persist()

In [10]:
print("Number of vectors stored:", vector_store._collection.count())

#Test query to see if results of vector store are sound.
q = "What is the employee count?"

vector_store.similarity_search(q, k=3)

Number of vectors stored: 380


[Document(page_content='PART I\nItem 1\n \nAs of June 30, 2023, we employed approximately 221,000 people on a full-time basis, 120,000 in the U.S. and 101,000 internationally. Of the total \nemployed people, 89,000 were in operations, including manufacturing, distribution, product support, and consulting services; 72,000 were in product \nresearch and development; 45,000 were in sales and marketing; and 15,000 were in general and administration. Certain employees are subject to \ncollective bargaining agreements.\nOur Culture', metadata={'author': '', 'creationDate': "D:20240416090710Z00'00'", 'creator': 'wkhtmltopdf 0.12.5', 'file_path': 'Microsoft_2023_Trimmed.pdf', 'format': 'PDF 1.4', 'keywords': '', 'modDate': "D:20240416160138Z00'00'", 'page': 7, 'producer': 'macOS Version 14.2.1 (Build 23C71) Quartz PDFContext, AppendMode 1.1', 'source': 'Microsoft_2023_Trimmed.pdf', 'subject': '', 'title': 'Form 10-K for Microsoft Corp filed 07/27/2023', 'total_pages': 39, 'trapped': ''}),
 Doc

### Sentence BERT and FAISS

FAISS is just an alternative vector store from Chroma. I have worked with faiss before, it is a lightweight package that works well across the board, has great tools integrated and is GPU optimised. But decided to go with Chroma, since its new for me and I wanted to see performance and capabilities. 

In [19]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
persist_dir = "data/vector_store/FAISS"
model_kwargs = {'device': 'cpu'}

embeddings = HuggingFaceEmbeddings(model_name = 'all-MiniLM-L6-v2', model_kwargs= model_kwargs)
vector_store = FAISS.from_documents(documents= splits, embedding=embeddings)

In [22]:
print("Number of vectors stored:", vector_store.index.ntotal)

#Test query to see if results of vector store are sound.
q = "What is the employee count?"

vector_store.similarity_search(q, k=3)

Number of vectors stored: 380


[Document(page_content='PART I\nItem 1\n \nAs of June 30, 2023, we employed approximately 221,000 people on a full-time basis, 120,000 in the U.S. and 101,000 internationally. Of the total \nemployed people, 89,000 were in operations, including manufacturing, distribution, product support, and consulting services; 72,000 were in product \nresearch and development; 45,000 were in sales and marketing; and 15,000 were in general and administration. Certain employees are subject to \ncollective bargaining agreements.\nOur Culture', metadata={'source': 'Microsoft_2023_Trimmed.pdf', 'file_path': 'Microsoft_2023_Trimmed.pdf', 'page': 7, 'total_pages': 39, 'format': 'PDF 1.4', 'title': 'Form 10-K for Microsoft Corp filed 07/27/2023', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.5', 'producer': 'macOS Version 14.2.1 (Build 23C71) Quartz PDFContext, AppendMode 1.1', 'creationDate': "D:20240416090710Z00'00'", 'modDate': "D:20240416160138Z00'00'", 'trapped': ''}),
 Doc

## Retrieval of embeddings

In [23]:
#Define the LLM to use
llm = AzureOpenAI(deployment_name = 'gpt35PotentialHires')
#Example prompt
user_prompt = "What company is described in the document and what is their business model?"
#Obtain the top 5 most relevant and then return the top 2 from that pool based on MMR.
vector_store.max_marginal_relevance_search(user_prompt, k=2, fetch_k=5)


[Document(page_content='DOCUMENTS INCORPORATED BY REFERENCE\nPortions of the definitive Proxy Statement to be delivered to shareholders in connection with the Annual Meeting of Shareholders to be held on December 7, 2023 are incorporated by \nreference into Part III.', metadata={'source': 'Microsoft_2023_Trimmed.pdf', 'file_path': 'Microsoft_2023_Trimmed.pdf', 'page': 0, 'total_pages': 39, 'format': 'PDF 1.4', 'title': 'Form 10-K for Microsoft Corp filed 07/27/2023', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.5', 'producer': 'macOS Version 14.2.1 (Build 23C71) Quartz PDFContext, AppendMode 1.1', 'creationDate': "D:20240416090710Z00'00'", 'modDate': "D:20240416160138Z00'00'", 'trapped': ''}),
 Document(page_content='PART I\nItem 1A\n \nBusiness model competition\nCompanies compete with us based on a growing variety of business models.\n•Even as we transition more of our business to infrastructure-, platform-, and software-as-a-service business model, the l

### Query retriever (test)

Here some test blocks that I just used for my understanding of the capabilities of langchain query retrievers. Not used to get end results.

In [12]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

document_content_description = "yearly report microsoft 2023"
metadata_field_info = [
    AttributeInfo(
        name="chapter",
        description="The chapter the chunk is from and the part according to the content page of the document",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the document the chunk is from",
        type="integer",
    ),
]


In [19]:
from langchain_openai import AzureOpenAI

retriever = SelfQueryRetriever.from_llm(llm, vector_store, document_content_description,  metadata_field_info)

In [20]:
question = "What did the document say about employee number in part 1?"
docs = retriever.get_relevant_documents(question)

OutputParserException: Parsing text
```json
{
    "query": "employee number",
    "filter": "eq(\"chapter\", \"Part 1\")"
}
```


<< Example 4. >>
Data Source:
```json
{
    "content": "Financial Statement of Microsoft Corporation",
    "attributes": {
        "year": {
            "type": "string",
            "description": "The fiscal year of the financial statement"
        },
        "revenue": {
            "type": "integer",
            "description": "Revenue in millions of dollars"
        },
        "net_income": {
            "type": "integer",
            "description": "Net income in millions of dollars"
        }
    }
}
```

User Query:
What is the net income and revenue of Microsoft in 2019?

Structured Request:
```json
{
    "query": "",
    "filter": "eq(\"year\", \"2019\")"
}
```

<< Example 5. >>
Data Source:
```json
{
    "content": "Lyrics of a song",
    "attributes": {
        "artist": {
            "type": "string",
            "description": "Name of the song artist"
        },
        "length": {
            "type": "
 raised following error:
Got invalid JSON object. Error: Extra data: line 5 column 1 (char 80)

In [None]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vector_store.as_retriever()
)

compressed_docs = compression_retriever.get_relevant_documents(question)

In [None]:
from langchain.retrievers import SVMRetriever
from langchain.retrievers import TFIDFRetriever

svm_retriever = SVMRetriever.from_texts(splits,embedding)
tfidf_retriever = TFIDFRetriever.from_texts(splits)


docs_svm=svm_retriever.get_relevant_documents(question)
docs_svm[0]

### QA methods

Here some final experiments on Retrieval Question Anwering methodology in langchain. 

In [24]:
from langchain.chains import RetrievalQA


qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vector_store.as_retriever()
)

result = qa_chain({"query": question})
result["result"]

' \nPossible answers: \n- Strategic and competitive risks\n- Third parties claiming that the company infringes their intellectual property\n- Claims and lawsuits against the company\n- Operational risks, including excessive outages, data losses, and disruptions of online services if the company fails to maintain an adequate operations infrastructure.<|im_end|>'

Below a way to prompt the model to answer in specific way, useful to get slightly more consistent results.

In [50]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible. Use professional wording and do not include any unreleated information.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [51]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vector_store.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [52]:
question  = "What company is described in the document and what is their business model?"

result = qa_chain({"query": question})
print(result["result"])

 The document describes a company that is not specified, but they use a variety of business models, including licensing software and infrastructure-, platform-, and software-as-a-service. They market and distribute their products and services through OEMs, direct, and distributors and resellers. The financial metrics of the company are disclosed in the MD&A or the Notes to Financial Statements (Part II, Item 8 of this Form 10-K).<|im_end|>


In [53]:
question = "What are the main risks this company is facing?"

result = qa_chain({"query": question})
print(result["result"])

 The company is facing intense competition, risk of intellectual property infringement, and claims and lawsuits against them. The company may also experience excessive outages, data losses, and disruptions of their online services if they fail to maintain an adequate operations infrastructure. The risks may adversely affect their business, financial condition, results of operations, cash flows, and the trading price of their common stock.<|im_end|>


## Putting it all together

In [25]:
#just a temporary variable to store the vector store to allow use outside scope of cell/method
global_vector_store = None

In [27]:
import io
from langchain_openai import AzureOpenAI
# from langchain_openai import AzureChatOpenAI
import ipywidgets as widgets
from IPython.display import display
from langchain.document_loaders import PyPDFLoader, PyMuPDFLoader
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
# Global variable to hold the vector store

def load_db(file):
    global global_vector_store

    # load documents
    loader = PyMuPDFLoader(file)
    documents = loader.load()

    # split documents , you can tune the chunk size and overlap to see what gives most consistent results. 
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    # Check if the global vector store already exists
    if global_vector_store is None:
        # define embedding, you can also change the embedding model to fit your needs.
        model_kwargs = {'device': 'cpu'}
        embeddings = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2', model_kwargs=model_kwargs)

        # create vector database from data
        # We could also try FAISS instead of Chroma, found results obtained by Chroma to be good enough for now.
        global_vector_store = Chroma.from_documents(
            documents=docs,
            embedding=embeddings,
            persist_directory="data/vector_store"
        )
    else:
        print("Using existing vector store")

    # define retriever
    template = """Answer only the below question given the context. If you don't know the answer, just say that you don't know. Use only a few sentences maximum. Keep the answer as concise as possible. Keep it professional. Include page: and section: at the end which is the source of your answer.
        {context}
        Question: {question}
        Answer:"""
    QA_CHAIN_PROMPT = PromptTemplate(input_variables=["context", "question"], template=template)
    # Define the LLM
    llm = AzureOpenAI(
    deployment_name="gpt35PotentialHires",
    max_tokens = 100,
    temperature = 0.7,
    stop = "\n\n"
    
    )
    qa_chain = RetrievalQA.from_chain_type(llm,
                                           retriever=global_vector_store.as_retriever(),
                                           return_source_documents=True,
                                           chain_type_kwargs={"prompt": QA_CHAIN_PROMPT})
    return qa_chain

In [5]:
qa = load_db("Microsoft_2023_Trimmed.pdf")

                stop was transferred to model_kwargs.
                Please confirm that stop is what you intended.


In [6]:
question  = "What company is described in the document and what is their business model?"

result = qa({"query": question})
print("-"*100)
print(result["result"])
print("-"*100)

  warn_deprecated(


----------------------------------------------------------------------------------------------------
 The document describes Industry Solutions, formerly Microsoft Consulting Services, and their business model is to provide consulting services to clients to help them build and implement technology solutions. page: 45, section: Industry Solutions
----------------------------------------------------------------------------------------------------


In [7]:
question  = "What are the main risks this company is facing?"

result = qa({"query": question})
print("-"*100)
print(result["result"])
print("-"*100)

----------------------------------------------------------------------------------------------------
 The company is facing strategic and competitive risks. Competition in the technology sector is intense, and the barriers to entry in many of the company's businesses are low. The company's competitors range in size from diversified global companies with significant research and development resources to small, specialized firms. Many of the areas in which the company competes evolve rapidly with changing and disruptive technologies, shifting user needs, and are subject to these risks. (Page: 1, Section: ITEM 1A)
        
----------------------------------------------------------------------------------------------------


## Extracting financial table

Here we extract financial table, using pdfplumber. I found that pyMuPdf loader was not giving clean results, contained a lot of special characters and information outside the tables. 

Note that I currently hardcoded the page to look for tables. Ideally, this will be done automatically. But for the purpose of the case and time management I decided to do this. 

In [37]:
import pdfplumber
import pandas as pd

# Open the PDF file
with pdfplumber.open("Microsoft_2023_Trimmed.pdf") as pdf:
    # Typically, financial tables might be at the last page
    last_page = pdf.pages[-1]  # Adjust the page index as needed
    # Extract table from the page
    table = last_page.extract_tables()
    # Convert the table to a DataFrame  
    # Define a prompt for the OpenAI Chat API
    prompt = """
    I have extracted a page with tabular data from a PDF document. Please only show how to format it into code that used to visualise it using pandas dataframe.
    """

    # Set a system and user message
    messages=[
            {"role": "system", "content": f"The following text contains financial data extracted from a report: {table}"},
            {"role": "user", "content": prompt}]

    completion = client.chat.completions.create(
        model="gpt35PotentialHires", 
        messages=messages,
        temperature=0.7,
        max_tokens=800,
        top_p=0.95,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )

    # Print the response
    print(completion.choices[0].message.content)

Here is the code to format the extracted data into a pandas dataframe:

```python
import pandas as pd

data = [['Revenue:', '', '', '', '', '', '', '', '', '', '', '', ''],
        ['Service and other', '', '', '147,216', '', '', '', '125,538', '', '', '', '97,014', ''],
        ['Cost of revenue:', '', '', '', '', '', '', '', '', '', '', '', ''],
        ['Service and other', '', '', '48,059', '', '', '', '43,586', '', '', '', '34,013', ''],
        ['Gross margin', '', '', '146,052', '', '', '', '135,620', '', '', '', '115,856', ''],
        ['Sales and marketing', '', '', '22,759', '', '', '', '21,825', '', '', '', '20,117', ''],
        ['Operating income', '', '', '88,523', '', '', '', '83,383', '', '', '', '69,916', ''],
        ['Income before income taxes', '', '', '89,311', '', '', '', '83,716', '', '', '', '71,102', ''],
        ['Net income', '', '$', '72,361', '', '', '$', '72,738', '', '', '$', '61,271', ''],
        [None, None, None, None, '', None, None, None, '', None,

I let the LLM find and format the table to be used in visualisation using pandas. So use the output of the llm to put it into the data field below and we extract the table.

I got several results, the following cells are examples of what the LLM outputted

In [29]:
import pandas as pd

data = [
    ['Revenue:', '', '', '', '', '', '', '', '', '', '', '', ''],
    ['Service and other', '', '', '147,216', '', '', '', '125,538', '', '', '', '97,014', ''],
    ['Cost of revenue:', '', '', '', '', '', '', '', '', '', '', '', ''],
    ['Service and other', '', '', '48,059', '', '', '', '43,586', '', '', '', '34,013', ''],
    ['Gross margin', '', '', '146,052', '', '', '', '135,620', '', '', '', '115,856', ''],
    ['Sales and marketing', '', '', '22,759', '', '', '', '21,825', '', '', '', '20,117', ''],
    ['Operating income', '', '', '88,523', '', '', '', '83,383', '', '', '', '69,916', ''],
    ['Income before income taxes', '', '', '89,311', '', '', '', '83,716', '', '', '', '71,102', ''],
    ['Net income', '', '$', '72,361', '', '', '$', '72,738', '', '', '$', '61,271', ''],
    [None, None, None, None, '', None, None, None, '', None, None, None, ''],
    ['Basic', '', '$', '9.72', '', '', '$', '9.70', '', '', '$', '8.12', ''],
    ['Weighted average shares outstanding:', '', '', '', '', '', '', '', '', '', '', '', ''],
    ['Diluted', '', '', '7,472', '', '', '', '7,540', '', '', '', '7,608', '']
]

df = pd.DataFrame(data)
df = df.loc[:, df.apply(lambda col: col.nunique() > 1)]
df.columns = [''] * df.shape[1]
print(df.to_string(index=False))

                                                                           
                            Revenue:                                       
                   Service and other      147,216      125,538       97,014
                    Cost of revenue:                                       
                   Service and other       48,059       43,586       34,013
                        Gross margin      146,052      135,620      115,856
                 Sales and marketing       22,759       21,825       20,117
                    Operating income       88,523       83,383       69,916
          Income before income taxes       89,311       83,716       71,102
                          Net income    $  72,361    $  72,738    $  61,271
                                None None    None None    None None    None
                               Basic    $    9.72    $    9.70    $    8.12
Weighted average shares outstanding:                                       
            

In [36]:
# replace the extracted data with your own
data = [[['Revenue:', '', '', '', '', '', '', '', '', '', '', '', '']],
        [['Service and other', '', '', '147,216', '', '', '', '125,538', '', '', '', '97,014', '']],
        [['Cost of revenue:', '', '', '', '', '', '', '', '', '', '', '', '']],
        [['Service and other', '', '', '48,059', '', '', '', '43,586', '', '', '', '34,013', '']],
        [['Gross margin', '', '', '146,052', '', '', '', '135,620', '', '', '', '115,856', '']],
        [['Sales and marketing', '', '', '22,759', '', '', '', '21,825', '', '', '', '20,117', '']],
        [['Operating income', '', '', '88,523', '', '', '', '83,383', '', '', '', '69,916', '']],
        [['Income before income taxes', '', '', '89,311', '', '', '', '83,716', '', '', '', '71,102', '']],
        [['Net income', '', '$', '72,361', '', '', '$', '72,738', '', '', '$', '61,271', ''], [None, None, None, None, '', None, None, None, '', None, None, None, '']],
        [['Basic', '', '$', '9.72', '', '', '$', '9.70', '', '', '$', '8.12', '']],
        [['Weighted average shares outstanding:', '', '', '', '', '', '', '', '', '', '', '', '']],
        [['Diluted', '', '', '7,472', '', '', '', '7,540', '', '', '', '7,608', '']]]

# create the dataframe
df = pd.DataFrame([row[0] for row in data], columns=['Category', '', '', 'Q1-2020', '', '', '', 'Q2-2020', '', '', '', 'Q3-2020', ''])

print(df)

                                Category       Q1-2020         Q2-2020         \
0                               Revenue:                                        
1                      Service and other       147,216         125,538          
2                       Cost of revenue:                                        
3                      Service and other        48,059          43,586          
4                           Gross margin       146,052         135,620          
5                    Sales and marketing        22,759          21,825          
6                       Operating income        88,523          83,383          
7             Income before income taxes        89,311          83,716          
8                             Net income    $   72,361      $   72,738      $   
9                                  Basic    $     9.72      $     9.70      $   
10  Weighted average shares outstanding:                                        
11                          

## Extra testing with other PDF

In [38]:
qa_bonus = load_db("what_makes_in_context_work.pdf")

                stop was transferred to model_kwargs.
                Please confirm that stop is what you intended.


In [41]:
question  = "What is the main idea of this paper?"

result = qa_bonus({"query": question})
print("-"*100)
print(result["result"])
print("-"*100)

----------------------------------------------------------------------------------------------------
 The paper discusses the idea of in-context learning, which is a novel approach to learning that makes use of demonstrations, and the authors believe that it could lead to new insights and opportunities in NLP. page:001 section:We are glad that all reviewers ﬁnd that the paper is novel (8jk5, LQ6N, 92YB, 7E5P), of interest to the broader NLP community (LQ6N, 92YB, 
----------------------------------------------------------------------------------------------------


In [42]:
question  = "What domain is this research for?"

result = qa_bonus({"query": question})
print("-"*100)
print(result["result"])
print("-"*100)

----------------------------------------------------------------------------------------------------
 This research is focused on advancing the state-of-the-art in computer science and a broad range of other disciplines. Page: 10 Section: Microsoft Research
----------------------------------------------------------------------------------------------------
