
### Step 1: Installing Necessary Libraries

Before we begin, it's essential to install the required packages. Here's a breakdown of what each package does:

- `openai`: The official Python client for the OpenAI API. It facilitates interactions with OpenAI's GPT models.
- `langchain`: A library possibly providing tools for chaining language models.
- `tiktoken`: An OpenAI library used to count the number of tokens in a string without making an API call.
- `unstructured[slack]` and `unstructured_inference`: Libraries that handle unstructured data, with a specific extension for Slack.
- `singlestoredb`: The official Python client for SingleStore DB to interact with SingleStore databases.
    

In [None]:
!pip install openai langchain tiktoken
!pip install "unstructured[slack]" unstructured_inference
!pip install singlestoredb

Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.0.326-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting tiktoken
  Downloading tiktoken-0.5.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m29.3 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.1-py3-none-any.whl (27 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langsmith<0.1.0,>=0.0.52 (from langchain)
  Downloading langsmith-0.0.54-py3-none-any.whl (43 kB)
[2K     [90m━━━━━━━━━━━━━━

Collecting singlestoredb
  Downloading singlestoredb-0.9.3-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/277.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━[0m [32m204.8/277.8 kB[0m [31m6.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.8/277.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Collecting parsimonious (from singlestoredb)
  Downloading parsimonious-0.10.0-py3-none-any.whl (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.4/48.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Collecting sqlparams (from singlestoredb)
  Downloading sqlparams-5.1.0-py3-none-any.whl (16 kB)
Installing collected packages: sqlparams, parsimonious, singlestoredb
Successfully installed parsimonious-0.10.0 singlestoredb-0.9.3 sqlparams-5.1.0


### Step 2: Configuration and API Initialization

To interact with the OpenAI API and SingleStore database:

1. Import the necessary modules.
2. Set the OpenAI API key.
3. Configure the SingleStore database connection details.

**Note:** Ensure to keep your API keys confidential and avoid hardcoding them directly into scripts or notebooks.    

In [None]:
import os
import openai

openai.api_key = os.environ["OPENAI_KEY"]
os.environ["SINGLESTOREDB_URL"] = "<REPLACE SINGLESTOREDB URL>"


### Step 3: Importing Modules

These modules provide functionalities ranging from text retrieval to embeddings generation:

- `RetrievalQA`: Used for retrieving answers to questions based on a dataset.
- `load_qa_chain`: A function to load specific question-answering chains.
- `TextLoader`: Helps in loading documents or texts.
- `OpenAIEmbeddings`: Generates embeddings using OpenAI models.
- `OpenAI`: Pertains to OpenAI's functionalities within the `langchain` library.
- `PromptTemplate`: Useful for creating structured prompts for language models.
- `CharacterTextSplitter`: Splits texts based on characters.
- `SingleStoreDB`: Enables interactions with SingleStore databases.
    

In [None]:
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import SingleStoreDB


### Step 4: Data Ingestion from Unstructured's Slack Connector

In this step, the goal is to fetch data from a Slack channel. The `unstructured-ingest` command allows for the ingestion of unstructured data from various sources, in this case, Slack:

- The specific Slack channel ID and token are used to fetch data.
- The data is stored locally in specified directories for download and structured output.
- "--partition-by-api" to run this process using Unstructured's hosted API
- An API key must be provided when using the hosted API

Executing this command will result in the Slack channel's data being ingested and ready for further processing.
    

In [None]:
import subprocess

# partition-by-api
command = [
  "unstructured-ingest",
    "slack",
    "--channels", "<REPLACE CHANNEL URL>",
    "--token", "<REPLACE SLACK TOKEN>",
    "--download-dir", "slack-ingest-download",
    "--structured-output-dir", "slack-ingest-output",
    "--partition-by-api",
    "--api-key", "<REPLACE UNSTRUCTURED API KEY>",
    "--reprocess", "--preserve-downloads"
]

# Run the command
process = subprocess.Popen(command, stdout=subprocess.PIPE)
output, error = process.communicate()

# Print output
if process.returncode == 0:
    print('Command executed successfully. Output:')
    print(output.decode())
else:
    print('Command failed. Error:')
    print(error.decode())

2023-09-06 17:03:03,993 MainProcess INFO     Processing 1 docs
2023-09-06 17:03:06,489 SpawnPoolWorker-2 DEBUG    File exists: slack-ingest-download/C044N0YV08G.xml, skipping get_file
2023-09-06 17:03:06,489 SpawnPoolWorker-2 INFO     Processing slack-ingest-download/C044N0YV08G.xml
2023-09-06 17:03:06,489 SpawnPoolWorker-2 DEBUG    Using remote partition (https://api.unstructured.io/general/v0/general)
2023-09-06 17:03:09,851 SpawnPoolWorker-2 INFO     Wrote slack-ingest-output/C044N0YV08G.json


Command executed successfully. Output:


### Step 5: Loading Ingested Text Data

After ingesting data from Slack:

1. A `TextLoader` object loads the text data saved in the previous step.
2. The `load()` method reads the content, making it available for further processing in the notebook.    

In [None]:
loader = TextLoader("slack-ingest-download/C044N0YV08G.txt")
documents = loader.load()

### Step 6: Splitting Text into Manageable Chunks

Given the potentially large volume of ingested data, it's beneficial to split texts into smaller chunks:

- The `CharacterTextSplitter` utility breaks down large text documents into chunks based on a specified character count.
- This facilitates more efficient processing in later stages, especially when generating embeddings or performing retrievals.    

In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000,
                                      chunk_overlap=100)

texts = text_splitter.split_documents(documents)

Created a chunk of size 1316, which is longer than the specified 1000
Created a chunk of size 1405, which is longer than the specified 1000
Created a chunk of size 1054, which is longer than the specified 1000
Created a chunk of size 1418, which is longer than the specified 1000
Created a chunk of size 2421, which is longer than the specified 1000
Created a chunk of size 1156, which is longer than the specified 1000
Created a chunk of size 1398, which is longer than the specified 1000
Created a chunk of size 3522, which is longer than the specified 1000
Created a chunk of size 1212, which is longer than the specified 1000
Created a chunk of size 3927, which is longer than the specified 1000
Created a chunk of size 4182, which is longer than the specified 1000
Created a chunk of size 1735, which is longer than the specified 1000
Created a chunk of size 1459, which is longer than the specified 1000
Created a chunk of size 1396, which is longer than the specified 1000
Created a chunk of s

### Step 7: Setting Up Embeddings

Embeddings are dense vector representations of text data. They play a crucial role in information retrieval systems:

- The `OpenAIEmbeddings` class is used here, suggesting that embeddings are generated using models from OpenAI.
- These embeddings will be used later for storing text chunks in the database and for efficient retrieval.

In [None]:
embeddings = OpenAIEmbeddings()

### Step 8: Storing Text in SingleStoreDB

To enable efficient retrieval of information, the processed text chunks, along with their embeddings, are stored in a SingleStore database:

- The `SingleStoreDB.from_documents` method is employed to achieve this. It takes in the text chunks, their corresponding embeddings, and stores them in a specified database table.
- This setup ensures that when queries are posed later, the system can quickly retrieve relevant text chunks based on their embeddings.

In [None]:
docsearch = SingleStoreDB.from_documents(texts,
                                         embeddings,
                                         table_name = "pdf_docs3")

### Step 9: Setting Up RetrievalQA with a Prompt Template

The RetrievalQA model gets configured with the previously defined prompt template:

1. The prompt template (`PROMPT`) is specified as a keyword argument.
2. The `RetrievalQA.from_chain_type` method is used to set up the model with the "stuff" chain type, using the specified OpenAI model and the retriever object.

This configuration ensures that queries to the model are structured according to the defined template, enhancing clarity and consistency in interactions.

In [None]:
# use prompt template
prompt_template = """
Use the following pieces of context to answer the question at the end. If you're not sure, just say so.
If there are potential multiple answers, summarize them as possible answers.

{context}

Question: {question}
Answer:
"""

PROMPT = PromptTemplate(template=prompt_template,
                        input_variables=["context", "question"])

In [None]:
# use prompt template
qa_chain = load_qa_chain(OpenAI(), chain_type="stuff")
chain_type_kwargs = {"prompt": PROMPT}

qa = RetrievalQA.from_chain_type(llm=OpenAI(model_name='gpt-4-0613'),
                                 chain_type="stuff",
                                 retriever=docsearch.as_retriever(),
                                 chain_type_kwargs=chain_type_kwargs)



### Step 10: Query about a Specific Task

To address distinct tasks or challenges:

1. A query string is set up, asking a specific question.
2. The `run` method of the RetrievalQA model is invoked with the query.

This demonstrates the capability of the RetrievalQA model to handle a wide array of queries and provide relevant responses based on the ingested data.

In [None]:
query = "How can I extract table from a PDF file?"
qa.run(query)

'From the context given, here are the steps for extracting tables from a PDF file:\n\n1. Inferring table structure is off by default, and for now only works on PDFs. Use it by calling `partition("yourfile.pdf", pdf_infer_table_structure=True)`, or `partition_pdf("yourfile.pdf", infer_table_structure=True)`.\n2. The extraction works best when the OCR for the table cells is done by PaddleOCR. \n3. The table structure is stored in the element metadata, under the `text_as_html` property, in HTML form.\n\nPlease note that the platform is currently working on resolving some known issues on table extraction, such as:\n- Reports of some cases where the text from cells is duplicated in non-table elements.\n- Reports of the table structure not being properly captured when the table spans multiple pages.'

In [None]:
query = "Can I bring my own model for inferences?"
qa.run(query)

'Yes, you can use your own model for inferences. However, it would need to be wrapped in the `UnstructuredObjectDetectionModel` class to ensure the inputs and outputs align for the downstream functionality. You may follow the interface defined in the Chipper model implementation or refer to the documentation in the unstructured-inference repository for more precise information.'

In [None]:
query = "How does Unstructured tool chunk texts?"
qa.run(query)

"The Unstructured tool chunks texts by grabbing text of the same type in roughly paragraph sized chunks. The specific chunking logic depends on the document type and it's more successful with chunking some document types than others."

In [None]:
query = "How can I speed up the execution of PDF parsing?"
qa.run(query)

'There are not many quick solutions for this, but one strategy suggested is to process individual pages in parallel. However, there are limits to speed improvement when working on a CPU. It seems the support is working on improving the speed. Also, depending on the parameters you use, like `strategy="auto"`, and the value of `pdf_infer_table_structure`, speed can vary. Changes to `strategy` might result in different PDF elements being shown. Merging related sequential elements at `partition` time is also considered to speed up the process, but it is not implemented yet.'

In [None]:
query = "How to extract image URLs from HTML content?"
qa.run(query)

"The user wants to extract image URLs from the HTML content of news articles. They've explored the source code and come across a potential method but are unclear on how to utilize it. There are several proposed solutions to this issue:\n\n1. The first solution was to use a cleaning 'brick' or module on the result of the partition function to extract the URL. However, the user was confused by this solution, as the cleaner module indicated was for email extraction, not image URL extraction.\n\n2. The user was later informed that there isn't a defined pattern for catching image URLs from HTML content within the codebase. An issue was created for this and a solution was being devised, involving a pattern similar to other solutions.\n\n3. Another solution was to extract the URLs from the JSON response structure from the partition response. There is no clear consensus or final solution provided in the context given. The engineering team will be working on a more generic extraction brick/modu