<a href="https://colab.research.google.com/github/JohnRTurner/JohnRTurner.github.io/blob/master/PDF_Table_Extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.2 LTS
Release:	22.04
Codename:	jammy


## Check Java environment and install tabula-py

tabula-py requires a java environment, so let's check the java environment on your machine.

In [None]:
!java -version

openjdk version "11.0.20" 2023-07-18
OpenJDK Runtime Environment (build 11.0.20+8-post-Ubuntu-1ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20+8-post-Ubuntu-1ubuntu122.04, mixed mode, sharing)


After confirming the java environment, install tabula-py by using pip.

In [None]:
# To be more precisely, it's better to use `{sys.executable} -m pip install tabula-py`
!pip install -q tabula-py singlestoredb

Before trying tabula-py, check your environment via tabula-py `environment_info()` function, which shows Python version, Java version, and your OS environment.

In [None]:
import tabula
tabula.environment_info()

Python version:
    3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]
Java version:
    openjdk version "11.0.20" 2023-07-18
OpenJDK Runtime Environment (build 11.0.20+8-post-Ubuntu-1ubuntu122.04)
OpenJDK 64-Bit Server VM (build 11.0.20+8-post-Ubuntu-1ubuntu122.04, mixed mode, sharing)
tabula-py version: 2.7.0
platform: Linux-5.15.109+-x86_64-with-glibc2.35
uname:
    uname_result(system='Linux', node='6ef6d8e2e9c2', release='5.15.109+', version='#1 SMP Fri Jun 9 10:57:30 UTC 2023', machine='x86_64')
linux_distribution: ('Ubuntu', '22.04', 'jammy')
mac_ver: ('', ('', '', ''), '')


## Read a PDF with `read_pdf()` function

Let's read a PDF from GitHub. tabula-py can load a PDF or file like object on both local or internet by using `read_pdf()` function.

In [None]:
import tabula
pdf_path = "./pdfs/paper.pdf"
dfs = tabula.read_pdf(pdf_path, stream=True, pages='all')

# read_pdf returns list of DataFrames
print(len(dfs))
dfs[1]

Aug 28, 2023 7:17:26 PM org.apache.pdfbox.pdmodel.font.PDType1Font <init>
Aug 28, 2023 7:17:46 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
Aug 28, 2023 7:17:46 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
Aug 28, 2023 7:17:47 PM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode



10


Unnamed: 0,Model,Pre-training Supervision,Pre-training Dataset,Acc (%)
0,TSN (RGB+Flow) [26],Supervised: action labels,Kinetics,36.5*
1,S3D [16],Unsupervised: MIL-NCE on ASR,HT100M,37.5*
2,ClipBERT [12],Supervised: captions,COCO + Visual Genome,30.8
3,VideoCLIP [28],Unsupervised: NCE on ASR,HT100M,39.4
4,SlowFast [10],Supervised: action labels,Kinetics,32.9
5,TimeSformer [4],Supervised: action labels,Kinetics,48.3
6,LwDS: TimeSformer [4],Unsupervised: k-means on ASR,HT100M,46.5
7,LwDS: TimeSformer,Distant supervision,HT100M,54.1
8,VideoTF (SC),Unsupervised: NN on ASR,HT100M,47.0
9,VideoTF (DM),Distant supervision,HT100M,54.8


# Load PDF table into SingleStore

In [None]:
!python --version

Python 3.10.12


In [None]:
import os
import singlestoredb

os.environ["SINGLESTOREDB_URL"] = "admin:SingleStore1!@svc-f1a640fd-31f3-4150-8558-8ee0260c94ad-dml.aws-virginia-5.svc.singlestore.com:3306/s2labs"
conn = singlestoredb.connect()
cur = conn.cursor()

# Create table for pdf document table data
# The doc_values table will hold the data for the document parsed tables
# the loc_values will be a duplicate of this data, but multiplied by 2. This will allow us to show a difference between the two datasets using langchain
cur.execute('DROP TABLE IF EXISTS doc_values;')
cur.execute('DROP TABLE IF EXISTS loc_values;')
cur.execute('''
CREATE TABLE IF NOT EXISTS doc_values (
    `Model` VARCHAR(255),
    `PreTraining Supervision` VARCHAR(255),
    `PreTrainingDataset` VARCHAR(255),
    `AccPerc` FLOAT
);
''')

cur.execute('''
CREATE TABLE IF NOT EXISTS loc_values (
    `Model` VARCHAR(255),
    `PreTraining Supervision` VARCHAR(255),
    `PreTrainingDataset` VARCHAR(255),
    `AccPerc` FLOAT
);
''')

# Convert the DataFrame to a list of tuples
data_tuples = [tuple(x) for x in dfs[1].to_records(index=False)]

# Insert the table extracted from the pdf
cur.executemany('''
    INSERT INTO doc_values (`Model`,`PreTraining Supervision`, `PreTrainingDataset`,`AccPerc`)
    VALUES (
      %s,
      %s,
      %s,
      %s
    )''', data_tuples)

# Copy the data from doc_values to loc_values and multiply it by 2
cur.execute('''
    INSERT INTO loc_values (`Model`,`PreTraining Supervision`, `PreTrainingDataset`,`AccPerc`)
    SELECT
    `Model`,
    `PreTraining Supervision`,
    `PreTrainingDataset`,
    `AccPerc` * 2
    FROM doc_values
''')

# Commit the transaction
conn.commit()

# Close the connection
cur.close()
conn.close()


# Local Llama SQLDatabaseChain
This will allow you to now ask your the LLM questions about the data that was inserted from PDF into SingleStore

## Setup Llama

In [None]:
!pip install langchain ctransformers ctransformers[gptq] --quiet

## Download and install Llama2 7B GPTQ

In [None]:
from langchain.llms import CTransformers
llm = CTransformers(model='TheBloke/Llama-2-7B-GPTQ')

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

Downloading (…)e1863c3b/config.json:   0%|          | 0.00/784 [00:00<?, ?B/s]

Downloading (…)0ae1863c3b/README.md:   0%|          | 0.00/19.1k [00:00<?, ?B/s]

Downloading (…)63c3b/.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

Downloading (…)863c3b/USE_POLICY.md:   0%|          | 0.00/4.77k [00:00<?, ?B/s]

Downloading (…)0d90ae1863c3b/Notice:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)d90ae1863c3b/LICENSE:   0%|          | 0.00/7.02k [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)63c3b/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

## Test to ensure LLM is working correctly

In [None]:
print(llm('AI is going to'))

 be a big factor in the evolution of technology. surely it will evolve into something that is as ubiquitous as electricity, but at present it's nowhere near there yet.
AI is going to have to get smarter before we can trust it to drive a car.
The idea that self-driving cars could result in fewer people owning cars sounds like something a liberal would propose to me. I hope not though because the only way they will be able to afford it is if it's free or subsidized by taxpayers. It's kind of like "free" healthcare. People don't understand how expensive it really is and therefore underestimate its real cost. So we have all these liberals claiming that their taxes won't go up because they are going to get this great benefit without realizing how much taxes will actually need to be raised in order to pay for the "benefit".
I think self-driving cars will become more popular as the price of gas gets higher and the cost of driving becomes a problem. But I don't see people giving up their cars 

In [None]:
# Ask your database to compare the document to the loc_values table
#from langchain import  SQLDatabaseChain
#!pip install sqlalchemy-singlestoredb
import langchain.chains
from langchain.utilities import SQLDatabase
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.agents import create_sql_agent, AgentType, initialize_agent
from langchain_experimental.sql import SQLDatabaseChain
db = SQLDatabase.from_uri(
    'singlestoredb://admin:SingleStore1!@svc-f1a640fd-31f3-4150-8558-8ee0260c94ad-dml.aws-virginia-5.svc.singlestore.com:3306/s2labs',
    include_tables=['doc_values', 'loc_values'],
    sample_rows_in_table_info=5
)
db_chain = SQLDatabaseChain.from_llm(
    llm,
    db,
    toolkit=[]
    verbose=True)

In [None]:
db_chain.run("Find the differences between doc_values and loc_values?")



[1m> Entering new SQLDatabaseChain chain...[0m
Find the differences between doc_values and loc_values?
SQLQuery:[32;1m[1;3mSELECT * FROM doc_values JOIN loc_values ON doc_values.Model = loc_values.Model LIMIT 5;[0m
SQLResult: [33;1m[1;3m[('TSN (RGB+Flow) [26]', 'Supervised: action labels', 'Kinetics', 36.5, 'TSN (RGB+Flow) [26]', 'Supervised: action labels', 'Kinetics', 73.0), ('LwDS: TimeSformer', 'Distant supervision', 'HT100M', 54.1, 'LwDS: TimeSformer', 'Distant supervision', 'HT100M', 108.2), ('LwDS: TimeSformer [4]', 'Unsupervised: k-means on ASR', 'HT100M', 46.5, 'LwDS: TimeSformer [4]', 'Unsupervised: k-means on ASR', 'HT100M', 93.0), ('TimeSformer [4]', 'Supervised: action labels', 'Kinetics', 48.3, 'TimeSformer [4]', 'Supervised: action labels', 'Kinetics', 96.6), ('ClipBERT [12]', 'Supervised: captions', 'COCO + Visual Genome', 30.8, 'ClipBERT [12]', 'Supervised: captions', 'COCO + Visual Genome', 61.6)][0m
Answer:[32;1m[1;3mBoth tables have the same number of mo

'Both tables have the same number of models (5). All their models are called TSN (26) and they are both Supervised: action labels. However, only doc_values has a PreTraining Supervision column which is always Unsupervised: NCE on ASR. Also, loc_values has a PreTrainingDataset column which always is HT100M while the PreTrainingDataset of doc_values can be COCO + Visual Genome or Kinetics. Finally, the AccPerc value of loc_values is higher than that of doc_values for all models.'

## Parse and embed PDF Documents to SingleStore
This will allow you to parse and embed your PDFs into SingleStore

In [None]:
!pip install llama-cpp-python pypdf singlestoredb langchain_experimental -q
!curl --silent -o paper.pdf https://arxiv.org/pdf/2303.13519.pdf > /dev/null

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/221.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━[0m [32m143.4/221.8 kB[0m [31m4.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.8/221.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("./paper.pdf")
pages = loader.load_and_split()

In [None]:
import os
from langchain.embeddings import GPT4AllEmbeddings
from langchain.vectorstores import SingleStoreDB

# SingleStore connection URL
os.environ["SINGLESTOREDB_URL"] = "admin:SingleStore1!@svc-f1a640fd-31f3-4150-8558-8ee0260c94ad-dml.aws-virginia-5.svc.singlestore.com:3306/s2labs"

# The path of where your pdfs are stored
f_path  = './pdfs'
pdfs    = os.listdir(f_path)

# Loop through each file and load it into SingleStore
for p in pdfs:
  if p.endswith('.pdf'):
    loader = PyPDFLoader(f'{f_path}/{p}')
    pages = loader.load_and_split()
    vectorstore = SingleStoreDB.from_documents(
      documents=pages,
      embedding=GPT4AllEmbeddings(),
      table_name="my_docs"
    )

Found model file at  /root/.cache/gpt4all/ggml-all-MiniLM-L6-v2-f16.bin


In [None]:
# Verify the document has been embedded
import singlestoredb
conn = singlestoredb.connect()
cur = conn.cursor()

cur.execute('SELECT * FROM my_docs LIMIT 10')
for r in cur:
  print(r)

In [None]:
# Ask questions about your documents using euclidean distance
query = "Write a summary about these documents"
docs = vectorstore.similarity_search(query)  # Find documents that correspond to the query
print(docs[0].page_content)