## Approach

Extract features from whitepapers.

No need for real understanding here (models can't do this yet), hence we don't need NLP really - just text extraction using OCR that we can then search for features.

Features from whitepapers:

1. Number of pages
2. Number of words
3. Number of equations (?)
4. Number of images
5. Number of references
6. Acronym count: pow, pos, apy, roi
7. Word count: leverage, price, attack, token

## Experiments

### Nougat

Ok yeah so this takes waaaay too long without a GPU. Not an option.

### Document LLM

Let's try a document understanding model from Hugging Face.

In [2]:
from transformers import pipeline

In [3]:
nlp = pipeline(
    "document-question-answering",
    model="impira/layoutlm-document-qa",
)

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/511M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/315 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Basically need to use individual images and then agregate the outputs...which I can't be bothered to do.

### Element Extraction

In [7]:
from PyPDF2 import PdfReader  

In [33]:
reader = PdfReader("whitepaper_examples/Bitcoin_BTC.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[8]
text = page.extract_text()

In [51]:
reader.pdf_header

'%PDF-1.5'

In [34]:
print(text)

References
[1] W. Dai, "b-money," http://www.weidai.com/bmoney.txt, 1998.
[2] H. Massias, X.S. Avila, and J.-J. Quisquater, "Design of a secure timest amping service with minimal 
trust requirements," In 20th Symposium on Information Theory in the Benelux , May 1999.
[3] S. Haber, W.S. Stornetta, "How to time-stamp a digital document," In  Journal of Cryptology , vol 3, no 
2, pages 99-111, 1991.
[4] D. Bayer, S. Haber, W.S. Stornetta, "Improving the efficiency and re liability of digital time-stamping," 
In Sequences II: Methods in Communication, Security and Computer Science , pages 329-334, 1993.
[5] S. Haber, W.S. Stornetta, "Secure names for bit-strings," In Proceedings of the 4th ACM Conference 
on Computer and Communications Security , pages 28-35, April 1997.
[6] A. Back, "Hashcash - a denial of service counter-measure," 
http://www.hashcash.org/papers/hashcash.pdf, 2002.
[7] R.C. Merkle, "Protocols for public key cryptosystems," In Proc. 1980 Symposium on Security and 
Privacy

In [25]:
number_of_pages

9

In [29]:
reader = PdfReader("whitepaper_examples/Binance_BNB.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[5]
text = page.extract_text()

In [31]:
len(page.images)

3

## Extraction Examples

### How to get features

1. Number of pages --> `len(reader.pages)`
2. Number of words --> separate text by spaces, sum
3. Number of equations --> count "="
4. Number of images --> `len(page.images)`
5. Number of references --> search for "References" as the first element of the page, count refs below
6. Acronym count: pow, pos, apy, roi --> search in page text
7. Word count: leverage, price, attack, token --> search in page text