## Approach

Extract features from whitepapers.

No need for real understanding here (models can't do this yet), hence we don't need NLP really - just text extraction using OCR that we can then search for features.

Features from whitepapers:

1. Number of pages
2. Number of words
3. Number of equations (?)
4. Number of images
5. Number of references
6. Acronym count: pow, pos, apy, roi
7. Word count: leverage, price, attack, token

## Experiments

### Nougat

Ok yeah so this takes waaaay too long without a GPU. Not an option.

### Document LLM

Let's try a document understanding model from Hugging Face.

In [2]:
from transformers import pipeline

In [3]:
nlp = pipeline(
    "document-question-answering",
    model="impira/layoutlm-document-qa",
)

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


model.safetensors:   0%|          | 0.00/511M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/315 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Basically need to use individual images and then agregate the outputs...which I can't be bothered to do.

### Element Extraction

In [7]:
from PyPDF2 import PdfReader  

In [33]:
reader = PdfReader("whitepaper_examples/Bitcoin_BTC.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[8]
text = page.extract_text()

In [51]:
reader.pdf_header

'%PDF-1.5'

In [34]:
print(text)

References
[1] W. Dai, "b-money," http://www.weidai.com/bmoney.txt, 1998.
[2] H. Massias, X.S. Avila, and J.-J. Quisquater, "Design of a secure timest amping service with minimal 
trust requirements," In 20th Symposium on Information Theory in the Benelux , May 1999.
[3] S. Haber, W.S. Stornetta, "How to time-stamp a digital document," In  Journal of Cryptology , vol 3, no 
2, pages 99-111, 1991.
[4] D. Bayer, S. Haber, W.S. Stornetta, "Improving the efficiency and re liability of digital time-stamping," 
In Sequences II: Methods in Communication, Security and Computer Science , pages 329-334, 1993.
[5] S. Haber, W.S. Stornetta, "Secure names for bit-strings," In Proceedings of the 4th ACM Conference 
on Computer and Communications Security , pages 28-35, April 1997.
[6] A. Back, "Hashcash - a denial of service counter-measure," 
http://www.hashcash.org/papers/hashcash.pdf, 2002.
[7] R.C. Merkle, "Protocols for public key cryptosystems," In Proc. 1980 Symposium on Security and 
Privacy

In [25]:
number_of_pages

9

In [29]:
reader = PdfReader("whitepaper_examples/Binance_BNB.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[5]
text = page.extract_text()

In [31]:
len(page.images)

3

### How to get features

1. Number of pages --> `len(reader.pages)`
2. Number of words --> separate text by spaces, sum
3. Number of equations --> count "="
4. Number of images --> `len(page.images)`
5. Number of references --> search for "References" as the first element of the page, count refs below
6. Acronym count: pow, pos, apy, roi --> search in page text
7. Word count: leverage, price, attack, token --> search in page text

## Accessing white papers

In [30]:
import requests
import io
from bs4 import BeautifulSoup as bts

In [64]:
url = 'https://whitepaper.io/document/718/ethereum-whitepaper'
result = requests.get(url, headers={"User-Agent":"Mozilla/5.0"})
soup = bts(result.text, 'html.parser')

In [65]:
soup

<!DOCTYPE html>
<html lang="en"><head><link href="https://fonts.googleapis.com" rel="preconnect"/><link crossorigin="true" href="https://fonts.gstatic.com" rel="preconnect"/><link data-href="https://fonts.googleapis.com/css2?family=Noto+Sans:ital,wght@0,100;0,200;0,300;0,400;0,500;0,600;0,700;0,800;0,900;1,100;1,200;1,300;1,400;1,500;1,600;1,700;1,800;1,900&amp;family=Oxygen:wght@300;400;700&amp;family=Ubuntu:ital,wght@0,300;0,400;0,500;0,700;1,300;1,400;1,500;1,700&amp;display=swap" rel="stylesheet"/><link href="/favicon.png" rel="icon" type="image/png"/><script async="" src="https://www.googletagmanager.com/gtag/js?id=UA-118779431-1"></script><script>
            window.dataLayer = window.dataLayer || [];
            function gtag(){dataLayer.push(arguments);}
            gtag('js', new Date());
            gtag('config', 'UA-118779431-1', {
              page_path: window.location.pathname,
            });
          </script><script async="" data-ad-client="ca-pub-5413472148787934" 

In [66]:
soup.select_one("span").attrs.get("data-value", None)

In [67]:
soup.find('div', class_="flex flex-col flex-1").object.attrs['data']

'https://api-new.whitepaper.io/documents/pdf?id=H1ugBX9Bd'

In [1]:
urlpdf='https://api-new.whitepaper.io/documents/pdf?id=SksIiBd6z'

In [26]:
response = requests.get(urlpdf)
with io.BytesIO(response.content) as f:
    pdf = PdfReader(f)
    number_of_pages = len(pdf.pages)
    number_images = len(pdf.pages[5])
    print(pdf.pages[5].extract_text())

10. Privacy
The traditional banking model achieves a level of p rivacy by limiting access to information to the 
parties involved and the trusted third party.  The necessity to announce all transactions publicly 
precludes this method, but privacy can still be mai ntained by breaking the flow of information in 
another place: by keeping public keys anonymous.  T he public can see that someone is sending 
an amount to someone else, but without information linking the transaction to anyone.  This is 
similar to the level of information released by sto ck exchanges, where the time and size of 
individual trades, the "tape", is made public, but without telling who the parties were.
As an additional firewall, a new key pair should be  used for each transaction to keep them 
from being linked to a common owner.  Some linking is still unavoidable with multi-input 
transactions, which necessarily reveal that their i nputs were owned by the same owner.  The risk 
is that if the owner of a key i

In [28]:
number_images

6