# Introduction

In this notebook, we wil use `qwen2.5:14b` for text analysys, as well as `QWen VL` for image analysis. 

For text/table extraction, we will use [Unstructured.io](https://unstructured.io/). The content of this lesson focuses on demonstrating how to extract data from a single PDF document for ease of understanding.

Finally, we will use [Tesla's Q3 financial report](chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf) as the source document.

# Example of use QWen VL

A good guide of using `QWen VL` is in this [link](https://github.com/QwenLM/Qwen-VL/blob/master/TUTORIAL.md).

In [1]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig

# If you expect the results to be reproducible, set a random seed.
# torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-VL-Chat", device_map="cuda", trust_remote_code=True).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".


Loading checkpoint shards:   0%|          | 0/10 [00:00<?, ?it/s]

In [2]:
query = tokenizer.from_list_format([
    {'image': 'Rebecca_(1939_poster)_Small.jpeg'},
    {'text': 'What is the name of the movie in the poster?'},
])

In [3]:
response, history = model.chat(tokenizer, query=query, history=None)
print(response)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


The name of the movie in the poster is "Rebecca."


# Get PDF

The first step is to get the report used as source for this lesson.

In [8]:
import requests

output_file = "./data/TeslaQ3.pdf"
try:
    response = requests.get("https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2023-Update-3.pdf")
    response.raise_for_status()  

    with open(output_file, "wb") as file:
        file.write(response.content)

    print(f"File downloaded and saved as: {output_file}")
except requests.exceptions.RequestException as e:
    print(f"Error downloading file: {e}")


File downloaded and saved as: ./data/TeslaQ3.pdf


# Text/tables

Is needed to install [poppler](https://poppler.freedesktop.org/) as well as [tesseract](https://github.com/tesseract-ocr/tesseract).

In [10]:
from unstructured.partition.pdf import partition_pdf

raw_pdf_elements = partition_pdf(
    filename="./data/TeslaQ3.pdf",
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000
)

yolox_l0.05.onnx:   0%|          | 0.00/217M [00:00<?, ?B/s]

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?