# GLM-4 Vision Small and Complex Text Recognition Task

**This tutorial is available in English and is attached below the Chinese explanation**

这篇内容通过密集手写字的辨别和地图中地点的识别两项任务演示了 GLM-4V 在小字体复杂内容中的图像识别能力。

This content demonstrates the GLM-4V's image recognition capabilities in complex content with small fonts through two tasks, the recognition of crammed handwriting and the recognition of locations in a map.

## 1. Set your Environment
首先，导入所有库，为 API 密钥设置环境变量，并启动client。然后，将 "JPEG "或 "PNG "格式的图像转换为 base64 编码字符串。

First, import all libraries, set up an environmental variable for the API key, and initiate the client. Then, convert the image in "JPEG" or "PNG" formats to the base64 encoded string. 

In [1]:
import os
from zhipuai import ZhipuAI

os.environ["ZHIPUAI_API_KEY"] = "your api key"
client = ZhipuAI()

In [2]:
import base64
import io
from PIL import Image
def image_to_base64(image_path):
    """Convert an image to base64 encoding."""
    with Image.open(image_path) as image:
        buffered = io.BytesIO()
        image.save(buffered, format="PNG") # change to "JPEG" depending on the format
        img_str = base64.b64encode(buffered.getvalue()).decode()
    return img_str


<img src="data/scanned_notes.png" width="500" >



上面是一位学生手写的线性代数定理。由于纸张大小限制，字写得非常小并且有时挤在一起，还含有很多数学的符号。如果这位学生想要复用其中的内容的话，重新抄写一遍将会非常困难。因此，我们可以通过GLM-4V的图像识别来减轻这位学生的工作量。

Above is a page full of handwritten linear algebra theorem written by a student. Due to the size of the paper, the words are very small and sometimes crammed together, and there are many mathematical symbols. If the student wanted to reuse the content, it would be very difficult to copy it again. Therefore, we can reduce the student's workload by using image recognition in GLM-4V.


这里，我们提问定理3.8是什么内容：

Here, we ask the model what is Theorem 3.8:

In [3]:
def analyze_image(image_path):
    """
    Analyze an image based on the provided question.
    
    Parameters:
        image_path (str): The path to the image file.
        question (str): The question to ask about the image.
        
    Returns:
        str: The answer to the question.
    """
    base64_image = image_to_base64(image_path)

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "From this scanned photo of handwritten notes, what is theorem 3.8?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_image
                    }
                }
            ]
        }
    ]

    response = client.chat.completions.create(
        model="glm-4v",
        messages=messages,
        temperature=0.1,
        top_p=0.1
    )

    return response.choices[0].message.content

answer = analyze_image("data/scanned_notes.png")

print(answer)

Theorem 3.8 states: "Let T: R^m → R^n. Then T is a linear transformation if and only if T(x) = Ax for some nxm matrix A." 

This means that a function T from R^m to R^n is a linear transformation if it can be represented by multiplying every vector in its domain by a fixed nxm matrix A. Conversely, any function satisfying this equation is necessarily a linear transformation.


通过把base64的字符串放进模型中，得到以下回答，我们可以看出，回答的内容正确，并还给予了两句话简单的解释。

Now, pass the base64 string to the model to obtain and return this response，We can see that the theorem in the response is exactly correct, and a two-sentence brief explanation is also given.

```
Theorem 3.8 states: "Let T: R^m → R^n. Then T is a linear transformation if and only if T(x) = Ax for some nxm matrix A." 
This means that a function T from R^m to R^n is a linear transformation if it can be represented by multiplying every vector in its domain by a fixed nxm matrix A. Conversely, any function satisfying this equation is necessarily a linear transformation.
```

另一些时候，给出的定理可能大多正确，但有细微错误（标红部分为错误）

Other times, the theorem given may be mostly correct, but with minor errors (errors are in red)

```
**From this scanned photo of handwritten notes, what is theorem 3.23?**
Theorem 3.23 from the provided image states: "Suppose that A is an invertible matrix with AB=I_n. Then BA=I_<span style="color:red">m</span>, and the matrix B such that AB=BA=I_n is unique."
```

## 3. Identify locations with different colors in maps with complex backgrounds
<img src="data/map.png" width="500">

现在，我们使用 GLM-4V来识别上面来自Google Maps地图中的地点。

Now, we use GLM-4V to identify the places in the map above from Google Maps.

In [4]:
def analyze_image(image_path):
    """
    Analyze an image based on the provided question.
    
    Parameters:
        image_path (str): The path to the image file.
        question (str): The question to ask about the image.
        
    Returns:
        str: The answer to the question.
    """
    base64_image = image_to_base64(image_path)

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What are the names of the locations?"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_image
                    }
                }
            ]
        }
    ]

    response = client.chat.completions.create(
        model="glm-4v",
        messages=messages,
        temperature=0.1,
        top_p=0.1
    )

    return response.choices[0].message.content

answer = analyze_image("data/map.png")

print(answer)

The image shows a map with several labeled locations. From top to bottom and from left to right, the visible labels include:

1. USPS Distribution Center
2. Blue Plate
3. Ever
4. SalonDJ Hairtography
5. W Fulton St
6. N Ada St
7. W Fulton Market
8. The Elizabeth Chicago
9. Peoria Packing Butcher Shop
10. Latin Rhythms Academy of Dance & Performance
11. Bottom Lounge
12. Gyu-Kaku Japanese BBQ
13. City Winery Chicago
14. Kaiser Tiger Beer Garden


<blockquote style="text-color: #f0f0f0; padding: 10px; border-left: 5px solid #cccccc;">

The image shows a map with several labeled locations. From top to bottom and from left to right, the visible labels include:
1. USPS Distribution Center
2. Blue Plate
3. Ever
4. SalonDJ Hairtography
5. W Fulton St

6. <span style="color:red">N Ada St</span>

7. W Fulton Market
8. The Elizabeth Chicago
9. Peoria Packing Butcher Shop
10. Latin Rhythms Academy of Dance & Performance
11. Bottom Lounge
12. Gyu-Kaku Japanese BBQ
13. City Winery Chicago
14. Kaiser Tiger Beer Garden

</blockquote>

我们可以看出，在地图的背景杂乱的情况下，模型识别的地址也基本准确，只有一个错误。

We can see that the model recognizes the addresses largely accurately even with the cluttered background of the map, with only one error.

还可以辨别不同颜色的地点：
```
**What are the names of the locations in orange?**

The locations with an orange icon on the map are:

1. <span style="color:red">Blue Plate</span>
2. Ever
3. Gyu-Kaku Japanese BBQ
4. Kaiser Tiger Beer Garden
```

```
**What are the names of the locations in blue?**

The locations in blue on the map are:

1. Peoria Packing Butcher Shop
2. City Winery Chicago
3. <span style="color:red">Solo Salon – Elizabeth</span>
```

### 3. Bonus: Asking GLM-4V whether 9.9 is larger or 9.11 is larger

这是2024年7月15日的一个有趣的小差距，众多大模型无法正常回答 9.11 和 9.9 的大小比较问题，我们给 GLM-4V 一张中央有 “Which number is larger between 9.9 and 9.11?” 的图片，并且给它“回答问题”的指示。

This is an interesting small gap on July 15, 2024. Many large models cannot properly answer the size comparison question between 9.11 and 9.9. We give GLM-4V a picture with "Which number is larger between 9.9 and 9.11?" in the center. picture, and give it "answer the question" instructions.


<img src="data/compare_numbers.png" width="500">

得到以下回答：

```
We get the following response:
To determine which number is larger between 9.9 and 9.11, you can look at them side by side:

- 9.9 has a decimal point followed by 2 digits, which are 9 and 9.
- 9.11 has a decimal point followed by 3 digits, which are 9, 1, and 1.

Since 9.11 has more digits after the decimal point than 9.9 does, it is the larger number.

```

In [5]:
def analyze_image(image_path):
    """
    Analyze an image based on the provided question.
    
    Parameters:
        image_path (str): The path to the image file.
        question (str): The question to ask about the image.
        
    Returns:
        str: The answer to the question.
    """
    base64_image = image_to_base64(image_path)

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Answer the question thoughtfully"
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_image
                    }
                }
            ]
        }
    ]

    response = client.chat.completions.create(
        model="glm-4v",
        messages=messages,
        temperature=0.1,
        top_p=0.1
    )

    return response.choices[0].message.content

answer = analyze_image("data/compare_numbers.png")

print(answer)

To determine which number is larger between 9.9 and 9.11, we can look at them side by side:

- 9.9 has a decimal point followed by 2 digits, which are 9 and 9.
- 9.11 has a decimal point followed by 3 digits, which are 9, 1, and 1.

We compare these numbers from left to right, ignoring any zeros that might be in front of the decimal point since they don't affect the comparison.

Since 9 is greater than 1, and both have the same number after the decimal point (which is 9), 9.9 is larger than 9.11.

Therefore, the answer is that 9.9 is larger than 9.11.


在这个问题上，经过提示词的夹持，GLM-4V 模型能有效的提取到文中的问题，并避开这个错误点，正确回答。

On this question, after clamping the prompt words, the GLM-4V model can effectively extract the questions in the text, avoid this error point, and answer correctly.