# GLM-4-Long Convert Lecture Transcript to Note-taking Format

**This tutorial is available in English and is attached below the Chinese explanation**

该内容演示了 GLM-4-Long 如何通过将讲座笔录转换为笔记格式的任务来处理长文本。在日常生活中，我们经常会遇到需要阅读长篇文章的情况，这可能会很耗时。例如，翘掉课程或临近截止日期的学生可能需要快速掌握长篇文章的要点。在这种情况下，对讲座笔录等长篇文章进行总结就非常有用。

This content demonstrates the how GLM-4-Long handles long text through the task of converting lecture transcript to note-taking format. In daily life, we often encounter situations where we need to read lengthy texts, which can be time-consuming. For example, a student who missed a lecture or is nearing a deadline might need to quickly grasp the main points of a long text. Summarizing long texts, such as lecture transcripts, can be very useful in these scenarios.


## 1. Set up the API Key


首先，设置环境变量并初始化 ZhipuAI client。

First, set up an environmental variable and initialize the ZhipuAI client.

In [6]:
import os
from zhipuai import ZhipuAI

os.environ["ZHIPUAI_API_KEY"] = "your api key"
client = ZhipuAI()

## 2. Open the Lecture Transcript File

然后，打开我们的讲座笔录文件，指定其字数。

Then, open our the lecture transcript file and specify its word count.

In [7]:
lecture_transcript_path = "data/lecture_transcript.txt"

with open(lecture_transcript_path, "r") as file:
    lecture_text = file.read()

WORD_COUNT = 20899

## 3. Preprocess and Chunk the Transcript File

现在，安装 NLP 库 spaCy，并加载 en_core_web_sm 模型，这是一个用于预处理文件的小型英语模型。然后，我们准备了两个方法将文件分成句子存入列表中，再将句子组织成具有指定最大长度（即总字数的五十分之一）的句块。

Now, install spaCy, an NLP library, and load the en_core_web_sm model, a small English language model that will be used to preprocess the file. Then we prepare two methods to preprocess the file into a list of sentences and further organize the sentences into chunks with a specified maximum length, which is one fiftith of the total word count.

In [8]:
!pip3 install spacy
!python -m spacy download en_core_web_sm

import spacy

nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):
    doc = nlp(text)
    sentences = [sent.text for sent in doc.sents]
    return sentences

def chunk_text(sentences, max_chunk_size=WORD_COUNT/50):
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > max_chunk_size:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.append(sentence)
        current_length += sentence_length

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## 4. Summarize the Lecture Transcript

以两轮的形式，我们用GLM-4-Long模型把讲座笔录先概括成笔记的形式，这样可以避免一次性概括造成的信息损失。这两轮的处理的概括程度是由“概括比例”来决定的，我们可以自行决定输出的笔记字数与原始文本之间0到1的比例。我在这里两轮用的都是0.2的比例，也就是输入1000字的讲座笔录得到200字的笔记。

In two rounds, we use the GLM-4-Long model to summarize the lecture transcripts into the form of notes, so as to avoid the loss of information caused by one-time summarization. The degree of generalization in these two rounds of processing is determined by the "generalization ratio", we can decide the ratio of 0 to 1 between the number of words in the output notes and the original text. I used a ratio of 0.2 for both rounds, which means that inputting a 1000-word lecture transcript results in a 200-word note.

In [9]:
def summarize_chunk(chunk, summary_ratio):
    response = client.chat.completions.create(
            model="glm-4-long",
            messages=[
                {
                    "role": "system",
                    "content": f"You are an assistant that reads a long lecture transcript and summarizes it to a short and concise note-taking format. The summary should be around {summary_ratio*100}% of the original length."
                },
                {
                    "role": "user",
                    "content": chunk
                },
            ],
            top_p=0.7,
            temperature=0.9
        )
    summarized_text = response.choices[0].message.content
    return summarized_text

def summarize_text(text, summary_ratio):
    sentences = preprocess_text(text)

    max_chunk_size = int(WORD_COUNT / 50)
    chunks = chunk_text(sentences, max_chunk_size)

    summarized_chunks = []
    for chunk in chunks:
        summarized_chunk = summarize_chunk(chunk, summary_ratio)
        if summarized_chunk:
            summarized_chunks.append(summarized_chunk)

    summarized_text = " ".join(summarized_chunks)

    return summarized_text

first_summary_ratio = 0.2
first_summarized_text = summarize_text(lecture_text, first_summary_ratio)

second_summary_ratio = 0.2
final_summarized_text = summarize_text(first_summarized_text, second_summary_ratio)

## 5. Format and Save into a Markdown file

最后，再次让模型把讲座的笔记形式保存为markdown格式，得到清晰的格式化的笔记，保存到本地以便我们快速阅读。

Finally, once again, let the model save the lecture in the form of notes in markdown format to get clearly formatted notes that are saved locally for us to read quickly.

In [10]:
markdown_notes = client.chat.completions.create(
            model="glm-4-long",
            messages=[
                {
                    "role": "system",
                    "content": "Convert the summary to markdown format. Organize information into headings and subheadings, with no big paragraphs and no more than 5 bullet points under a subheading.",
                },
                {
                    "role": "user",
                    "content": final_summarized_text,
                }
            ],
            top_p=0.7,
            temperature=0.9
        )

with open("data/summarized_notes.md", "w") as file:
    file.write(markdown_notes.choices[0].message.content)