<p align="center">
  <a href="https://github.com/google/langextract">
    <img src="https://raw.githubusercontent.com/google/langextract/main/docs/_static/logo.svg"
         alt="LangExtract Logo" width="128" />
  </a>
</p>

<h1 align="center">LangExtract</h1>

<p align="center">
  <a href="https://pypi.org/project/langextract/">
    <img src="https://img.shields.io/pypi/v/langextract.svg" />
  </a>
  <a href="https://github.com/google/langextract">
    <img src="https://img.shields.io/github/stars/google/langextract.svg?style=social&label=Star" />
  </a>
  <img src="https://github.com/google/langextract/actions/workflows/ci.yaml/badge.svg" />
  <a href="https://doi.org/10.5281/zenodo.17015089">
    <img src="https://zenodo.org/badge/DOI/10.5281/zenodo.17015089.svg" />
  </a>
</p>

<p align="center">
  <img src="https://raw.githubusercontent.com/google/langextract/main/docs/_static/romeo_juliet_basic.gif"
       alt="Romeo and Juliet Basic Visualization" />
</p>


In [8]:
!pip install langextract




##  Google AI Studio

You can access it via **Google AI Studio** at the following link:  
üîó https://aistudio.google.com/


In [None]:
import os
os.environ["LANGEXTRACT_API_KEY"] = "YOUR_API_KEY"



In [7]:
import langextract as lx
import textwrap

# 1. ƒê·ªãnh nghƒ©a Prompt: B·∫°n mu·ªën tr√≠ch xu·∫•t c√°i g√¨?
prompt = textwrap.dedent("""\
    Tr√≠ch xu·∫•t t√™n nh√¢n v·∫≠t, c·∫£m x√∫c v√† m·ªëi quan h·ªá theo th·ª© t·ª± xu·∫•t hi·ªán.
    S·ª≠ d·ª•ng ch√≠nh x√°c t·ª´ ng·ªØ trong vƒÉn b·∫£n, kh√¥ng ƒë∆∞·ª£c vi·∫øt l·∫°i (paraphrase).
""")

# 2. T·∫°o v√≠ d·ª• m·∫´u (Few-shot example):
# B∆∞·ªõc n√†y r·∫•t quan tr·ªçng ƒë·ªÉ d·∫°y model c√°ch tr·∫£ v·ªÅ d·ªØ li·ªáu ƒë√∫ng √Ω b·∫°n.
examples = [
    lx.data.ExampleData(
        text="ROMEO: Nh∆∞ng khoan! √Ånh s√°ng n√†o l√≥e qua c·ª≠a s·ªï kia? ƒê√≥ l√† ph∆∞∆°ng ƒê√¥ng, v√† Juliet l√† m·∫∑t tr·ªùi.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character", # Lo·∫°i d·ªØ li·ªáu
                extraction_text="ROMEO",      # Text ch√≠nh x√°c trong vƒÉn b·∫£n
                attributes={"emotional_state": "kinh ng·∫°c"} # Thu·ªôc t√≠nh b·ªï sung
            ),
            lx.data.Extraction(
                extraction_class="relationship",
                extraction_text="Juliet l√† m·∫∑t tr·ªùi",
                attributes={"type": "·∫©n d·ª•"}
            ),
        ]
    )
]

# 3. Ch·∫°y tr√≠ch xu·∫•t
input_text = "N√†ng Juliet nh√¨n ƒë·∫Øm ƒëu·ªëi l√™n nh·ªØng v√¨ sao, tr√°i tim n√†ng ƒëau nh√≥i v√¨ nh·ªõ Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro", # Ch·ªçn model (nhanh v√† r·∫ª)
)

# 4. Xu·∫•t k·∫øt qu·∫£ v√† t·∫°o file HTML ƒë·ªÉ xem
lx.io.save_annotated_documents([result], output_name="ket_qua.jsonl", output_dir=".")
html = lx.visualize("ket_qua.jsonl")

# Save the results to a JSONL file
lx.io.save_annotated_documents([result], output_name="ket_qua.jsonl", output_dir=".")

# Generate the visualization from the file
html_content = lx.visualize("ket_qua.jsonl")
with open("visualization.html", "w") as f:
    if hasattr(html_content, 'data'):
        f.write(html_content.data)  # For Jupyter/Colab
    else:
        f.write(html_content)


[94m[1mLangExtract[0m: model=[92mgemini-2.5-flash[0m [00:00][A
[94m[1mLangExtract[0m: model=[92mgemini-2.5-flash[0m, current=[92m79[0m chars, processed=[92m0[0m chars:  [00:00][A
[94m[1mLangExtract[0m: model=[92mgemini-2.5-flash[0m, current=[92m79[0m chars, processed=[92m0[0m chars:  [00:23]

[94m[1mLangExtract[0m: Saving to [92mket_qua_fix.jsonl[0m: 1 docs [00:00, 784.28 docs/s]

[92m‚úì[0m Saved [1m1[0m documents to [92mket_qua_fix.jsonl[0m




[94m[1mLangExtract[0m: Loading [92mket_qua_fix.jsonl[0m: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.81k/1.81k [00:00<00:00, 5.64MB/s]

[92m‚úì[0m Loaded [1m1[0m documents from [92mket_qua_fix.jsonl[0m
ƒê√£ s·ª≠a l·ªói v√† l∆∞u file th√†nh c√¥ng!





# Scaling to Longer Documents

In [13]:
import langextract as lx
import textwrap

# 1. Prompt & Example (Gi·ªØ nguy√™n ho·∫∑c d√πng c√°i ƒë∆°n gi·∫£n)
prompt = textwrap.dedent("""\
    Extract main characters and their locations.
    Keep extractions verbatim from text.
""")

examples = [
    lx.data.ExampleData(
        text="Alice sat by the river bank.",
        extractions=[
            lx.data.Extraction(
                extraction_class="character",
                extraction_text="Alice",
                attributes={"location": "river bank"}
            )
        ]
    )
]

# 2. Ch·∫°y tr√≠ch xu·∫•t t·ª´ URL
# T√†i li·ªáu v√≠ d·ª• d√πng Romeo & Juliet t·ª´ Project Gutenberg
print("ƒêang t·∫£i v√† x·ª≠ l√Ω vƒÉn b·∫£n t·ª´ URL...")
result = lx.extract(
    text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt", # Link vƒÉn b·∫£n
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-3-flash-preview",
    # --- C√ÅC THAM S·ªê N√ÇNG CAO ---
    extraction_passes=2,    # Qu√©t 2 l·∫ßn ƒë·ªÉ tƒÉng ƒë·ªô ph·ªß (recall)
    max_workers=10,         # X·ª≠ l√Ω song song 10 lu·ªìng cho nhanh
    max_char_buffer=2000    # Gi·ªõi h·∫°n k√≠ch th∆∞·ªõc ng·ªØ c·∫£nh (context)
)

# 3. L∆∞u k·∫øt qu·∫£
lx.io.save_annotated_documents([result], output_dir=".", output_name="url_test.jsonl")
print("Xong! H√£y visualize file url_test.jsonl")

ƒêang t·∫£i v√† x·ª≠ l√Ω vƒÉn b·∫£n t·ª´ URL...


[94m[1mLangExtract[0m: Downloading [92mhttps://www.gutenberg.org/files/1513/1513-0.txt[0m: 100%|[38;2;66;133;244m‚ñà[0m| 145k/145k [00:00<0[0m

[92m‚úì[0m Downloaded [1m142,570[0m characters ([1m25,976[0m words) from [94m1513-0.txt[0m



[94m[1mLangExtract[0m: model=[92mgemini-3-flash-preview[0m, current=[92m19,807[0m chars, processed=[92m0[0m chars:  [03:38]


KeyboardInterrupt: 