# Web Research (STORM)

`STORM` is a research assistant designed by Shao, et. al that extends the idea of "outline-driven RAG" for richer article generation.  
STORM is designed to generate Wikipedia-style ariticles on a user-provided topic. It applies two main insights to produce more organized and comprehensive articles:

In [3]:
import os
from dotenv import load_dotenv

load_dotenv()



True

In [4]:
%%capture --no-stderr
%pip install -U langchain_community langchain_openai langchain_fireworks langgraph wikipedia duckduckgo-search tavily-python

# Select LLM model

In [5]:
from langchain_openai import ChatOpenAI

fast_llm = ChatOpenAI(model_name="gpt-4o-mini")
long_context_llm = ChatOpenAI(model_name="gpt-4o")


In [6]:
from typing import List,Optional
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

direct_gen_outline_prompt = ChatPromptTemplate.from_messages(
    [
        
     (
        "system",
        "You are a Wikipedia writer. Write an outline for a Wikipedia page about a user-provided topic. Be comprehensive and specific.",
     ),
     ("user", "{topic}"),   
    ]
)
class Subsection(BaseModel):
    subsection_title: str = Field(..., title='Title of the subsection')
    description: str = Field(..., title="Content of the subsection")

    @property
    def as_str(self) -> str:
        return f'###{self.subsection_title}\n\n{self.description}'.strip()
    
class Section(BaseModel):
    section_title: str = Field(..., title='Title of the section')
    description: str = Field(..., title="Content of the section")
    subsections: Optional[List[Subsection]] = Field(
        default=None,
        title='Titles and descriptions for each subsection of the Wikipeida page.'
    )

    @property
    def as_str(self) -> str:
        subsections = "\n\n".join(
            f"###{subsection.subsection_title}\n\n{subsection.description}"
            for subsection in self.subsections or []
        )
        return f"##{self.section_title}\n\n{self.description}\n\n{subsections}".strip()
    
class Outline(BaseModel):
    page_title: str = Field(..., title='Title of the Wikipedia page')
    sections: List[Section] = Field(
        default_factory=list,
        title='Titles and descriptions for each section of the Wikipedia page.',
    )
    @property
    def as_str(self) -> str:
        sections = "\n\n".join(section.as_str for section in self.sections)
        return f"# {self.page_title}\n\n{sections}".strip()
    
generate_outline_direct = direct_gen_outline_prompt | fast_llm.with_structured_output(Outline)

In [9]:
example_topic = "Creating LLM which can generate Kazakh language"

initial_outline = generate_outline_direct.invoke({"topic": example_topic})
print(initial_outline.as_str)

# Creating Large Language Models for the Kazakh Language

##Introduction

An overview of large language models (LLMs) and their significance in natural language processing, with a specific focus on the Kazakh language.

###Definition of Large Language Models

Explains what LLMs are, their architecture, and their capabilities.

###Importance of Kazakh Language Processing

Discusses the need for Kazakh language LLMs for preserving culture, enhancing communication, and supporting local applications.

##Development of Kazakh LLMs

Overview of the methodologies and technologies involved in creating LLMs for the Kazakh language.

###Data Collection

Details the sources and methods used to gather text data in Kazakh, including public datasets, web scraping, and collaborations.

###Preprocessing Text Data

Explains techniques used to clean and prepare the Kazakh text data for training, including tokenization and normalization.

###Model Training Techniques

Describes the machine learning frame

# Expand Tpoics

In [10]:
gen_related_topics_prompt = ChatPromptTemplate.from_template(
    """I'm writing a Wikipedia page for a topic mentioned below. 
    Please identify and recommend some Wikipedia pages on closely related subjects. 
    I'm looking for examples that provide insights into interesting aspects commonly associated with this topic, 
    or examples that help me understand the typical content and structure included in Wikipedia pages for similar topics.
    Please list the as many subjects and urls as you can.
    Topic of interest: {topic}
"""
)

class RelatedSubjects(BaseModel):
    topics: List[str] = Field(
        description="Comprehensive list of related subjects as background research.",
    )

expand_chain = gen_related_topics_prompt | fast_llm.with_structured_output(RelatedSubjects)

In [11]:
related_subjects = await expand_chain.ainvoke({"topic": example_topic})


In [13]:
related_subjects.topics

['Language Modeling',
 'Kazakh Language',
 'Natural Language Processing',
 'Large Language Models',
 'Machine Learning',
 'Artificial Intelligence',
 'Linguistic Resources for Lesser-Known Languages',
 'Text Generation in Natural Language Processing',
 'GPT (Generative Pre-trained Transformer)',
 'Multilingual Models',
 'Low-resource Languages',
 'Computational Linguistics',
 'Speech Recognition in Kazakh',
 'Ethical AI in Language Processing',
 'Cultural Significance of the Kazakh Language']

# Generate Perspectives

In [14]:
class Editor(BaseModel):
    affilation: str = Field(
        description="Primary affilation of the editor.",
    )
    name: str = Field(
        description="Name of the editor.", pattern=r"^[a-zA-Z0-9_-]{1, 64}$"
    )
    role: str = Field(
        description="Role of the editor in the context of the topic.",
    )
    description: str = Field(
        description="Description of the editor's focus, concerns, and motives.",
    )

    @property
    def persona(self) -> str:
        return f"Name: {self.name}\nRole: {self.role}\nAffilation: {self.affilation}\nDescription: {self.description}\n"

class Perspectives(BaseModel):
    editors: List[Editor]= Field(
        description="Comprehensive list of editors with their roles and affiliations.",
        # Add a pydantic validation/restriction to be at most M editors
    )
gen_perspectives_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You need to select a diverse (and distinct) group of Wikipedia editors who will work together to create a comprehensive article on the topic.\ 
            Each of them represents a different perspective, role, or affiliation related to this topic.\
            You can use other Wikipedia pages of related topics for inspiration. For each editor, add a description of what they will focus on.\
            Wiki page outlines of related topics for inspiration:
    {examples}""",
        ),
        ("user", "Topic of interest: {topic}")
    ]
)

gen_perspectives_chain = gen_perspectives_prompt | ChatOpenAI(model_name="gpt-4o-mini").with_structured_output(Perspectives)

