In [15]:
%pip install langchain langchain-openai langchain-community pydantic --upgrade --quiet

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
scrapegraphai 1.8.0 requires langchain==0.1.15, but you have langchain 0.2.12 which is incompatible.
scrapegraphai 1.8.0 requires langchain-openai==0.1.6, but you have langchain-openai 0.1.20 which is incompatible.
scrapegraphai 1.8.0 requires tiktoken==0.6.0, but you have tiktoken 0.7.0 which is incompatible.
llama-index 0.6.14 requires typing-extensions==4.5.0, but you have typing-extensions 4.9.0 which is incompatible.
llama-index 0.6.14 requires typing-inspect==0.8.0, but you have typing-inspect 0.9.0 which is incompatible.
langserve 0.1.0 requires langchain-core<0.2.0,>=0.1.0, but you have langchain-core 0.2.28 which is incompatible.
langchain-groq 0.1.3 requires langchain-core<0.2.0,>=0.1.45, but you have langchain-core 0.2.28 which is incompatible.
langchain-google-genai 1.0.3 requires langchain-core<0

In [36]:
from typing import List
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, SystemMessagePromptTemplate
from langchain_core.output_parsers import StrOutputParser
from pydantic import BaseModel, Field

In [37]:
model = ChatOpenAI(model="gpt-4o")

In [43]:
class Sections(BaseModel):
    outline_sections: List[str] = Field(description="The sections of the blog post outline. If the point is a nested point, then add a number to the start of it.")

In [52]:
# 1. Blog post outline chain:
blog_post_outline_system_prompt = SystemMessagePromptTemplate.from_template(
    '''You are a helpful assistant that writes blog post outlines. The outline must be incredibly long, extensive and detailed.
    You are writing an article on the topic of: {topic}.
    '''
)
blog_post_outline_chat_prompt = ChatPromptTemplate.from_messages([blog_post_outline_system_prompt])
blog_post_outline_runnable = blog_post_outline_chat_prompt | model.with_structured_output(Sections)

In [48]:
# 2. Create the blog post chain:
blog_post_generation_system_prompt = SystemMessagePromptTemplate.from_template(
    """You are a helpful assistant that writes blog posts, the blog post must be incredibly long, extensive and detailed.
    Here is the article topic: {topic}.
    Here are the last 3 sections of the article that have been generated: {previous_article_sections}
    Here are the next 3 sections of the article to be generated: {next_three_article_sections}
    You must render the article in structured .md content.
    You must only produce the content, never include the section headings as these are added later.
    Current section content: """
)
blog_post_generation_chat_prompt = ChatPromptTemplate.from_messages(
    [blog_post_generation_system_prompt]
)
blog_post_generation_runnable = (
    blog_post_generation_chat_prompt | model | StrOutputParser()
)

In [46]:
# 3. Generate the blog post outline:
outline_result = blog_post_outline_runnable.invoke({
    'topic': 'What is data engineering?'
})

In [51]:
# 4. Sequentially generate all of the sections for an article, including the a window size of 3x sections, before and after
history = []

for i, current_section in enumerate(outline_result.outline_sections):
    previous_sections = outline_result.outline_sections[max(0, i - 3) : i]
    previous_content = "\n".join(history[max(0, i - 3) : i])
    next_sections = outline_result.outline_sections[i + 1 : i + 4]

    section_content = blog_post_generation_runnable.invoke(
        {
            "topic": "What is data engineering?",
            "previous_article_sections": f"{previous_sections}\n\n{previous_content}",
            "next_three_article_sections": next_sections,
        }
    )

    history.append(f"## {current_section}\n\n{section_content}\n\n")
    print(f"Generated section: {current_section}")

# Print or save the full blog post
full_blog_post = "\n".join(history)
print(full_blog_post)

Generated section: Introduction
Generated section: 1. Definition of Data Engineering
Generated section: 2. The Importance of Data Engineering
## Introduction

## Definition of Data Engineering

At its core, data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. It involves the development and maintenance of architectures such as databases and large-scale processing systems. These systems are crucial for transforming raw data into actionable insights and for making data accessible to data scientists, analysts, and other stakeholders within an organization.

Data engineering encompasses a wide range of activities, including:

1. **Data Collection**: Gathering data from various sources such as databases, APIs, and real-time streams.
2. **Data Storage**: Organizing and storing data in a manner that makes it easily retrievable and manageable. This often involves using databases, data lakes, and data warehouses.
3. **Data Tran