# Summarising emails
In this notebook, we're going to learn how to summarise emails.

In [2]:
!pip install unstructured langchain openai python-dotenv pandas



## Loading emails 📧
Let's start by loading an email using the [unstructured](https://pypi.org/project/unstructured/) library.

In [187]:
from langchain.document_loaders import UnstructuredEmailLoader

In [188]:
loader = UnstructuredEmailLoader("emails/1124445847.eml")
duck_email = loader.load()

In [189]:
duck_email

[Document(page_content="Featured Community Member Archie Sarre Wood. fast DuckDB-powered dashboards, DuckDB & Golang, and more.\n    \n    \n\n\n  \n  \n\nHey, friend 👋\n\nIt’s\xa0Marcos\xa0again, aka “DuckDB News Reporter” with another issue of “This Month in the DuckDB Ecosystem for August 2023.\n\nThis month proves what we are already seeing in our internal channels: the DuckDB ecosystem is growing stronger with time: more companies like\xa0Rill Data\xa0are considering using DuckDB for production environments, more people\xa0are considering DuckDB for fast data analysis development, and so on.\n\nAs always we share here, this is a two-way conversation: if you have any feedback on this newsletter, feel free to send us an email to\xa0duckdbnews@motherduck.com\n\nMarcos\n\nFeatured Community Member\n\n \n      \n         \n         \n      \n     \n     \n       Archie Sarre Wood \n       \n         Archie Sarre Wood is Head of community at\xa0 Evidence \xa0an open source, code-based a

## Configuring the LLM ⚙️
Next, we're going to setup OpenAI and Langchain.

In [190]:
from langchain.chat_models import ChatOpenAI

import dotenv
dotenv.load_dotenv()

True

In [191]:
llm_gpt35 = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-16k")

## Let the summarisation begin! 📝 

In [192]:
from langchain.chains.summarize import load_summarize_chain

In [193]:
summarize_chain_gpt35 = load_summarize_chain(llm_gpt35, chain_type="stuff")

In [194]:
%%time
summarize_chain_gpt35.run(duck_email)

CPU times: user 7.11 ms, sys: 2.58 ms, total: 9.69 ms
Wall time: 3.74 s


'The DuckDB ecosystem is growing stronger with more companies considering using DuckDB for production environments and more people using it for fast data analysis development. The newsletter features Archie Sarre Wood, who has built a VS Code extension for DuckDB. Other topics covered include the Modern Data Stack, analyzing StackOverflow data, fast DuckDB-powered dashboards, a comparison study with other databases, using DuckDB with Golang, and upcoming events.'

## Can we change the way it does the summary? 📇 

In [195]:
from langchain.prompts import PromptTemplate

In [197]:
prompt_template = """Pull out a maximum of 5 interesting things as bullet points from the following:
"{text}"
The 5 things are:"""
prompt = PromptTemplate.from_template(prompt_template)

chain = load_summarize_chain(llm_gpt35, chain_type="stuff", prompt=prompt)

In [198]:
print(chain.run(duck_email))

1. Archie Sarre Wood is the Head of community at Evidence and has built a VS Code extension for DuckDB.
2. DuckDB is powerful enough to analyze rich datasets like StackOverflow data.
3. The integration between Motherduck and Rill Data allows for fast DuckDB-powered dashboards.
4. DuckDB was compared to Spark, Elasticsearch, and MongoDB in terms of performance and cost.
5. There is a comprehensive guide available for using DuckDB with Go.


## How much does it cost? 💰
We probably want to know how much it'll cost to create these summaries. 

In [199]:
from langchain.callbacks import get_openai_callback

In [200]:
with get_openai_callback() as cb:
  print(chain.run(duck_email))  
print(cb)

1. Archie Sarre Wood is the Head of community at Evidence and has built a VS Code extension for DuckDB.
2. DuckDB is powerful enough to analyze rich datasets like StackOverflow data.
3. The integration between Motherduck and Rill Data allows for fast DuckDB-powered dashboards.
4. DuckDB was compared to Spark, Elasticsearch, and MongoDB in terms of performance and cost.
5. There is a comprehensive guide available for using DuckDB with Go.
Tokens Used: 1074
	Prompt Tokens: 979
	Completion Tokens: 95
Successful Requests: 1
Total Cost (USD): $0.003317


In [201]:
import time

def run_with_cost(chain, data):
  start = time.time()
  with get_openai_callback() as cb:
    summary = chain.run(data)
    end = time.time()

    return {
    "model": chain.llm_chain.llm.model_name,
    "summary": summary,
    "cost": cb.total_cost,
    "tokens": cb.total_tokens,
    "promptTokens": cb.prompt_tokens,
    "completionTokens": cb.completion_tokens,
    "timeTaken": end - start
    }

In [202]:
result = run_with_cost(chain, duck_email)
result

{'model': 'gpt-3.5-turbo-16k',
 'summary': '1. Archie Sarre Wood is the Head of community at Evidence and has built a VS Code extension for DuckDB.\n2. DuckDB is powerful enough to analyze rich datasets like StackOverflow data.\n3. The integration between Motherduck and Rill Data allows for fast DuckDB-powered dashboards.\n4. DuckDB was compared to Spark, Elasticsearch, and MongoDB in terms of performance and cost.\n5. There is a comprehensive guide available for using DuckDB with Go.',
 'cost': 0.003317,
 'tokens': 1074,
 'promptTokens': 979,
 'completionTokens': 95,
 'timeTaken': 4.7362470626831055}

## Comparing different models 🅰️🆚🅱️
Let's have a look at how summaries vary with different models.

In [203]:
models = ["gpt-3.5-turbo-16k", "gpt-3.5-turbo", "gpt-4"]
results = []
for model_name in models:
  print(f"Running {model_name}")
  llm = ChatOpenAI(temperature=0, model_name=model_name)
  summary_chain = load_summarize_chain(llm, chain_type="stuff", prompt=prompt)
  result = run_with_cost(summary_chain, duck_email)
  results.append(result)
  print(f"Finished in  {result['timeTaken']} seconds")

Running gpt-3.5-turbo-16k
Finished in  4.3329689502716064 seconds
Running gpt-3.5-turbo
Finished in  4.607836961746216 seconds
Running gpt-4
Finished in  13.31515908241272 seconds


In [204]:
import pandas as pd
from style_pandas import style_dataframe

In [205]:
style_dataframe(pd.DataFrame(results))

Unnamed: 0,model,summary,cost,tokens,promptTokens,completionTokens,timeTaken
,gpt-3.5-turbo-16k,"1. Archie Sarre Wood is the Head of community at Evidence and has built a VS Code extension for DuckDB. 2. DuckDB is powerful enough to analyze rich datasets like StackOverflow data. 3. The integration between Motherduck and Rill Data allows for fast DuckDB-powered dashboards. 4. DuckDB was compared to Spark, Elasticsearch, and MongoDB in terms of performance and cost. 5. There is a comprehensive guide available for using DuckDB with Go.",$0.003,1074,979,95,4.332969
,gpt-3.5-turbo,"1. Archie Sarre Wood is the Head of community at Evidence and has built a VS Code extension for DuckDB. 2. DuckDB is powerful enough to analyze rich datasets like StackOverflow data. 3. The integration between Motherduck and Rill Data allows for fast DuckDB-powered dashboards. 4. DuckDB was compared to Spark, Elasticsearch, and MongoDB in terms of performance and cost. 5. There is a comprehensive guide available for using DuckDB with Go.",$0.002,1074,979,95,4.607837
,gpt-4,"- The DuckDB ecosystem is growing stronger with more companies like Rill Data considering using DuckDB for production environments and more people considering DuckDB for fast data analysis development. - Archie Sarre Wood, Head of community at Evidence, has built a VS Code extension for DuckDB that allows users to connect to a local, in-memory or MotherDuck DuckDB instance and run queries. - Rahul Joshi talks about a super simple and highly customizable approach to the Modern Data Stack in a box with dtl, DuckDB, Motherduck and Metabase. - Katie Staveley shows how the integration between Motherduck and Rill Data works smoothly in her post about fast DuckDB-powered dashboards. - Sergey Olontsev provides an interactive guide for using DuckDB with Golang, demonstrating how to combine the power of the two.",$0.040,1148,979,169,13.315159
