# Overall plan
  - Provide a specific article
  - Get back a tweet that imitates "The Telegraph"

# Steps
  1. Get the article.
  2. Extract the title and the content.
  3. Create a summary of the article.
  4. Generate a tweet that imitates "The Telegraph"

In [67]:
import nest_asyncio

# Needed for jupyter notebooks to work with async tasks
nest_asyncio.apply()

# Llama-index has a really easy to use news article parser. It takes a list of urls as strings
from llama_index.readers.web import NewsArticleReader


article_link = "https://www.telegraph.co.uk/world-news/2024/04/16/australian-woman-dies-after-drinking-infusion/"
article = NewsArticleReader(html_to_text=True).load_data([article_link])

In [68]:
article[0].metadata

{'title': 'Australian woman dies after drinking suspected toxic mushroom infusion',
 'link': 'https://www.telegraph.co.uk/world-news/2024/04/16/australian-woman-dies-after-drinking-infusion/',
 'authors': ['Andrea Hamblin', 'In Melbourne'],
 'language': 'en',
 'description': 'Rachael Dixon went into cardiac arrest after ingesting unknown substance at alternative healing centre north-west of Melbourne',
 'publish_date': datetime.datetime(2024, 4, 16, 0, 0),
 'keywords': ['mushrooms',
  'suspected',
  'soul',
  'woman',
  'toxic',
  'skincare',
  'centre',
  'mushroom',
  'dies',
  'infusion',
  'treatment',
  'barn',
  'wild',
  'drinking',
  'substance',
  'australian',
  'health'],

In [69]:
# Convert the document from llama-index to a langchain document

langchain_article = article[0].to_langchain_format()

In [70]:
langchain_article.metadata

{'title': 'Australian woman dies after drinking suspected toxic mushroom infusion',
 'link': 'https://www.telegraph.co.uk/world-news/2024/04/16/australian-woman-dies-after-drinking-infusion/',
 'authors': ['Andrea Hamblin', 'In Melbourne'],
 'language': 'en',
 'description': 'Rachael Dixon went into cardiac arrest after ingesting unknown substance at alternative healing centre north-west of Melbourne',
 'publish_date': datetime.datetime(2024, 4, 16, 0, 0),
 'keywords': ['mushrooms',
  'suspected',
  'soul',
  'woman',
  'toxic',
  'skincare',
  'centre',
  'mushroom',
  'dies',
  'infusion',
  'treatment',
  'barn',
  'wild',
  'drinking',
  'substance',
  'australian',
  'health'],

## Completed
- llama-index generates all of the metadata and the summary of the article.

## Next Steps
- Get all the tweets from the twitter account of "The Telegraph"
  - [tweet dataset](kaggle.com/datasets/shashank1558/preprocessed-twitter-tweets/data)
- Store them in a vector database
- Use the vector database to generate a tweet that imitates "The Telegraph"

In [20]:
# load the documents from csv files

#%pip install unstructured
from langchain_community.document_loaders import DirectoryLoader

# Text files are the pivoted csv files.
# The dataset is only comma separated tweet content. There are no headers. 
directory_loader = DirectoryLoader("tweets/", 
    glob="*.txt", 
    show_progress=True, 
    use_multithreading=True,
)
raw_docs = directory_loader.load()

100%|██████████| 3/3 [00:09<00:00,  3.08s/it]


In [21]:
print(raw_docs)

[Document(page_content="How unhappy  some dogs like it though\n\ntalking to my over driver about where I'm goinghe said he'd love to go to New York too but since Trump it's probably not\n\nDoes anybody know if the Rand's likely to fall against the dollar? I got some money  I need to change into R but it keeps getting stronger unhappy\n\nI miss going to gigs in Liverpool unhappy\n\nThere isnt a new Riverdale tonight ? unhappy\n\nit's that A*dy guy from pop Asia and then the translator so they'll probs go with them around Aus unhappy\n\nWho's that chair you're sitting in? Is this how I find out. Everyone knows now. You've shamed me in pu\n\ndon't like how jittery caffeine makes me sad\n\nMy area's not on the list unhappy  think I'll go LibDems anyway\n\nI want fun plans this weekend unhappy\n\nWhen can you notice me. unhappy  what?\n\nAhhhhh! You recognized LOGAN!!! Cinemax shows have a BAD track record for getting cancelled unhappy\n\nErrr dude.... They're gone unhappy  Asked other leag

## Completed
- Documents loaded

## Next Steps
- split the documents into smaller documents to reduce the tokens in each document

In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=150,
)

split_docs = text_splitter.split_documents(raw_docs)

In [23]:
# %pip install langchain-chroma
# %ollama pull nomic-embed-text

from langchain_chroma import Chroma
from langchain_community.embeddings.ollama import OllamaEmbeddings

# Generate the embeddings for the documents 
embeddings = OllamaEmbeddings(
    base_url="http://localhost:11434",
    model="nomic-embed-text",
    temperature=0.1,
    show_progress=True,
    num_thread=16,
)

In [24]:
from os.path import isdir

force_embeds = False    
if isdir("./chroma") is False or force_embeds is True:
    #store in a vector store
    Chroma.from_documents(
        documents=split_docs,
        embedding=embeddings,
        persist_directory="./chroma",
    )

vectorstore = Chroma(
    persist_directory="./chroma", 
    embedding_function=embeddings
)

### Completed
- Defined embedding function
- Saved to vector database (chroma)

In [182]:
retriever = vectorstore.as_retriever(
    search_type="mmr",
#    search_kwargs={"k":6}
)

In [183]:
from langchain_community.llms.ollama import Ollama

llm = Ollama(
    base_url="http://localhost:11434",
    model="mistral",
    temperature=0.1,
    num_thread=16,
)

In [184]:
def format_retrieved_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [190]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

template = """Use the following output examples to transform the input text into the output.
Do not include any information that is not necessary to match the example output's style, tone, sparsity, briefness, and vocabulary.
Match the length of the output to the average length of the output examples.
After the expected output, give some brief instructions on how to match the examples better.

Output Examples:
{context}

Input Text: 
{question}

Output:"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_retrieved_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

In [191]:
request = str(article[0].text)

rag_chain.invoke(request)

OllamaEmbeddings:   0%|          | 0/1 [00:00<?, ?it/s]

OllamaEmbeddings: 100%|██████████| 1/1 [00:03<00:00,  3.47s/it]


