# Overall plan
  - Provide a specific article
  - Get back a tweet that imitates "The Telegraph"

# Steps
  1. Get the article.
  2. Extract the title and the content.
  3. Create a summary of the article.
  4. Generate a tweet that imitates "The Telegraph"

In [343]:
import nest_asyncio

# Needed for jupyter notebooks to work with async tasks
nest_asyncio.apply()

# Llama-index has a really easy to use news article parser. It takes a list of urls as strings
from llama_index.readers.web import NewsArticleReader


article_link = "https://www.telegraph.co.uk/health-fitness/wellbeing/sex/rachel-johnson-sex-questions-relationship-dilemmas/"
article = NewsArticleReader(html_to_text=True).load_data([article_link])

## Completed
- llama-index generates all of the metadata and the summary of the article.

## Next Steps
- Get all the tweets from the twitter account of "The Telegraph"
  - [tweet dataset](kaggle.com/datasets/shashank1558/preprocessed-twitter-tweets/data)
- Store them in a vector database
- Use the vector database to generate a tweet that imitates "The Telegraph"

In [344]:
# load the documents from csv files

#%pip install unstructured
from langchain_community.document_loaders import DirectoryLoader

# Text files are the pivoted csv files.
# The dataset is only comma separated tweet content. There are no headers. 
force_embeds = False 
raw_docs = None
if isdir("./chroma") is False or force_embeds is True:
    directory_loader = DirectoryLoader("tweets/", 
        glob="*.txt", 
        show_progress=True, 
        use_multithreading=True,
    )
    raw_docs = directory_loader.load()

## Completed
- Documents loaded

## Next Steps
- split the documents into smaller documents to reduce the tokens in each document

In [346]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=0,
)

if raw_docs is not None:
    split_docs = text_splitter.split_documents(raw_docs)

In [347]:
# %pip install langchain-chroma
# %ollama pull nomic-embed-text

from langchain_chroma import Chroma
from langchain_community.embeddings.ollama import OllamaEmbeddings

# Generate the embeddings for the documents 
embeddings = OllamaEmbeddings(
    base_url="http://localhost:11434",
    model="nomic-embed-text",
    temperature=0.1,
    show_progress=True,
    num_thread=16,
)

In [348]:
from os.path import isdir
   
if isdir("./chroma") is False or force_embeds is True:
    #store in a vector store
    Chroma.from_documents(
        documents=split_docs,
        embedding=embeddings,
        persist_directory="./chroma",
    )

vectorstore = Chroma(
    persist_directory="./chroma", 
    embedding_function=embeddings
)

### Completed
- Defined embedding function
- Saved to vector database (chroma)

In [349]:
retriever = vectorstore.as_retriever(
    search_type="similarity",
#    search_kwargs={"k":6}
)

In [350]:
from langchain_community.llms.ollama import Ollama

llm = Ollama(
    base_url="http://localhost:11434",
    model="mistral",
    temperature=0.1,
    num_thread=16,
)

In [351]:
def format_retrieved_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [352]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

template = """Use the following pieces of context to to generate a short sentence based on the article provided.
Keep the tweet as concise, sparse, and close to the context's provided language use as possible.

{context}

Article: {question}

Short Sentence:"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_retrieved_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

In [353]:
article[0].metadata['summary']

'British journalist, television presenter, and author Rachel Johnson will answer your queries about sex and relationships in an upcoming Q&A.\nAs The Telegraph’s sex and relationships agony aunt, Rachel has responded to many readers’ dilemmas in her column, including concerns about marital affairs; queries about threesomes; intimacy issues; and how to be more adventurous in bed – and is keen to help more readers.\nIf you’ve got a burning question you’d like answered, submit it via the form below.\nIf you’d prefer to remain anonymous, simply tick the checkbox.'

In [354]:
request = "Match the context in vocabulary, writing style, and tone for the following content:\n\n" + str(article[0].metadata['summary'])

rag_chain.invoke(request)

OllamaEmbeddings:   0%|          | 0/1 [00:00<?, ?it/s]

OllamaEmbeddings: 100%|██████████| 1/1 [00:03<00:00,  3.44s/it]


" Rachel Johnson, a British journalist, television presenter, and author, will answer your sex and relationship questions in an upcoming Q&A as The Telegraph's agony aunt. Submit your burning question via the form below for her response, with the option to remain anonymous if preferred."

In [355]:
llm.invoke("Generate a short sentence that generally describes the given: \n\n" + str(article[0].metadata['summary']))

' Rachel Johnson, known for her work as a British journalist, television presenter, and author, will answer anonymous questions about sex and relationships in an upcoming Q&A session for The Telegraph.'