# Rational Future Prediction
Goal: create an LLM system that uses RAG in order to predict the next innovation in a particular field through time series forecasting.

# Step 1: gather data for the vector database from a website containing "timeline" data.

In [35]:
from bs4 import BeautifulSoup
import requests

r = requests.get("https://www.telephonetribute.com/timeline.html")
html = r.text
clean_text = ' '.join(BeautifulSoup(html, "html.parser").stripped_strings)
text = "1881"+clean_text.split("1881")[4]
text

'1881: First long-distance line, between\nBoston and Providence, R.I. *[Note 5] 1908: AT&T President Theodore Vail sets a goal of Universal Service: providing\nhigh-quality telephone service to any American who wants it. *[Note 5] 1915: AT&T engineers transmit speech across the Atlantic Ocean by wireless\nradio. *[Note 5] 1917: Bell System engineers demonstrate oneway radiotelephone transmission from\nairplane to ground. *[Note 5] 1919: AT&T introduces dial telephones. *[Note 5] 1923: AT&T longdistance lines connect stations in the first network-radio\nbroadcast. *[Note 5] 1925: The Research and Engineering Departments of AT&T and the Western Electric\nCompany are incorporated as ell Telephone Laboratories. *[Note 5] 1927: Wireless telephone service between New York and London. *[Note 5] 1927: Don Juan, the first fulllength movie with sound uses AT&T equipment and\nthe next year, Warner Bros. produces The Jazz Singer with AT&T equipment. *[Note\n5] 1947: Bell Labs scientists Shockley, 

## To extract data in the format of ['year']: ['sentence describing innovation'], use a Mistral procedure.

In [2]:
from langchain_openai import ChatOpenAI
from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.chat_models.huggingface import ChatHuggingFace
import getpass
import os

token = getpass.getpass("huggingface token")

llm = HuggingFaceEndpoint(repo_id="mistralai/Mistral-7B-v0.1", huggingfacehub_api_token=token)
llm = ChatHuggingFace(llm=llm)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/devam/.cache/huggingface/token
Login successful


None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [39]:
filteredText = llm.invoke(f"Filter out data in the form of 'year':'sentence' from the following text document: {text}")

In [56]:
data = filteredText.content
open('data.txt', 'w').write(data)

2550

## Step 2: add the data to a vector database.

In [8]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
raw_documents = TextLoader('/Users/devam/PycharmProjects/rationalFuturePrediction/data.txt').load()
text_splitter = CharacterTextSplitter(separator="\n", chunk_size=20, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)
db = Chroma.from_documents(documents, OpenAIEmbeddings())

Created a chunk of size 67, which is longer than the specified 20
Created a chunk of size 139, which is longer than the specified 20
Created a chunk of size 81, which is longer than the specified 20
Created a chunk of size 99, which is longer than the specified 20
Created a chunk of size 38, which is longer than the specified 20
Created a chunk of size 84, which is longer than the specified 20
Created a chunk of size 135, which is longer than the specified 20
Created a chunk of size 61, which is longer than the specified 20
Created a chunk of size 151, which is longer than the specified 20
Created a chunk of size 80, which is longer than the specified 20
Created a chunk of size 74, which is longer than the specified 20
Created a chunk of size 106, which is longer than the specified 20
Created a chunk of size 104, which is longer than the specified 20
Created a chunk of size 85, which is longer than the specified 20
Created a chunk of size 39, which is longer than the specified 20
Creat

In [11]:
retriever = db.as_retriever()
docs = retriever.invoke("What happened in 2004?")
docs[0]

Document(page_content='1994: AT&T and McCaw merger completed.', metadata={'source': '/Users/devam/PycharmProjects/rationalFuturePrediction/data.txt'})

## Step 3: Create a vanilla agent that combines an LLM with the vector database. 
Assess this base agent's ability to predict "already seen events" through a measure of cosine similarity taking into account size of "event chain" (how many past events it can see).

Also assess its time series forecasting ability by testing its ability to guess the year correctly (1), the next innovation (2), and both. 

Also assess "event chain skips": randomly ommitting a few events in a long chain, and seeing how that impacts the above measure. 

## Step 4: Fine-tune LLM on event chains from the RAG Database. 
Fine-tune LLM by using the above metric as a loss function. Given a series of steps in the chain, predict the next event. WE DON"T WORRY ABOUT TREATING THE DATE SEPERATELY BECAUSE THE LLM DEMONSTRATES TEMPORAL REASONING WITH DATE AND CONTENT CONCATINATED BASED OFF OF THE PREVIOUS DESCRIPTION).

Go through all the permutations. In a chain of 6, we can start with the first one, go all the way down, etc. Figure out a formal definition for this. 



## Step 5: Create a refined agent that combines the fine-tuned LLM with the vector database. 

Assess this  agent's ability to predict "already seen events" through a measure of cosine similarity taking into account size of "event chain" (how many past events it can see).

Also assess its time series forecasting ability by testing its ability to guess the year correctly (1), the next innovation (2), and both. 

Also assess "event chain skips": randomly ommitting a few events in a long chain, and seeing how that impacts the above measure.

In [3]:
from langchain_community.utils.math import cosine_similarity
def temporalLoss(prediction, skip, topk):
    return -topk * cosine_similarity(prediction,skip)