In [5]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv(override=True)

# Retrieve API keys from environment variables
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
PINECONE_API_KEY = os.getenv("PINECONE_API_KEY ")
PINECONE_ENV = os.getenv("PINECONE_ENV")
GOOGLE_CSE_ID = os.getenv("GOOGLE_CSE_ID")
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
APIFY_API_TOKEN = os.getenv("APIFY_API_TOKEN")

Libraries added for this project:  
. matplotlib  
. apify_client  
. chromadb   https://docs.trychroma.com/getting-started
. pypdf 
. arxiv

In [3]:
# visualise Mermaid diagram

import base64
from IPython.display import Image, display
import matplotlib.pyplot as plt

def visualize(graph):
    graphbytes = graph.encode('ascii')
    base64_bytes = base64.b64encode(graphbytes)
    base64_string = base64_bytes.decode('ascii')
    display(Image(url='https://mermaid.ink/img/' + base64_string))
    
visualize("""
graph LR;
    A--> B & C & D;
    B--> A & E;
    C--> A & E;
    D--> A & E;
    E--> B & C & D;
""")

### Scrape Mermaid documentation using Langchain

### langchain.document_loaders

In [7]:
from langchain.document_loaders.base import Document
from langchain.utilities import ApifyWrapper
from langchain.indexes import VectorstoreIndexCreator

apify = ApifyWrapper()

url = 'https://mermaid.js.org/'

loader = apify.call_actor(
    actor_id='apify/website-content-crawler',
    run_input={'startUrls': [{'url': url}]},
    dataset_mapping_function=lambda item: Document(
        page_content=item['text'] or '', 
        metadata={'source': item['url']}
    ),
)

In [10]:
loader.schema()

  loader.schema()


{'title': 'ApifyDatasetLoader',
 'description': 'Logic for loading documents from Apify datasets.',
 'type': 'object',
 'properties': {'apify_client': {'title': 'Apify Client'},
  'dataset_id': {'title': 'Dataset Id', 'type': 'string'}},
 'required': ['dataset_id']}

### Use loader to embed data and store in index store

In [11]:
from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator().from_loaders([loader]) #default openai embeddings with chromadb

### Query ChromaDB

In [12]:
query = 'What is the syntax for flowcharts?'
result = index.query_with_sources(query)
result

{'question': 'What is the syntax for flowcharts?',
 'answer': ' The syntax for flowcharts includes normal, thick, and dotted arrows, as well as special characters and subgraphs. It is also possible to declare multiple links and nodes in the same line, and to use new arrow types and multi-directional arrows.\n',
 'sources': 'https://mermaid.js.org/syntax/flowchart.html?id=flowcharts-basic-syntax, https://mermaid.js.org/syntax/flowchart.html?id=special-characters-that-break-syntax'}

### Let's turn this index into a Retriever

In [13]:
retriever = index.vectorstore.as_retriever()
# we change the number of document to return when the LLM queries the index store
retriever.search_kwargs['k'] = 10

### Create a chain to augment ChatGPT with this tool

In [14]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

mermaid_qa = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=retriever,
)

In [15]:
text = """
Machine learning (ML) is a field devoted to understanding and building methods that let machines "learn" – that is, methods that leverage data to improve computer performance on some set of tasks.[1] Machine learning algorithms build a model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to do so.[2] Machine learning algorithms are used in a wide variety of applications, such as in medicine, email filtering, speech recognition, agriculture, and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.[3][4] A subset of machine learning is closely related to computational statistics, which focuses on making predictions using computers, but not all machine learning is statistical learning. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a related field of study, focusing on exploratory data analysis through unsupervised learning.[6][7] Some implementations of machine learning use data and neural networks in a way that mimics the working of a biological brain.[8][9] In its application across business problems, machine learning is also referred to as predictive analytics.
"""

In [16]:
query = """
Your job is to write the code to generate a colorful mermaid diagram describing the following text. 
Return only the code and make sure it has multiple colors

TEXT: {text}
"""
result = mermaid_qa.run(query.format(text=text))

In [18]:
print(result)

Here is the code to generate a colorful mermaid diagram for the given text:

```mermaid
graph TD
A[Machine learning (ML)] --> B(Understanding and building methods)
A --> C(Leveraging data)
B --> D(Improving computer performance)
C --> D
D --> E(Machine learning algorithms)
E --> F(Building a model)
F --> G(Training data)
G --> H(Making predictions or decisions)
H --> I(Without explicit programming)
A --> J(Wide variety of applications)
J --> K(Medicine)
J --> L(Email filtering)
J --> M(Speech recognition)
J --> N(Agriculture)
J --> O(Computer vision)
J --> P(Difficult or unfeasible to develop conventional algorithms)
K --> Q(Performing medical tasks)
L --> Q
M --> Q
N --> Q
O --> Q
P --> Q
Q --> R(Subfield of machine learning)
R --> S(Computational statistics)
R --> T(Mathematical optimization)
R --> U(Application domains)
T --> V(Predictions using computers)
U --> V
V --> W(Data mining)
W --> X(Exploratory data analysis)
X --> Y(Unsupervised learning)
E --> Z(Neural networks)
Z --> AA

In [20]:
visualize("""
graph TD;
A[Machine learning (ML)] --> B(Understanding and building methods);
A --> C(Leveraging data);
B --> D(Improving computer performance);
C --> D;
D --> E(Machine learning algorithms);
E --> F(Building a model);
F --> G(Training data);
G --> H(Making predictions or decisions);
H --> I(Without explicit programming);
A --> J(Wide variety of applications);
J --> K(Medicine);
J --> L(Email filtering);
J --> M(Speech recognition);
J --> N(Agriculture);
J --> O(Computer vision);
J --> P(Difficult or unfeasible to develop conventional algorithms);
K --> Q(Performing medical tasks);
L --> Q;
M --> Q;
N --> Q;
O --> Q;
P --> Q;
Q --> R(Subfield of machine learning);
R --> S(Computational statistics);
R --> T(Mathematical optimization);
R --> U(Application domains);
T --> V(Predictions using computers);
U --> V;
V --> W(Data mining);
W --> X(Exploratory data analysis);
X --> Y(Unsupervised learning);
E --> Z(Neural networks);
Z --> AA(Biological brain);
AA --> AB(Machine learning implementations);
AB --> AC(Business problems);
AC --> AD(Predictive analytics);
""")

In [21]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain, SimpleSequentialChain

prompt = PromptTemplate(
    input_variables=['text'],
    template='''Your job is to write the code to generate a colorful mermaid diagram describing the following text: {text}. 
Return only the code and make sure it has multiple colors'''
)

result = mermaid_qa.run(prompt.format(text=text))

In [22]:
print(result)

To generate a colorful mermaid diagram describing the given text, you can use the following code:

```mermaid
graph LR;
A(Machine learning) --> B(Understanding and building methods);
A --> C(Leveraging data);
A --> D(Improving computer performance);
B --> E(Methods that let machines "learn");
C --> F(Improving computer performance);
D --> F(Improving computer performance);
E --> G(Building a model based on sample data);
G --> H(Making predictions or decisions);
H --> I(Without explicit programming);
F --> J(Used in various applications);
J --> K(Medicine, email filtering, speech recognition, agriculture, computer vision);
K --> L(Difficult or unfeasible to develop conventional algorithms);
M(Machine learning subset);
M --> N(Computational statistics);
N --> O(Making predictions using computers);
P(Mathematical optimization);
P --> Q(Methods, theory, application domains);
R(Data mining);
R --> S(Exploratory data analysis);
S --> T(Unsupervised learning);
U(Biological brain mimicking);
U

In [1]:
import arxiv

paper = next(arxiv.Search(id_list=['1706.03762']).results())
paper.download_pdf(filename='attention.pdf')

'./attention.pdf'

In [2]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('attention.pdf')
pages = loader.load_and_split()

print(pages[0])

page_content='Attention Is All You Need\nAshish Vaswani\x03\nGoogle Brain\navaswani@google.comNoam Shazeer\x03\nGoogle Brain\nnoam@google.comNiki Parmar\x03\nGoogle Research\nnikip@google.comJakob Uszkoreit\x03\nGoogle Research\nusz@google.com\nLlion Jones\x03\nGoogle Research\nllion@google.comAidan N. Gomez\x03y\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser\x03\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin\x03z\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer,\nbased solely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to\nbe superior in quality while being more parallelizable

In [4]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI()

In [5]:
from langchain.chains.question_answering import load_qa_chain

concept_chain = load_qa_chain(
    llm=llm, 
    chain_type='map_reduce'
)

In [6]:
query = """
The text provided as context is an extract of an academic paper in the field of machine learning.
Your job is to find the main concepts that are explained in this paper and return those concepts as a list, all on one line separated by the characters '||'. Order that list from most essential concept for the paper to least essential. Limit the list to the 5 most important concepts

Example: concept 1||concept 2||concept 3 
"""

result = concept_chain({'input_documents': pages, 'question': query})

result['output_text']

'Transformer||Attention mechanisms||Encoder-decoder architecture||Self-attention||Multi-Head Attention'

In [8]:
concept_list = result['output_text'].split('||')
concept_list

['Transformer',
 'Attention mechanisms',
 'Encoder-decoder architecture',
 'Self-attention',
 'Multi-Head Attention']

In [9]:
explain_chain = load_qa_chain(
    llm=llm, 
    chain_type='map_reduce'
)        

In [10]:
query = query = """
The text provided as context is an extract from an academic paper in machine learning.

Your job is to use that context to explain the following concept. 
The answer must be self-contained so you cannot refer to the article in the answer.
Imagine you are a Computer Science Professor teaching at the university.
Respond only with an explanation of the concept:


CONCEPT: {concept} 
"""

In [11]:
explanations = []

for concept in concept_list:
    
    result = explain_chain({
        'input_documents': pages, 
        'question': query.format(concept=concept)
    })
    
    explanations.append({
        'concept': concept,
        'explanation': result['output_text']  
    })

explanations[0]['explanation']

'The concept of a Transformer refers to a type of neural network architecture that has revolutionized the field of natural language processing (NLP). It is a deep learning model specifically designed for handling sequential data, such as sentences or documents. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers rely on a self-attention mechanism to capture the relationships between different words or tokens in a sequence.\n\nIn a Transformer, the input sequence is first encoded into a set of embeddings. These embeddings are then processed through multiple layers of self-attention and feed-forward neural networks. The self-attention mechanism allows each word in the sequence to attend to all other words, capturing the importance and relevance of each word in the context of the entire sequence.\n\nBy leveraging self-attention, Transformers can model dependencies between words regardless of their position in the sequence, making them 

In [12]:
print(explanations[0]['explanation'])

The concept of a Transformer refers to a type of neural network architecture that has revolutionized the field of natural language processing (NLP). It is a deep learning model specifically designed for handling sequential data, such as sentences or documents. Unlike traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers rely on a self-attention mechanism to capture the relationships between different words or tokens in a sequence.

In a Transformer, the input sequence is first encoded into a set of embeddings. These embeddings are then processed through multiple layers of self-attention and feed-forward neural networks. The self-attention mechanism allows each word in the sequence to attend to all other words, capturing the importance and relevance of each word in the context of the entire sequence.

By leveraging self-attention, Transformers can model dependencies between words regardless of their position in the sequence, making them highl