#pip show openai

In [None]:
#!pip install -r ./requirements.txt -q
#!pip - installs the packages in the base environment
# pip - installs the packages in the virtual environment

In [1]:
pip show langchain

Name: langchain
Version: 0.0.338
Summary: Building applications with LLMs through composability
Home-page: https://github.com/langchain-ai/langchain
Author: 
Author-email: 
License: MIT
Location: c:\users\hp\anaconda3\lib\site-packages
Requires: pydantic, jsonpatch, tenacity, requests, async-timeout, numpy, PyYAML, aiohttp, langsmith, dataclasses-json, SQLAlchemy, anyio
Required-by: langchain-experimental
Note: you may need to restart the kernel to use updated packages.


### Python-dotenv

In [3]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [4]:
#os.environ.get('PINECONE_API_KEY')
os.environ.get('PINECONE_ENV')
#os.environ.get('OPENAI_API_KEY')
# This is calling out the answer to the second question')

'gcp-starter'

### LLM Models (Wrappers): GPT-3

In [4]:
from langchain.llms import OpenAI
llm = OpenAI(model_name='text-davinci-003', temperature=0.7, max_tokens=512)
print(llm)

[1mOpenAI[0m
Params: {'model_name': 'text-davinci-003', 'temperature': 0.7, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'logit_bias': {}, 'max_tokens': 512}


The llm class interfaces with llms, there are lots of providers, OpenAI is one of them

In [5]:
output= llm("explain quantum mechanics in one sentence")
# similar as calling openAI API directly
print(output)



Quantum mechanics is a theory that explains the behavior of matter and energy on the atomic and subatomic level.


In [6]:
print(llm.get_num_tokens("explain quantum mechanics in one sentence"))

7


In [7]:
output = llm.generate(['... is the capital of France.', 
                   'What is the formula for the area of a circle'])

In [8]:
print(output.generations)

[[Generation(text='\n\nParis.', generation_info={'finish_reason': 'stop', 'logprobs': None})], [Generation(text='\n\nThe formula for the area of a circle is A = πr2, where r is the radius of the circle.', generation_info={'finish_reason': 'stop', 'logprobs': None})]]


In [9]:
print(output.generations[1][0].text)
# This is calling out the answer to the second question



The formula for the area of a circle is A = πr2, where r is the radius of the circle.


In [10]:
print(output.generations[0][0].text)
# This is calling out the answer to the first question



Paris.


In [11]:
#to check how many generations it gives
len(output.generations)

2

In [12]:
output = llm.generate(['Write an original tagline for burger restaurant'] * 3)

In [13]:
for o in output.generations:
    print(o[0].text, end='')



"Taste the Burger Revolution!"

"Satisfy your craving with our juicy Burgers!"

"Try our Burgers and Taste the Difference!"

### ChatModels: GPT-3.5-Turbo and GPT-4

In [14]:
from langchain.schema import(
    AIMessage, 
    HumanMessage, 
    SystemMessage
)
from langchain.chat_models import ChatOpenAI

In [15]:
chat = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.5, max_tokens=1024)
messages = [
    SystemMessage(content='You are physicist and respond only in German'), ## stting the assistant behaviour
    HumanMessage(content='explan quantum mechanics in one sentence')
]
output = chat(messages)

In [16]:
print(output)

content='Quantenmechanik beschreibt das Verhalten von Teilchen auf atomarer und subatomarer Ebene.'


In [17]:
print(output.content) ## to view only the text

Quantenmechanik beschreibt das Verhalten von Teilchen auf atomarer und subatomarer Ebene.


## Prompt Templates

In [18]:
from langchain import PromptTemplate

In [19]:
template= ''' You are an experienced virologist
Write a few sentences about following {virus} in {language}.'''

prompt = PromptTemplate(
    input_variables=['virus', 'language'],
    template=template
)
print(prompt)

input_variables=['language', 'virus'] template=' You are an experienced virologist\nWrite a few sentences about following {virus} in {language}.'


In [20]:
from langchain.llms import OpenAI
llm = OpenAI(model_name='text-davinci-003', temperature=0.7)
output = llm(prompt.format(virus='HIV', language='yoruba'))
print(output)



Hivi ni olufẹ wa lati ṣawari aworan HIV ni Yoruba. Bi o ṣe le gba awọn oṣiṣẹ ti HIV, iṣẹ ti o yoo pese ni gbogbo iru awọn aaye ti o ṣe itọju fun aye ti o n ṣe. Awọn aini ti o n ṣe ni imọran, iṣẹ iṣẹ, iṣọrọ, ṣiṣe ṣe ti a fi ṣe pupọ ati ṣe igbese fun awọn aaye ti o ṣe.


## Simple Chains

In [21]:
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.5)

template= ''' You are an experienced virologist
Write a few sentences about following {virus} in {language}.'''

prompt = PromptTemplate(
    input_variables=['virus', 'language'],
    template=template
)

chain =LLMChain(llm=llm, prompt=prompt)
output = chain.run({'virus' :'HIV', 'language' : 'french'})

In [22]:
print(output)

Le VIH (Virus de l'Immunodéficience Humaine) est un virus qui attaque le système immunitaire de l'organisme humain. Il se transmet principalement par voie sexuelle, par le sang contaminé ou de la mère à l'enfant pendant la grossesse, l'accouchement ou l'allaitement. Une fois infectée, une personne peut développer le SIDA (Syndrome de l'Immunodéficience Acquise), une maladie qui affaiblit progressivement le système immunitaire et rend l'organisme vulnérable à diverses infections opportunistes. Le VIH peut être contrôlé grâce à des traitements antirétroviraux, mais il n'existe pas encore de vaccin ou de remède définitif contre cette maladie.


In [23]:
### removing the second variable, default language will be English
from langchain.chat_models import ChatOpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0.5)

template= ''' You are an experienced virologist
Write a few sentences about following {virus}.'''

prompt = PromptTemplate(
    input_variables=['virus'],
    template=template
)

chain =LLMChain(llm=llm, prompt=prompt)
output = chain.run('HSV')

In [24]:
print(output)

HSV, or herpes simplex virus, is a highly prevalent virus that infects humans. There are two types of HSV: HSV-1 and HSV-2. HSV-1 primarily causes oral herpes, typically characterized by cold sores or fever blisters around the mouth and on the face. On the other hand, HSV-2 commonly causes genital herpes, which manifests as painful blisters or sores in the genital area. Both types of HSV can be transmitted through direct contact with an infected person or through contact with their bodily fluids. Although there is no cure for HSV, antiviral medications can help manage and reduce the frequency and severity of outbreaks.


 ## Sequential Chains

In [25]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain, SimpleSequentialChain

## define the first llm
llm1 = OpenAI(model_name='text-davinci-003', temperature=0.7, max_tokens=1024)
prompt1 = PromptTemplate(
    input_variables=['concept'],
    template = '''You are an experienced scientist and Python programmer. 
    Write a function that implements the concept of {concept}'''
)
chain1 = LLMChain(llm=llm1, prompt=prompt1)

llm2 = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=1.2)
prompt2 = PromptTemplate(
    input_variables=['function'],
    template='Given the Python {function}, describe it as detailed as possible'
)
chain2 = LLMChain(llm=llm2, prompt=prompt2)

overall_chain = SimpleSequentialChain(chains=[chain1, chain2], verbose=True)
output = overall_chain.run('linear regression')



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m.

def linearRegression(x, y):
    """
    Implements the concept of linear regression
    Parameters:
        x (array): Array of x-values
        y (array): Array of y-values
    Returns:
        m (float): The slope of the regression line
        b (float): The y-intercept of the regression line
    """
    # Calculate the sums
    x_sum = sum(x)
    y_sum = sum(y)
    xy_sum = sum([x[i] * y[i] for i in range(len(x))])
    x_squared_sum = sum([x[i] ** 2 for i in range(len(x))])
    
    # Calculate the slope and intercept
    m = (len(x) * xy_sum - x_sum * y_sum) / (len(x) * x_squared_sum - x_sum ** 2)
    b = (y_sum - m * x_sum) / len(x)
    
    return m, b[0m
[33;1m[1;3mThis Python code is implementing the concept of linear regression. It takes two arrays as input - x (which represents the independent variable) and y (which represents the dependent variable). The function calculates the slope (m) and y-interc

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain import PromptTemplate
from langchain.chains import LLMChain, SimpleSequentialChain

## define the first llm
llm1 = OpenAI(model_name='text-davinci-003', temperature=0.7, max_tokens=1024)
prompt1 = PromptTemplate(
    input_variables=['concept'],
    template = '''You are an experienced scientist and Python programmer. 
    Write a function that implements the concept of {concept}'''
)
chain1 = LLMChain(llm=llm1, prompt=prompt1)

llm2 = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=1.2)
prompt2 = PromptTemplate(
    input_variables=['function'],
    template='Given the Python {function}, describe it as detailed as possible'
)
chain2 = LLMChain(llm=llm2, prompt=prompt2)

overall_chain = SimpleSequentialChain(chains=[chain1, chain2], verbose=True)
output = overall_chain.run('softmax')

## LangChain Agents

In [45]:
#pip install langchain_experimental -q

Note: you may need to restart the kernel to use updated packages.


In [26]:
from langchain_experimental.agents.agent_toolkits import create_python_agent
from langchain_experimental.tools import PythonREPLTool
from langchain.llms import OpenAI

In [48]:
llm = OpenAI(temperature = 0)
agent_executor = create_python_agent(
    llm=llm,
    tool=PythonREPLTool(),# agent use this to reach outside world
    verbose=True #to see all immediate steps
)
agent_executor.run('Calculate the square root of the factorial 0f 20\ and display it with 4 decimal points')




[1m> Entering new AgentExecutor chain...[0m


Python REPL can execute arbitrary code. Use with caution.


[32;1m[1;3m I need to calculate the factorial of 20 and then take the square root
Action: Python_REPL
Action Input: from math import factorial; print(round(factorial(20)**0.5, 4))[0m
Observation: [36;1m[1;3m1559776268.6285
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 1559776268.6285[0m

[1m> Finished chain.[0m


'1559776268.6285'

In [27]:
llm = OpenAI(temperature = 0)
agent_executor = create_python_agent(
    llm=llm,
    tool=PythonREPLTool(),# agent use this to reach outside world
    verbose=True #to see all immediate steps
)
agent_executor.run('What is the answer to 5.1 ** 7.3')




[1m> Entering new AgentExecutor chain...[0m


Retrying langchain.llms.openai.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised APIConnectionError: Error communicating with OpenAI: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')).
Python REPL can execute arbitrary code. Use with caution.


[32;1m[1;3m I need to use the Python REPL to calculate this
Action: Python_REPL
Action Input: print(5.1 ** 7.3)[0m
Observation: [36;1m[1;3m146306.05007233328
[0m
Thought:[32;1m[1;3m I now know the final answer
Final Answer: 146306.05007233328[0m

[1m> Finished chain.[0m


'146306.05007233328'

  ## Diving into Pinecone

In [2]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True) 

True

In [5]:
import pinecone
pinecone.init(
    api_key=os.environ.get('PINECONE_API_KEY'),
    environment= os.environ.get('PINECONE_ENV')
)

  from tqdm.autonotebook import tqdm


In [6]:
pinecone.info.version()

VersionResponse(server='2.0.11', client='2.2.4')

## Pinecone Indexes

In [7]:
#list indexes, you can create 1 index on free account
pinecone.list_indexes()

['langchain-pinecone']

In [8]:
# create index if it doesnt exist
index_name = 'langchain-pinecone'
if index_name not in pinecone.list_indexes():
    print(f'Creating index {index_name} ....')
    pinecone.create_index(index_name, dimension=1536, metric='cosine', pods=1, pod_type='p1.x2')
    print('Done')
else:
    print(f'Index {index_name} already exists')

Index langchain-pinecone already exists


In [9]:
pinecone.describe_index(index_name)

IndexDescription(name='langchain-pinecone', metric='cosine', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='starter', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

In [6]:
# deleting index
index_name = 'langchain-pinecone'
if index_name in pinecone.list_indexes():
    print(f'Deleting index {index_name} ... ')
    pinecone.delete_index(index_name)
    print('Done')
else:
    print(f'Index {index_name} does not exist!')

Deleting index langchain-pinecone ... 
Done


In [9]:
# working with index
index_name = 'langchain-pinecone'
index = pinecone.Index(index_name)
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

In [10]:
import random
vectors = [random.random() for _ in range(1536)] #this is 1 vector
vectors

[0.18227244104866958,
 0.9095994382934515,
 0.1313781455459252,
 0.5895443428084353,
 0.6296881882315492,
 0.5784768414407037,
 0.5633955222516666,
 0.6710923297098202,
 0.7650903684922951,
 0.5784270010624492,
 0.9313805908588364,
 0.5560664741294808,
 0.21458806188997048,
 0.9428481021743366,
 0.9347232410032748,
 0.718843665277429,
 0.7016284447200154,
 0.7221223858406913,
 0.39275657041509626,
 0.5317549613723535,
 0.30262737929783423,
 0.035674920127148235,
 0.1571385637166678,
 0.05817632920050109,
 0.6965728522037838,
 0.8232593976609641,
 0.9549156072920805,
 0.8886080342215323,
 0.44054743533298824,
 0.3705958927783132,
 0.20005793394539717,
 0.5023107655587692,
 0.09131154286482235,
 0.05307623519827542,
 0.8231308291512247,
 0.00021254915156654342,
 0.4252229601280668,
 0.15471358444848382,
 0.6186957984657191,
 0.9331346674518439,
 0.18257917832421378,
 0.6583671525248844,
 0.21445266111820394,
 0.8733451438886304,
 0.6339132934359181,
 0.896429741532564,
 0.976796475476596

In [37]:
#inserting into a Pinecone index
# import random
vectors = [[random.random() for _ in range(1536)] for v in range(5)] #a list with 5 vectors
#vectors
ids = list('abcde')

In [13]:
#inserting into a Pinecone index
#import random
#vectors = [[random.random() for _ in range(1536)] for v in range(5)] #a list with 5 vectors
#vectors
#ids = list('abcde')

In [38]:
index_name = 'langchain-pinecone'
index = pinecone.Index(index_name)
index.upsert(vectors=zip(ids, vectors)) # upsert help you insert if you are unsure vector already exist

{'upserted_count': 5}

In [15]:
#updating a vector c with 0.3
index.upsert(vectors = [('c', [0.3] * 1536)])

{'upserted_count': 1}

In [16]:
#fectching a vector
index =  pinecone.Index('langchain-pinecone')
index.fetch(ids = ['c', 'd'])

{'namespace': '',
 'vectors': {'c': {'id': 'c',
                   'metadata': {},
                   'values': [0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
                              0.3,
       

In [17]:
#deleting vectors
index.delete(ids=['b', 'c'])

{}

In [42]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 5e-05,
 'namespaces': {'': {'vector_count': 5}},
 'total_vector_count': 5}

In [19]:
index.fetch(ids=['b']) # b was deleted hence the namespace no longer exist

{'namespace': '', 'vectors': {}}

In [23]:
#index.delete(delete_all=True) #"deleting a vectors in the index" cant delete because of free account

#deleting vectors
index.delete(ids=['a', 'd', 'e'])

{}

In [41]:
#querying

#queries = [[random.random() for _ in range(1536)] for v in range(2)]
#queries = [[random.random() for _ in range(1536)] for v in range(2)]

In [36]:
#index.query(
  #  queries=queries,
   # top_k=3,
    #include_values=False
#)

index.query(
vector=queries, # queries=queries https://docs.pinecone.io/reference/query
top_k=3,
include_values=False
)

{'matches': [], 'namespace': ''}

## Splitting and Embedding Text Using LangChain

In [43]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True) 

True

In [46]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open('files/churcill_speech.txt') as f:
    churchill_speech = f.read()
    
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, #this is the maximum size
    chunk_overlap=20, #maximum
    length_function=len
    )

In [54]:
chunks = text_splitter.create_documents([churchill_speech])
#print(chunks[2]) each chunk is a line
#print(chunks[10].page_content)
print(f'Now you have {len(chunks)} chunks')

Now you have 326 chunks


### Embedding Cost

In [56]:
#calculation embedding cost ahead is better to avoid any surprises
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens/ 1000 * 0.0004:.6f}')

print_embedding_cost(chunks)

Total Tokens: 4902
Embedding Cost in USD: 0.001961


In [71]:
#Instantiate OpenAI embedding
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [60]:
# OpenAI embedding class can be used to embed text to vectors
#vector = embedding.embed_query('abc') testing to see abc embedding
vector = embedding.embed_query(chunks[0].page_content)
print(vector)


[-0.03798534034355896, -0.034565106564583586, -0.0012712523070555657, 0.007617796888102064, 0.01153681631204899, 0.02679184306539319, -0.02517241464432262, -0.019031536126059623, 0.0064097024552436715, 0.004446953603306, 0.013551387212462341, 0.031248514088591813, 0.006795126583371242, -0.005402417005036996, 0.00910767181779796, -0.008330346212936992, 0.024058245491539083, 0.0070995796629743195, 0.005389461842164046, -0.021402381688890436, -0.008466377984240079, -0.02285339159562878, 0.03016025805552194, -0.005988650581477578, -0.014108470857531523, -0.017995102607126723, 0.017645305293315084, -0.022011287848096594, -0.00010141678077233529, -0.017956234790201402, 0.0031530294119768024, -0.00750767590820617, -0.016323849809274003, -0.004955454805525003, -0.014484178497089058, -0.02702504065371922, -0.021868776865542502, -0.004702823774397605, 0.013629120052345215, -0.014095515228997279, 0.022529505538885632, 0.001333600279846363, 0.00026923016825241907, 0.012158677168466801, -0.03197401

## Inserting Embedding into Pinecone Index

In [63]:
import os
import pinecone
from langchain.vectorstores import Pinecone

pinecone.init(api_key=os.environ.get('PINECONE_API_KEY'), environment=os.environ.get('PINECONE_ENV'))

In [67]:
#create a pinecone index
# deleting all existing index in a free account
index = pinecone.list_indexes()
for i in index:
    print(f'Deleting {i} all', end = '')
    pinecone.delete_index(i)
    print('Done')

Deleting langchain-pinecone allDone


In [69]:
#dimension of openAI embeedings is 1536
index_name = 'churchill-speech'
if index_name not in pinecone.list_indexes():
    print(f'Creating index {index_name}...')
    pinecone.create_index(index_name, dimension=1536, metric='cosine')
    print('Done')

Creating index churchill-speech...
Done


In [72]:
vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)#takes 3 arguments
#chunks is the viriable title you may give the recursive text classifier on you doc/ text
# embedding is instance of OpenAI embedding class, converts text data into embedding using OpenAI embedding model used for similarity search in Pinecone#
#index_name is the string representing the name of Pinecone index, used to identify index in PineconeDB

# in nutshell, the above function picks classified text, embeds it and convert it to a vector store that can aid similarity search


### Asking Questions (Similarity Search)

In [73]:
query = 'Where should we fight?'
result = vector_store.similarity_search(query)
print(result)

[Document(page_content='shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and'), Document(page_content='front, now on that, fighting'), Document(page_content='end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing'), Document(page_content='Winston Churchill Speech - We Shall Fight on the Beaches \nWe Shall Fight on the Beaches')]


In [74]:
for r in result:
    print (r.page_content)
    print('-'* 50)# for readability

shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields and
--------------------------------------------------
front, now on that, fighting
--------------------------------------------------
end, we shall fight in France, we shall fight on the seas and oceans, we shall fight with growing
--------------------------------------------------
Winston Churchill Speech - We Shall Fight on the Beaches 
We Shall Fight on the Beaches
--------------------------------------------------


In [76]:
# use LLM to get most relevant/ package results 

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model ='gpt-3.5-turbo', temperature=1)

#The retrieval interface is a generic interface that makes it eay to combine documents in large language models
# k = 3 means that it will rturn 3 most similar chunks

retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k' : 3})
# create chain to answer question

chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever = retriever)

In [77]:
query = 'Where should we fight?'
answer = chain.run(query)
print(answer)

We should fight on the beaches, the landing grounds, in the fields, in France, on the seas and oceans.


In [78]:
query = 'Who was the king of Belgium at that time?'
answer = chain.run(query)
print(answer)

The king of Belgium at that time was King Leopold.


In [80]:
query = 'What about the French Armies?'
answer = chain.run(query)
print(answer)

The context suggests that the French Armies were involved in the situation being discussed. Specifically, there is mention of French troops holding a certain area and the creation of a French Army that was supposed to advance across the Somme. However, the details provided do not give further information about the current status or specific actions of the French Armies.
