#### Databrick's Free Dolly with LangChain 

- To use the pipeline with LangChain, you must set return_full_text=True, as LangChain expects the full text to be returned and the default for the pipeline is to only return the new text.

##### Main Use Cases of LangChain

- Summarization - Express the most important facts about a body of text or chat interaction

- Question and Answering Over Documents - Use information held within documents to answer questions or query

- Extraction - Pull structured data from a body of text or an user query

- Evaluation - Understand the quality of output from your application

- Querying Tabular Data - Pull data from databases or other tabular source

- Code Understanding - Reason about and digest code

- Interacting with APIs - Query APIs and interact with the outside world

- Chatbots - A framework to have a back and forth interaction with a user combined with memory in a chat interface

- Agents - Use LLMs to make decisions about what to do next. Enable these decisions with tools.



In [3]:
!pip install --upgrade pip

!pip install "accelerate>=0.16.0,<1" "transformers[torch]>=4.28.1,<5" "torch>=1.13.1,<2"

[0m

In [4]:
!pip install langchain

Collecting langchain
  Downloading langchain-0.0.341-py3-none-any.whl.metadata (16 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.23-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting anyio<4.0 (from langchain)
  Downloading anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Downloading async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.3-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-core<0.0.7,>=0.0.6 (from langchain)
  Downloading langchain_core-0.0.6-py3-none-any.whl.metadata (750 bytes)
Collect

In [5]:
!pip install unstructured
!pip install "unstructured[pdf]"


Collecting unstructured
  Downloading unstructured-0.11.0-py3-none-any.whl.metadata (25 kB)
Collecting chardet (from unstructured)
  Downloading chardet-5.2.0-py3-none-any.whl.metadata (3.4 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl (13 kB)
Collecting lxml (from unstructured)
  Downloading lxml-4.9.3-cp39-cp39-manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Collecting nltk (from unstructured)
  Downloading nltk-3.8.1-py3-none-any.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m00:01[0m
Collecting beautifulsoup4 (from unstructured)
  Downloading beautifulsoup4-4.12.2-py3-none-any.whl (142 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.0/143.0 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting emoji (

In [6]:
#!pip install langchain>=0.0.139

In [8]:
import torch
from transformers import pipeline

generate_text = pipeline(model="databricks/dolly-v2-12b", torch_dtype=torch.bfloat16,
                         trust_remote_code=True, device_map="auto", return_full_text=True)


In [9]:
from langchain import PromptTemplate, LLMChain
from langchain.llms import HuggingFacePipeline
import unstructured
from langchain.document_loaders import S3FileLoader

In [14]:
from unstructured.partition.auto import partition
#elements = partition(filename="example-docs/eml/fake-email.eml")



In [15]:
# template for an instrution with no input
prompt = PromptTemplate(
    input_variables=["instruction"],
    template="{instruction}")

# template for an instruction with input
prompt_with_context = PromptTemplate(
    input_variables=["instruction", "context"],
    template="{instruction}\n\nInput:\n{context}")

hf_pipeline = HuggingFacePipeline(pipeline=generate_text)

llm_chain = LLMChain(llm=hf_pipeline, prompt=prompt)
llm_context_chain = LLMChain(llm=hf_pipeline, prompt=prompt_with_context)

In [16]:
#Loading pdf file as context to langchain
#loader = S3FileLoader("sagemaker-studio-njiztjducek", 's3://webage-genaidata/Private-Data/CV1.pdf')#"genai/Private-Data/CV1.pdf"
## s3fileloader (bucket, key)
loader = S3FileLoader("webage-genaidata", "Private-Data/CV2.pdf")
                      #"genai/Private-Data/CV1.pdf"
loader
data=loader.load()
#context = data[0].page_content
#print(llm_context_chain.predict(instruction="Give the carrier summary of CHRISTOPHOER MORGAN who is senior web developer?", context=context).lstrip())

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [17]:

context = data[0].page_content
print(llm_context_chain.predict(instruction="Give the carrier summary of CHRISTOPHOER MORGAN who is senior web developer?", context=context).lstrip())

[2023-11-28 21:29:17.832: W smdistributed/modelparallel/torch/nn/predefined_hooks.py:78] Found unsupported HuggingFace version 4.35.2 for automated tensor parallelism. HuggingFace modules will not be automatically distributed. You can use smp.tp_register_with_module API to register desired modules for tensor parallelism, or directly instantiate an smp.nn.DistributedModule. Supported HuggingFace transformers versions for automated tensor parallelism: ['4.17.0', '4.20.1', '4.21.0']


INFO:root:Using NamedTuple = typing._NamedTuple instead.


[2023-11-28 21:29:18.004 pytorch-1-13-gpu-py-ml-g4dn-xlarge-f4059b0a6fcc8375ce85d12387ea:60 INFO utils.py:28] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-11-28 21:29:18.070 pytorch-1-13-gpu-py-ml-g4dn-xlarge-f4059b0a6fcc8375ce85d12387ea:60 INFO profiler_config_parser.py:111] Unable to find config at /opt/ml/input/config/profilerconfig.json. Profiler is disabled.
CHRISTOPHOER MORGAN is senior web developer at Luxury Car Center. Her responsibility includes creating and modifying employee schedules with service levels in mind. She is also responsible for helping out in sales and repair areas as needed and maintains comprehensive current knowledge of operations.


Or download with this:

In [19]:
#! aws s3 cp s3://webage-genai-data/Private-Data/ Private-Data --recursive
! aws s3 cp s3://webage-genaidata/Private-Data/ Private-Data --recursive

download: s3://webage-genaidata/Private-Data/CV1.pdf to Private-Data/CV1.pdf
download: s3://webage-genaidata/Private-Data/CV11.pdf to Private-Data/CV11.pdf
download: s3://webage-genaidata/Private-Data/CV10.pdf to Private-Data/CV10.pdf
download: s3://webage-genaidata/Private-Data/CV7.pdf to Private-Data/CV7.pdf
download: s3://webage-genaidata/Private-Data/CV9.pdf to Private-Data/CV9.pdf
download: s3://webage-genaidata/Private-Data/CV5.pdf to Private-Data/CV5.pdf
download: s3://webage-genaidata/Private-Data/CV8.pdf to Private-Data/CV8.pdf
download: s3://webage-genaidata/Private-Data/CV6.pdf to Private-Data/CV6.pdf
download: s3://webage-genaidata/Private-Data/CV2.pdf to Private-Data/CV2.pdf
download: s3://webage-genaidata/Private-Data/CV13.pdf to Private-Data/CV13.pdf
download: s3://webage-genaidata/Private-Data/CV4.pdf to Private-Data/CV4.pdf
download: s3://webage-genaidata/Private-Data/CV3.pdf to Private-Data/CV3.pdf
download: s3://webage-genaidata/Private-Data/CV12.pdf to Private-Data/

In [1]:
# ...and open pdf to load text into "context" variable

In [20]:
print(llm_context_chain.predict(instruction="Name of certification of CHRISTOPHOER MORGAN who is senior web developer?", context=context).lstrip())

CHRISTOPHOER MORGAN


In [21]:
print(llm_context_chain.predict(instruction="What are name of Certification's completed by CHRISTOPHOER MORGAN who is senior web developer?", context=context).lstrip())

CHRISTOPHOER MORGAN received the following certifications:
- Store Manager LUXURY CAR CENTER, New York
- Master Technician (level 2)


In [23]:
print(llm_context_chain.predict(instruction="Give the carrier summary of CHRISTOPHOER MORGAN who is senior web developer?", context=context).lstrip())

CHRISTOPHOER MORGAN is senior web developer. His last job title is Store Manager. He is based out of LUXURY CAR CENTER, New York. His graduation year is 2019. His hobbies are playing chess and keeping inventory at optimal levels.
