## Introduction

In this tutorial, you learn how to use Google Cloud AI tools to quickly bring the power of Large Language Models to enterprise systems.  

This tutorial covers the following -

*   What are embeddings - what business challenges do they help solve ?
*   Understanding Text with Vertex AI Text Embeddings
*   Find Embeddings fast with Vertex AI Vector Search
*   Grounding LLM outputs with Vector Search

This tutorial is based on [the blog post](https://cloud.google.com/blog/products/ai-machine-learning/how-to-use-grounding-for-your-llms-with-text-embeddings), combined with sample code.


### Prerequisites

This tutorial is designed for developers who has basic knowledge and experience with Python programming and machine learning.

If you are not reading this tutorial in Qwiklab, then you need to have a Google Cloud project that is linked to a billing account to run this. Please go through [this document](https://cloud.google.com/vertex-ai/docs/start/cloud-environment) to create a project and setup a billing account for it.

### Choose the runtime environment

The notebook can be run on either Google Colab or [Vertex AI Workbench](https://cloud.google.com/vertex-ai-workbench).

- To use Colab: Click [this link](https://colab.research.google.com/github/GoogleCloudPlatform/generative-ai/blob/main/embeddings/intro-textemb-vectorsearch.ipynb) to open the tutorial in Colab.

- To use Workbench: If it is the first time to use Workbench in your Google Cloud project, open [the Workbench console](https://console.cloud.google.com/vertex-ai/workbench) and click ENABLE button to enable Notebooks API. Then click [this link](https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/generative-ai/main/embeddings/intro-textemb-vectorsearch.ipynb),  and select an existing notebook or create a new notebook.


### How much will this cost?

In case you are using your own Cloud project, not a temporary project on Qwiklab, you need to spend roughly a few US dollars to finish this tutorial.

The pricing of the Cloud services we will use in this tutorial are avilable in the following pages:

- [Vertex AI Embeddings for Text](https://cloud.google.com/vertex-ai/pricing#generative_ai_models)
- [Vertex AI Vector Search](https://cloud.google.com/vertex-ai/pricing#matchingengine)
- [BigQuery](https://cloud.google.com/bigquery/pricing)
- [Cloud Storage](https://cloud.google.com/storage/pricing)
- [Vertex AI Workbench](https://cloud.google.com/vertex-ai/pricing#notebooks) if you use one

You can use the [Pricing Calculator](https://cloud.google.com/products/calculator) to generate a cost estimate based on your projected usage. The following is an example of rough cost estimation with the calculator, assuming you will go through this tutorial a couple of time.

<img src="https://storage.googleapis.com/github-repo/img/embeddings/vs-quickstart/pricing.png" width="50%"/>

### **Warning: delete your objects after the tutorial**

In case you are using your own Cloud project, please make sure to delete all the Indexes, Index Endpoints and Cloud Storage buckets (and the Workbench instance if you use one) after finishing this tutorial. Otherwise the remaining assets would incur unexpected costs.


# Bringing Gen AI and LLMs to production services

Many people are now starting to think about how to bring Gen AI and LLMs to production services, and facing with several challenges.

- "How to integrate LLMs or AI chatbots with existing IT systems, databases and business data?"
- "We have thousands of products. How can I let LLM memorize them all precisely?"
- "How to handle the hallucination issues in AI chatbots to build a reliable service?"

Here is a quick solution: **grounding** with **embeddings** and **vector search**.

What is grounding? What are embedding and vector search? In this tutorial, we will learn these crucial concepts to build reliable Gen AI services for enterprise use. But before we dive deeper, let's try the demo below.

# Vertex AI Embeddings for Text

With the [Vertex AI Embeddings for Text](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings), you can easily create a text embedding with LLM. The product is also available on [Vertex AI Model Garden](https://cloud.google.com/model-garden)

![](https://storage.googleapis.com/github-repo/img/embeddings/textemb-vs-notebook/7.png)

This API is designed to extract embeddings from texts. It can take text input up to 3,072 input tokens, and outputs 768 dimensional text embeddings.

# Text Embeddings in Action
## Setup

Before get started with the Vertex AI services, we need to setup the following.

* Install Python SDK
* Environment variables
* Authentication (Colab only)
* Enable APIs
* Set IAM permissions

### Install Python SDK

In [11]:
# Install Vertex AI LLM SDK
! pip install --user --upgrade google-cloud-aiplatform==1.47.0 langchain==0.1.14 langchain-google-vertexai==0.1.3 typing_extensions==4.9.0

# Dependencies required by Unstructured PDF loader
! sudo apt -y -qq install tesseract-ocr libtesseract-dev
! sudo apt-get -y -qq install poppler-utils
! pip install --user --upgrade unstructured==0.12.4 pdf2image==1.17.0 pytesseract==0.3.10 pdfminer.six==20221105
! pip install --user --upgrade pillow-heif==0.15.0 opencv-python==4.9.0.80 unstructured-inference==0.7.24 pikepdf==8.13.0 pypdf==4.0.1

# For Matching Engine integration dependencies (default embeddings)
! pip install --user --upgrade tensorflow_hub==0.16.1 tensorflow_text==2.15.0
! pip install sentence-transformers
! pip install -U langchain-community faiss-gpu
! pip install --upgrade --quiet  sentence_transformers > /dev/null
! pip install langchain_community
! pip install openai

libtesseract-dev is already the newest version (4.1.1-2.1build1).
tesseract-ocr is already the newest version (4.1.1-2.1build1).
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.
Collecting pdfminer.six==20221105
  Using cached pdfminer.six-20221105-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer.six
  Attempting uninstall: pdfminer.six
    Found existing installation: pdfminer.six 20231228
    Uninstalling pdfminer.six-20231228:
      Successfully uninstalled pdfminer.six-20231228
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.11.0 requires pdfminer.six==20231228, but you have pdfminer-six 20221105 which is incompatible.[0m[31m
[0mSuccessfully installed pdfminer.six-20221105


Collecting pdfminer.six==20231228 (from pdfplumber->layoutparser[layoutmodels,tesseract]->unstructured-inference==0.7.24)
  Using cached pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
Installing collected packages: pdfminer.six
  Attempting uninstall: pdfminer.six
    Found existing installation: pdfminer.six 20221105
    Uninstalling pdfminer.six-20221105:
      Successfully uninstalled pdfminer.six-20221105
Successfully installed pdfminer.six-20231228




# Download custom Python modules and utilities
The cell below will download some helper functions needed for using Vertex AI Matching Engine in this notebook. These helper functions were created to keep this notebook more tidy and concise, and you can also view them directly on Github.

In [1]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

In [9]:
#Authenticating your notebook environment
import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [2]:
PROJECT_ID = "iisccapstone-420805"

# Enable APIs
Run the following to enable APIs for Compute Engine, Vertex AI, Cloud Storage and BigQuery with this Google Cloud project.

In [3]:
! gcloud services enable compute.googleapis.com aiplatform.googleapis.com storage.googleapis.com bigquery.googleapis.com --project {PROJECT_ID}

Operation "operations/acat.p2-739467896105-5516a9cf-c1fe-4e91-8a15-6877e4d9b31d" finished successfully.


### Set IAM permissions

Also, we need to add access permissions to the default service account for using those services.

- Go to [the IAM page](https://console.cloud.google.com/iam-admin/) in the Console
- Look for the principal for default compute service account. It should look like: `<project-number>-compute@developer.gserviceaccount.com`
- Click the edit button at right and click `ADD ANOTHER ROLE` to add `Vertex AI User`, `BigQuery User` and `Storage Admin` to the account.

This will look like this:

![](https://storage.googleapis.com/github-repo/img/embeddings/vs-quickstart/iam-setting.png)

# Environment variables

In [4]:
# get project ID
PROJECT_ID = ! gcloud config get project
PROJECT_ID = "iisccapstone-420805"
LOCATION = "us-central1"
if PROJECT_ID == "(unset)":
    print(f"Please set the project ID manually below")
    # define project information
if PROJECT_ID == "(unset)":
    PROJECT_ID = "iisccapstone-420805'"  # @param {type:"string"}

# generate an unique id for this session
from datetime import datetime

UID = datetime.now().strftime("%m%d%H%M")

In [5]:
PROJECT_ID

'iisccapstone-420805'

# Import libraries

In [6]:
import vertexai

#PROJECT_ID = PROJECT_ID # @param {type:"string"}
REGION = "us-central1"

vertexai.init(project={PROJECT_ID}, location=REGION)

In [7]:
import json
import textwrap

# Utils
import time
import uuid
from typing import List

import numpy as np
import vertexai

# Vertex AI
from google.cloud import aiplatform

print(f"Vertex AI SDK version: {aiplatform.__version__}")

# LangChain
import langchain

print(f"LangChain version: {langchain.__version__}")

from langchain.chains import RetrievalQA
from langchain.document_loaders import GCSDirectoryLoader
from langchain.embeddings import VertexAIEmbeddings
from langchain.llms import VertexAI
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Import custom Matching Engine packages
# from utils.matching_engine import MatchingEngine
# from utils.matching_engine_utils import MatchingEngineUtils
# Import custom Matching Engine packages
from langchain_google_vertexai import VertexAI , VertexAIEmbeddings , VectorSearchVectorStore
import faiss
from faiss import IndexFlatL2
import numpy as np
import spacy
from langchain_community.embeddings import HuggingFaceEmbeddings
import json
import pdfplumber
from langchain_community.vectorstores import FAISS
import os
from google.colab import files
import zipfile
from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
from transformers import AutoTokenizer, pipeline
from langchain import HuggingFacePipeline
from langchain.chains import RetrievalQA

Vertex AI SDK version: 1.47.0
LangChain version: 0.1.14


# connecting to bigquery to extract the text data

In [None]:
# # load the BQ Table into a Pandas Dataframe
# import pandas as pd
# from google.cloud import bigquery


# bq_client = bigquery.Client(project=PROJECT_ID)
# QUERY_TEMPLATE = """
#         SELECT * from iisccapstone-420805.Pubmed.pubmed where content !='';
#         """
# # query_params=[
# #         bigquery.ArrayQueryParameter("q1","DATE", q1),
# #         bigquery.ArrayQueryParameter("q2","DATE", q2),
# #         bigquery.ArrayQueryParameter("q3","DATE", q3),
# #         bigquery.ArrayQueryParameter("q4","DATE", q4),
# #         bigquery.ArrayQueryParameter("rule_name","STRING", rule_name),
# #         bigquery.ArrayQueryParameter("insert_timestamp","DATE", insert_timestamp),
# #         bigquery.ArrayQueryParameter("Manufacturer","STRING", Manufacturer),
# #         bigquery.ArrayQueryParameter("partner_code","STRING", partner_code),
#     # ]
# try:
#   pubmed = bq_client.query(QUERY_TEMPLATE)  # Make an API request.
#   pubmed_data = pubmed.to_dataframe()
# except Exception as e:
#   print('Error',e,'Data_not_found')

# # examine the data
# pubmed_data.head()

Unnamed: 0,Title,content
0,"Impact of Alcoholism – Kerala.pdf,page:119",Problems Experienced While Tried to Cut Down /...
1,Impact of Alcohol Consumption on Young People....,Level of Grade of Details Year Country Cited e...
2,The Impact of Alcoholic Beverages on Human Hea...,"Nutrients2021,13,3938 1.4.Conclusion Insummary..."
3,REGIONAL STATUS REPORT ON ALCOHOL AND HEALTH I...,Regional status report on alcohol and health i...
4,therapy for multisystem inflammatory syndrome ...,"Articles significant comorbidities (eg, immune..."


# Load the text embeddings model
from vertexai.preview.language_models import TextEmbeddingModel

model = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

# Load the biquery where data for llm is stored

In [None]:
# from langchain.document_loaders import BigQueryLoader #Class for storing a piece of text and associated metadata.
# BASE_QUERY = """SELECT * FROM `iisccapstone-420805.Pubmed.pubmed` where content !=''"""
# loader = BigQueryLoader(BASE_QUERY,project="iisccapstone-420805")
# documents = loader.load()

  warn_deprecated(


In [None]:
# # Add document metadata to all the document of documents and formatting page_content to only contain contents of BQ table pubmed as it contained both title and content
# for document in documents:
#   document.metadata={'source':document.page_content.split('\n')[0]}
#   document.page_content=document.page_content.split('\n')[1]


In [None]:
# documents[0].metadata

{'source': 'Title: Impact of Alcoholism – Kerala.pdf,page:119'}

# Chunk documents
Split the documents to smaller chunks. When splitting the document, ensure a few chunks can fit within the context length of LLM.

In [None]:
# # split the documents into chunks
# text_splitter = RecursiveCharacterTextSplitter(
#     chunk_size=1000,
#     chunk_overlap=50,
#     separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
# )
# doc_splits = text_splitter.split_documents(documents)

# # Add chunk number to metadata
# for idx, split in enumerate(doc_splits):
#     split.metadata["chunk"] = idx

# print(f"# of documents = {len(doc_splits)}")

# of documents = 4405


In [None]:
# doc_splits[0]

Document(page_content='content: Problems Experienced While Tried to Cut Down / Stop Drinking The present study had a probe into the problems faced by the respondents, while they had tried to stop/cut down drinking. The query was posed only to the Alcohol Users (Adult & Adolescents) and not to the Spouses of Drinkers. Obviously, it was very pathetic to see that 63.4% of the Adults and 58.4% of the Adolescents have faced problems, while they tried to stop drinking. Further, 37.7% of the Adults had faced multiple problems. Multiple withdrawal problems were found to be comparatively less (15.9%) among the Adolescent Drinkers and headache and fidgety/restless was the major difficulties they faced when they cut down/stopped drinking. Further, 12.4% reported that they had a problem of „Unable to sleep‟. (Refer to table 2.6.7) Category-wise, the figure 2.6.2 showed that withdrawal problems were more (83.5%) among the Harmful Drinkers (Adults) compared to the Less-Harmful Drinkers (54.6%). Tabl

# Load the text embeddings model

In [None]:
!pip install -q diffusers transformers accelerate peft

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m23.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [10]:
model_name = 'all-mpnet-base-v2'
embeddings = HuggingFaceEmbeddings(model_name=model_name)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# creating faiss db to store embeddings in offline mode

In [None]:
#db = FAISS.from_documents(doc_splits , embeddings)

In [None]:
#print(db.index.ntotal)

4405


# Querying

In [None]:
# query = 'prolong alcohol intake impact human health?'
# docs = db.similarity_search(query)

In [None]:
#db.similarity_search_with_score(query)

# As a Retriever

In [None]:
# retriever = db.as_retriever()
# docs = retriever.invoke('prolong alcohol intake impact human health?')

In [None]:
# print(docs[0].page_content)

content: The Impact of Alcoholic Beverages on Human Health


# Similarity Search with score

In [None]:
# docs_and_scores = db.similarity_search_with_score(query)

In [None]:
# docs_and_scores[0]

(Document(page_content='content: The Impact of Alcoholic Beverages on Human Health', metadata={'source': 'Title: The Impact of Alcoholic Beverages on Human Health  .pdf,page:2', 'chunk': 972}),
 0.58215624)

It is also possible to do a search for documents similar to a given embedding vector using similarity_search_by_vector which accepts an embedding vector as a parameter instead of a string

In [None]:
# embedding_vector = embeddings.embed_query('prolong alcohol intake impact human health?')
# docs_and_scores = db.similarity_search_by_vector(embedding_vector)

In [None]:
# docs_and_scores

[Document(page_content='content: The Impact of Alcoholic Beverages on Human Health', metadata={'source': 'Title: The Impact of Alcoholic Beverages on Human Health  .pdf,page:2', 'chunk': 972}),
 Document(page_content='content: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the o

# Saving and loading

In [11]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#db.save_local("/content/drive/MyDrive/Capstone_project/faiss_index")

In [12]:
new_db = FAISS.load_local("/content/drive/MyDrive/Capstone_project/faiss_index", embeddings,allow_dangerous_deserialization=True)
query = 'prolong alcohol intake impact human health?'

docs = new_db.similarity_search(query)

In [13]:
docs

[Document(page_content='content: The Impact of Alcoholic Beverages on Human Health', metadata={'source': 'Title: The Impact of Alcoholic Beverages on Human Health  .pdf,page:2', 'chunk': 972}),
 Document(page_content='content: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the o

In [14]:
new_db.similarity_search_with_score(query)

[(Document(page_content='content: The Impact of Alcoholic Beverages on Human Health', metadata={'source': 'Title: The Impact of Alcoholic Beverages on Human Health  .pdf,page:2', 'chunk': 972}),
  0.58215624),
 (Document(page_content='content: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences o

# Designing custom retreiver to generate_summary

# Model gpt2

In [15]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [16]:
# converting it into list of string to generate summary
docs1=format_docs(docs)

In [17]:
docs1

"content: The Impact of Alcoholic Beverages on Human Health\n\ncontent: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the of Diseases, 10th edition) diseases, drinker, the chronic use of alcohol is responsible including 40 diseases that would not for a significant societal impa

In [18]:
from transformers import pipeline
model_name="gpt2"
chat_pipeline=pipeline("text-generation",model=model_name)

from transformers import GPT2LMHeadModel, GPT2Tokenizer
from langchain.prompts import PromptTemplate

model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)


config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [19]:

chat_pipeline(docs1,max_length=1000)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "content: The Impact of Alcoholic Beverages on Human Health\n\ncontent: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the of Diseases, 10th edition) diseases, drinker, the chronic use of alcohol is responsible including 40 diseases that would not for a signi

# Text generator

In [20]:
def generate_text(prompt, max_length=1000):
  input_ids = tokenizer.encode(prompt, return_tensors="pt")
  output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, temperature=0.7)
  generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return generated_text

In [21]:
generate_text(docs1)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


"content: The Impact of Alcoholic Beverages on Human Health\n\ncontent: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the of Diseases, 10th edition) diseases, drinker, the chronic use of alcohol is responsible including 40 diseases that would not for a significant societal impa

# Text summarizer

In [22]:

def summarize_text(text, max_length=1000):
  input_ids = tokenizer.encode(text, return_tensors="pt", max_length=, 1024truncation=True)
  output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, temperature=0.2, early_stopping=True)
  summarized_text = tokenizer.decode(output[0], skip_special_tokens=True)
  return summarize_text

In [23]:
len(docs)

4

In [24]:
from transformers import pipeline

def summarize_text(text, max_length=1000):
    """Summarize input text using a pre-trained GPT-2 model."""
    summarization_pipeline = pipeline("summarization", model="gpt2")
    summary = summarization_pipeline(text, max_length=max_length, min_length=50, do_sample=True)[0]['summary_text']
    return summary

# Example usage:
input_text = docs1
summary = summarize_text(input_text)
print("Summary:", summary)

The model 'GPT2LMHeadModel' is not supported for summarization. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SeamlessM4Tv2ForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].
Your max_length is set to 1000, but your input_length is only 580

Summary: content: The Impact of Alcoholic Beverages on Human Health

content: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the of Diseases, 10th edition) diseases, drinker, the chronic use of alcohol is responsible including 40 diseases that would not for a significant societa

In [25]:
# Create a retriever object from the 'db' using the 'as_retriever' method.
# This retriever is likely used for retrieving data or documents from the database.
retriever = new_db.as_retriever()
docs = retriever.invoke('prolong alcohol intake impact human health?')
docs

[Document(page_content='content: The Impact of Alcoholic Beverages on Human Health', metadata={'source': 'Title: The Impact of Alcoholic Beverages on Human Health  .pdf,page:2', 'chunk': 972}),
 Document(page_content='content: Al cohol affects human physiology been linked with an increase in risk of breast decreased health care utilisation, and increased either through years of cancer(9). Women are less likely to consume HIV risk behaviours due to lack of sobriety(14). alcohol than men; however, the use of alcohol Some countries have also found a four time consumption, acute intoxication, may have more implications for women than increased risk of multimorbidity in individuals or dependence(5). It has been men with respect to physical illnesses and more who drink alcohol(15). linked with approximately 230 severe cognitive and motor impairment with ICD-10 (International Classification a much lower alcohol exposure as compared to men(10). Beyond the direct consequences on health of the o

In [75]:
docs = retriever.get_relevant_documents("What is prolong alcohol intake impact human health?")

context_summ=' '.join([str(i.page_content) for i in docs])


In [90]:
context_summ



In [77]:
docs = retriever.get_relevant_documents("What is e-cigarrette?")

context_summ=' '.join([str(i.page_content) for i in docs])


In [91]:
context_summ

