In [1]:
!pip install -q google-generativeai langchain_google_genai pypdf pymupdf chromadb sentence-transformers langchain

# Gemini using Langchain

In [9]:
from langchain_google_genai import ChatGoogleGenerativeAI


llm = ChatGoogleGenerativeAI(model="gemini-pro",google_api_key= 'YOUR-GEMINI-API-KEY',
                             temperature=0.2,convert_system_message_to_human=True)


In [13]:
result = llm.invoke("what is data science")

In [16]:
print(result.content)

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It is a multidisciplinary field that draws on mathematics, statistics, computer science, and domain knowledge to analyze data and solve problems.

Data science is used in a wide variety of industries, including healthcare, finance, retail, manufacturing, and government. It is used to improve decision-making, optimize processes, and create new products and services.

Some of the common tasks that data scientists perform include:

* Collecting and cleaning data
* Exploratory data analysis
* Building and training machine learning models
* Evaluating and interpreting results
* Communicating findings to stakeholders

Data science is a rapidly growing field, and there is a high demand for data scientists. According to the U.S. Bureau of Labor Statistics, the median annual salary for data scientists was $100,910 in

# Prompts and Prompt Templates

In [17]:
from langchain import PromptTemplate

In [18]:
template = """
<s>[INST] <<SYS>>
you are my financial advisor provide me the financial advices based on my queries, if you dont know the about the query provide me generic response based on your knowledge
<</SYS>>

{text} [/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [19]:
text = "2023 indian health budget"
print(prompt.format(text=text))


<s>[INST] <<SYS>>
you are my financial advisor provide me the financial advices based on my queries, if you dont know the about the query provide me generic response based on your knowledge
<</SYS>>

2023 indian health budget [/INST]



In [22]:
result = llm.invoke(prompt.format(text=text))
print(result.content)

<s>[INST] <<SYS>>
The 2023 Indian health budget is ₹86,200 crore (US$11 billion), a 10% increase from the previous year. The budget includes ₹45,000 crore (US$5.6 billion) for health infrastructure, ₹20,000 crore (US$2.5 billion) for health services, and ₹10,000 crore (US$1.2 billion) for health research. The budget also includes ₹5,000 crore (US$625 million) for the National Health Protection Scheme, which provides health insurance to low-income families.

The budget has been welcomed by health experts, who say that it will help to improve the health of the Indian population. However, some experts have also expressed concern that the budget is not enough to address the growing burden of non-communicable diseases, such as heart disease, stroke, and cancer.

Overall, the 2023 Indian health budget is a positive step towards improving the health of the Indian population. However, more needs to be done to address the growing burden of non-communicable diseases.
<</SYS>>


# Create a Chain

In [24]:
from langchain.chains import LLMChain

In [25]:
chain = LLMChain(llm=llm, prompt=prompt)
result = chain.run(text)

  warn_deprecated(


In [27]:
print(result)

The Indian government allocated a total of ₹86,200 crore (US$11 billion) for the health sector in the 2023 Union Budget, an increase of 13.2% over the previous year's allocation. This includes ₹69,000 crore (US$8.6 billion) for the National Health Mission (NHM), which is the government's flagship healthcare program. The NHM provides financial assistance to states and union territories for a range of health services, including maternal and child health, immunization, and communicable disease control.

The budget also includes ₹10,000 crore (US$1.2 billion) for the Pradhan Mantri Jan Arogya Yojana (PMJAY), which is the government's health insurance scheme for the poor. The PMJAY provides cashless hospitalization coverage of up to ₹5 lakh (US$6,250) per family per year for a range of medical procedures.

In addition, the budget includes ₹5,000 crore (US$625 million) for the Ayushman Bharat Digital Mission (ABDM), which is the government's initiative to create a digital health infrastructu

# Sequential chains

In [33]:
template = "<s>[INST] Use the data {summary} and give short summary  [/INST]"

examples_prompt = PromptTemplate(
    input_variables=["summary"],
    template=template,
)

examples_chain = LLMChain(llm=llm, prompt=examples_prompt)

In [34]:
from langchain.chains import SimpleSequentialChain

multi_chain = SimpleSequentialChain(chains=[chain, examples_chain], verbose=True)
result = multi_chain.run(text)



[1m> Entering new SimpleSequentialChain chain...[0m
[36;1m[1;3m<s>[INST] <<SYS>>
The 2023 Indian health budget is ₹86,200 crore (US$11 billion), a 10% increase from the previous year. The budget includes ₹35,000 crore (US$4.4 billion) for the National Health Mission, which provides comprehensive primary healthcare services to all Indians. The budget also includes ₹20,000 crore (US$2.5 billion) for the Pradhan Mantri Jan Arogya Yojana (PMJAY), which provides health insurance to over 100 million poor and vulnerable Indians.

The 2023 health budget is a significant increase from previous years, and it reflects the government's commitment to improving the health of all Indians. The budget will help to expand access to healthcare services, improve the quality of care, and reduce the financial burden of healthcare on families.

Here are some specific ways that the 2023 health budget will benefit Indians:

* **Expanded access to healthcare services:** The budget will help to expand acce

In [38]:
result

'The 2023 Indian health budget is ₹86,200 crore (US$11 billion), a 10% increase from the previous year. The budget will help to expand access to healthcare services, improve the quality of care, and reduce the financial burden of healthcare on families.'

# Using RAG

In [41]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader('data/CTGAN.pdf')
docs = loader.load()
len(docs)

15

In [42]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)
len(texts)

49

In [43]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

query_result = embeddings.embed_query(texts[0].page_content)
print(len(query_result))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

768


In [44]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(texts, embeddings, persist_directory="db")

In [45]:
results = db.similarity_search("what is ctgan", k=2)
print(results[0].page_content)

Algorithm 1: Train CTGAN on step.
Input: Training data Ttrain , Conditional generator and Critic parameters ΦGandΦCrespectively,
batch sizem, pac sizepac.
Result: Conditional generator and Critic parameters ΦG,ΦCupdated.
1Create masks{m1,...,mi∗,...,mNd}j, for 1≤j≤m
2Create condition vectors condj, for 1≤j≤mfrom masks ⊿Createmconditional vectors
3Sample{zj}∼MVN(0,I), for 1≤j≤m
4ˆrj←Generator (zj,condj), for 1≤j≤m ⊿ Generate fake data
5Sample rj∼Uniform (Ttrain|condj), for 1≤j≤m ⊿ Get real data
6cond(pac)
k←condk×pac+1⊕...⊕condk×pac+pac, for 1≤k≤m/pac⊿ Conditional vector pacs
7ˆr(pac)
k←ˆrk×pac+1⊕...⊕ˆrk×pac+pac, for 1≤k≤m/pac ⊿ Fake data pacs
8r(pac)
k←rk×pac+1⊕...⊕rk×pac+pac, for 1≤k≤m/pac ⊿ Real data pacs
9LC←1
m/pac∑m/pac
k=1Critic (ˆr(pac)
k,cond(pac)
k)−1
m/pac∑m/pac
k=1Critic (r(pac)
k,cond(pac)
k)
10Sampleρ1,...,ρm/pac∼Uniform (0, 1)
11˜r(pac)
k←ρkˆr(pac)
k+ (1−ρk)r(pac)
k, for 1≤k≤m/pac
12LGP←1
m/pac∑m/pac
k=1(||∇˜r(pac)
kCritic (˜r(pac)
k,cond(pac)
k)||2−1)2⊿Gradient Penalty


In [46]:
from langchain.chains import RetrievalQA

template = """
<s>[INST] <<SYS>>
You are expert data scientest, provide me a detailed insigsts of my query
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])

In [47]:
text = "what is ctgan"
print(prompt.format(context=text, question=text))


<s>[INST] <<SYS>>
You are expert data scientest, provide me a detailed insigsts of my query
<</SYS>>

what is ctgan

what is ctgan [/INST]



In [48]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

In [51]:
question = "what is ctgan"
result = qa_chain({"query": question})
print(result["result"])

CTGAN (Conditional Table GAN) is a generative adversarial network (GAN) model designed specifically for generating synthetic tabular data. It is an extension of the original GAN model, which was developed for generating images. CTGAN is able to generate realistic synthetic data that preserves the statistical properties of the original data, including the relationships between different variables.

CTGAN consists of two main components: a generator and a discriminator. The generator is responsible for generating synthetic data, while the discriminator is responsible for distinguishing between synthetic and real data. The generator and discriminator are trained together in an adversarial manner, where the generator tries to fool the discriminator into believing that its synthetic data is real, while the discriminator tries to correctly classify synthetic and real data.

One of the key features of CTGAN is its ability to generate data conditioned on a set of categorical variables. This me

In [None]:
# Ram Ram  