## Transitioning from Old class to New Pipe Base Operator

## 1. Understanding `Runnables`
- `Runnables` are self-contained units of work.
- Can be executed in isolation or combined for complex operations.
- Provides flexibility in execution (sync, async, parallel).

## 2. `RunnableParallel`
- Executes tasks concurrently.
- Useful for performance enhancement in scenarios where tasks can run independently.
- Syntax example:
    ```python
    from some_module import RunnableParallel
    ```

## 3. `RunnablePassthrough`
- A simple `Runnable` that passes inputs directly to outputs without modification.
- Helpful for debugging or chaining in pipelines.
- Example use case:
    ```python
    from some_module import RunnablePassthrough
    passthrough = RunnablePassthrough()
    result = passthrough.run(input_data)
    ```

## 4. `RunnableLambda`
- Allows quick, inline definitions of small, custom functions.
- Example:
    ```python
    from some_module import RunnableLambda
    lambda_op = RunnableLambda(lambda x: x * 2)
    result = lambda_op.run(5)  # Output: 10
    ```

## 5. Assign Functions
- Used to assign values or parameters during execution.
- Useful in data pipelines to update intermediate values.

## 6. Performance Improvement (Inference Speed)
- Focus on optimizing the inference speed by leveraging parallel execution.
- Use `RunnableParallel` or batching techniques.
- Consider optimizing data pipelines by removing unnecessary steps.

## 7. Async Invoke
- Executes operations asynchronously, improving the overall throughput of the system.
- Syntax example:
    ```python
    async def async_operation():
        result = await some_async_function()
    ```

## 8. Batch Support
- Handles multiple inputs at once to improve performance.
- Can be combined with `RunnableParallel` for parallel batch execution.

## 9. Async Batch Execution
- Combines asynchronous execution with batch processing for high-performance tasks.
- Reduces overall execution time for larger datasets.

## 10. Using `Itemgetter` with `LCEL`
- `Itemgetter` is used to extract specific items from collections.
- When combined with `LCEL` (LangChain Execution Layer), it can streamline complex operations.

## 11. Bind Tools
- `Bind` tools help to connect different steps in the pipeline.
- Ensures smooth data flow between various `Runnable` components.

## 12. Stream Support
- Keep your pipelines more responsive by incorporating stream support for data.
- This allows continuous data processing and near real-time outputs.
  


In [1]:
!pip install langchain_google_genai
!pip install langchain_community
!pip install langchain
!pip install langchain_huggingface
!pip install langchain_groq

Collecting langchain_google_genai
  Downloading langchain_google_genai-2.1.10-py3-none-any.whl.metadata (7.2 kB)
Collecting filetype<2.0.0,>=1.2.0 (from langchain_google_genai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting google-ai-generativelanguage<0.7.0,>=0.6.18 (from langchain_google_genai)
  Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl.metadata (9.8 kB)
Collecting langchain-core<0.4.0,>=0.3.75 (from langchain_google_genai)
  Downloading langchain_core-0.3.75-py3-none-any.whl.metadata (5.7 kB)
Downloading langchain_google_genai-2.1.10-py3-none-any.whl (49 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Downloading google_ai_generativelanguage-0.6.18-py3-none-any.whl (1.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m33.2 MB/s[0m eta [36m0:00:00[0m
[?25

Collecting langchain_community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting requests<3,>=2.32.5 (from langchain_community)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.6.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.6.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.6.7->langchain_community)
  Downloading mypy_extensions-1.1.0-py3-none-any.whl.metadata (1.1 kB)
Downloading langchain_community-0.3.29-py3-none-any.whl (2.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from google.colab import userdata
GROQ_API_KEY=userdata.get('GROQ_API_KEY')
import os
os.environ["GROQ_API_KEY"]=GROQ_API_KEY

In [3]:
from google.colab import userdata
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
import os
os.environ["GOOGLE_API_KEY"]=GOOGLE_API_KEY

In [4]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

In [None]:
'''from langchain_huggingface import HuggingFaceEmbeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
from langchain_groq import ChatGroq
import os
llm=ChatGroq(model_name="Gemma2-9b-It")'''

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# this is my simple chain (old chaining concept)

In [5]:
template= 'Hi! I am learning {skill}. Can you suggest me top 5 things to learn?\n'

In [6]:
from langchain import PromptTemplate

In [7]:
prompt = PromptTemplate(template=template,input_variables=["skill"])

In [8]:
print(prompt)

input_variables=['skill'] input_types={} partial_variables={} template='Hi! I am learning {skill}. Can you suggest me top 5 things to learn?\n'


In [9]:
from langchain import LLMChain

In [13]:
llm_chain = LLMChain(prompt=prompt,llm=llm)

In [14]:
print(llm_chain.run('Data Science'))

  print(llm_chain.run('Data Science'))


Okay, that's great! Data Science is a fascinating and rapidly evolving field. Here are my top 5 essential areas to focus on as you're learning, along with why they're important and some starting points for each:

**1. Python Programming (with a focus on Data Science Libraries)**

*   **Why it's essential:** Python is the workhorse language of data science. It's versatile, has a huge community, and boasts powerful libraries specifically designed for data manipulation, analysis, and visualization.

*   **Key areas to learn:**
    *   **Basic Python:**  Variables, data types, control flow (if/else, loops), functions, object-oriented programming (classes, objects).
    *   **NumPy:**  For efficient numerical computation, working with arrays and matrices.  Understanding array operations, broadcasting, and linear algebra fundamentals is crucial.
    *   **Pandas:**  For data manipulation and analysis.  Learn to create, clean, transform, and analyze data using DataFrames and Series.  Masterin

In [None]:
print(llm_chain.run({'skill':'Data Science'}))

**Top 5 Things to Learn for Data Science:**

1. **Programming Language:** Proficiency in a programming language is essential for data manipulation, analysis, and visualization. Python and R are the most popular choices for data science.

2. **Data Manipulation and Analysis:** Understand how to clean, transform, and explore data using tools like Pandas (Python) or dplyr (R). Learn statistical concepts like descriptive statistics, hypothesis testing, and regression analysis.

3. **Machine Learning:** Study supervised and unsupervised learning algorithms, such as linear regression, logistic regression, decision trees, and clustering. Understand the principles of model selection, evaluation, and deployment.

4. **Data Visualization:** Learn to create informative and visually appealing visualizations using libraries like Matplotlib (Python) or ggplot2 (R). This enables effective communication of insights and facilitates data exploration.

5. **Cloud Computing:** Familiarize yourself with cl

# this is a implementation  using LCEL

In [15]:
llm

ChatGoogleGenerativeAI(model='models/gemini-2.0-flash', google_api_key=SecretStr('**********'), client=<google.ai.generativelanguage_v1beta.services.generative_service.client.GenerativeServiceClient object at 0x7a9ffe55fa10>, default_metadata=(), model_kwargs={})

In [16]:
prompt

PromptTemplate(input_variables=['skill'], input_types={}, partial_variables={}, template='Hi! I am learning {skill}. Can you suggest me top 5 things to learn?\n')

In [17]:
chain = prompt | llm

In [18]:
print(chain.invoke({'skill':'Big Data'}))

content="Okay, great! Diving into Big Data is exciting. Here are my top 5 suggestions for things to learn, focusing on a balance of core concepts and practical skills, along with explanations of why they're important:\n\n**1. Distributed Computing Fundamentals (Hadoop & Spark Core Concepts):**\n\n*   **Why it's important:**  Big Data *requires* processing data across multiple machines. Understanding the principles of distributing workloads, managing data across a cluster, and dealing with fault tolerance is absolutely fundamental.\n*   **What to learn:**\n    *   **Hadoop:**\n        *   **HDFS (Hadoop Distributed File System):** How data is stored in a distributed manner, replication for fault tolerance, data locality.\n        *   **MapReduce (the original processing model):** Understand the Map and Reduce phases, how jobs are submitted and executed.  While Spark has largely replaced MapReduce for many use cases, understanding the underlying concepts is still valuable.\n        *   *

In [51]:
from langchain_core.output_parsers import StrOutputParser

In [52]:
parser = StrOutputParser()

In [None]:
chain = prompt | llm | parser

In [None]:
print(chain.invoke({'skill':'Machine Learning'}))

**Top 5 Essential Concepts to Learn in Machine Learning:**

1. **Supervised Learning:** Understanding how algorithms learn from labeled data and predict outcomes. This includes concepts like linear regression, logistic regression, and decision trees.

2. **Unsupervised Learning:** Exploring algorithms that learn from unlabeled data to find patterns and structures. This encompasses methods like clustering, dimensionality reduction, and anomaly detection.

3. **Model Evaluation:** Mastering techniques to assess the performance of machine learning models, such as accuracy, precision, recall, and confusion matrices.

4. **Feature Engineering:** Learning how to transform and select the most informative features from raw data to improve model performance. This involves data cleaning, normalization, and feature selection techniques.

5. **Model Tuning and Optimization:** Understanding how to adjust model parameters and hyperparameters to enhance its predictive capabilities. This includes meth

# lets discuss about the runnables

In [19]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough , RunnableLambda

In [20]:
chain = RunnablePassthrough()

In [21]:
chain.invoke('You are welcome')

'You are welcome'

In [22]:
chain = RunnablePassthrough() | RunnablePassthrough() | RunnablePassthrough()

In [23]:
def string_upper(input):
  return input.upper()

In [24]:
chain = RunnablePassthrough() | RunnableLambda(string_upper)

In [25]:
chain.invoke('Welcome to this city')

'WELCOME TO THIS CITY'

In [27]:
chain = RunnableParallel({'x':RunnablePassthrough(),'y':RunnablePassthrough()})

In [28]:
chain.invoke("DNB")

{'x': 'DNB', 'y': 'DNB'}

In [30]:
def fetch_website(input: dict):
    output = input.get('Website','Not found')
    return output

In [31]:
mydict={'DS': 'AI','Blog': "ML"}

In [32]:
mydict.get("website","Not found")

'Not found'

In [34]:
def extra_func(input):
    return 'Happy Learning'

In [35]:
chain = RunnableParallel({'x' : RunnablePassthrough()}).assign(extra=RunnableLambda(extra_func))

In [36]:
chain = RunnableParallel({'x' : RunnablePassthrough()}).assign(y=RunnableLambda(extra_func))

In [37]:
chain.invoke('Hello')

{'x': 'Hello', 'y': 'Happy Learning'}

In [29]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [

In [46]:
import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia(
    language='vi',
    user_agent='MyApp/1.0 (https://example.com)'
)

page = wiki_wiki.page("Trí tuệ nhân tạo")

if page.exists():
    with open("wiki_ai.txt", "w", encoding="utf-8") as f:
        f.write(page.text)
    print("Đã lưu nội dung vào wiki_ai.txt")
else:
    print("Trang không tồn tại")

Đã lưu nội dung vào wiki_ai.txt


In [48]:
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

### Reading the txt files from source directory

loader = DirectoryLoader('./source', glob="./*.txt", loader_cls=TextLoader)
docs = loader.load()

### Creating Chunks using RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10,
    length_function=len
)
new_docs = text_splitter.split_documents(documents=docs)
doc_strings = [doc.page_content for doc in new_docs]

###  BGE Embddings

'''from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs,
)
'''

### Creating Retriever using Vector DB

db = Chroma.from_documents(new_docs, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 4})

In [49]:
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""
prompt = PromptTemplate.from_template(template)


In [53]:
retrieval_chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | prompt
    | llm
    | StrOutputParser()
    )

In [54]:
question ="Trí tuệ nhân tạo là gì ?"

In [55]:
retrieval_chain.invoke(question)

'Dựa vào ngữ cảnh được cung cấp, không thể đưa ra định nghĩa đầy đủ về "Trí tuệ nhân tạo". Các đoạn trích chỉ đề cập đến "Trí tuệ nhân tạo mạnh hay Trí tuệ nhân tạo yếu", "Trí tuệ nhân tạo có thể là dấu chấm hết cho", "đầu tiên mà bây giờ được công nhận là trí tuệ", và "trí tuệ nhân tạo tạo sinh tiên tiến".'

In [56]:
import time

start_time = time.time()

result = retrieval_chain.invoke(question)

print('Time taken:',time.time() - start_time)

Time taken: 1.1959948539733887


In [57]:
start_time = time.time()

result = retrieval_chain.ainvoke(question)

print('Time taken:',time.time() - start_time)

Time taken: 0.00010657310485839844


In [58]:
start_time = time.time()

batch_output = retrieval_chain.batch([
                        "Ứng dụng của AI?",
                        "Mục tiêu của AI?"
                       ])

print('Time taken:',time.time() - start_time)

Time taken: 1.1168150901794434


In [59]:
batch_output

['Dựa trên ngữ cảnh được cung cấp, AI có nhiều ứng dụng và một số ứng dụng AI không được nhận diện là AI.',
 'Dựa trên ngữ cảnh được cung cấp, mục tiêu của AI là đạt được những mục tiêu đó.']

In [62]:
import json
from langchain_core.messages import ToolMessage
from langchain_core.tools import tool
from langchain_core.utils.function_calling import convert_to_openai_tool

@tool
def multiply(first_number: int, second_number: int):
    """Multiplies two numbers together."""
    return first_number * second_number

model_with_tools = llm.bind(tools=[convert_to_openai_tool(multiply)])

In [63]:
response = model_with_tools.invoke('What is 35 * 46?')

In [64]:
response

AIMessage(content='', additional_kwargs={'function_call': {'name': 'multiply', 'arguments': '{"second_number": 46.0, "first_number": 35.0}'}}, response_metadata={'prompt_feedback': {'block_reason': 0, 'safety_ratings': []}, 'finish_reason': 'STOP', 'model_name': 'gemini-2.0-flash', 'safety_ratings': []}, id='run--2e9f3cc4-ee6f-4f23-bd28-f6ec010db576-0', tool_calls=[{'name': 'multiply', 'args': {'second_number': 46.0, 'first_number': 35.0}, 'id': '91cb6231-929f-40b6-9ea7-fc245a23b373', 'type': 'tool_call'}], usage_metadata={'input_tokens': 32, 'output_tokens': 9, 'total_tokens': 41, 'input_token_details': {'cache_read': 0}})