Ingesting PDF

In [2]:
!pip install --q unstructured langchain  --quiet
!pip install --q "unstructured[all-docs]" --quiet   #for loading all pdf ,text etc files

In [3]:
!pip install -U langchain-community




[notice] A new release of pip available: 22.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [43]:
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.document_loaders import OnlinePDFLoader      #use to load online files

local_path = "ML_unit4_ensemble learning (1).pdf"

# Local PDF file uploads
if local_path:
    loader = UnstructuredPDFLoader(file_path=local_path)
    data = loader.load()
    
else:
    print("Upload a PDF file")

Vector Embeddings

In [5]:

!ollama pull nomic-embed-text

In [6]:
!ollama list

NAME                   	ID          	SIZE  	MODIFIED               
nomic-embed-text:latest	0a109f422b47	274 MB	Less than a second ago	


In [7]:
!pip install --q chromadb
!pip install --q langchain-text-splitters


[notice] A new release of pip available: 22.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip available: 22.3 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma

In [45]:
# Split and chunk 
text_splitter = RecursiveCharacterTextSplitter(chunk_size=7500, chunk_overlap=100)
chunks = text_splitter.split_documents(data)

In [46]:
# Add to vector database
vector_db = Chroma.from_documents(
    documents=chunks, 
    embedding=OllamaEmbeddings(model="nomic-embed-text",show_progress=True),
    collection_name="local-rag"
)

OllamaEmbeddings: 100%|██████████| 2/2 [00:09<00:00,  4.69s/it]


Retrieval

In [11]:
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.chat_models import ChatOllama
from langchain_core.runnables import RunnablePassthrough
from langchain.retrievers.multi_query import MultiQueryRetriever

In [None]:
!ollama pull mistral


In [21]:
!ollama list


NAME                   	ID          	SIZE  	MODIFIED       
mistral:latest         	2ae6f6dd7a3d	4.1 GB	2 minutes ago 	
nomic-embed-text:latest	0a109f422b47	274 MB	58 minutes ago	


In [47]:
# LLM from Ollama
local_model = "mistral"
llm = ChatOllama(model=local_model)

In [48]:
QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five
    different versions of the given user question to retrieve relevant documents from
    a vector database. By generating multiple perspectives on the user question, your
    goal is to help the user overcome some of the limitations of the distance-based
    similarity search. Provide these alternative questions separated by newlines.
    Original question: {question}""",
)

In [49]:
retriever = MultiQueryRetriever.from_llm(
    vector_db.as_retriever(), 
    llm,
    prompt=QUERY_PROMPT
)

# RAG prompt
template = """Answer the question based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [50]:
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [51]:
chain.invoke("What are the steps for Adaboosting")

OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.26s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.07s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.17s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.28s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


' The steps for AdaBoosting, or Adaptive Boosting, can be summarized as follows:\n\n1. Initialize the weak learners: Start with a collection of weak learners (usually decision trees or simple classifiers). Each weak learner produces a classification function that maps input features to outputs.\n\n2. Assign weights: Initially, all the data points are assigned equal weights. However, AdaBoosting dynamically adjusts these weights during training based on the errors made by the previous weak learners. After each round of training, the misclassified examples get more weight.\n\n3. Train each weak learner: For each weak learner, train it using the data points with their updated weights from step 2. The goal is to minimize the weighted error rate on the given dataset.\n\n4. Combine weak learners: Combine the outputs of all the weak learners into a single ensemble by taking weighted sums or averages (depending on whether the weak learners are linear or non-linear).\n\n5. Normalize weights: Af

In [52]:
chain.invoke("What are the differences between boosting and bagging")

OllamaEmbeddings: 100%|██████████| 1/1 [00:04<00:00,  4.28s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.10s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.18s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2
OllamaEmbeddings: 100%|██████████| 1/1 [00:02<00:00,  2.25s/it]
Number of requested results 4 is greater than number of elements in index 2, updating n_results = 2


" Boosting and Bagging are two popular ensemble learning techniques used to improve the performance of machine learning models. Here are the main differences between them:\n\n1. Aim: The goal of Boosting is to correct the weaknesses of a base learner by giving it more weight on instances that it has misclassified, while Bagging aims to reduce variance and overfitting by combining multiple models trained on different subsets of the data.\n\n2. Training process: In Boosting, each new model (called a weak learner) is trained to correct the mistakes made by its predecessor. This results in stronger models being produced sequentially. Bagging, on the other hand, trains all the models simultaneously and combines their predictions.\n\n3. Voting scheme: When it comes to making predictions, Boosting uses a weighted majority vote where more weight is given to the stronger models, while Bagging uses a simple averaging or voting (majority) of the outputs from individual models.\n\n4. Example: A po

In [44]:
# Delete all collections in the db
vector_db.delete_collection()