**Import packages**

In [1]:
!pip install langchain-community "unstructured[pdf]" langchain-text-splitters chromadb

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting unstructured[pdf]
  Downloading unstructured-0.16.25-py3-none-any.whl.metadata (24 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting filetype (from unstructured[pdf])
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured[pdf])
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured[pdf])
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured[pdf]

In [1]:
# Import the DirectoryLoader class from the langchain.document_loaders module
# This class is used to load documents from a specified directory
from langchain.document_loaders import DirectoryLoader

# Define the path to the directory where the documents are stored
# In this case, the current directory ("./") is used
DATA_PATH = "./"

# Define a function to load documents from the specified directory
def load_documents():
    # Create an instance of DirectoryLoader, specifying the directory path and the file pattern to match
    # Here, we are loading all PDF files ("*.pdf") in the directory
    loader = DirectoryLoader(DATA_PATH, "*.pdf")

    # Load the documents using the loader
    documents = loader.load()

    # Return the loaded documents
    return documents

# The function load_documents() can now be called to load all PDF documents from the specified directory
# Example usage:
# documents = load_documents()
# This will load all PDF files in the current directory and return them as a list of document objects

**Explanation:**

-   **DirectoryLoader**: This is a utility that loads documents from a
    specified directory. It can filter files based on a pattern, such
    as \*.pdf for PDF files.

-   **DATA_PATH**: This variable holds the path to the directory where
    the documents are located. In this case, it's set to the current
    directory (./).

-   **load_documents()**: This function initializes
    the DirectoryLoader with the specified path and file pattern, loads
    the documents, and returns them.

**Usage:**

-   You can call load_documents() to load all PDF files from the
    directory specified by DATA_PATH.

-   The returned documents object can be used for further processing,
    such as text extraction, analysis, or feeding into a language model.

**Example:**

In [2]:
# Load documents from the current directory
documents = load_documents()

# Print the number of documents loaded
print(f"Loaded {len(documents)} documents.")

Loaded 1 documents.


**Inspect the Loaded Documents**

The documents object returned by the DirectoryLoader in LangChain is a list of Document objects. Each Document object represents a single document (e.g., a PDF file) and contains both the text content and metadata associated with the document.

In [3]:
# Print documents datatype
# The documents object is a list, so you can iterate over it or access individual documents by index.
print(type(documents))

<class 'list'>


In [4]:
# Print of datatype of one document
print(type(documents[0]))

<class 'langchain_core.documents.base.Document'>


**Structure of a Document Object**

Each Document object has two main attributes:

1.   **page_content**: A string containing the text content of the document. This is the primary data you’ll use for NLP tasks like text analysis, summarization, or question answering.

2.   **metadata**: A dictionary containing metadata about the document (e.g., file path, source, etc.).



In [5]:
# Inspect the first document
if documents:
    first_doc = documents[0]
    print("Structure of a Document object:")
    print(type(first_doc))  # <class 'langchain.schema.Document'>

    print("\nContent of the first document:")
    print(first_doc.page_content[0:100])  # Text content of the document

    print("\nMetadata of the first document:")
    print(first_doc.metadata)  # Metadata dictionary

Structure of a Document object:
<class 'langchain_core.documents.base.Document'>

Content of the first document:
T-104 2022

Course Specification

T-104 2022

Course Specification

Course Title: Design and Analysi

Metadata of the first document:
{'source': '02_ 432CCS-3_CS.pdf'}


**Preprocess the Documents**

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [8]:
# Initialize a text splitter
text_splitter=RecursiveCharacterTextSplitter(
    chunk_size=1000, # Split text into chunks of 1000 characters
    chunk_overlap=200, # # Add overlap between chunks for context
    length_function=len,
    add_start_index=True,
)

In [9]:
# Split documents into smaller chunks
chunks=text_splitter.split_documents(documents)

In [10]:
# Print the number of chunks created
print(f"Number of text chunks created: {len(chunks)}")

Number of text chunks created: 15


In [11]:
# Inspect the first chunk
if chunks:
    print("First chunk content:")
    print(chunks[0].page_content)

First chunk content:
T-104 2022

Course Specification

T-104 2022

Course Specification

Course Title: Design and Analysis of Algorithms

Course Code: 432CCS- 3

Program: Bachelor in Computer Science

Department: Computer Science.

College: Collage of Computer Science

Institution: King Khalid University.

Version: 1

Last Revision Date: 25 September 2023

1

Table of Contents:

Content

A. General Information about the course

1. Teaching mode 2. Contact Hours

B. Course Learning Outcomes (CLOs), Teaching Strategies and Assessment Methods

Engaging in discussions and collaborative activities with peers to explore alternative problem- and solving embracing for collective learning and growth. C. Course Content

approaches opportunities

V1

Tutorial Activities

D. Student Assessment Activities

E. Learning Resources and Facilities

1. References and Learning Resources

2. Required Facilities and Equipment

F. Assessment of Course Quality

G. Specification Approval Data

2

Page

3

4

4

In [12]:
# Each chunk is a Document object
print(type(chunks[0]))

<class 'langchain_core.documents.base.Document'>


In [13]:
# Each chunk has a the two attributes: page_content and metadata
print(chunks[0])

page_content='T-104 2022

Course Specification

T-104 2022

Course Specification

Course Title: Design and Analysis of Algorithms

Course Code: 432CCS- 3

Program: Bachelor in Computer Science

Department: Computer Science.

College: Collage of Computer Science

Institution: King Khalid University.

Version: 1

Last Revision Date: 25 September 2023

1

Table of Contents:

Content

A. General Information about the course

1. Teaching mode 2. Contact Hours

B. Course Learning Outcomes (CLOs), Teaching Strategies and Assessment Methods

Engaging in discussions and collaborative activities with peers to explore alternative problem- and solving embracing for collective learning and growth. C. Course Content

approaches opportunities

V1

Tutorial Activities

D. Student Assessment Activities

E. Learning Resources and Facilities

1. References and Learning Resources

2. Required Facilities and Equipment

F. Assessment of Course Quality

G. Specification Approval Data

2

Page

3

4

4

5' me

**Embed the Documents for Vector Search**

If you want to perform semantic search or similarity matching, you can embed the documents into vector representations.

In [14]:
from sentence_transformers import SentenceTransformer

# Load a pre-trained embedding model (e.g., all-MiniLM-L6-v2)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Embed the documents
chunk_embeddings = model.encode([chunk.page_content for chunk in chunks])

# Print the embedding of the first document
print(f"Embedding of the first chunk (first 5 dimensions): {chunk_embeddings[0][:5]}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Embedding of the first chunk (first 5 dimensions): [-0.0123356   0.07022995 -0.05620731 -0.0328389  -0.02182647]


In [15]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import Document


# Step 1: Initialize the embedding model using HuggingFaceEmbeddings
model_name = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=model_name)


  embeddings = HuggingFaceEmbeddings(model_name=model_name)


In [21]:
# List of strings to encode
texts = ["This is the first string.", "This is the second string."]

# Encode the list of strings
embeddings_list = embeddings.embed_documents(texts)

# Print the embeddings
for i, embedding in enumerate(embeddings_list):
    print(f"Embedding of string {i+1} (first 5 dimensions): {embedding[:5]}")

Embedding of string 1 (first 5 dimensions): [-0.010816674679517746, 0.02960001863539219, -0.022867651656270027, -0.019406117498874664, -0.09649969637393951]
Embedding of string 2 (first 5 dimensions): [-0.0017478391528129578, 0.03199101611971855, -0.027795197442173958, -0.023361021652817726, -0.08704768866300583]


In [16]:
import os , shutil
# Step 2: Create a Chroma vector store from the documents
persist_directory="./chroma_db"
# Delete the persist directory if it exists
if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)
    print(f"Chroma database at '{persist_directory}' has been deleted.")
else:
    print(f"No Chroma database found at '{persist_directory}'.")



No Chroma database found at './chroma_db'.


In [17]:
# Step 4: Create a new Chroma vector store
vector_store = Chroma.from_documents(
    documents=chunks,  # List of Document objects
    embedding=embeddings,  # Embedding model
    persist_directory="./chroma_db"  # Directory to persist the database
)


In [18]:
# Step 3: Save the vector store (optional, as Chroma auto-persists)
vector_store.persist()

# Step 4: Load the vector store later (if needed)
loaded_vector_store = Chroma(
    persist_directory="./chroma_db",  # Directory where the database is stored
    embedding_function=embeddings  # Embedding model
)

  vector_store.persist()
  loaded_vector_store = Chroma(


In [20]:
# Step 5: Perform a similarity search
query = "what are the course CLO?"
similar_docs = loaded_vector_store.similarity_search(query, k=2)  # Retrieve top 2 similar documents

# Print the results
print("Top 2 similar documents:")
for doc in similar_docs:
    print("--------------------")
    print(doc.page_content)

Top 2 similar documents:
--------------------
3 (2+1)

7th Level /4th Year

1. Teaching mode No 1. 2.

Mode of Instruction

Traditional classroom E-learning Hybrid

Contact Hours 60

Percentage 100

3.

4.

Traditional classroom  E-learning Distance learning

2. Contact Hours

No 1. 2. 3. 4. 5.

Lectures Laboratory/Studio Field Tutorial Others (specify)

Activity

Contact Hours

30 0 0 30 0 60

Total

B. Course Learning Outcomes (CLOs), Teaching Strategies and Assessment Methods

Code

Course Learning Outcomes

Code of CLOs aligned with program

Teaching Strategies

Assessment Methods

1.0

1.1

2.0

2.1

2.2
--------------------
Code

Course Learning Outcomes

Code of CLOs aligned with program

Teaching Strategies

Assessment Methods

1.0

1.1

2.0

2.1

2.2

Knowledge and understanding Utilize scientific methods and reasoning the algorithms, of efficiency considering factors such as time complexity. Skills Analyze and compare different algorithmic and the most methods, selecting for