## Connect to Google Drive Mock Data

In [2]:
!pip install -q -U langchain-google-genai langchain langchain-community PyPDF2


[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
!pip install pypdf




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
!pip install chromadb




[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [5]:
%pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


In [7]:
import os


# 2. Define your Mock Data path (Update 'MyProjectFolder' to your actual folder name)
mock_data_path = r"G:\.shortcut-targets-by-id\1IE0kGxUTOOmWfsPivDfCtr3CiHJJR9TO\Agentic_AI_Team_Folder\mock_data\syllabus\Syllabus.pdf"

# 3. List files to verify the agent can "see" them
# files = os.listdir(mock_data_path)
# print(f"Found {len(files)} documents: {files}")

## Basic AI Agent Set up

In [8]:
from dotenv import load_dotenv

load_dotenv()


# It's best practice to store keys in Colab "Secrets" (the key icon on the left)
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-preview-09-2025")

In [9]:

from langchain_community.document_loaders import PyPDFLoader

# Initialize the loader with the path to your PDF file
loader = PyPDFLoader(mock_data_path)

# Load the documents (each page is a separate Document object)
pages = loader.load()

# You can now work with the loaded pages, for example, print the number of pages
print(f"Number of pages loaded: {len(pages)}")

# Or access the content of a specific page
if pages:
    print(pages[0].page_content[:200]) # Print the first 200 characters of the first page


Number of pages loaded: 7
4780/6780: Fundamentals of Data Science
Kiril Kuzmin
Spring, 2025
E-mail: kkuzmin1@gsu.edu
Office Hours: M 2:00‚Äì4:00pm Class Hours: M,W 5:30‚Äì7:15pm
Office: 25 Park Place 750 Class Room: Langdale Hall 


In [10]:
full_text = "\n\n".join([page.page_content for page in pages])
print(f"Loaded {len(pages)} pages, {len(full_text)} total characters")

Loaded 7 pages, 11227 total characters


In [11]:
# create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=500,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_text(full_text)
print(f"Created {len(chunks)} chunks")

Created 12 chunks


In [12]:
# embedding
embeddings = GoogleGenerativeAIEmbeddings(
    model = "models/gemini-embedding-001"
)

# Create a Chroma vector store from the chunks
vectorstore = Chroma.from_texts(
    texts=chunks,
    embedding=embeddings,
    collection_name="TaskBudy_Phase1"
)

print(f"Stored {vectorstore._collection.count()} chunks in the vector store")

Stored 12 chunks in the vector store


In [13]:
retrieval = vectorstore.similarity_search("extract every assignment, project, or exam mentioned", k=5)
for i in retrieval:
  print(i,'\n')

page_content='‚úì Project Implementation (35 points): Provide Jupyter notebooks for data preprocessing and
modeling.
‚úì Presentation (30 points): Deliver a clear and professional group presentation, followed by a
Q&A.
‚úì Final Report (25 points): Submit a detailed PDF report explaining your process, findings,
and recommendations.
Deadlines
‚Ä¢ February 18 th: Proposal submission (via email to kkuzmin1@gsu.edu with CC to all mem-
bers).
‚Ä¢ April 16 th, April 21th, April 23th: Project presentations.
‚Ä¢ April 25 th: Final report submission.
Further details and guidelines will be provided on iCollege.
4/7' 

page_content='4780/6780: Fundamentals of Data Science: Spring 2025
‚Ä¢ Encoding techniques and feature creation/extraction
‚Ä¢ Forward and backward feature selection
Week 10: Model Evaluation and Hyperparameter T uning
‚Ä¢ Metrics: Accuracy, precision, recall (sensitivity), F1-score, specificity
‚Ä¢ ROC (Receiver Operating Characteristic) curve and thresholding
‚Ä¢ Cross-validation

In [14]:
# --- Prompt and Generation ---
template = """You are an academic assistant. Below is the text from several syllabus documents.
    Please extract every assignment, project, or exam mentioned.
    Format the output as a clean table with two columns: 'Assignment Name' and 'Due Date'.
    If a due date is missing, write 'Not Specified'

Context: {context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)
llm = ChatGoogleGenerativeAI(model="gemini-flash-latest")

def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": vectorstore.as_retriever(search_kwargs={"k": 5}) | format_docs,
    "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("RAG chain built successfully.")

RAG chain built successfully.


In [15]:
from IPython.display import Markdown

response = rag_chain.invoke('Generate')
display(Markdown(response))

Based on the syllabus documents provided, here are the assignments, projects, and exams mentioned:

| Assignment Name | Due Date |
| :--- | :--- |
| Homework Assignments (5 total) | Not Specified |
| Project Proposal | February 18th |
| Midterm Exam | February 26th (tentative) |
| Project Presentations | April 16th, April 21st, or April 23rd |
| Final Project Report | April 25th |
| Final Exam | April 30th (4:15‚Äì6:00pm) |
| Project Implementation (Jupyter notebooks) | Not Specified |

will need to rewrite the function below and combine it with the retrieval above

In [16]:
# def extract_assignments():
#     raw_content = pages[0].page_content

#     prompt = f"""
#     You are an academic assistant. Below is the text from several syllabus documents.
#     Please extract every assignment, project, or exam mentioned.
#     Format the output as a clean table with two columns: 'Assignment Name' and 'Due Date'.
#     If a due date is missing, write 'Not Specified'.

#     SYLLABUS TEXT:
#     {raw_content}
#     """

#     response = llm.invoke(prompt)
#     return response.content

# # Run the agent
# assignments_list = extract_assignments()
# print(assignments_list)

## Google Calendar

In [17]:
%pip install -qU langchain-google-community\[calendar\]

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip
ERROR: Exception:
Traceback (most recent call last):
  File "C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\packaging\requirements.py", line 36, in __init__
    parsed = _parse_requirement(requirement_string)
  File "C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\packaging\_parser.py", line 62, in parse_requirement
    return _parse_requirement(Tokenizer(source, rules=DEFAULT_RULES))
  File "C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\packaging\_parser.py", line 77, in _parse_requirement
    extras = _parse_extras(tokenizer)
  File "C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\pip\_vendor\packaging\_parser.py", line 164, in _parse_extras
    with tokenizer.enclosing_tokens(
         ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        "LEFT_BRACKET",
        ^^^^^^^^^^^^^^^
      

In [18]:
%pip install google-auth-oauthlib

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [19]:
import datetime
import zoneinfo
import os.path

from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

SCOPES = ["https://www.googleapis.com/auth/calendar.events"]


def event_exists(service, title, start_time, end_time):
    """Check if an event with same title exists in the time window"""
    events_result = service.events().list(
        calendarId="primary",
        timeMin=start_time,
        timeMax=end_time,
        singleEvents=True,
        orderBy="startTime"
    ).execute()




    events = events_result.get("items", [])

    for e in events:
        if e.get("summary") == title:
            return True

    return False



def main():
    creds = None

    if os.path.exists("token.json"):
        creds = Credentials.from_authorized_user_file("token.json", SCOPES)

    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                "credentials.json", SCOPES
            )
            creds = flow.run_local_server(port=0)

        with open("token.json", "w") as token:
            token.write(creds.to_json())





    try:
        service = build("calendar", "v3", credentials=creds)

        title = "Cloud Study Session"

        tz = zoneinfo.ZoneInfo("America/Chicago")

        start_dt = datetime.datetime(2026, 2, 28, 15, 0, 0, tzinfo=tz)
        end_dt = datetime.datetime(2026, 2, 28, 17, 0, 0, tzinfo=tz)

        start_iso = start_dt.isoformat()
        end_iso = end_dt.isoformat()
        
        if event_exists(service, title, start_iso, end_iso):
            print("event already exists!")
            return
        
        # üî• Event details
        event = {
            "summary": title,
            "location": "TCU Library",
            "description": "Preparing for cloud + cybersecurity internship",
            "start": {
                "dateTime": start_iso,
                "timeZone": "America/Chicago",
            },
            "end": {
                "dateTime": end_iso,
                "timeZone": "America/Chicago",
            },
            "reminders": {
                "useDefault": False,
                "overrides": [
                    {"method": "popup", "minutes": 30},
                    {"method": "popup", "minutes": 10},
                ],
            }
        }

        created_event = service.events().insert(
            calendarId="primary",
            body=event
        ).execute()

        print("Event created:", created_event.get("htmlLink"))

    except HttpError as error:
        print(f"An error occurred: {error}")

if __name__ == "__main__":
    main()

event already exists!


## Docling - doc parser

In [33]:
%pip install docling torch torchvision

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [34]:
%pip install -qU langchain-docling

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [58]:
from langchain_docling.loader import DoclingLoader

file = r"G:\.shortcut-targets-by-id\1IE0kGxUTOOmWfsPivDfCtr3CiHJJR9TO\Agentic_AI_Team_Folder\mock_data\syllabus\Syllabus2.pdf"

loader = DoclingLoader(file_path=file)

docs = loader.load()
print(f"Loaded {len(docs)} documents from DoclingLoader")

[32m[INFO] 2026-02-27 13:14:37,150 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-27 13:14:37,163 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-27 13:14:37,164 [RapidOCR] main.py:53: Using C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-27 13:14:37,261 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-27 13:14:37,264 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-27 13:14:37,265 [RapidOCR] main.py:53: Using C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-27 13:14:37,330 [Rap

Loaded 22 documents from DoclingLoader


In [59]:
for doc in docs:
    print(doc.page_content, "\n---\n")

Course
Course Title, Prefix, Number, Section: Web Technologies, CITE 30363, 074
Semester and Year: Spring 2026
Number of Credits: 3
Course Component Type: Lecture
Class Location: RJH 113
Class Meeting Day(s) & Time(s): MW 4 PM - 5:20 PM 
---

Instructor
Instructor Name: Bingyang Wei
Office Location: Tucker Technology Center 341D
Office Hours: Tuesday: 9 AM - 11 AM, Wednesday: 2 PM - 4 PM
Preferred Method of Contact: Email
Email: b.wei@tcu.edu 
---

FINAL EXAM: 2 PM - 4:30 PM, MAY 4, 2026
Note for students: The syllabus is your first course reading. It provides an orientation to, overview of the flow, and expectations of the course. You should turn to the syllabus for details on assignments and course policies. 
---

Student Resources & Policy Information
Scan QR code for resources to support you as a TCU student. Please note section on Student Access and Accommodation, Academic Conduct & Course Materials Policies, and Emergency Response & TCU Alert. 
---

SYLLABUS: WEB TECHNOLOGIES
Las

In [60]:
from docling.chunking import HybridChunker

loader = DoclingLoader(
    file_path=mock_data_path,
    chunker=HybridChunker()
)

docs = loader.load()
print(f"Loaded {len(docs)} documents with HybridChunker")

[32m[INFO] 2026-02-27 13:15:02,726 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-27 13:15:02,741 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-27 13:15:02,742 [RapidOCR] main.py:53: Using C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_PP-OCRv4_det_infer.onnx[0m
[32m[INFO] 2026-02-27 13:15:02,847 [RapidOCR] base.py:22: Using engine_name: onnxruntime[0m
[32m[INFO] 2026-02-27 13:15:02,851 [RapidOCR] download_file.py:60: File exists and is valid: C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-27 13:15:02,851 [RapidOCR] main.py:53: Using C:\Users\Owner\AppData\Roaming\Python\Python313\site-packages\rapidocr\models\ch_ppocr_mobile_v2.0_cls_infer.onnx[0m
[32m[INFO] 2026-02-27 13:15:02,916 [Rap

Loaded 31 documents with HybridChunker


In [61]:
for doc in docs:
    print(doc.page_content, "\n---\n")

Kiril Kuzmin
Spring, 2025
Office: 25 Park Place 750, Class Hours: M,W 5:30-7:15pm = Class Room: Langdale Hall 215 
---

Course Description
This course provides students with essential concepts, principles, and tools to extract and generalize knowledge from data. Students will develop a comprehensive skill set, including data processing, statistics, and machine learning, and gain a solid understanding of how these skills integrate to solve real-world problems. The course systematically introduces key topics in data science, including:
1. Principles of data processing and representation;
2. Theoretical foundations and recent advancements in data science;
3. Modeling and algorithms;
4. Evaluation techniques.
The focus will be on the breadth of these topics rather than on in-depth exploration. Realworld engineering problems and datasets will be used to illustrate the advantages and limitations of various algorithms, allowing students to compare their effectiveness and efficiency and to dis

In [62]:
full_text = "\n\n".join([doc.page_content for doc in docs])
print(f"Combined full text length: {len(full_text)} characters")

Combined full text length: 10703 characters


In [63]:
# create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=500,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_text(full_text)
print(f"Created {len(chunks)} chunks")

Created 11 chunks


In [64]:
embeddings = GoogleGenerativeAIEmbeddings(
    model = "models/gemini-embedding-001",
)
vectorstore = Chroma.from_texts(
    texts=chunks,
    embedding=embeddings,
    collection_name="taskbuddy_with_docling"
)

print(f"Stored {vectorstore._collection.count()} chunks in the vector store")

Stored 33 chunks in the vector store


In [65]:
retrieval = vectorstore.similarity_search("extract every assignment, project, or exam mentioned", k=5)
for i in retrieval:
  print(i,'\n')

page_content='Deliverables and Grading
- Processed Datasets (10 points): Submit raw and preprocessed datasets.
- Project Implementation (35 points): Provide Jupyter notebooks for data preprocessing and modeling.
- Presentation (30 points): Deliver a clear and professional group presentation, followed by a Q&A.
- Final Report (25 points): Submit a detailed PDF report explaining your process, findings, and recommendations.

Deadlines
- February 18 th : Proposal submission (via email to kkuzmin1@gsu.edu with CC to all members).
- April 16 th , April 21 th , April 23 th : Project presentations.
- April 25 th : Final report submission.
Further details and guidelines will be provided on iCollege.' 

page_content='Deliverables and Grading
- Processed Datasets (10 points): Submit raw and preprocessed datasets.
- Project Implementation (35 points): Provide Jupyter notebooks for data preprocessing and modeling.
- Presentation (30 points): Deliver a clear and professional group presentation, foll

In [66]:
retrieval = vectorstore.similarity_search_with_score("extract every assignment, project, or exam mentioned", k=5)
for doc, score in retrieval:
    print(f"Score: {score:.4f}\n{doc.page_content}\n")

Score: 0.6402
Deliverables and Grading
- Processed Datasets (10 points): Submit raw and preprocessed datasets.
- Project Implementation (35 points): Provide Jupyter notebooks for data preprocessing and modeling.
- Presentation (30 points): Deliver a clear and professional group presentation, followed by a Q&A.
- Final Report (25 points): Submit a detailed PDF report explaining your process, findings, and recommendations.

Deadlines
- February 18 th : Proposal submission (via email to kkuzmin1@gsu.edu with CC to all members).
- April 16 th , April 21 th , April 23 th : Project presentations.
- April 25 th : Final report submission.
Further details and guidelines will be provided on iCollege.

Score: 0.6402
Deliverables and Grading
- Processed Datasets (10 points): Submit raw and preprocessed datasets.
- Project Implementation (35 points): Provide Jupyter notebooks for data preprocessing and modeling.
- Presentation (30 points): Deliver a clear and professional group presentation, follow