# Chunking Experiments

In [4]:
import os
from dotenv import load_dotenv

In [2]:
_ = load_dotenv()
CHUNKR_API_KEY = os.getenv("CHUNKR_API_KEY")

In [None]:
pdf_dir = "pdf_collection"
pdf_list = [
    os.path.join(pdf_dir, file) for file in os.listdir(pdf_dir) if file.lower().endswith(".pdf")
]
pdf_list

['pdf_collection/empty_graph.pdf',
 'pdf_collection/screenshot_text_and_image.pdf',
 'pdf_collection/complex_graph.pdf',
 'pdf_collection/syllabus.pdf',
 'pdf_collection/table.pdf']

## Chunkr Experiment

In [12]:
# !pip install chunkr-ai

Collecting chunkr-ai
  Downloading chunkr_ai-0.0.41-py3-none-any.whl.metadata (7.0 kB)
Downloading chunkr_ai-0.0.41-py3-none-any.whl (14 kB)
Installing collected packages: chunkr-ai
Successfully installed chunkr-ai-0.0.41


In [6]:
from chunkr_ai import Chunkr
from chunkr_ai.models import (
    Configuration, 
    GenerationConfig, 
    GenerationStrategy,
    SegmentProcessing,
    SegmentationStrategy,
    ChunkProcessing
)
from IPython.display import Markdown, display
from IPython.display import HTML, display

In [9]:
chunkr = Chunkr(api_key=CHUNKR_API_KEY)

### Config Experiment 1
### (highest performance according to the doc)

In [33]:
config = Configuration(
    high_resolution=True, # Use high resolution for all segments
    segmentation_strategy=SegmentationStrategy.LAYOUT_ANALYSIS,
    segment_processing=SegmentProcessing(
        Caption=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Footnote=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        ListItem=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Page=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        PageFooter=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        PageHeader=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Picture=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        SectionHeader=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Text=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Title=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        )
    )
)

In [41]:
path = pdf_list[3]  
task = await chunkr.upload("/Users/henryhu1607/Documents/Professional_Development/Projects/RAG_framework_experiments/pdf_collection/syllabus.pdf", config)

In [63]:
# Display the markdown content with images
markdown_content = task.markdown()
for chunk in task.output.chunks:
    if hasattr(chunk, 'image_base64') and chunk.image_base64:
        image_tag = f"![Image](data:image/png;base64,{chunk.image_base64})"
        markdown_content = markdown_content.replace(f"[Image: {chunk.id}]", image_tag)
display(Markdown(markdown_content))

The information contained on this page is designed to give students a representative example of material covered in the
course. Any information related to course assignments, dates, or course materials is illustrative only.

The image shows the logo of New York University (NYU). On the left is a purple square containing a white torch. To the right of the square are the letters "NYU" in purple. A thin black vertical line is on the right edge of the image.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus

Computer Science and Engineering
Principles of Database Systems

**Course Information**

Course Prerequisites

Graduate student status.

Course Description

This course broadly introduces database systems, including the
relational data model, query languages, database design, index and
file structures, query processing and optimization, concurrency and
recovery, transaction management and database design. Students
acquire hands-on experience in working with database systems and in
building web-accessible database applications.

Course Objectives

This course will provide students with the opportunity to:

- Apply queries in relational algebra to retrieve data.

• Apply queries in SQL to create, read, update and delete data in a
database.

• Apply the concepts of entity integrity constraint and referential
integrity constraint (including definition of the concept of a
foreign key).

- Describe the normal forms (1NF, 2NF, 3NF, BCNF, and 4NF) of a
relation.

- Apply normalization to a relation to create a set of BCNF
relations and denormalize a relational schema.

NYU

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

• Describe functional dependency between two or more attributes
that are a subset of a relation.

- Understand multi-valued dependency and identify examples in relational schemas.

# 1. Introduction

## 1.1. Purpose

This document describes the requirements for the development of a database system. It outlines the functionalities, data structure, and user interface aspects of the system.

## 1.2. Scope

The scope of this document includes the detailed description of the database system's features, data entities, relationships, and user interactions. It serves as a guide for developers and stakeholders involved in the project.

## 1.3. Intended Audience

This document is intended for:

-   Software Developers
-   Database Administrators
-   Project Managers
-   Stakeholders

## 1.4. Definitions, Acronyms, and Abbreviations

-   **ER**: Entity-Relationship
-   **UI**: User Interface
-   **DBMS**: Database Management System

## 1.5. References

-   IEEE Standard 830-1998, Recommended Practice for Software Requirements Specifications

## 1.6. Overview

The following sections provide a detailed description of the database system's requirements, including functional requirements, data requirements, and user interface requirements.

# 2. Overall Description

## 2.1. Product Perspective

The database system will be a standalone application, designed to manage and store data efficiently. It will provide a user-friendly interface for data entry, retrieval, and manipulation.

## 2.2. Product Functions

The system will provide the following functions:

-   Data Entry: Allows users to input new data into the system.
-   Data Retrieval: Enables users to search and retrieve existing data.
-   Data Modification: Permits users to update and modify data.
-   Data Deletion: Allows users to remove data from the system.
-   Reporting: Generates reports based on the stored data.

## 2.3. User Classes and Characteristics

-   **Administrators**: Users with full access to the system, responsible for managing users and system settings.
-   **Regular Users**: Users with limited access, able to perform data entry, retrieval, and modification.

## 2.4. Operating Environment

The system will operate on Windows, macOS, and Linux operating systems. It will require a modern web browser for the user interface.

## 2.5. Design and Implementation Constraints

-   The system must be implemented using open-source technologies.
-   The database must be scalable to handle large volumes of data.
-   The system must be secure to protect sensitive data.

## 2.6. User Documentation

User documentation will be provided in the form of a user manual and online help.

## 2.7. Assumptions and Dependencies

-   It is assumed that users have basic computer literacy.
-   The system depends on the availability of a stable network connection.

# 3. Specific Requirements

## 3.1. Functional Requirements

### 3.1.1. Data Entry

-   The system shall allow users to enter new data into the system.
-   The system shall validate data to ensure accuracy.
-   The system shall provide feedback to the user upon successful data entry.

### 3.1.2. Data Retrieval

-   The system shall allow users to search for data based on multiple criteria.
-   The system shall display search results in a clear and organized manner.
-   The system shall allow users to export search results to various formats (e.g., CSV, Excel).

### 3.1.3. Data Modification

-   The system shall allow users to modify existing data.
-   The system shall track changes made to the data.
-   The system shall provide an audit trail of data modifications.

### 3.1.4. Data Deletion

-   The system shall allow users to delete data.
-   The system shall require confirmation before deleting data.
-   The system shall archive deleted data for future reference.

### 3.1.5. Reporting

-   The system shall generate reports based on the stored data.
-   The system shall allow users to customize reports.
-   The system shall provide various reporting options (e.g., summary reports, detailed reports).

## 3.2. Data Requirements

### 3.2.1. ER Diagram

- Apply SQL to create a relational database schema based on
conceptual and relational models.

• Apply stored procedures, functions, and triggers using a
commercial relational DBMS.

- Describe concurrency control and how it is affected by isolation
levels in the database.

- Analyze Current Research in Database Systems.

Course Structure

This course is conducted entirely online, which means you do not have
to be on campus to complete any portion of it. You will participate in
the course using NYU Classes located at https://newclasses.nyu.edu
Your final grade will be computed as a combination of the components
shown below.

- Quizzes: 30%

- Labs: 40%

- Project: 30%

Weekly Structure

Week 1: Introduction to the Relational Model

- Introduce class and overview of course topics.

Weeks 2-4: SQL Language

The image shows the logo of New York University (NYU). On the left is a purple square containing a white torch. To the right of the square are the letters "NYU" in purple. A thin black vertical line is on the far right.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

- Introduction to SQL

- Intermediate SQL

- Advanced SQL

Week 5: Formal Relational Query Languages

- Relational Algebra

- Tuple Relational Calculus

- Domain Relational Calculus

Week 6: Database Design: The Entity-Relationship Approach

- ER Design

- Reduction to Relational Model

Week 7: Relational Database Design

- Functional Dependency

- Multivalued Dependency

- Normal Forms.

Week 8: Application Design

- Web Architectures

- Application Security

Week 9: Storage and File Structure

- Physical Storage

- Record Organization

Week 10: Indexing and Hashing

The image shows the logo of New York University (NYU). On the left is a purple square containing a white torch. To the right of the square, also in purple, are the letters "NYU". A thin black vertical line is on the right edge of the image.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

- Ordered Indices

- Hashed Indices

- Bitmap Indices

Weeks 11-12: Query Processing & Optimization

- Query Processing

- Query Optimization

Week 13: Transactions & Concurrency Control

- ACID Properties

- Transaction Management

Week 14: Recovery System & Database System Architectures

• Locks

- Deadlocks

- Snapshot Isolation

Week 15: Student Presentations

- Presentations and reviews

Learning Time Rubric

Please modify the below table to represent the breakdown of learning time in
each week of your course.

The image shows the logo of New York University (NYU). On the left is a purple square containing a white torch. To the right of the square, also in purple, are the letters "NYU". A thin black vertical line is on the right edge of the image.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

| Learning Time Element | Asynchronous* / Synchronous** | Time on Task for Students (weekly) | Notes |
|---|---|---|---|
| Reading Assignments / Recorded Lecture | Asynchronous | 2.5 hours | Video format. Expect quizzes throughout the module or weekly chapter readings |
| Weekly Discussion Board & Peer Review | Asynchronous | 1.5 hours | Students are expected to post responses to weekly topic questions. See Interaction Policy. |
| Assessment (Labs and Programming assignments) | Asynchronous | 2 hours | Students submit their assignment by [the end of the week] |
| Reading Assignment | Asynchronous | 2 hours | Reading assigned textbook chapters and journal articles. |
| Live webinars | Synchronous | 2 hours | Group discussion in class, live, overly weekly chapter |

NYU

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

Course Communication

Interaction Policy

Please follow the interaction guidelines stated below for this course.

- I will be holding online virtual classroom sessions every week.
This virtual classroom will be held via NYU Classes on Thursdays
from 8am to 9am.

- The course will involve regular discussions via the Discussion
Forums within NYU Classes and students are encouraged to
participate.

- If you have a technical or course content related question,
please send me an email. If I think that your question can
benefit the class, I might post it on the discussion forum.

- If you have a question related to grading, please send an email
to the TA and cc on the email thread. The TA will be responsible
for examining your answers and providing a grade as per my
guidelines.

- If any other questions need to be answered that are not
addressed via email or the live classroom, I can hold virtual
office hours on an appointment basis.

Announcements

Announcements will be posted on NYU Classes on a regular basis. You
can locate all class announcements under the Announcements tab of
our class. Be sure to check the class announcements regularly as they
will contain important information about class assignments and other
class matters.

Email

You are encouraged to post your questions about the course in the
Forums discussions on NYU Classes. This is an open forum in which
you and your classmates are encouraged to answer each other's
questions. But, if you need to contact me directly, please email me. All

NYU Tandon School of Engineering Logo

The image is a logo for the NYU Tandon School of Engineering. It consists of three distinct elements arranged horizontally:

1.  **Torch Symbol:** On the left, there is a square filled with a deep purple color. Inside this square is a white stylized image of a torch with a flame.

2.  **"NYU" Text:** To the right of the square, the letters "NYU" are displayed in a bold, sans-serif font, also in the same deep purple color as the square.

3.  **"TANDON SCHOOL OF ENGINEERING" Text:** To the right of the "NYU" text, separated by a vertical black line, is the text "TANDON SCHOOL OF ENGINEERING" in a bold, sans-serif font. The text is black.

Course Syllabus - CS GY 6083 Principles of Database System

homework, labs or programming assignments related questions must
be researched first on own time, then posted on forums, then
discussed with TAs during weekly reviews, and then can be forwarded
to me. Typically, you can expect a response within 48 hours.

Readings

Avi Silberschatz, Henry F. Korth, S. Sudarshan, Database System
Concepts, Sixth Edition, McGraw Hill

You can access NYU's central library here: http://library.nyu.edu/
You can access NYU Tandon's Bern Dibner Library here:
http://library.poly.edu/

RECOMMENDED READINGS are online journal articles provided in each
lecture You can access NYU's central library here: http://library.nyu.edu/

You can access NYU Tandon's Bern Dibner Library here:
http://library.poly.edu/

Assignments and Exams

Exams Administered and Proctored Online

Exams in this course are administered through NYU Classes. You are required
to arrange an online proctor for your exams via ProctorU. More information
on ProctorU and scheduling proctoring sessions can be found on Tandon
Online's website.

<image>

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

Exams Administered On Paper and Proctored Remotely

Exams in this course are administered via paper and pencil. If you are not
able to attend an exam session on-campus, you are required to secure
in-person proctoring arrangements near your location. Tandon Online's
website.

University Policies

Moses Center Statement of Disability

Academic accommodations are available for students with disabilities. Please
contact the Moses Center for Students with Disabilities (212-998-4980 or
mosescsd@nyu.edu) for further information. Students who are requesting
academic accommodations are advised to reach out to the Moses Center as
early as possible in the semester for assistance.

NYU Tandon School of Engineering Policies and
Procedures on Academic Misconduct¹

A. Introduction: The School of Engineering encourages academic
excellence in an environment that promotes honesty, integrity, and
fairness, and students at the School of Engineering are expected to
exhibit those qualities in their academic work. It is through the process
of submitting their own work and receiving honest feedback on that
work that students may progress academically. Any act of academic
dishonesty is seen as an attack upon the School and will not be
tolerated. Furthermore, those who breach the School's rules on
academic integrity will be sanctioned under this Policy. Students are
responsible for familiarizing themselves with the School's Policy on
Academic Misconduct.

B. Definition: Academic dishonesty may include misrepresentation,
deception, dishonesty, or any act of falsification committed by a
student to influence a grade or other academic evaluation. Academic
dishonesty also includes intentionally damaging the academic work of
others or assisting other students in acts of dishonesty. Common

¹ Excerpted from the Tandon School of Engineering Student Code of Conduct

The image shows the logo of New York University (NYU). On the left is a purple square containing a white torch. To the right of the square is the text "NYU" in purple. A thin black vertical line is on the right edge of the image.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

examples of academically dishonest behavior include, but are not
limited to, the following:

a. Cheating: intentionally using or attempting to use unauthorized
notes, books, electronic media, or electronic communications in
an exam; talking with fellow students or looking at another
person's work during an exam; submitting work prepared in
advance for an in-class examination; having someone take an
exam for you or taking an exam for someone else; violating
other rules governing the administration of examinations.

b. Fabrication: including but not limited to, falsifying experimental
data and/or citations.

c. Plagiarism: intentionally or knowingly representing the words or
ideas of another as one's own in any academic exercise; failure
to attribute direct quotations, paraphrases, or borrowed facts or
information.

d. Unauthorized collaboration: working together on work that was
meant to be done individually.

e. Duplicating work: presenting for grading the same work for
more than one project or in more than one class, unless express
and prior permission has been received from the course
instructor(s) or research adviser involved.

f. Forgery: altering any academic document, including, but not
limited to, academic records, admissions materials, or medical
excuses.

In [68]:
task.html()

'The information contained on this page is designed to give students a representative example of material covered in the\ncourse. Any information related to course assignments, dates, or course materials is illustrative only.\n<div>\n  <img src="

In [55]:
chunks = task.output.chunks
for chunk in chunks:
    print(chunk)

chunk_id='56621853-7b1e-49c0-90c2-626ccd0c9f63' chunk_length=83 segments=[Segment(bbox=BoundingBox(left=83.88, top=14.9472, width=1095.5376, height=51.9984), content='The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.', page_height=1584.0, llm=None, html='The information contained on this page is designed to give students a representative example of material covered in the\ncourse. Any information related to course assignments, dates, or course materials is illustrative only.', image='https://storage.googleapis.com/chunkr-prod-bucket/f10106d6-c411-4427-b19b-8b6b808084f3/ccf7ad86-a263-4a20-bfc3-a28c4e447cea/images/d270e738-108f-4ff7-87ff-6a71cf857b2b.jpg?x-id=GetObject&response-content-disposition=inline&response-content-encoding=utf-8&response-content-type=image%2Fjpeg&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credenti

### Config Experiment 2 
### (ignore headers and footers, image_summarization)

In [10]:
config = Configuration(
    high_resolution=True, # Use high resolution for all segments
    segmentation_strategy=SegmentationStrategy.LAYOUT_ANALYSIS,
    chunk_processing=ChunkProcessing(
            ignore_headers_and_footers=True, 
            target_length=1024 
    ),
    segment_processing=SegmentProcessing(
        Caption=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Footnote=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        ListItem=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Page=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Picture=GenerationConfig(
            llm = "summarize key information in the image",
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        SectionHeader=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Text=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        Title=GenerationConfig(
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        )
    )
)

In [11]:
task_2 = await chunkr.upload("/Users/henryhu1607/Documents/Professional_Development/Projects/RAG_framework_experiments/pdf_collection/syllabus.pdf", config)

In [14]:
markdown_content = task_2.markdown()
for chunk in task_2.output.chunks:
    if hasattr(chunk, 'image_base64') and chunk.image_base64:
        image_tag = f"![Image](data:image/png;base64,{chunk.image_base64})"
        markdown_content = markdown_content.replace(f"[Image: {chunk.id}]", image_tag)
display(Markdown(markdown_content))

The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.

The image shows the logo for New York University (NYU).  It features a stylized white torch with flames on a square purple background to the left of the letters "NYU" in a bold, purple sans-serif font. A thin vertical purple line appears to the right of the letters, perhaps the edge of something out of frame.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus

Computer Science and Engineering
Principles of Database Systems

Course Information

Course Prerequisites

Graduate student status.

Course Description

This course broadly introduces database systems, including the relational data model, query languages, database design, index and file structures, query processing and optimization, concurrency and recovery, transaction management and database design. Students acquire hands-on experience in working with database systems and in building web-accessible database applications.

Course Objectives

This course will provide students with the opportunity to:

• Apply queries in relational algebra to retrieve data.

* Apply queries in SQL to create, read, update and delete data in a database.

- Apply the concepts of entity integrity constraint and referential integrity constraint (including definition of the concept of a foreign key).

* Describe the normal forms (1NF, 2NF, 3NF, BCNF, and 4NF) of a relation.

* Apply normalization to a relation to create a set of BCNF relations and denormalize a relational schema.

<img src="NYU_logo.svg.png" alt="NYU logo">

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

* Describe functional dependency between two or more attributes that are a subset of a relation.

- Understand multi-valued dependency and identify examples in relational schemas.

- Sketch conceptual data models (including ER) to describe a database structure.

* Apply SQL to create a relational database schema based on conceptual and relational models.

- Apply stored procedures, functions, and triggers using a commercial relational DBMS.

* Describe concurrency control and how it is affected by isolation levels in the database.

• Analyze Current Research in Database Systems.

Course Structure

This course is conducted entirely online, which means you do not have to be on campus to complete any portion of it. You will participate in the course using NYU Classes located at [https://newclasses.nyu.edu](https://newclasses.nyu.edu)
Your final grade will be computed as a combination of the components shown below.

• Quizzes: 30%

* Labs: 40%

* Project: 30%

Weekly Structure

Week 1: Introduction to the Relational Model

* Introduce class and overview of course topics.

Weeks 2-4: SQL Language

The image shows the New York University (NYU) logo. It consists of a purple square on the left side containing a white torch with flames, positioned vertically.  To the right of the square, the letters "NYU" are written in a bold, purple, sans-serif font. A thin vertical line separates the logo from another segment assumed to be from the original image, now cropped.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

* Introduction to SQL

* Intermediate SQL

* Advanced SQL

Week 5: Formal Relational Query Languages

* Relational Algebra

* Tuple Relational Calculus

* Domain Relational Calculus

Week 6: Database Design: The Entity-Relationship Approach

• ER Design

• Reduction to Relational Model

Week 7: Relational Database Design

* Functional Dependency

* Multivalued Dependency

* Normal Forms.

Week 8: Application Design

* Web Architectures

• Application Security

Week 9: Storage and File Structure

* Physical Storage

* Record Organization

Week 10: Indexing and Hashing

The image is a logo for New York University (NYU). It features a stylized white torch on a purple square to the left of the letters "NYU" in purple, also on a white background. A thin vertical purple line is visible on the far right edge of the image, seemingly cut off. The torch is depicted with a simple flame and a straight handle. The letters "NYU" are in a bold, sans-serif font. The overall impression is clean, modern, and academic.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

• Ordered Indices

• Hashed Indices

* Bitmap Indices

Weeks 11-12: Query Processing & Optimization

- Query Processing

* Query Optimization

Week 13: Transactions & Concurrency Control

* ACID Properties

* Transaction Management

Week 14: Recovery System & Database System Architectures

• Locks

* Deadlocks

* Snapshot Isolation

Week 15: Student Presentations

● Presentations and reviews

Learning Time Rubric

Please modify the below table to represent the breakdown of learning time in
each week of your course.

The image displays the New York University (NYU) logo. It consists of a purple square containing a white stylized torch on the left, and the letters "NYU" in purple on a white background to the right. The torch has a flame with three distinct upward points and a straight handle. The "NYU" acronym is written in a bold, sans-serif font. A thin black vertical line forms the rightmost boundary of the image, perhaps indicating a crop or edge.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

| Learning Time Element | Asynchronous* / Synchronous** | Time on Task for Students (weekly) | Notes |
|---|---|---|---|
| Reading Assignments / Recorded Lecture | Asynchronous | 2.5 hours | Video format. Expect quizzes throughout the module or weekly chapter readings |
| Weekly Discussion Board & Peer Review | Asynchronous | 1.5 hours | Students are expected to post responses to weekly topic questions. See Interaction Policy. |
| Assessment (Labs and Programming assignments) | As

<img src="NYU_logo.png" alt="NYU logo">

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

Course Communication

Interaction Policy

Please follow the interaction guidelines stated below for this course.

* I will be holding online virtual classroom sessions every week.
This virtual classroom will be held via NYU Classes on Thursdays
from 8am to 9am.

- The course will involve regular discussions via the Discussion Forums within NYU Classes and students are encouraged to participate.

- If you have a technical or course content related question, please send me an email. If I think that your question can benefit the class, I might post it on the discussion forum.

- If you have a question related to grading, please send an email to the TA and cc on the email thread. The TA will be responsible for examining your answers and providing a grade as per my guidelines.

* If any other questions need to be answered that are not addressed via email or the live classroom, I can hold virtual office hours on an appointment basis.

Announcements

Announcements will be posted on NYU Classes on a regular basis. You can locate all class announcements under the Announcements tab of our class. Be sure to check the class announcements regularly as they will contain important information about class assignments and other class matters.

Email

You are encouraged to post your questions about the course in the Forums discussions on NYU Classes. This is an open forum in which you and your classmates are encouraged to answer each other's questions. But, if you need to contact me directly, please email me. All

The image shows the logo for the NYU Tandon School of Engineering. It consists of two parts:

* **Left side:** A purple square containing a white stylized torch. The torch has a flame at the top and a straight handle extending down. 
* **Right side:** The text "NYU" in purple is placed to the left of a vertical black line. To the right of the line, the text "TANDON SCHOOL OF ENGINEERING"  in black is stacked in two lines. "TANDON SCHOOL" is above "OF ENGINEERING".

Course Syllabus - CGY 6083 Principles of Database System

homework, labs or programming assignments related questions must be researched first on own time, then posted on forums, then discussed with TAs during weekly reviews, and then can be forwarded to me. Typically, you can expect a response within 48 hours.

Readings

Avi Silberschatz, Henry F. Korth, S. Sudarshan, Database System Concepts, Sixth Edition, McGraw Hill

You can access NYU's central library here: [http://library.nyu.edu/](http://library.nyu.edu/)
You can access NYU Tandon's Bern Dibner Library here:
[http://library.poly.edu/](http://library.poly.edu/)

RECOMMENDED READINGS are online journal articles provided in each lecture. You can access NYU's central library here: [http://library.nyu.edu/](http://library.nyu.edu/)

You can access NYU Tandon's Bern Dibner Library here: http://library.poly.edu/

Assignments and Exams

Exams Administered and Proctored Online

Exams in this course are administered through NYU Classes. You are required to arrange an online proctor for your exams via ProctorU. More information on ProctorU and scheduling proctoring sessions can be found on [Tandon Online's website](Tandon Online's website).

<img src="NYU_logo.svg.png" alt="NYU Logo">

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CGY 6083 Principles of Database System

Exams Administered On Paper and Proctored Remotely

Exams in this course are administered via paper and pencil. If you are not able to attend an exam session on-campus, you are required to secure in-person proctoring arrangements near your location. Tandon Online's website.

University Policies

Moses Center Statement of Disability

Academic accommodations are available for students with disabilities. Please contact the Moses Center for Students with Disabilities (212-998-4980 or mosescsd@nyu.edu) for further information. Students who are requesting academic accommodations are advised to reach out to the Moses Center as early as possible in the semester for assistance.

NYU Tandon School of Engineering Policies and Procedures on Academic Misconduct¹

A. Introduction: The School of Engineering encourages academic excellence in an environment that promotes honesty, integrity, and fairness, and students at the School of Engineering are expected to exhibit those qualities in their academic work. It is through the process of submitting their own work and receiving honest feedback on that work that students may progress academically. Any act of academic dishonesty is seen as an attack upon the School and will not be tolerated. Furthermore, those who breach the School's rules on academic integrity will be sanctioned under this Policy. Students are responsible for familiarizing themselves with the School's Policy on Academic Misconduct.

B. Definition: Academic dishonesty may include misrepresentation, deception, dishonesty, or any act of falsification committed by a student to influence a grade or other academic evaluation. Academic dishonesty also includes intentionally damaging the academic work of others or assisting other students in acts of dishonesty. Common

1. Excerpted from the Tandon School of Engineering Student Code of Conduct.

The image shows the logo for New York University (NYU). It features the letters "NYU" in a bold, sans-serif, purple font.  To the left of the letters, separated by some blank space, is a purple square containing a stylized white torch. The torch has a flame with three upward-pointing sections, and a straight, elongated base. The entire logo appears against a white background. A thin vertical black line is present at the far right edge of the image, likely an artifact of the image capture and not part of the actual logo.

TANDON SCHOOL
OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

examples of academically dishonest behavior include, but are not limited to, the following:

a. Cheating: intentionally using or attempting to use unauthorized notes, books, electronic media, or electronic communications in an exam; talking with fellow students or looking at another person's work during an exam; submitting work prepared in advance for an in-class examination; having someone take an exam for you or taking an exam for someone else; violating other rules governing the administration of examinations.

b. Fabrication: including but not limited to, falsifying experimental data and/or citations.

c. Plagiarism: intentionally or knowingly representing the words or ideas of another as one's own in any academic exercise; failure to attribute direct quotations, paraphrases, or borrowed facts or information.

d. Unauthorized collaboration: working together on work that was meant to be done individually.

e. Duplicating work: presenting for grading the same work for
more than one project or in more than one class, unless express
and prior permission has been received from the course
instructor(s) or research adviser involved.

f. Forgery: altering any academic document, including, but not limited to, academic records, admissions materials, or medical excuses.

### Config Experiment 3 
### (ignore headers and footers, image_summarization, faster_ouput with heuristics (AUTO))

In [17]:
config = Configuration(
    high_resolution=True, # Use high resolution for all segments
    segmentation_strategy=SegmentationStrategy.LAYOUT_ANALYSIS,
    chunk_processing=ChunkProcessing(
            ignore_headers_and_footers=True, 
            target_length=1024 
    ),
    segment_processing=SegmentProcessing(
        Caption=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        ),
        Footnote=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        ),
        ListItem=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        ),
        Page=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        ),
        Picture=GenerationConfig(
            llm = "summarize key information in the image",
            html=GenerationStrategy.LLM,
            markdown=GenerationStrategy.LLM,
        ),
        SectionHeader=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        ),
        Text=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        ),
        Title=GenerationConfig(
            html=GenerationStrategy.AUTO,
            markdown=GenerationStrategy.AUTO,
        )
    )
)

In [18]:
task_3 = await chunkr.upload("/Users/henryhu1607/Documents/Professional_Development/Projects/RAG_framework_experiments/pdf_collection/syllabus.pdf", config)

In [19]:
markdown_content = task_3.markdown()
for chunk in task_3.output.chunks:
    if hasattr(chunk, 'image_base64') and chunk.image_base64:
        image_tag = f"![Image](data:image/png;base64,{chunk.image_base64})"
        markdown_content = markdown_content.replace(f"[Image: {chunk.id}]", image_tag)
display(Markdown(markdown_content))

The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.

The image is a logo for New York University (NYU). It features a stylized white torch on a purple square to the left of the letters "NYU" in a bold, purple, sans-serif font. A thin vertical purple line appears to the right of the letters, likely the edge of a larger design. The entire image is on a white background.

TANDON SCHOOL OF ENGINEERING

# Course Syllabus

Computer Science and Engineering Principles of Database Systems

## Course Information

## Course Prerequisites

Graduate student status.

## Course Description

This course broadly introduces database systems, including the relational data model, query languages, database design, index and file structures, query processing and optimization, concurrency and recovery, transaction management and database design. Students acquire hands-on experience in working with database systems and in building web-accessible database applications.

## Course Objectives

This course will provide students with the opportunity to:

. Apply queries in relational algebra to retrieve data.

· Apply queries in SQL to create, read, update and delete data in a database.

. Apply the concepts of entity integrity constraint and referential integrity constraint (including definition of the concept of a foreign key).

· Describe the normal forms (1NF, 2NF, 3NF, BCNF, and 4NF) of a relation.

. Apply normalization to a relation to create a set of BCNF relations and denormalize a relational schema.

NYU

TANDON SCHOOL OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

. Describe functional dependency between two or more attributes that are a subset of a relation.

· Understand multi-valued dependency and identify examples in relational schemas.

· Sketch conceptual data models (including ER) to describe a database structure.

. Apply SQL to create a relational database schema based on conceptual and relational models.

. Apply stored procedures, functions, and triggers using a commercial relational DBMS.

. Describe concurrency control and how it is affected by isolation levels in the database.

· Analyze Current Research in Database Systems.

## Course Structure

This course is conducted entirely online, which means you do not have to be on campus to complete any portion of it. You will participate in the course using NYU Classes located at https://newclasses.nyu.edu Your final grade will be computed as a combination of the components shown below.

· Quizzes: 30%

· Labs: 40%

· Project: 30%

## Weekly Structure

Week 1: Introduction to the Relational Model

. Introduce class and overview of course topics.

Weeks 2-4: SQL Language

The image shows the logo for New York University (NYU). It consists of a purple square on the left side containing a white torch. To the right of the square, the letters "NYU" are printed in a large, bold, purple font. A thin, vertical black line is visible at the extreme right edge of the image, likely an artifact from the image cropping process.

# TANDON SCHOOL OF ENGINEERING

# Course Syllabus - CS GY 6083 Principles of Database System

· Introduction to SQL

· Intermediate SQL

· Advanced SQL

Week 5: Formal Relational Query Languages

· Relational Algebra

· Tuple Relational Calculus

· Domain Relational Calculus

Week 6: Database Design: The Entity-Relationship Approach

· ER Design

· Reduction to Relational Model

Week 7: Relational Database Design

· Functional Dependency

· Multivalued Dependency

· Normal Forms.

Week 8: Application Design

· Web Architectures

· Application Security

## Week 9: Storage and File Structure

· Physical Storage

· Record Organization

Week 10: Indexing and Hashing

The image is the logo for New York University (NYU). It features the letters "NYU" in a bold, sans-serif, purple font. To the left of the letters is a purple square containing a white stylized torch. The torch has a flame with three distinct peaks at the top and a straight, slender handle. The overall design is simple, clean, and modern. A thin black vertical line is also visible on the far right of the image.

# TANDON SCHOOL OF ENGINEERING

# Course Syllabus - CS GY 6083 Principles of Database System

· Ordered Indices

· Hashed Indices

· Bitmap Indices

Weeks 11-12: Query Processing & Optimization

· Query Processing

· Query Optimization

## Week 13: Transactions & Concurrency Control

· ACID Properties

· Transaction Management

## Week 14: Recovery System & Database System Architectures

· Locks

· Deadlocks

· Snapshot Isolation

## Week 15: Student Presentations

· Presentations and reviews

## Learning Time Rubric

Please modify the below table to represent the breakdown of learning time in each week of your course.

The image shows the logo for New York University (NYU). It features a stylized white torch on a purple square background, placed to the left of the letters "NYU" also in purple. The letters and the square with the torch are separated by white space. A thin, vertical black line runs along the very right edge of the image, likely a cropping artifact or edge of a container where the logo is displayed.

# TANDON SCHOOL OF ENGINEERING

# Course Syllabus - CS GY 6083 Principles of Database System

| Learning Time Element | Asynchronous* / Synchronous** | Time on Task for Students (weekly) | Notes |
|---|---|---|---|
| Reading Assignments / Recorded Lecture | Asynchronous | 2.5 hours | Video format. Expect quizzes throughout the module or weekly chapter readings |
| Weekly Discussion Board & Peer Review | Asynchronous | 1.5 hours | Students are expected to post responses to weekly topic questions. See Interaction Policy. |
| Assessment (Labs and Programming assignments) | As

NYU

TANDON SCHOOL OF ENGINEERING

# Course Syllabus - CS GY 6083 Principles of Database System

# Course Communication

## Interaction Policy

Please follow the interaction guidelines stated below for this course.

. I will be holding online virtual classroom sessions every week. This virtual classroom will be held via NYU Classes on Thursdays from 8am to 9am.

. The course will involve regular discussions via the Discussion Forums within NYU Classes and students are encouraged to participate.

· If you have a technical or course content related question, please send me an email. If I think that your question can benefit the class, I might post it on the discussion forum.

· If you have a question related to grading, please send an email to the TA and cc on the email thread. The TA will be responsible for examining your answers and providing a grade as per my guidelines.

. If any other questions need to be answered that are not addressed via email or the live classroom, I can hold virtual office hours on an appointment basis.

## Announcements

Announcements will be posted on NYU Classes on a regular basis. You can locate all class announcements under the Announcements tab of our class. Be sure to check the class announcements regularly as they will contain important information about class assignments and other class matters.

## Email

You are encouraged to post your questions about the course in the Forums discussions on NYU Classes. This is an open forum in which you and your classmates are encouraged to answer each other's questions. But, if you need to contact me directly, please email me. All

The image shows the logo for the NYU Tandon School of Engineering. It consists of two parts:

* **Left part:** A purple square containing a white stylized torch. The torch has a flame with three distinct tips and a straight handle.
* **Right part,** Separated by a vertical black line: The text "NYU" in purple, followed by the text "TANDON SCHOOL OF ENGINEERING" in black. The text is in a sans-serif font, with "TANDON SCHOOL" above "OF ENGINEERING".

# Course Syllabus - CS GY 6083 Principles of Database System

homework, labs or programming assignments related questions must be researched first on own time, then posted on forums, then discussed with TAs during weekly reviews, and then can be forwarded to me. Typically, you can expect a response within 48 hours.

## Readings

Avi Silberschatz, Henry F. Korth,S. Sudarshan, Database System Concepts, Sixth Edition, McGraw Hill

You can access NYU's central library here: http://library.nyu.edu/ You can access NYU Tandon's Bern Dibner Library here: http://library.poly.edu/

RECOMMENDED READINGS are online journal articles provided in each lecture You can access NYU's central library here: http://library.nyu.edu/

You can access NYU Tandon's Bern Dibner Library here: http://library.poly.edu/

## Assignments and Exams

## Exams Administered and Proctored Online

Exams in this course are administered through NYU Classes. You are required to arrange an online proctor for your exams via ProctorU. More information on ProctorU and scheduling proctoring sessions can be found on Tandon Online's website.

NYU

TANDON SCHOOL OF ENGINEERING

Course Syllabus - CS GY 6083 Principles of Database System

Exams Administered On Paper and Proctored Remotely

Exams in this course are administered via paper and pencil. If you are not able to attend an exam session on-campus, you are required to secure in-person proctoring arrangements near your location. Tandon Online's website.

## University Policies

## Moses Center Statement of Disability

Academic accommodations are available for students with disabilities. Please contact the Moses Center for Students with Disabilities (212-998-4980 or mosescsd@nyu.edu) for further information. Students who are requesting academic accommodations are advised to reach out to the Moses Center as early as possible in the semester for assistance.

## NYU Tandon School of Engineering Policies and Procedures on Academic Misconduct1

A. Introduction: The School of Engineering encourages academic excellence in an environment that promotes honesty, integrity, and fairness, and students at the School of Engineering are expected to exhibit those qualities in their academic work. It is through the process of submitting their own work and receiving honest feedback on that work that students may progress academically. Any act of academic dishonesty is seen as an attack upon the School and will not be tolerated. Furthermore, those who breach the School's rules on academic integrity will be sanctioned under this Policy. Students are responsible for familiarizing themselves with the School's Policy on Academic Misconduct.

B. Definition: Academic dishonesty may include misrepresentation, deception, dishonesty, or any act of falsification committed by a student to influence a grade or other academic evaluation. Academic dishonesty also includes intentionally damaging the academic work of others or assisting other students in acts of dishonesty. Common

1 Excerpted from the Tandon School of Engineering Student Code of Conduct

The image displays the logo for New York University (NYU). It features the letters "NYU" in a bold, sans-serif, purple font. To the left of the letters, there's a separate purple square containing a stylized white torch with flames at the top and a short handle at the bottom. The torch and the "NYU" text are aligned horizontally.

# TANDON SCHOOL OF ENGINEERING

# Course Syllabus - CS GY 6083 Principles of Database System

examples of academically dishonest behavior include, but are not limited to, the following:

a. Cheating: intentionally using or attempting to use unauthorized notes, books, electronic media, or electronic communications in an exam; talking with fellow students or looking at another person's work during an exam; submitting work prepared in advance for an in-class examination; having someone take an exam for you or taking an exam for someone else; violating other rules governing the administration of examinations.

b. Fabrication: including but not limited to, falsifying experimental data and/or citations.

c. Plagiarism: intentionally or knowingly representing the words or ideas of another as one's own in any academic exercise; failure to attribute direct quotations, paraphrases, or borrowed facts or information.

d. Unauthorized collaboration: working together on work that was meant to be done individually.

e. Duplicating work: presenting for grading the same work for more than one project or in more than one class, unless express and prior permission has been received from the course instructor(s) or research adviser involved.

f. Forgery: altering any academic document, including, but not limited to, academic records, admissions materials, or medical excuses.

In [21]:
chunks = task_3.output.chunks
print(len(chunks))
for chunk in chunks:
    print(chunk)


24
chunk_id='530e05e9-dace-4f90-9cd4-d532c4df049f' chunk_length=97 segments=[Segment(bbox=BoundingBox(left=83.88, top=14.9472, width=1095.5376, height=51.9984), content='The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.', page_height=1584.0, llm=None, html='<p>The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.</p>', image=None, markdown='The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.', ocr=[OCRResult(bbox=BoundingBox(left=0.014404297, top=0.84959984, width=34.8768, height

In [25]:
for chunk in chunks:
    for segment in chunk.segments:
        print(segment.content)
    print("-"*100)

The information contained on this page is designed to give students a representative example of material covered in the course. Any information related to course assignments, dates, or course materials is illustrative only.

TANDON SCHOOL OF ENGINEERING
----------------------------------------------------------------------------------------------------
Course Syllabus
Computer Science and Engineering Principles of Database Systems
----------------------------------------------------------------------------------------------------
Course Information
Course Prerequisites
Graduate student status.
----------------------------------------------------------------------------------------------------
Course Description
This course broadly introduces database systems, including the relational data model, query languages, database design, index and file structures, query processing and optimization, concurrency and recovery, transaction management and database design. Students acquire hands-on e