## Dataset Validation and Quality Checks

This notebook is used to validate the correctness, structure, and retrieval readiness of the academic advisor dataset. The purpose of these tests is to ensure that:

- The dataset contains meaningful academic and policy information.
- Program, university, and category metadata are correctly assigned.
- The data supports real academic advising questions such as GPA rules, graduation requirements, registration, and probation.
- The dataset works across multiple universities.

Each test below focuses on a specific academic use case and verifies that relevant records can be retrieved successfully.

In [19]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 1: Dataset Loading and Initial Inspection

This step loads the final processed academic advisor dataset from the JSON file and performs a basic inspection to ensure that the data is readable, properly structured, and ready for validation. The goal is to confirm that no corruption or formatting issues exist before running deeper tests.

In [20]:
import json

with open("/content/drive/MyDrive/DAB_RAG_ZakyProject/data/processed/academic_advisor_rag_dataset.json", "r", encoding="utf-8") as f:
    data = json.load(f)

print("JSON loaded successfully")
print("Total records:", len(data))
print("First record keys:", data[0].keys())

JSON loaded successfully
Total records: 5051
First record keys: dict_keys(['id', 'source_file', 'university', 'catalog_label', 'section', 'section_chunk_index', 'category', 'program', 'college', 'degree', 'level', 'text'])


## Step 2: Category Distribution Analysis

This test analyzes the distribution of records across different academic and policy categories (such as course descriptions, graduation requirements, admissions, financial information, and academic policies). The goal is to verify that the dataset covers a wide range of advising-related topics and is not biased toward a single category.

In [21]:
from collections import Counter

category_counts = Counter([row["category"] for row in data])

for cat, count in category_counts.most_common():
    print(f"{cat:30s} -> {count}")

general_academic               -> 2784
course_description             -> 1165
graduation_requirements        -> 256
program_learning_outcomes      -> 238
financial_fees                 -> 228
admissions_general             -> 136
program_overview               -> 127
esl_language                   -> 38
curriculum_structure           -> 25
conduct_policy                 -> 18
sap_probation_policy           -> 15
academic_calendar              -> 14
grievance_policy               -> 6
admissions_program_specific    -> 1


## Step 3: Program Coverage Validation

This test extracts all unique program names from the dataset to verify that multiple academic disciplines are represented. The purpose is to ensure that the dataset is not limited to only one or two majors and that it supports advising across a wide variety of programs such as Accounting, Computer Science, Biology, Engineering, and others.

In [22]:
programs = sorted(set(row["program"] for row in data if row["program"]))
print("Programs found:")
for p in programs:
    print("-", p)

Programs found:
- Accounting
- Africainterdisciplinarystudies
- Africanastudies
- Americanindianstudies
- Anthropology
- Artanddesign
- Asianamericanstudies
- Asianstudies
- Assistivetechnologyengineering
- Assistivetechnologystudiesandhumanservices
- Biology
- Businessadministrationgraduatelevel
- Businesshonors
- Businesslaw
- Californiastudies
- Centralamericanandtransborderstudies
- Chemistryandbiochemistry
- Chicanaandchicanostudies
- Childandadolescentdevelopment
- Cinemaandtelevisionarts
- Civicandcommunityengagement
- Civilengineeringandconstructionmanagement
- Collegeofengineeringandcomputerscience
- Collegeofhealthandhumandevelopment
- Collegeofhumanities
- Collegeofscienceandmathematics
- Collegeofsocialandbehavioralsciences
- Collegesdepartmentsandprograms
- Communicationdisordersandsciences
- Communicationstudies
- Computerscience
- Coursepolicies
- Credentialoffice
- Criminologyandjusticestudies
- Csun
- Davidnazariancollegeofbusinessandeconomics
- Deafstudies
- Disabilit

## Step 4: Academic Level Distribution Analysis

This test counts the number of undergraduate and graduate records in the dataset. The purpose is to determine the primary academic focus of the current dataset and to confirm whether the system is mainly supporting undergraduate advising, graduate advising, or a combination of both.

In [23]:
levels = Counter([row["level"] for row in data if row["level"]])
print(levels)

Counter({'undergraduate': 2460, 'graduate': 64})


## Step 5: GPA Policy Retrieval Test

This test searches the dataset for GPA-related records to verify that the system can retrieve academic standing rules such as GPA requirements, probation, dismissal, and progression policies. This simulates real advising questions such as:
"What is the minimum GPA requirement?" and
"What happens if a student’s GPA drops below the required level?"

In [24]:
results = [
    r for r in data
    if "gpa" in r["text"].lower()
    or "probation" in r["text"].lower()
]

print("Found", len(results), "GPA-related records\n")

for r in results[:3]:
    print("-" * 80)
    print("Category:", r["category"])
    print("Program:", r["program"])
    print(r["text"][:2000])

Found 369 GPA-related records

--------------------------------------------------------------------------------
Category: admissions_general
Program: Accounting
Identify and analyze problems and devise appropriate solutions using qualitative and quantitative techniques.
Identify ethical dilemmas, analyze them from multiple perspectives, develop solutions and support their decisions.
Recognize and evaluate the role of diversity, inclusion and multiculturalism in the global business environment.
Demonstrate proficiency in the functional areas of business, as well as the ability to synthesize and apply this knowledge across disciplines.
Requirements Business Majors
A Business major is any student majoring in Accountancy; Information Systems; or Business Administration with an option in either Business Analytics, Business Law, Financial Analysis, Financial Planning, Global Supply Chain Management, Management, Marketing, Real Estate, Risk Management and Insurance, or Systems and Operations 

## Step 6: Course Withdrawal Policy Retrieval Test

This test retrieves records related to course withdrawal to verify that the dataset contains policies regarding administrative withdrawals, course drops, enrollment conditions, and student responsibilities. This simulates questions such as:
"Can I withdraw from a course?" and
"What happens if I withdraw late?"

In [25]:
results = [
    r for r in data
    if "withdraw" in r["text"].lower()
]

print("Found", len(results), "withdrawal-related records\n")

for r in results[:3]:
    print("-" * 80)
    print("Category:", r["category"])
    print(r["text"][:2000])

Found 56 withdrawal-related records

--------------------------------------------------------------------------------
Category: general_academic
A component course is a graded lecture class that has a required, non-graded, 0-unit lab or discussion. To enroll in component classes, students enter the class number of the lab or discussion and the system will automatically enroll them in the lecture class.
Preparatory
A course/condition* that is recommended to be completed/met prior to enrollment in another course. Enrollment in preparatory course/condition* groupings is not enforced by SOLAR.
*Examples of prerequisite &#8220;conditions&#8221; include class level, a specific examination score, a specified passing grade, etc.
Enrolling in Courses with Prerequisites
Students must fulfill the prerequisite(s) for a course prior to enrollment in the course. For further information, see Course Requisites in this Catalog or Registration FAQ: Course Requisites . Graduate students are not held to p

## Step 7: Computer Science Graduation Requirements at CSUN

This test retrieves Computer Science graduation-related records specifically for California State University, Northridge (CSUN). The goal is to verify that the dataset correctly supports detailed degree structure queries such as total required units, general education requirements, electives, and graduate program requirements.

In [26]:
results = [
    r for r in data
    if r["program"] in ["Computerscience", "Computer Science"]  # handles both cases
    and r["university"].startswith("California")
    and r["category"] == "graduation_requirements"
]

print("Found", len(results), "CSUN Computer Science graduation records\n")

for r in results:
    print("-" * 80)
    print(r["text"][:2000])

Found 6 CSUN Computer Science graduation records

--------------------------------------------------------------------------------
Additional Units: 0-3
Total Units Required for the B.S. Degree: 120-124
Computer Science, B.S.
Overview
The B.S. degree in Computer Science provides a broad knowledge of computing and is designed for students who desire: (a) to pursue graduate work in computer science and (b) to work on the development and support of software projects in a diverse range of specialized areas. The Computer Science degree consists of a set of core courses and a 15-unit senior electives package. The core of the program covers programming languages, computer system organization, operating systems, data structures, software engineering, computation theory and societal implications in computing. The senior electives package allows students to specialize in such fields as artificial intelligence, embedded applications, networking, gaming, graphics, software engineering and security

## Step 8: University of the People Computer Science Content Retrieval

This test searches for Computer Science-related content within the University of the People records. Since the UoPeople source document is policy-oriented rather than program-structured, this test verifies that Computer Science content can still be retrieved using text-based and semantic matching even when explicit program metadata is missing.

In [27]:
results = [
    r for r in data
    if r["university"].lower().startswith("university of the people")
    and "computer" in r["text"].lower()
]

print("Found", len(results), "UoPeople Computer Science related records\n")

for r in results[:5]:
    print("-" * 80)
    print("Category:", r["category"])
    print("Program:", r["program"])
    print(r["text"][:2000])

Found 89 UoPeople Computer Science related records

--------------------------------------------------------------------------------
Category: general_academic
Program: None
4 Dr. Barbara Kahn, The Wharton School
Mr. Aref Lahham, Orion Capital Managers
Mr. Ken Marlin, Marlin & Associates
Mr. Daniel Weinberg Kenetic
Dr. Russell S. Winer, New York University
Computer Science
Dr. Alexander Tuzhilin, New York University, Chair
Dr. Vijay Atluri, Rutgers University
Prof. Justine Cassell, Carnegie Mellon University
Dr. Shay David, Retrain.ai
Dr. Shawndra Hill, Facebook
Dr. H.V. Jagadish, University of Michigan
Dr. Vincent Oria, New Jersey Institute of Technology
Dr. Avi Silberschatz, Yale University
Dr. Albert Wenger, Union Square Ventures
Ms. Gabriele Zedlmayer, Hypo Vereinsbank UniCredit
Health Science
Dr. Dalton Conley, Princeton University, Chair
Mr. Stanley Bergman, Henry Schein
Dr. Mark R. Cullen, Stanford University School of Medicine
Professor Patricia M. Davidson, University of Wollo

## Step 9: UoPeople GPA and Academic Standing Policy Validation

This test retrieves GPA, probation, dismissal, scholarship, and registration restriction policies for University of the People. The goal is to verify that the dataset supports real policy advising questions such as:
"What happens if my CGPA drops below 2.0?" and
"How many courses can I take while on probation?"

In [28]:
results = [
    r for r in data
    if r["university"].lower().startswith("university of the people")
    and (
        "gpa" in r["text"].lower()
        or "probation" in r["text"].lower()
    )
]

print("Found", len(results), "UoPeople GPA / probation records\n")

for r in results[:5]:
    print("-" * 80)
    print(r["text"][:2000])

Found 22 UoPeople GPA / probation records

--------------------------------------------------------------------------------
31
All petitions should be sent to the student’s Program Advisor, who will forward it to the Office of
Student Services at student.services@uopeople.edu and will then be directed to the appropriate
Department Chair who will decide if the petition is valid and has merit. If so, he/she will forward it to
the Student Affairs Committee. Once the appeal is submitted, students will receive a confirmation
email within one week from the Office of Student Services and a final decision about the appeal within
six wee ks of the submission of their petition. Decisions rendered by the Committee are final and
binding. If the petition is granted, the Office of Student Services will process the appropriate action.
Course Repeats
Students whose CGPA is not high enough to graduate may request an academic waiver in order to
repeat a course. The request must be made in accordance wit

## Step 9: Accounting Program Overview and Academic Support at CSUN

This test retrieves general academic and program overview information related to the Accounting major at California State University, Northridge (CSUN). The goal is to verify that the dataset includes high-level descriptive information such as:

- Department structure and contact information  
- Degree program descriptions  
- Academic advising services  
- Career support resources  

This simulates real student questions such as:
- "What is the Accounting program about?"
- "Where can I get advising for Accounting at CSUN?"
- "What career services are available for Accounting students?"

In [29]:
results = [
    r for r in data
    if r["program"] == "Accounting"
    and r["category"] in ["program_overview", "general_academic"]
]

for r in results[:3]:
    print("-" * 80)
    print("Category:", r["category"])
    print(r["text"][:2000])

--------------------------------------------------------------------------------
Category: general_academic
Accounting
Accounting
David Nazarian College of Business and Economics
Department of Accounting
Chair: Rishma Vedd
Bookstein Hall (BB) 3123
(818) 677-2461
Master of Professional Accountancy
Director: Rafael Efrat
Bookstein Hall (BB) 3123
(818) 677-2461
Master of Science in Taxation
Bookstein Chair in Taxation: Rafael Efrat
Bookstein Hall (BB) 3123
(818) 677-5488
EY Center for Careers in Accounting
Director: Gladys Polio
Bookstein Hall (BB) 2224
(818) 677-2979
Faculty
Katie Boylen, Keji Chen, Manuela Dantas, Michael E. Doron, Kiren Dosanjh-Zucker, Rafi Efrat, Monica Gianni, Young-Won Her, Yuan Yuan Lu, Joon Seok Moon, Rishma Vedd, Dongyi Wang, Sung Wook Yoon, Jun Zhan
Emeritus Faculty
Dhia D. Alhashim, Shahid L. Ansari, Robert L. Barker, Janice Bell, James C. Bennett, Dwight Call, Raymond S. Chen, James S. Chiu, Donna Driscoll, Glen L. Gray, Catherine Jeppson, Robert J. Kiddoo, Da

## Step 10: Accounting Graduation Requirements at CSUN

This test retrieves graduation-related records for the Accounting program at California State University, Northridge (CSUN). The objective is to verify that the dataset contains detailed degree completion rules, including:

- General Education requirements  
- Campus requirements  
- Total required units  
- Major-specific course and specialization requirements  
- Culminating experience rules  

This simulates real academic advising questions such as:
- "What are the graduation requirements for Accounting at CSUN?"
- "How many units are required to complete the Accounting degree?"
- "What are the specialization and capstone requirements?"

In [30]:
results = [
    r for r in data
    if r["program"] == "Accounting"
    and r["university"].startswith("California")
    and r["category"] == "graduation_requirements"
]

for r in results:
    print("-" * 80)
    print(r["text"][:2000])

--------------------------------------------------------------------------------
Undergraduate students must complete 43 units of General Education as described in this Catalog, including 3 units of coursework meeting the Ethnic Studies (E.S.) requirement.
12 units are satisfied by the following courses in the major: MATH 103 satisfies Basic Skills Area 2 Mathematical Concepts and Quantitative Reasoning; FIN 303 satisfies Area 2 Mathematical Concepts and Quantitative Reasoning, Upper Division; and ECON 160 and ECON 161 satisfy Area 4 Social and Behavioral Sciences. In addition, IS 212 fulfills the Information Competence requirement.
The following courses are strongly recommended to satisfy the upper division GE requirement:
RS 361 Contemporary Ethical Issues (3) (satisfies 3 units of Area 3A Humanities)
COMS 356 Intercultural Communication (3) (satisfies 3 units of F Comparative Cultural Studies)
JS 318 Applied Jewish Ethics (3)
or PHIL 305 Business Ethics and Public Policy (3)
or BLAW

## Step 11: Computer Science Content at University of the People

This test looks for Computer Science-related text in the University of the People records. The goal is to confirm that the dataset contains enough information to answer general questions about the Computer Science program at UoPeople, such as what it is about or where it appears in the catalog.

In [31]:
results = [
    r for r in data
    if r["university"].lower().startswith("university of the people")
    and "computer science" in r["text"].lower()
]

print("Found", len(results), "UoPeople Computer Science related records\n")

for r in results[:5]:
    print("-" * 80)
    print("Category:", r["category"])
    print("Program:", r["program"])
    print(r["text"][:2000])

Found 71 UoPeople Computer Science related records

--------------------------------------------------------------------------------
Category: general_academic
Program: None
4 Dr. Barbara Kahn, The Wharton School
Mr. Aref Lahham, Orion Capital Managers
Mr. Ken Marlin, Marlin & Associates
Mr. Daniel Weinberg Kenetic
Dr. Russell S. Winer, New York University
Computer Science
Dr. Alexander Tuzhilin, New York University, Chair
Dr. Vijay Atluri, Rutgers University
Prof. Justine Cassell, Carnegie Mellon University
Dr. Shay David, Retrain.ai
Dr. Shawndra Hill, Facebook
Dr. H.V. Jagadish, University of Michigan
Dr. Vincent Oria, New Jersey Institute of Technology
Dr. Avi Silberschatz, Yale University
Dr. Albert Wenger, Union Square Ventures
Ms. Gabriele Zedlmayer, Hypo Vereinsbank UniCredit
Health Science
Dr. Dalton Conley, Princeton University, Chair
Mr. Stanley Bergman, Henry Schein
Dr. Mark R. Cullen, Stanford University School of Medicine
Professor Patricia M. Davidson, University of Wollo