# Initial Discovery

## Context

I created this notebook, because I realized that in my initial `test.py` script I was documenting my code in a way that was more suitable to a python notebook. With that being siad, I will be rearranging some of the code to better suit this more narrative style. Hopefully this context will be helpful for anyone who was wondering.

With that out of the way, let's dive in.

## Import required libraries

In [1]:
from pypdf import PdfReader as pr
import numpy as np

## Read in the data

In this case we will be working with the course catalog found at this [link]() for the 2025 Fall Term courses offered at the City College of San Francisco (CCSF).

In [3]:
# Create an instance of the `PdfReader` class
reader = pr("pdfs\ccsf_fall-2025-credit-classes.pdf")

  reader = pr("pdfs\ccsf_fall-2025-credit-classes.pdf")


## Examine the first page

In [9]:
first_page = reader.pages[0].extract_text().split('\n')

### "Meta" data

I figured it would be nice to get some metadata about the courses.

In [8]:
# Contains the information about the context of the courses in this file
meta = first_page[0].split()
print(meta)

['CREDIT', 'FALL', '2025CCSF', 'SCHEDULE', 'OF', 'CLASSES']


In [5]:
# Weird space between hard coding and kind of not?
meta = {
    # Handling the combined year and college name
    'school': meta[2][4:],
    'are_credit_courses': True if meta[0] == 'CREDIT' else False,
    'term': meta[1],
    # Handling the combined year and college name
    'year': meta[2][:4] 
}

meta

{'school': 'CCSF', 'are_credit_courses': True, 'term': 'FALL', 'year': '2025'}

### Column Headers

In this document, which you can view [here](), you can see that there are several columns used to organize information pertaining to each course. Let's check out what the `reader` object is returning.

In [7]:
first_page[1]

'CRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'

Because several of the column titles have odd spacing, I felt that it was easier to just hardcode the column titles:

In [None]:
column_headers = [
    'CRN',
    'SEC',
    'TYPE',
    'DAYS',
    'TIMES',
    'DATES',
    'LOCATION',
    'CAMPUS',
    'INSTRUCTOR'
]

### Parsing the Data

It became clear quite quickly that the returned text data would not be very clear. So I needed to find a way to parse the data but also retain its hierarchy.

We have `.extract_text()`, which we used for the work above, but that didn't provide much insight on hierarchy. Printing out the text below, we can see that the footer (e.g., "REGISTER ONLINE TODAY!") is returned before any of the actual department or course information.

Additionally, we still get the weird spacing between some of the column titles and the actual course information pertaining to each column (e.g., "L ec" for the "TYPE" column, or "T R" for "DAYS", etc.).

In [None]:
# Feel free to change how much you want printed out
first_page[:10]

['CREDIT FALL 2025CCSF SCHEDULE OF CLASSES',
 'CRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR',
 'REGISTER ONLINE TODAY!  1',
 'LAST UPDATED: 6/27/2025, 4:30PM',
 'Academic Achievement & Personal Success',
 'AAPS 103: Orientation to College Transfer 3 .0',
 'PREREQ: Completion of or concurrent enrollment in: ENGL C1000.  ',
 '70482 0 08  L ec  T R  0 9:40-10:55AM  0 8/19-12/19  M IC 254  M ission  R ivera',
 ' ',
 'Accounting']