# Initial Discovery

## Context

I created this notebook, because I realized that in my initial `test.py` script I was documenting my code in a way that was more suitable to a python notebook. With that being siad, I will be rearranging some of the code to better suit this more narrative style. Hopefully this context will be helpful for anyone who was wondering.

With that out of the way, let's dive in.

## Import required libraries

In [1]:
from pypdf import PdfReader as pr
import pandas as pd
import numpy as np
import re

## Read in the data

In this case we will be working with the course catalog found at this [link]() for the 2025 Fall Term courses offered at the City College of San Francisco (CCSF).

In [2]:
# Create an instance of the `PdfReader` class
reader = pr("pdfs\ccsf_fall-2025-credit-classes.pdf")

  reader = pr("pdfs\ccsf_fall-2025-credit-classes.pdf")


## Examine the first page

In [3]:
first_page = reader.pages[0].extract_text().split('\n')

### "Meta" data

I figured it would be nice to get some metadata about the courses.

In [4]:
# Contains the information about the context of the courses in this file
meta = first_page[0].split()
print(meta)

['CREDIT', 'FALL', '2025CCSF', 'SCHEDULE', 'OF', 'CLASSES']


In [5]:
# Weird space between hard coding and kind of not?
meta = {
    # Handling the combined year and college name
    'school': meta[2][4:],
    'are_credit_courses': True if meta[0] == 'CREDIT' else False,
    'term': meta[1],
    # Handling the combined year and college name
    'year': meta[2][:4] 
}

meta

{'school': 'CCSF', 'are_credit_courses': True, 'term': 'FALL', 'year': '2025'}

### Column Headers

In this document, which you can view [here](), you can see that there are several columns used to organize information pertaining to each course. Let's check out what the `reader` object is returning.

In [6]:
first_page[1]

'CRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'

Because several of the column titles have odd spacing, I felt that it was easier to just hardcode the column titles:

In [7]:
column_headers = [
    'CRN',
    'SEC',
    'TYPE',
    'DAYS',
    'TIMES',
    'DATES',
    'LOCATION',
    'CAMPUS',
    'INSTRUCTOR'
]

### Parsing the Data

It became clear quite quickly that the returned text data would not be very clear. So I needed to find a way to parse the data but also retain its hierarchy.

#### Troubleshooting

We have `.extract_text()`, which we used for the work above, but that didn't provide much insight on hierarchy. Printing out the text below, we can see that the footer (e.g., "REGISTER ONLINE TODAY!") is returned before any of the actual department or course information.

Additionally, we still get the weird spacing between some of the column titles and the actual course information pertaining to each column (e.g., "L ec" for the "TYPE" column, or "T R" for "DAYS", etc.).

In [8]:
# Feel free to change how much you want printed out
first_page[:10]

['CREDIT FALL 2025CCSF SCHEDULE OF CLASSES',
 'CRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR',
 'REGISTER ONLINE TODAY!  1',
 'LAST UPDATED: 6/27/2025, 4:30PM',
 'Academic Achievement & Personal Success',
 'AAPS 103: Orientation to College Transfer 3 .0',
 'PREREQ: Completion of or concurrent enrollment in: ENGL C1000.  ',
 '70482 0 08  L ec  T R  0 9:40-10:55AM  0 8/19-12/19  M IC 254  M ission  R ivera',
 ' ',
 'Accounting']

With some perusing, you can find that there are different "extraction modes" (`plain` vs `layout`). `plain` is the default and legacy option, so let's try `layout`.

In [9]:
# Converting to a list so we can print in a for loop
first_page = reader.pages[0].extract_text(extraction_mode='layout').split('\n')

lines = 10
for line in range(lines):
    print(first_page[line])

 Academic Achievement & Personal Success
 AAPS 103: Orientation to College Transfer                                                                                                                   3.0
          PREREQ: Completion of or concurrent enrollment in: ENGL C1000.
     70482      008       Lec        TR         09:40 -10:55AM          08/19 -12/19     MIC 254                Mission           Rivera



 Accounting
 ACC T 1: Financial Accounting                                                                                                                               5.0
          Recommended Prep: (Readiness for college -level English or ESL 188) and BSMA 68.


Something important to note is that this approach seems to drop the header information (up to the 10th line) but does keep the overall layout of the document well preserved **and** it does not create splits in the course information like the `plain` method did.

*However*, there is still the issue of understanding the hierarchy. We could use regular expressions and conditionals to try and map out a predictable pattern for the text, but that *feels* inefficient and the documentation also makes reference that it is "very hard to guarantee correct whitespaces."

So I looked into this method some more and found that you can use a 'visitor function' which, for the purposes of this project, can provide more information about the text being read by the extractor. This does present another issue, as 'visitor functions' cannot be used with the `layout` "extraction mode".

Nevertheless, let's inspect this new approach:

###### Method Arguments

If you would like to know more about this please follow this [link](https://pypdf.readthedocs.io/en/stable/user/extract-text.html). But I will paste their explanation of the visitor function's arguments below:

- text: the current text (as long as possible, can be up to a full line)
- user_matrix: current matrix to move from user coordinate space (also known as CTM)
- tm_matrix: current matrix from text coordinate space
- font_dictionary: full font dictionary
- font_size: the size (in text coordinate space)

###### *Uncollapse the above section if you'd like an explanation of the method arguments

The documentation shows that we pass the visitor function through our `extract_text` method, which allows us to obtain some extra information about the text that is parsed by the extractor - namely `font_size`. 

Given how the document is structured, it could be useful to use the `font_size` to distinguish between certain sections on each page. To obtain this, we need to accumulate the font sizes identfied in the extractor.

In [10]:
# Define an accumulator to collect the font sizes and the actual text processed
font_sizes = []
processed_text = []

# Define a visitor function
def visitor_func(text, cm, tm, font_dict, font_size):
    # Since we don't know how many items will be processed we will have to append new items
    font_sizes.append(font_size)
    processed_text.append(text)

# Extract/process the text on the first page
first_page = reader.pages[0].extract_text(visitor_text=visitor_func)

# Debug the information gathered
for i, (font_size, text) in enumerate(zip(font_sizes, processed_text)):
    if i >= 10:
        break
    print(f"Item {i}: ({font_size}) - {text}")

Item 0: (12.0) - 
Item 1: (12.0) - 
Item 2: (1.0) - CREDIT FALL 2025CCSF SCHEDULE OF CLASSES
Item 3: (1.0) - 
Item 4: (1.0) - 

Item 5: (1.0) - CRN
Item 6: (1.0) - 
Item 7: (1.0) -   
Item 8: (1.0) - 
Item 9: (1.0) - SEC


So, the result was pretty confusing. How is it that the font size for the largest text on the document (Item 2) has a font size of 1? I know that we already manually parsed the header, but if you increase the number of lines the previous cell prints, you will see similarly concerning/unexpected outputs.

Fortunately, there are other arguments that I passed over that can still be useful. The argument `tm` is defined as the "current matrix from text coordinate space", and the documentation later goes on to say the following:

"*The matrix stores six parameters. The first four provide the rotation/scaling matrix and the last two provide the translation (horizontal/vertical). It is recommended to use the user_matrix as it takes into account all transformations.*"

Let's take a look at what the `tm` argument returns:

In [11]:
# Define an accumulator to collect the font sizes and the actual text processed
font_sizes = []
processed_text = []
tms = []

# Define a visitor function
def visitor_func(text, cm, tm, font_dict, font_size):
    # Since we don't know how many items will be processed we will have to append new items
    font_sizes.append(font_size)
    processed_text.append(text)
    # * Added in the tm accumulator
    tms.append(tm)

# Extract/process the text on the first page
first_page = reader.pages[0].extract_text(visitor_text=visitor_func)

# Debug the information gathered
for i, (font_size, tm, text) in enumerate(zip(font_sizes, tms, processed_text)):
    if i >= 10:
        break
    print(f"Item {i}: ({font_size}) - {tm} -{text}")

Item 0: (12.0) - [1.0, 0.0, 0.0, 1.0, 0.0, 0.0] -
Item 1: (12.0) - [1.0, 0.0, 0.0, 1.0, 0.0, 0.0] -
Item 2: (1.0) - [16.0, 0.0, 0.0, 16.0, 36.0798, 750.1841] -CREDIT FALL 2025CCSF SCHEDULE OF CLASSES
Item 3: (1.0) - [18.0, 0.0, 0.0, 18.0, 179.0648, 749.4571] -
Item 4: (1.0) - [1.0, 0.0, 0.0, 1.0, 0.0, 0.0] -

Item 5: (1.0) - [9.0, 0.0, 0.0, 9.0, 55.19, 732.15] -CRN
Item 6: (1.0) - [9.0, 0.0, 0.0, 9.0, 55.19, 732.15] -
Item 7: (1.0) - [1.0, 0.0, 0.0, 1.0, 0.0, 0.0] -  
Item 8: (1.0) - [9.0, 0.0, 0.0, 9.0, 72.677, 732.15] -
Item 9: (1.0) - [9.0, 0.0, 0.0, 9.0, 92.306, 732.15] -SEC


Given that the first four elements are said to pertain to rotation and scaling, and only the 1st and 4th of those elements are filled (you can print out more to confirm), I will opt to see if the 4th element would be a suitable marker for font size:

In [12]:
# Define an accumulator to collect the font sizes and the actual text processed
font_sizes = []
processed_text = []

# Define a visitor function
def visitor_func(text, cm, tm, font_dict, font_size):
    # Since we don't know how many items will be processed we will have to append new items
    font_sizes.append(tm[3]) # I'm now using the 4th element of tm for the font size
    processed_text.append(text)

# Extract/process the text on the first page
first_page = reader.pages[0].extract_text(visitor_text=visitor_func)

# Debug the information gathered
for i, (font_size, text) in enumerate(zip(font_sizes, processed_text)):
    if i >= 10:
        break
    print(f"Item {i}: ({font_size}) - {text}")

Item 0: (1.0) - 
Item 1: (1.0) - 
Item 2: (16.0) - CREDIT FALL 2025CCSF SCHEDULE OF CLASSES
Item 3: (18.0) - 
Item 4: (1.0) - 

Item 5: (9.0) - CRN
Item 6: (9.0) - 
Item 7: (1.0) -   
Item 8: (9.0) - 
Item 9: (9.0) - SEC


This is very promising!

I'm interested to see the distribution and examples of the font sizes and their corresponding text look like across the first page (as a sample of the entire document).

Here's how I did this:

In [13]:
# Since we will be looping, I want to store the page information to be more efficient
first_page = reader.pages[0]

# This is useful for the loop below, but needs to be created before the method
font_limit = 0
# Since the visitor function can collect lines of text
lines = []        
# This will store the information for each font size throughout the first page
_dict = {}

def visitor_func(text, cm, tm, font_dict, font_size):
    # I just want to collect font and text information for a specific size one at a time
    if tm[3] == font_limit: 
        # I don't want to sift through clutter when looking at printed examples
        if text not in  ('', ' ', "'  '", '\n'):
            # I use repr just to get more insight into the characters included in each string
            lines.append(repr(text))

# Arbitrarily chose 20 because I saw that most examples were below 18
for i in range(20):
    # The index controls the font size
    font_limit = i + 1
    
    # Do the extraction
    first_page.extract_text(visitor_text=visitor_func, extraction_mode="plain")
    # _list = lines.copy()
    _dict[font_limit] = (len(lines), lines)
    lines = []
    
for k, v in _dict.items():
    if v[0] > 0:
        print(k, v)

1 (10, ["'  '", "'  '", "'  '", "'  '", "'  '", "'  '", "'CREDIT FALL 2025CCSF SCHEDULE OF CLASSES\\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'", "'  '", "'  '", "'  '"])
9 (175, ["'CRN'", "'SEC'", "'TYPE'", "'D AYS'", "'TIMES'", "'D ATES'", "'L OCATION'", "'C AMPUS'", "'INSTR UCTOR'", "'PREREQ: Completion of or concurrent enrollment in: ENGL C1000.  '", "'70482'", "'0 08'", "'  '", "'L ec'", "'  '", "'T R'", "'  '", "'0 9:40-10:55AM'", "'  '", "'0 8/19-12/19'", "'  '", "'M IC 254'", "'  '", "'M ission'", "'  '", "'R ivera'", "'Recommended Prep: (Readiness for college-level English or ESL 188) and BSMA 68.  '", "'70243'", "'0 01'", "'  '", "'L ec'", "'  '", "'M TWRF'", "'  '", "'1 0:10-11:00AM'", "'  '", "'0 8/18-12/19'", "'  '", "'C LOU 229'", "'  '", "'O cean'", "'  '", "'Y run'", "'70244'", "'0 02'", "'  '", "'L ec'", "'  '", "'M TWRF'", "'  '", "'1 1:10-12:00PM'", "'  '", "'0 8/18-12/19'", "'  '", "'C LOU 229'", "'  '", "'O cean'", "'  '", "'Y run'", "'702

Eureka! This output tells me the following:
 - Font 9: Paragraph text, specific course information (CRN, SEC, etc.)
 - Font 11: Footer data
 - Font 14: Department, course title, number of units
 - Font 16: Main title of the document

So, we still see that there is weird separation of words, but I can do either of the following:

1. I can just remove the whitespace for each of the words because they were correctly included as one string.
2. I can ignore it and just use this approach to identify departments, course titles, units, and the overal page title, **and** I can use the `layout` extraction mode separately for the other text.

At this point, I'm opting for option 2.

Before I do that, however, I want to extend this inspection to the entire document. Since this is likely to be a slow process I will define a variable that will control if the operation will execute and will be set to `False` after the information is printed.

In [14]:
# Apologies for the long name, but I wanted to make it descriptive
execute_document_font_distribution_extraction = True

# Throwing error to ensure the next cell isn't run if I forget and do run all
# raise Exception("Do you really want to re-run the next cell?")

In [15]:
# We don't need this to run all the time, so I added a safeguard
if execute_document_font_distribution_extraction:
    # Importing a module to do some cleaner printing
    from IPython.display import clear_output
    
    # Importing default_dict to make accumulation easier
    from collections import defaultdict
    
    # Doing this here instead of every loop
    num_pages = len(reader.pages)
         
    # This will store the information for each font size throughout the first page
    document_font_size_dict = defaultdict(lambda: [0, []]) # Needs to be outside of the loop to not erase previous page data
    
    # Iterate through each page and extract and process text/font size
    for page_number, page in enumerate(reader.pages):
        # This is useful for the loop below, but needs to be created before the method
        font_limit = 0
        # Since the visitor function can collect lines of text
        lines = []

        def visitor_func(text, cm, tm, fontdocument_font_size_dict, font_size):
            # I just want to collect font and text information for a specific size one at a time
            if tm[3] == font_limit: 
                # I don't want to sift through clutter when looking at printed examples
                if re.search('\w', text):
                    lines.append(text)

        # Arbitrarily chose 20 because I saw that most examples were below 18
        for i in range(20):
            # The index controls the font size
            font_limit = i + 1
            
            # Do the extraction
            page.extract_text(visitor_text=visitor_func, extraction_mode="plain")
            
            # Assign the information
            document_font_size_dict[font_limit][0] += len(lines)
            # This let's us see a breakdown of what was extracted on each page
            document_font_size_dict[font_limit][1].append(lines)
            
            # Reset this accumulator to not duplicate information across key-value pairs
            lines = []
            
        # Print an update
        clear_output(wait=True)
        print(f"Completed page {page_number + 1} out of {num_pages}")
        
    # Debug distribution collected
    for k, v in document_font_size_dict.items():
        if v[0] > 0:
            print(k, v)

Completed page 280 out of 280
1 [280, [['CREDIT FALL 2025CCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['FALL 2025 CREDITCCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['CREDIT FALL 2025CCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['FALL 2025 CREDITCCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['CREDIT FALL 2025CCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['FALL 2025 CREDITCCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['CREDIT FALL 2025CCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['FALL 2025 CREDITCCSF SCHEDULE OF CLASSES\nCRN  SEC  TYPE  D AYS TIMES  D ATES L OCATION  C AMPUS  INSTR UCTOR'], ['CREDIT FALL 2025CCSF SCHEDULE OF CLASSE

Now that we have run this for the entire document, we can be extra sure that the formatting is consistently identified within the first 20 font sizes. Let's march on.

#### Implementing Plan

My plan going forward is to use the font sizes that correlate with headers, and the `layout` extraction method for everything else. To do that, I need to be able to accurately classify these lines of text. Let's take a look at that data now.

In [16]:
# Get unique list of strings for each relevant font sizes

# * Font size = 16
text_16 = []
for text_list in document_font_size_dict[16][1]:
    text_16 += text_list

# Generate a unique set of lines
text_16 = set(text_16)
print(text_16)

# * Font size = 14
text_14 = []
for text_list in document_font_size_dict[14][1]:
    text_14 += text_list

# We don't want to generate a unique set of lines here actually
print(text_14)


{'FALL 2025 CREDITCCSF SCHEDULE OF CLASSES', 'CREDIT FALL 2025CCSF SCHEDULE OF CLASSES'}
['Academic Achievement & Personal Success', 'AAPS 103: Orientation to College Transfer', '3 .0', 'Accounting', 'ACCT 1: Financial Accounting', '5 .0', 'ACCT 2: Managerial Accounting', '5 .0', 'ACCT 51: Intermediate Accounting', '5 .0', 'ACCT 59: Federal Income Tax', '3 .0', 'Administration of Justice', 'ADMJ 51: Juvenile Procedures', '3 .0', 'ADMJ 52: Concepts of Criminal Law', '3 .0', 'ADMJ 53: Legal Aspects of Evidence', '3 .0', 'ADMJ 54: Principles and Procedures of the Justice System', '3 .0', 'ADMJ 57: Introduction to Administration of Justice', '3 .0', 'ADMJ 62: Criminal Investigation', '3 .0', 'ADMJ 64: Progressive Policing in the 21st Century', '3 .0', 'ADMJ 68: Criminal Justice Report Writing', '3 .0', 'ADMJ 69: Crime Scene Documentation', '3 .0', 'ADMJ 70A: Patrol Procedures', '3 .0', 'ADMJ 72: Police Work Experience', '8 .0', 'ADMJ 80: Community Corrections', '3 .0', 'ADMJ 89: Continuing

We can skip the text of font size 16 and just focus on the font size of 14. Looking at the examples it seems like it might be a safe bet to parse on strings that contain a colon (course titles vs departments), and separating out those that are numbers (units). Let's see what that yields.

In [17]:
# Storing the departments, their courses, and their courses' data 
departments = np.array([])
courses = []

current_department = ''
current_course = ''
# Min, Max
current_units = []

# Total number of items in text_14
n_items = len(text_14)

# Generate a dictionary --> json file?
for i, text in enumerate(text_14):
    # print(text)
    
    # try to get the next item to handle the instance where the "Mandarin" course does not match the rest of the course patterns
    try:
        next_item = text_14[i + 1]
    except IndexError:
        next_item = ''
        
    # If the string represents a course title
    if (':' in text) or ('.' in next_item):
        # print("course")
        # Assigning to a variable will be helpful for when we assign courses to a department's course list
        current_course = text
    else:
        # If the text represents the units for the course
        if re.search('[0-9]', text):
            # print("units")
            # Clean the data
            text = re.sub("[^0-9.-]", '', text) # Instead of removing characters that I couldn't identify, I just removed everything I didn't want
            # Cast as float in case there are courses with half units??
            text = [float(course_units) for course_units in text.split('-')]
            
            # Find the index of the department
            department_index = np.where(departments == current_department)[0][0]
            
            # Update the course criteria
            courses.append([current_course, text, department_index])
        # Department title
        else:
            # print("Department")
            # Similar to above, assigning course title to a variable will be helpful for units
            current_department = text
            
            # #  Add a new department to the dictionary
            departments = np.append(departments, text)

In [18]:
# This approach is more efficient than appending to the end of a dataframe
department = pd.DataFrame(departments, columns=['department'])
department.head()

Unnamed: 0,department
0,Academic Achievement & Personal Success
1,Accounting
2,Administration of Justice
3,African American Studies
4,American Sign Language


In [144]:
# This approach is more efficient than appending to the end of a dataframe
course = pd.DataFrame(courses, columns=['course', 'units', 'department'])
print(f"Number of courses identified: {len(course)}")
course.head()

Number of courses identified: 1025


Unnamed: 0,course,units,department
0,AAPS 103: Orientation to College Transfer,[3.0],0
1,ACCT 1: Financial Accounting,[5.0],1
2,ACCT 2: Managerial Accounting,[5.0],1
3,ACCT 51: Intermediate Accounting,[5.0],1
4,ACCT 59: Federal Income Tax,[3.0],1


This worked well! We have some clean data that we can use to reference and join as needed. Next we need to actually go through the pages to get the information pertaining to each course offering. To do this, we will iterate through the strings returned in the `layout` extraction mode.

I found out that the units being retained on the same line as the course title is preserved in the layout mode, which makes checking for equivalence a bit difficult. I could use the length of the line to determine if it's a course title (because they actually seems to be a consistent length of 160 characters) and then just increment through the course table. So let's try that and see how well that works.

Amendment: The course title lines have also had a length of 164 characters.
Amendment: This no longer was unique, so I had to come up with a regex pattern and strip the excess to match the course title to the `course` dataframe course values.

In [145]:
# Defining the method that will generate a dictionary to store course related information
def course_criteria_information(crn = '', sec = '', type = '', days = '', times = '', dates = '', location = '', campus = '', instructor = '', description = '', course_index = None):
    return {
        'crn': crn,
        'sec': sec,
        'type': type,
        'days': days,
        'times': times,
        'dates': dates,
        'location': location,
        'campus': campus,
        'instructor': instructor,
        'description': description,
        'course': course_index
    }
    
# Initialize an empty course info dictionary
course_info = course_criteria_information()
    
# Course info list
course_info_list = []

# For iterating through the course dataframe
curr_course_index = -1

# # Tracking the current description
curr_description = ''

# Tracking advisory notice
course_advisory_notice_pairs = []

# Advisory notices  
advisory_notices = [
    'PREREQ',
    'Recommended Prep'
]

stop_loop = False

num_missed_records = 0

identified_courses = []

course_regex_pattern = re.compile(r'[\w\s:]+\s+\d\.0$')

dense_course_titles = course.course.apply(lambda x: re.sub(r'\s', '', x)).values

# Iterate through each page of the document
for page_number, page in enumerate(reader.pages):
    # Clear output each page for easier debugging
    # clear_output(wait=True)
    
    if stop_loop:
        break
    
    # Return the extracted page data as a list
    lines = page.extract_text(extraction_mode='layout').split('\n')
    
    # The bulk of the operations will be in this loop block
    for line in lines:
        # Strip away leading and trailing whitespace
        stripped_line = line.strip()
        
        
        # Skip if on an empty line; Some advisory notice lines contain 'No prerequisites.' which is unnecessary, and throws off my code as it looks like a description with the current criteria, so we can skip entirely
        if stripped_line in ('', 'No prerequisites.'):
            continue
        
        # print(len(line))
        # Define what this line represents and determine what department and courses we're working on
        if stripped_line in departments:
            # ? I don't think we need to track this
            # print("Is a department")
            pass
        elif re.match(course_regex_pattern, stripped_line):
            # print("Is a course")
            
            # Remove the units from the line
            course_title = re.sub(r'\d\.0$', '', stripped_line)
            
            
            # Instead of handling the weird spacing that is inconsistent, I can simply remove all whitespace characters and compare titles that way
            course_title = re.sub(r'\s', '', course_title).strip()
            
                
            if course_title in dense_course_titles:
                pass
            else:
                print("Not in dataframe")
                print(course_title)
                print(dense_course_titles)
                
            
            
            
            curr_course_index += 1
            identified_courses.append(line)
            
        # If the line starts with a word, and isn't a course or a department, then it's an advisory or a description?
        elif re.match(r'^[a-zA-Z(]', stripped_line) or re.search(r'[.,()]', stripped_line):
            # I don't care about the footer text; not including the last character because it's the page number which will change every page
            if re.sub(r'\s{1}', '', stripped_line[:-1]) == 'REGISTERONLINETODAY!':
                continue
            elif stripped_line[:stripped_line.find(':')] in advisory_notices:
                # print("Advisory notice")
                # Save the notice
                course_advisory_notice_pairs.append([curr_course_index, stripped_line])
            else:
                # print("Description")
                curr_description += stripped_line
        elif re.match(r'^\d+', stripped_line) and not re.search(r'[.,()]', stripped_line):
            # print("Course Criteria")
            # This feels really bulky, but I just want to get a working solution
            # I needed to use more than one space to split on, to account for the the weird spacing between characters
            # print(re.split('\s{2}', stripped_line))
            # I am opting to not remove spaces from string elements here, as I can do that with pandas for specific columns
            course_criteria = [criteria for criteria in re.split(r'\s{2}', stripped_line) if criteria != '']
            
            # Sometimes there was online one space character between the first two criteria; sometimes the crn has a space in it, so I needed to add the re.sub() to make sure that wasn't captured in this if block
            if len(re.sub(r'\s', '', course_criteria[0])) > 5:
                # When this happens, we just split the elements erroneously combined at index 0
                course_criteria = [course_criteria[0][:5], course_criteria[0][5:]] + course_criteria[1:]
                
            
            
                
            # * Once we hit this line, it marks a new record for the course criteria list
            
            # So we first need to append it to a list to save any information collected so far
            
            # Add the description
            course_info['description'] = curr_description
            course_info_list.append(course_info)
            
            # Then we need to create a new set of information
            curr_description = ''
            # When a course offering is asynchronous, then it won't have a "DAYS" or a "LOCATION" field
            if len(course_criteria) == 7:                
                course_info = course_criteria_information(
                    crn = course_criteria[0],
                    sec = course_criteria[1],
                    type = course_criteria[2],
                    days = None,
                    times = course_criteria[3],
                    dates = course_criteria[4],
                    location = None,
                    campus = course_criteria[5],
                    instructor = course_criteria[6],
                    description = curr_description,
                    course_index = curr_course_index
                )
            else:
                try:
                    course_info = course_criteria_information(
                        crn = course_criteria[0],
                        sec = course_criteria[1],
                        type = course_criteria[2],
                        days = course_criteria[3],
                        times = course_criteria[4],
                        dates = course_criteria[5],
                        location = course_criteria[6],
                        campus = course_criteria[7],
                        instructor = course_criteria[8],
                        description = curr_description,
                        course_index = curr_course_index
                    )
                except IndexError as e:
                    print("Reached exception!")
                    print("On page:", page_number + 1)
                    # print(f"Result of conditions 're.match('^\d+', stripped_line) and not re.match('[./(,)]', stripped_line) is {re.match('^\d+', stripped_line)} and {re.match('[./(,)]', stripped_line)}")
                    print(len(stripped_line))
                    print(stripped_line)
                    print(course_criteria)
                    # stop_loop = True
                    print(e)
                    num_missed_records += 1
                    
            
            
            # print(course_criteria)
        else:
            print("Other?")
            print(stripped_line)
        
        # print(stripped_line)
        
    
    # if page_number >= 2:
    #     break

print(num_missed_records)

Reached exception!
On page: 4
283
73995                     001                  Wrk                                                                                           Hours Arr 09/12-11/26                                                                            Other Sites                            Guzman
['73995', ' 001', 'Wrk', ' Hours Arr 09/12-11/26', 'Other Sites', 'Guzman']
list index out of range
Reached exception!
On page: 4
283
73996                     002                  Wrk                                                                                           Hours Arr 12/26 - 03/11                                                                          Other Sites                            Guzman
['73996', ' 002', 'Wrk', ' Hours Arr 12/26 - 03/11', 'Other Sites', 'Guzman']
list index out of range
Not in dataframe
AFAM55:FromFunktoHipHop
['AAPS103:OrientationtoCollegeTransfer' 'ACCT1:FinancialAccounting'
 'ACCT2:ManagerialAccounting' ... 'PSYC25:Psychologyo

KeyboardInterrupt: 

In [149]:
course[course.course.str.startswith('AFAM')]

Unnamed: 0,course,units,department
18,AFAM 30: African American Consciousness,[3.0],3
19,AFAM 40: The Black Experience in California,[3.0],3
20,AFAM 42: The Origins and History of Race Theor...,[3.0],3
21,AFAM 60: African American Women in the U.S.,[3.0],3
1003,AFAM 60: African American Women in the U.S.,[3.0],102


In [123]:
_ = "ACC T 1: Financial Accounting                                                                                                                               5.0"
pattern = r'[\w\s]+'
if re.match(pattern, _):
    a = re.search(pattern, _).group()
    print(a)
    b = len(re.findall(r'\s', a))
    print(b)
else:
    print("Not a match")
    

ACC T 1
2


In [99]:
print(len(identified_courses))
for identified_course in identified_courses:
    print(identified_course)

2862
 AAPS 103: Orientation to College Transfer                                                                                                                   3.0
 ACC T 1: Financial Accounting                                                                                                                               5.0
 ACC T 2: Managerial Accounting                                                                                                                              5.0
 ACC T 51: Intermediate Accounting                                                                                                                           5.0
 ACC T 59: Federal Income Tax                                                                                                                                3.0
 ADMJ 51: Juvenile Procedures                                                                                                                                3.0
 ADM J 52: Concepts of Crimin

In [96]:
# ! There is something wrong with the code: The crn's are inconsistent, the descriptions don't match the courses, and the course index seems to be incorrect (expecially for the ones with course index 8)
course_offering_info = pd.DataFrame(course_info_list[1:])
print(len(course_offering_info))
course_offering_info

983


Unnamed: 0,crn,sec,type,days,times,dates,location,campus,instructor,description,course
0,70482,008,Lec,TR,09:40 -10:55AM,08/19 -12/19,MIC 254,Mission,Rivera,,0
1,70243,001,Lec,MT WRF,10:10 -11:00AM,08/18 -12/19,CLOU 229,Ocean,Yrun,,1
2,7024 4,002,Lec,MT WRF,11:10 -12:00PM,08/18 -12/19,CLOU 229,Ocean,Yrun,,1
3,70246,004,Lec,MW,02:10 - 04:25PM,08/18 -12/19,CLOU 230,Ocean,Mullen,,1
4,70247,931,Onl,,Asynchronous,09/02-12/19,,Online,Carballo,This class requires use of Canvas with no requ...,1
...,...,...,...,...,...,...,...,...,...,...,...
978,73042,001,Lec,MW,11:10 -12:25PM,08/18 -12/19,JDVL 810,Ocean,Lockman,This class is taught in person at the Ocean ca...,2836
979,73826,S01,Lec,MT WR,09:10 -10:15AM,08/18 -12/19,HC 213,Ocean,Bravewoman,IMPORTANT: This section of 110A is linked to 1...,2837
980,73832,S01,Lec,TR,10:40 -11:30AM,08/19 -12/19,HC 213,Ocean,Bravewoman,This is the co-requisite suppor t class for Ma...,2838
981,73409,931,Onl,,Asynchronous,09/29-11/09,,Online,Lau,This course is designed for general education ...,2839


In [22]:
# You must have a name for your series
advisory_notices_courses = [val[0] for val in course_advisory_notice_pairs]
advisory_notices_notices = [val[1] for val in course_advisory_notice_pairs]
course_advisory_notice_pairs = pd.Series(advisory_notices_notices, name="advisory_notice", index=advisory_notices_courses)

# Add the data to the dataframe
course = pd.merge(
    course,
    course_advisory_notice_pairs,
    left_index=True,
    right_index=True
)

course

Unnamed: 0,course,units,department,advisory_notice
0,AAPS 103: Orientation to College Transfer,[3.0],0,PREREQ: Completion of or concurrent enrollment...
1,ACCT 1: Financial Accounting,[5.0],1,Recommended Prep: (Readiness for college -leve...
2,ACCT 2: Managerial Accounting,[5.0],1,PREREQ: ACC T 1.
3,ACCT 51: Intermediate Accounting,[5.0],1,Recommended Prep: ACC T 2.
6,ADMJ 52: Concepts of Criminal Law,[3.0],2,Recommended Prep: Readiness for college -level...
7,ADMJ 53: Legal Aspects of Evidence,[3.0],2,Recommended Prep: Readiness for college -level...
8,ADMJ 54: Principles and Procedures of the Just...,[3.0],2,Recommended Prep: Readiness for college -level...
8,ADMJ 54: Principles and Procedures of the Just...,[3.0],2,Recommended Prep: Readiness for college -level...
8,ADMJ 54: Principles and Procedures of the Just...,[3.0],2,Recommended Prep: Readiness for college -level...
8,ADMJ 54: Principles and Procedures of the Just...,[3.0],2,Recommended Prep: Completion of or concurrent ...
