# Preprocessing PDF - TXT - BAML - JSON

This notbook is used to perform the extraction from the module catalogs. 

**How to use this notebook?**
- set up the lists `splits` and `splits_modules`. the first one has the (starting) page numbers where to split the pdf. the second one has either 0 or 1 or 2 indecating if split at this index is about a module (1) or not (0) use (2) for any other information that does not need extraction at the moment
    - 0 --> module overview
    - 1 --> module details
    - 2 --> anything else (not used at the moment)
- run `folder = process_pdf(splits, splits_modules, PATH_TO_FILE, CATALOG_ABBREVIATION)`
- run all the following cells as they are defined 


**The progress so far:**

- MMDS (wima_wifo/MK_MMDS_2024_2025_14.06.2024) DONE
- MMM (bwl/Module_Catalog_Mannheim_Master_in_Management_en) DONE
- WIFO (wima_wifo/MK_MSc_Wifo__2024_25_16.10.2024.pdf) DONE
- WIMA (Modulkatalog_Master_Wima_Mathe_2022_23) NEXT

In [1]:
%load_ext autoreload
%autoreload 2

import sys
import os
import dotenv
import json
import numpy as np

sys.path.append(os.path.abspath("./scripts"))
from scripts import process_pdf, merge_additional_files, process_catalog_overview, process_module_list, merge_json_files, process_module_files, dict_to_lists, fix_module_file

import baml_client as client
from baml_client import reset_baml_env_vars

dotenv.load_dotenv()
reset_baml_env_vars(dict(os.environ))

In [None]:
# EXAMPLE USAGE: extract text with splits
# List where to split the PDF. First page is 1, not 0!
# MMDS
split_dict = {
    1: 0,
    3: 2,
    4: 0,
    6: 1,
    14: 0,
    15: 1,
    30: 1,
    44: 0,
    45: 2,
    46: 1,
    61: 1,
    77: 0,
    78: 1,
    86: 0,
    87: 1,
    102: 0,
    103: 1,
    120: 1,
    138: 1,
    140: 2
}
splits, splits_modules = dict_to_lists(split_dict)

# splits = [1, 6, 14, 15, 44, 46, 77, 78, 86, 87, 102, 103, 138, 140]
# splits_modules = [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0]

folder = process_pdf(splits, splits_modules, './downloads/wima_wifo/MK_MMDS_2024_2025_20.08.2024.pdf', "MMDS")
print(folder)

# next run the other predefined cells to perform the BAML extraction

In [None]:
# MMM (BWL)
split_dict = {
    1: 2,
    5: 0,
    7: 2, 
    9: 0, 
    20: 1,
    26: 1,
    41: 1,
    64: 1,
    79: 1,
    95: 1,
    115: 1,
    128: 1,
    145: 1,
    166: 1,
    183: 1,
    197: 1,
    210: 1,
    225: 1,
    248: 2,
    252: 2,
    291: 1
}

splits, splits_modules = dict_to_lists(split_dict)
folder = process_pdf(splits, splits_modules, "./downloads/bwl/Module_Catalog_Mannheim_Master_in_Management_en.pdf", "MMM")

In [10]:
# WIFO
split_dict = {
    1: 0,
    3: 2,
    4: 0,
    6: 1,
    21: 0,
    25: 1,
    37: 1,
    51: 1,
    65: 1,
    75: 2,
    76: 1,
    81: 0,
    83: 1,
    102: 1,
    120: 0,
    121: 1,
    123: 2
}
splits, splits_modules = dict_to_lists(split_dict)

# splits = [1, 6, 21, 25, 51, 75, 76, 81, 83, 120, 121, 123]
# splits_modules = [0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0]
folder = process_pdf(splits, splits_modules, './downloads/wima_wifo/MK_MSc_Wifo__2024_25_16.10.2024.pdf', 'WIFO')

In [None]:
# WIMA
splits = []
splits_modules = []
folder = process_pdf(splits, splits_modules, './downloads/wima_wifo/Modulkatalog_Master_Wima_Mathe_2022_23.pdf', 'WIMA')

In [9]:
# check split sizes
distances = [j - i for i, j in zip(splits[:-1], splits[1:])]
split_values = [j - i for i, j in zip(split_dict.keys(), list(split_dict.keys())[1:])]
split_dict_filled = {k: v for k, v in zip(split_dict.keys(), split_values)}
split_dict_filled

{1: 2,
 3: 1,
 4: 2,
 6: 15,
 21: 4,
 25: 12,
 37: 14,
 51: 14,
 65: 10,
 75: 1,
 76: 5,
 81: 2,
 83: 19,
 102: 18,
 120: 1,
 121: 2}

In [11]:
# Merge all files that are not related to modules
merge_additional_files(folder)

All '_additional_other' files have been merged into output/20241120_212121_WIFO/merged_additional_other.txt
All '_additional_overview' files have been merged into output/20241120_212121_WIFO/merged_additional_overview.txt


In [20]:
# Extract study programm overview from 'addtional' file using BAML
process_catalog_overview(folder)

In [13]:
# Extract content from module files using BAML
process_module_files(folder)

['split_7_modules.txt', 'split_14_modules.txt', 'split_8_modules.txt', 'split_11_modules.txt', 'split_4_modules.txt', 'split_16_modules.txt', 'split_13_modules.txt', 'split_6_modules.txt', 'split_9_modules.txt']


Processing module files:   0%|          | 0/9 [00:00<?, ?it/s]

Now processing file: output/20241120_212121_WIFO/split_7_modules.txt


Processing module files:  11%|█         | 1/9 [00:25<03:24, 25.54s/it]

Now processing file: output/20241120_212121_WIFO/split_14_modules.txt


Processing module files:  22%|██▏       | 2/9 [00:58<03:29, 29.96s/it]

Now processing file: output/20241120_212121_WIFO/split_8_modules.txt


Processing module files:  33%|███▎      | 3/9 [01:29<03:01, 30.26s/it]

Now processing file: output/20241120_212121_WIFO/split_11_modules.txt


Processing module files:  44%|████▍     | 4/9 [01:42<01:56, 23.38s/it]

Now processing file: output/20241120_212121_WIFO/split_4_modules.txt


Processing module files:  56%|█████▌    | 5/9 [02:06<01:35, 23.88s/it]

Now processing file: output/20241120_212121_WIFO/split_16_modules.txt


Processing module files:  67%|██████▋   | 6/9 [02:10<00:51, 17.11s/it]

Now processing file: output/20241120_212121_WIFO/split_13_modules.txt


Processing module files:  78%|███████▊  | 7/9 [02:37<00:40, 20.25s/it]

Now processing file: output/20241120_212121_WIFO/split_6_modules.txt


Processing module files:  89%|████████▉ | 8/9 [03:00<00:21, 21.12s/it]

Now processing file: output/20241120_212121_WIFO/split_9_modules.txt


Processing module files: 100%|██████████| 9/9 [03:27<00:00, 23.05s/it]


In [14]:
# Merge extracted modules to one file
merge_json_files(folder)

Merged JSON file created at: output/20241120_212121_WIFO/merged_modules.json


## Other code for manual fixes etc.

In [None]:
# folder = './output/20241118_222726_MMDS'
# file = str(folder) + '/split_2_modules.txt'
# file

In [None]:
# process_module_list(file)

In [22]:
# REVIEW THE IDS IN CATALOG VS IN MODULE LIST

folder_review = folder 
# folder_review = 'output/20241120_125015_MMM'


# Get the file names from the folder
merged_modules_file = os.path.join(folder_review, 'merged_modules.json')
catalog_overview_file = os.path.join(folder_review, 'catalog_overview.json')

# print number of modules in detailed list
with open(merged_modules_file, 'r') as f:
    data_modules = json.load(f)

modules_len = len(data_modules['modules'])
print("number of modules in modules list:", modules_len)

# print number of modules in overview
with open(catalog_overview_file, 'r') as f:
    data_overview = json.load(f)
combined_length = sum(len(area['modules']) for area in data_overview['studyArea'])
print("number of modules in overview:", combined_length)


# create a list of the ids in the overview
overview_ids = [module['id'] for area in data_overview['studyArea'] for module in area['modules']]
# print(overview_ids)

# create a list of the ids in the data_modules
modules_ids = [module['id'] for module in data_modules['modules']]
# print(modules_ids)

# Compare the lists and print the ids that are in one list but not in the other
overview_ids_set = set(id.upper() for id in overview_ids)
modules_ids_set = set(id.upper() for id in modules_ids)

# IDs in overview but not in modules
ids_in_overview_not_in_modules = overview_ids_set - modules_ids_set
print(f"{len(ids_in_overview_not_in_modules)} IDs in overview but not in modules:", ids_in_overview_not_in_modules)

# IDs in modules but not in overview
ids_in_modules_not_in_overview = modules_ids_set - overview_ids_set
print(f"{len(ids_in_modules_not_in_overview)} IDs in modules but not in overview:", ids_in_modules_not_in_overview)

with open(os.path.join(folder_review, 'information.txt'), 'w') as info_file:
    info_file.write(f"Number of modules in modules list: {modules_len}\n")
    info_file.write(f"Number of modules in overview: {combined_length}\n")
    info_file.write(f"{len(ids_in_overview_not_in_modules)} IDs in overview but not in modules:\n{ids_in_overview_not_in_modules}\n")
    info_file.write(f"{len(ids_in_modules_not_in_overview)} IDs in modules but not in overview:\n{ids_in_modules_not_in_overview}\n")

number of modules in modules list: 53
number of modules in overview: 54
4 IDs in overview but not in modules: {'IS_752', 'IS_742', 'IS_751', 'IS_722'}
3 IDs in modules but not in overview: {'BI_656', 'DS_203', 'MAC_570'}


In [None]:
# openai pls help fix this mess
import openai
from openai import OpenAI
import os

openai.api_key = os.getenv("OPENAI_API_KEY")

client = OpenAI()
def fix_with_openai(ids_overview, ids_modules):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant for dataprocessing. You responds with dicts or list but do not use explanitory text."},
            {
                "role": "user",
                "content": f"""
                    It is your task to compare two lists of ids. There is one correct list and one incorrect one. Some of the ids in the incorrect one might match and id of the correct list.
                    It is your taks to fix the incorrect list. To do so return a dict (with "") where the key is the incorrect id and the value is a matching corrected id that is present in the correct list.
                    Note that it can be possible that there is no option to correct the id in this case just write None (without "") in the value.
                    This is the list with the correct ids: {ids_overview}
                    This is the list with the incorrect ids: {ids_modules}       
                    """
            }
        ]
    )

    return completion.choices[0].message.content

# print(completion.choices[0].message.content)

**Fix IDS in Modules, assumes that overview is correct**

In [None]:
# fix ids in modules. this matches ids we have in the modules but that do not match the one in the overview. so it will assume that overview is the truth and adjust accordingly
fix_dict = fix_with_openai(ids_in_overview_not_in_modules, ids_in_modules_not_in_overview)
print(fix_dict)

In [None]:
import ast

try:
    parsed_dict = ast.literal_eval(fix_dict)
    # Replace string "None" with actual Python None
    parsed_dict = {k: (None if v == "None" else v) for k, v in parsed_dict.items()}
    print(parsed_dict)
except Exception as e:
    print(f"Error parsing string to dictionary: {e}")

len(parsed_dict)

In [None]:
# Open the merged_modules.json file
with open(merged_modules_file, 'r') as f:
    merged_modules_data = json.load(f)

# Iterate through the list of dictionaries under the key 'modules'
for module in merged_modules_data['modules']:
    module_id = module['id']
    if module_id in parsed_dict:
        new_id = parsed_dict[module_id]
        if new_id is not None:
            module['id'] = new_id

# Save the updated data back to the file
with open(merged_modules_file, 'w') as f:
    json.dump(merged_modules_data, f, indent=4)

In [None]:
# # Open the merged_modules.json file
# with open(merged_modules_file, 'r') as f:
#     merged_modules_data = json.load(f)

# # Iterate through the list of dictionaries under the key 'modules'
# for module in merged_modules_data['modules']:
#     module_id = module['id']
#     if module_id in parsed_dict:
#         new_id = parsed_dict[module_id]
#         if new_id is not None:
#             module['id'] = new_id

#     # Check and update ids in requiredPrerequisiteModules
#     updated_required_prerequisites = []
#     for prereq_id in module['requiredPrerequisiteModules']:
#         if prereq_id in parsed_dict:
#             new_prereq_id = parsed_dict[prereq_id]
#             if new_prereq_id is not None:
#                 updated_required_prerequisites.append(new_prereq_id)
#             else:
#                 updated_required_prerequisites.append(prereq_id)
#         else:
#             updated_required_prerequisites.append(prereq_id)
#     module['requiredPrerequisiteModules'] = updated_required_prerequisites

#     # Check and update ids in optionalPrerequisiteModules
#     updated_optional_prerequisites = []
#     for prereq_id in module['optionalPrerequisiteModules']:
#         if prereq_id in parsed_dict:
#             new_prereq_id = parsed_dict[prereq_id]
#             if new_prereq_id is not None:
#                 updated_optional_prerequisites.append(new_prereq_id)
#             else:
#                 updated_optional_prerequisites.append(prereq_id)
#         else:
#             updated_optional_prerequisites.append(prereq_id)
#     module['optionalPrerequisiteModules'] = updated_optional_prerequisites

#     # Check and update ids in furtherModules
#     updated_further_modules = []
#     for further_module_id in module['furtherModules']:
#         if further_module_id in parsed_dict:
#             new_further_module_id = parsed_dict[further_module_id]
#             if new_further_module_id is not None:
#                 updated_further_modules.append(new_further_module_id)
#             else:
#                 updated_further_modules.append(further_module_id)
#         else:
#             updated_further_modules.append(further_module_id)
#     module['furtherModules'] = updated_further_modules

# # Save the updated data back to the file
# with open(merged_modules_file, 'w') as f:
#     json.dump(merged_modules_data, f, indent=4)

**Fix IDS in Overview, assumes that modules is correct**

In [None]:
fix_dict = fix_with_openai(ids_in_modules_not_in_overview, ids_in_overview_not_in_modules)
print(fix_dict)

In [None]:
import ast

try:
    parsed_dict = ast.literal_eval(fix_dict)
    # Replace string "None" with actual Python None
    parsed_dict = {k: (None if v == "None" else v) for k, v in parsed_dict.items()}
    print(parsed_dict)
except Exception as e:
    print(f"Error parsing string to dictionary: {e}")

len(parsed_dict)

In [None]:
# Open the catalog overview file
with open(catalog_overview_file, 'r') as f:
    catalog_overview_data = json.load(f)

# Go through all the modules and check if the id is in the keys of the parsed_dict
for study_area in catalog_overview_data['studyArea']:
    for module in study_area['modules']:
        module_id = module['id']
        if module_id in parsed_dict:
            new_id = parsed_dict[module_id]
            if new_id is not None:
                module['id'] = new_id

# Save the updated data back to the file
with open(catalog_overview_file, 'w') as f:
    json.dump(catalog_overview_data, f, indent=4)

**Fix a module file where not all modules where extracted correctly**

In [None]:
fix_module_file('output/20241120_125015_MMM/split_13_modules.txt')