## Requirements

First of all the GF markdown files need to be in the folder GF_FOLDER. They can be downloaded from there GitLab if you have access to it.

### Pip packages

* markdown
* beautifulsoup4

## Workflow

* Scrapes the first few chapters of GF by scraping the structure file
* Iterates over each of these chapters so they get all subdirectory names and directories in a Chapter class in sthe CHAPTER_TREE
* Adds all markdown files under each directory and puts them in File class
* Saves the root Chapter tree to a .pkl file which basically holds the whole structure of the GF
* At the bottom I show how to use it

In [1]:
import os
import markdown
from bs4 import BeautifulSoup
import pickle

from utils import Chapter, File

In [2]:
GF_FOLDER = os.path.join("C:", os.sep, "Users", "manuv", "OneDrive", "Documenten", "Code", "BAP", "BP_gitrepo", "BP_Info_Support", "docs")
STRUCTURE_FILE = "structure.md"
CHAPTER_TREE = Chapter("root",GF_FOLDER)

SAVE_TREE_DIR = os.path.join("C:", os.sep, "Users", "manuv", "OneDrive", "Documenten", "Code", "BAP", "BP_gitrepo", "BP_Info_Support", "data")

In [2]:
# E:\HOGENT\2022_2023\BA\docs
GF_FOLDER = os.path.join("E:", os.sep, "HOGENT", "2022_2023", "BA","docs")
STRUCTURE_FILE = "structure.md"
CHAPTER_TREE = Chapter("root",GF_FOLDER)

SAVE_TREE_DIR = os.path.join("E:", os.sep, "HOGENT", "2022_2023", "BA","docs")

In [None]:
# Reads structure file from STRUCTURE_FILE in GF_FOLDER
structure_dir = os.path.join(GF_FOLDER,STRUCTURE_FILE)
with open(structure_dir, 'r') as f:
    tempMd= f.read()

# Convert markdown file to HTML
structure_file_html = markdown.markdown(tempMd)

In [30]:
# Makes assumption that chapters names are in h2 element with a links
soup_structure = BeautifulSoup(structure_file_html, 'html.parser')
raw_chapters = soup_structure.find_all('h2')

for chapter in raw_chapters:
    link = chapter.find('a')
    name = os.path.basename(link.get('href').replace('index.md','')[:-1])
    href, text = link.get('href').replace('index.md','')[2:], name
    CHAPTER_TREE.add_child(Chapter(text,os.path.join(CHAPTER_TREE.dir,href)))

In [31]:
CHAPTER_TREE.print_tree()

- root
  - 05-essence-model
  - 01-customer-needs
  - 02-delivery-guidance
  - 03-architecture-guidance
  - 04-technology-guidance


In [32]:
# Adds every directory not in blacklist to the CHAPTER_TREE
blacklist = ["images","documents"]
def visit_directory(chapter):
    dir = os.path.join(chapter.parent.dir,chapter.dir)
    for root, dirs, files in os.walk(dir):
        for dir in dirs:
            if not dir in blacklist:
                child = Chapter(dir, os.path.join(root, dir))
                chapter.add_child(child)
                visit_directory(child)

In [33]:
for chapter in CHAPTER_TREE.children:
    visit_directory(chapter)

In [34]:
CHAPTER_TREE.print_tree()

- root
  - 05-essence-model
    - principles
  - 01-customer-needs
    - 01-business-value-and-scoping
  - 02-delivery-guidance
    - 01-delivery-approaches
    - 02-practices
      - adr-examples
      - specification-by-example
      - team-kickoff
        - 0-preparation
        - 1-meet
        - 2-team-building
        - 3-group-roles
        - 4-professional-roles
        - 5-project-goal
        - 6-personal-goals
        - 7-working-arrangements
      - team-topologies
      - 0-preparation
      - 1-meet
      - 2-team-building
      - 3-group-roles
      - 4-professional-roles
      - 5-project-goal
      - 6-personal-goals
      - 7-working-arrangements
    - 03-activities
      - 01-customer
      - 02-solution
      - 03-delivery
    - 04-phases
    - 05-roles
    - 06-remote-working
      - 02-tools-practices
      - 03-meetings
    - adr-examples
    - specification-by-example
    - team-kickoff
      - 0-preparation
      - 1-meet
      - 2-team-building
      - 3-group

In [35]:
blacklist = []
def get_markdown_files(directory):
    markdown_files = []
    for file_name in os.listdir(directory):
        file_path = os.path.join(directory, file_name)
        if os.path.isfile(file_path) and file_name.endswith(".md") and file_name not in blacklist:
            file = File(os.path.basename(file_path),file_path)
            markdown_files.append(file)
    return markdown_files

In [36]:
def add_markdown_files(chapter):
    chapter.files = get_markdown_files(chapter.dir)
    for child in chapter.children:
        add_markdown_files(child)

In [37]:
# Adds markdown files to each Chapter. Chapter under nodes have all files in their entire structure.
add_markdown_files(CHAPTER_TREE)

In [38]:
# Save structure tree
# with open(os.path.join(SAVE_TREE_DIR,"gf_structure.pkl"), "wb") as f:
#     pickle.dump(CHAPTER_TREE, f)

# Load structure tree
with open(os.path.join(SAVE_TREE_DIR,"gf_structure.pkl"), "rb") as f:
    root = pickle.load(f)

In [18]:
def find_chapter_by_dir(root, dir):
    if root.dir == dir:
        return root
    for child in root.children:
        found_chapter = find_chapter_by_dir(child, dir)
        if found_chapter:
            return found_chapter
    return None

In [20]:
rnd_chapter = CHAPTER_TREE.get_random_chapter()
print(f'Chapter Name: {rnd_chapter.name}')
rnd_file = rnd_chapter.get_random_file()
print(f'File Name: {rnd_file.name}')
print(f'File Dir: {rnd_file.dir}')
print('Structured Text in file')
rnd_file.text

Chapter Name: 02-java
File Name: index.md
File Dir: C:\Users\manuv\OneDrive\Documenten\Code\BAP\BP_gitrepo\BP_Info_Support\docs\04-technology-guidance/02-application-development\03-frameworks\02-java\index.md
Structured Text in file


{'General': ["Oracle's supported JDK for customers is the OracleJDK, but this is just one of the many JDKs available for use. In this article the most common JDKs are listed with their characteristics, as well as their release schedules.Note:Don't mix different versions and/or flavors of JDK, although they all implement the same requirements there might be small differences which could lead to unexpected behavior."],
 'JDKs': ["Oracle's supported JDK for customersis the OracleJDK, but there are alternatives. Look at the chapters for more information, but see the next guideline about versioning, which is true for multiple JDKs.Guideline:If the development team is in charge of the operations, and is able to keep up with newer versions, use the latest version of the JDK.Why:Using the latest version makes updating to newer versions easier.",
  'Newer versions often contain bug fixes, security fixes, optimizations and features to the platform.Guideline:If the application is handed over to a