### Author: Guilherme Resende

This notebook builds an alternative dataset aggregating all subsections of the documentation in one file following the `index.md` structure.

---

### Concatenation Strategy

The concatenation strategy assumes that every piece of documentation is listed in a index (named `index.md`) about the subject. To start concatenating the subsections of documentation into their macro topic we need to verify if every documentation has an `index.md` file, or even if it has any Markdown file.

In [None]:
import os
import re

from bs4 import BeautifulSoup
from markdown import markdown
from tqdm import tqdm

In [None]:
count = 0
no_index = []
for documentation in os.listdir("./awsdocs"):
    documents = os.listdir(f"./awsdocs/{documentation}")
    if "index.md" in documents:
        count += 1
    else:
        no_index.append(documentation)

In [None]:
print(f"There are {len(no_index)} directories/subjects without an `index.md` file")

Let's see the list of directories

In [None]:
explanations = [
    "[FIXED] There were version subdirectories. Inside those there is an index file.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
    "[LEFT-OUT] The documentation was moved to another directory.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
    "[LEFT-OUT] Empty directory.",
    "[FIXED] There was a subdirectory with all files.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] Empty directory.",
    "[FIXED] There was a subdirectory with all files.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] Empty directory.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] Empty directory.",
    "[FIXED] There was a subdirectory with all files.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
    "[LEFT-OUT] Empty directory.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] There was no `.md` files.",
    "[NEED-FIXING] No index, but many `.md` files."
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
    "[FIXED] There was a subdirectory with all files.",
    "[FIXED] There was a subdirectory with all files.",
    "[LEFT-OUT] There was no `.md` files.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
    "[FIXED] There was a subdirectory with all files.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
    "[FIXED] There was a subdirectory with all files.",
    "[LEFT-OUT] Empty directory.",
    "[LEFT-OUT] There were subdirectories often without `.md` files.",
]

for directory, explanation in zip(no_index, explanations):
    print("---------------------------------------------")
    print(directory)
    print(f"- {explanation}")

--- 
**Processing Decision**

Since there were only a few problematic cases, I decided to fix what was fixable with a small effort and decided to simply ignore the remaining pieces of documentation given they either comprise code examples or other file formats.

In [None]:
documentations_to_ignore = [
    "aws-example-apps",
    "amazon-connect-user-guide",
    "aws-doc-sdk-examples",
    "amazon-chime-user-guide",
    "aws-xamarin-developer-guide",
    "aws-freertos-docs",
    "aws-cpp-developer-guide",
    "GuidesPatternsSolutions",
    "aws-mobile-developer-guide",
    "aws-android-developer-guide",
    "aws-unity-developer-guide",
    "amazon-comprehend-developer-guide",
    "elastic-beanstalk-samples",
    "amazon-elasticache-docs",
    "aws-go-developer-guide",
    "aws-php-developers-guide",
    "amazon-lightsail-developer-guide",
    "aws-java-developer-guide",
    "aws-java-developer-guide-v2",
    "aws-doc-shared-content",
    "aws-ruby-developer-guide",
    "amazon-machinelearning-developer-guide",
    "aws-ios-developer-guide"
]

#### Build the Alternative Dataset

In [None]:
INPUT_DIR = "./awsdocs_toy/"
OUTPUT_DIR = "./awsdocs_merged/"

os.mkdir(OUTPUT_DIR)

for documentation in tqdm(os.listdir(INPUT_DIR)):
    if documentation in documentations_to_ignore:
        continue
    
    documents = os.listdir(os.path.join(INPUT_DIR, documentation))

    name_to_content = {}
    for document in documents:
        if not os.path.isfile(os.path.join(INPUT_DIR, documentation, document)) or not document.endswith(".md"):
            continue
        
        with open(os.path.join(INPUT_DIR, documentation, document), 'r') as f:
            subsection_text = f.read()
            html = markdown(subsection_text)

        # Create a mapping from document name to its content
        name_to_content[document] = ''.join(BeautifulSoup(html).findAll(string=True))
    
    # Reads the index file and finds every occurence of a []() markdown pattern. 
    # It represents a place to insert the content of the referenced file
    with open(os.path.join(INPUT_DIR, documentation, "index.md"), 'r') as f:
        base_text = f.read()
        matches = re.findall(r"(\[.*\]\((.*)\))", base_text)

    # Replace the pattern to the content
    for match, filename in matches:
        if filename in name_to_content:
            base_text = base_text.replace(match, name_to_content[filename])

    document = documentation.replace('-', '_')
    # Converts the index file to plain text and writes the final version to the new directory
    with open(os.path.join(OUTPUT_DIR, f"{document}.txt"), 'w') as f:
        html = markdown(base_text)
        base_text = ''.join(BeautifulSoup(html).findAll(string=True))
        f.write(base_text)