### Index Zoomcamp FAQ documents
- `DE Zoomcamp`: https://docs.google.com/document/d/19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw/edit
- `ML Zoomcamp`: https://docs.google.com/document/d/1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8/edit
- `MLOps Zoomcamp`: https://docs.google.com/document/d/12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0/edit

### Downloding Necessary Packages

In [7]:
!python3 -m pip install --upgrade pip
!pip install tqdm notebook==7.1.2 openai elasticsearch==8.13.0 pandas python-docx==1.1.2 scikit-learn ipywidgets --quiet



### Importing Necessary Libraries

In [11]:
import io
import docx
import json
import requests

### High-Level Overview
These two functions work together to **extract FAQs (frequently asked questions)** from a **Google Docs document**, assuming the document uses specific heading styles to format its sections and questions. Here's a breakdown:

- `clean_line(line)`: This function ensures a line of text is clean by trimming whitespace and special Unicode characters.
- `read_faq(file_id)`: Given a Google Docs file ID, it downloads the document, parses it as a `.docx` file, and extracts a list of FAQ items, each with a section, question, and answer, based on the document's structure.

In [9]:
def clean_line(line):
    """
    Cleans a line of text by stripping leading/trailing whitespace and 
    removing the Unicode Byte Order Mark (BOM) character if present.
    
    Args:
        line (str): A single line of text.
    
    Returns:
        str: The cleaned line of text.
    """
    # Remove leading/trailing whitespace
    line = line.strip()
    
    # Remove BOM character if present (commonly appears in UTF-8 encoded files)
    line = line.strip('\uFEFF')
    
    return line


def read_faq(file_id):
    """
    Downloads and parses a Google Docs document (in .docx format) using the provided file ID.
    Extracts FAQs from the document where:
      - Heading 1 denotes a section title.
      - Heading 2 denotes a question.
      - Normal text following a question is the answer.
    
    Args:
        file_id (str): The unique identifier of the Google Docs file.
    
    Returns:
        list[dict]: A list of dictionaries where each dictionary contains:
                    - 'section': The section the question belongs to.
                    - 'question': The question text.
                    - 'text': The corresponding answer text.
    """
    # Construct URL to export the Google Doc as a .docx file
    url = f'https://docs.google.com/document/d/{file_id}/export?format=docx'
    
    # Download the document content
    response = requests.get(url)
    response.raise_for_status()  # Raise an exception if the download failed

    # Load the document into memory and parse with python-docx
    with io.BytesIO(response.content) as f_in:
        doc = docx.Document(f_in)

    questions = []  # List to hold extracted FAQ entries

    # Define the expected styles for section and question headings
    question_heading_style = 'heading 2'
    section_heading_style = 'heading 1'
    
    # Track current context while iterating through paragraphs
    section_title = ''
    question_title = ''
    answer_text_so_far = ''
     
    for p in doc.paragraphs:
        style = p.style.name.lower()  # Get paragraph style name
        p_text = clean_line(p.text)   # Clean paragraph text
    
        if len(p_text) == 0:
            continue  # Skip empty lines
    
        if style == section_heading_style:
            # Update current section title
            section_title = p_text
            continue
    
        if style == question_heading_style:
            # Save previous question-answer pair before starting a new one
            answer_text_so_far = answer_text_so_far.strip()
            if answer_text_so_far != '' and section_title != '' and question_title != '':
                questions.append({
                    'text': answer_text_so_far,
                    'section': section_title,
                    'question': question_title,
                })
                answer_text_so_far = ''  # Reset answer buffer
    
            # Start a new question
            question_title = p_text
            continue
        
        # Accumulate text for the current answer
        answer_text_so_far += '\n' + p_text
    
    # Handle the last question-answer pair after the loop
    answer_text_so_far = answer_text_so_far.strip()
    if answer_text_so_far != '' and section_title != '' and question_title != '':
        questions.append({
            'text': answer_text_so_far,
            'section': section_title,
            'question': question_title,
        })

    return questions


The below code block **fetches and processes FAQ documents for multiple courses** using their associated Google Docs file IDs. It builds a consolidated list of FAQ data for each course and prints out the course names as it goes.

In [10]:
## This is a dictionary mapping course names to their corresponding Google Docs file IDs that contain FAQ information.
faq_documents = {
    'data-engineering-zoomcamp': '19bnYs80DwuUimHM65UV3sylsCn2j1vziPOwzBwQrebw',
    'machine-learning-zoomcamp': '1LpPanc33QJJ6BSsyxVg-pWNMplal84TdZtq10naIhD8',
    'mlops-zoomcamp': '12TlBfhIiKtyBv8RnsoJR6F72bkPDGEvPOItJIxaEzE0',
}

## Initializes an empty list to hold the processed FAQ data for all courses.
documents = []

for course, file_id in faq_documents.items():
    print(course)
    ## Calls the read_faq() function (explained earlier) with the file_id, returning a list of FAQ entries (questions and answers) for that document.
    course_documents = read_faq(file_id)
    documents.append({'course': course, 'documents': course_documents})

data-engineering-zoomcamp
machine-learning-zoomcamp
mlops-zoomcamp


In [12]:
## This code block writes the extracted FAQ data to a JSON file called documents.json.

with open('documents.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

In [13]:
!head documents.json

[
  {
    "course": "data-engineering-zoomcamp",
    "documents": [
      {
        "text": "Data Engineering Zoomcamp FAQ\nData Engineering Zoomcamp FAQ\nThe purpose of this document is to capture Frequently asked technical questions\nEditing guidelines:\nWhen adding a new FAQ entry, make sure the question is \u201cHeading 2\u201d\nFeel free to improve if you see something is off\nDon\u2019t change the formatting in the Data document or add any visual \u201cimprovements\u201d (make a copy for yourself first if you need to do it for whatever reason)\nDon\u2019t change the pages format (it should be \u201cpageless\u201d)\nAdd name and date for reference, if possible\nThe next cohort starts January 13th 2025. More info at DTC.\nRegister before the course starts using this link.\nJoint the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
        "section": "General course-related questions",
        "question": "C