# Generating the JetBrains Academy Course

This notebook serves as a showcase to the functionality of the second part of the pipeline to generate a JetBrains Academy Course from a closed and solved OSS GitHub issue. This part of the pipeline will create the lessons and tasks sequentially for the sake of simplicity and completeness. However, as the tasks do not depend on eachother during the file creation, several steps could be parallelized to improve the performance.

## Processing the Input Data

The first step of this part of the pipeline is to properly read the input given and pass it along to the appropriate functions. The given input is of the following form:

```python
class GuidedExercise(BaseModel):
    id : int
    title : str
    gitInfo : GitInfo
    exercise : str
    steps : list[Step]
    tags : list[str]
```

All the necessary types are defined in the coding blocks below

In [16]:
# Dependencies of the pipeline
import os
import yaml
from openai import OpenAI
from github import Github
from github import Auth
from dotenv import load_dotenv
from pydantic import BaseModel

github_token = os.getenv('GITHUB_TOKEN')
github_obj = Github(github_token)
client = OpenAI()

# Important types of the pipeline
class GitInfo(BaseModel):
    repo : str
    issue : str
    pr : str

class CodeRange(BaseModel):
    start: int
    end: int

class CodeStep(BaseModel):
    id : int
    summary : str
    code : str
    path : str
    range : CodeRange

class Step(BaseModel):
    id : str
    summary : str
    code_steps : list[CodeStep]
    
class GuidedExercise(BaseModel):
    id : int
    title : str
    git_info : GitInfo
    exercise : str
    steps : list[Step]
    tags : list[str]
    

In [15]:
# Testing Random Shit

generateLessonFolder("", "", git_info,"theory", [], "test_data")

### Task Content

To create the task folder, I expect the input to be passed as the following format:

```python
class Task(BaseModel):
    title: str          # Title of the task
    description: str    # Text which will be desplayed in the task description, already formatted
    category: str       # Category of the task that will be specified in the config file
    files: dict         # Dictionary containing the name of the file as the key and the content of the file as content
    lesson_path: str    # Path of the lesson directory

```

In [11]:
def generateTaskFolder(id, title, description, category, files, lesson_path):
    """
    This function generates a task folder which is the content of a lesson folder. 
    The inputs of the function are its title, description and type as well as its files. 
    """

    task_path = os.path.join(lesson_path, f"task{id}")
    os.makedirs(task_path, exist_ok=True)
    file_paths = {}

    # TODO: Files to create: -task.md -task.js -task-info.yaml
    taskMD_path = os.path.join(task_path, "task.md")
    with open(taskMD_path, "w", encoding="utf-8") as f:
        f.write(description)
    
    for file in files:
        if files[file] == None or files[file] == "":
            continue
        file_path = os.path.join(task_path, file)
        if file == "test":
            test_folder = os.path.join(task_path, "test")
            os.makedirs(test_folder, exist_ok=True)
            file_path = os.path.join(test_folder, file)
        with open(file_path, "w", encoding="utf-8") as f:
            f.write(files[file])
        file_paths[file] = file_path

    taskYML_path = os.path.join(task_path, "task-info.yaml")
    with open(taskYML_path, "w", encoding="utf-8") as f:
        file_content = {
            "type": category,
            "custom_name": title,
            "files": []
        }
        for file in file_paths:
            file_content["files"].append({"name": file, "visible": True})
            yaml.dump(file_content, f, default_flow_style=False)

    return task_path


# Lesson Content

Each lesson will be generated using GenAI to create its content. Depending on the type of the lesson, a different prompt will be used:
- `introduction` lesson -> `generateIntroductionContent()` will be called, where the repository as well as the issue will be briefly introduced
- `theory` lesson -> `generateTheoryContent()` will be called. The content will be short summaries about the necessary theoretical background to solve the issue
- `content` lesson -> No function will be called, as all the content will have been already generated by a previous part of the pipeline
- `debrief` lesson -> Currently still unclear if I need a prompt to generate content.

In [12]:
class NonCodingTaskContent(BaseModel):
    id: int
    title: str
    description: str
    additional_links: str

class NonCodingLessonContent(BaseModel):
    title: str
    tasks: list[NonCodingTaskContent]

In [18]:
def generateIntroductionContent(issue_url, pr_url):
    """
    This function takes as input the issue and pr url to generate the content of the 
    Introduction lesson, such that the introduction to the course is tailored to the 
    issue.
    """
    prompt = """You are an instructor creating programming exercises from closed GitHub issues.

            Input:
            $ISSUE, a github url linking to the GitHub Issue
            $PR, a link to the pull request solving the said issue.

            Output: An introductory lesson which greets and introduces the student to the issue
            The introductory lesson should mention the repository
            The format of th eoutput should be as follows:
            {
                "title": the title of the lesson
                "tasks:  a list of each task
            }
            The elements of the "tasks" field should be in the following format:
            {
                "id": the id of the task, starting at 0, and incremeting by 1 for each additional task
                "title": the title of the task
                "description": the content of the task, i.e. the text that introduces the student
                "additional_readings": the list of urls of the additional readings as a list of strings
            }
            The description field of the task should use markdown syntax.

            You will now be given a pair of ($ISSUE, $PR). Generate the output following the instructions as closely as possible.
    """
    model_input = "$ISSUE = " + issue_url + ", $PR = " + pr_url
    
    generated_content = client.beta.chat.completions.parse(
        model="gpt-4o-mini",
        response_format=NonCodingLessonContent,
        messages = [{"role": "system", "content": prompt}, {"role": "user", "content": model_input}]
    )

    return generated_content.choices[0].message.parsed

In [14]:
def generateTheoryContent(issue_url, pr_url):
    """
    This function takes as input the issue and pr of the exercise. Its output is the content used
    to create the Theory Lesson for the course, that presents the necessary theoretical knowledge to 
    the student taking the course. 
    """
    # TODO: Add md formatting for the task description in the prompt
    prompt = """You are an instructor creating programming exercises from closed GitHub issues.

            Input:
            $ISSUE, a github url linking to the GitHub Issue
            $PR, a link to the pull request solving the said issue.

            Output: A lesson teaching the student the necessary theoretical background to solve the issue.
            Each task of the lesson must be consice and relevant.
            Each task should only explain ONE concept.
            For each task, provide extra reading for the student in form of links.
            Make sure that the links work.
            Each task should NOT divulge any concrete solutions to solve the issue.
            The format of the output should be as follows:
            {
                "title": the title of the lesson
                "tasks:  a list of each task
            }
            The elements of the "tasks" field should be in the following format:
            {
                "id": the id of the task, starting at 0, and incremeting by 1 for each additional task
                "title": the title of the task
                "description": the content of the task, explaining the theoretical concept
                "additional_readings": the list of urls of the additional readings as a list of strings
            }
            The description field of the task should use markdown syntax.

            You will now be given a pair of ($ISSUE, $PR). Generate the output following the instructions as closely as possible.
    """
    model_input = "$ISSUE = " + issue_url + ", $PR = " + pr_url

    generated_content = client.beta.chat.completions.parse(
        model = "gpt-4o-mini",
        response_format = NonCodingLessonContent,
        messages = [{"role": "system", "content": prompt}, {"role": "user", "content": model_input}]
    )

    return generated_content.choices[0].message.parsed

In [20]:
def generateDebriefContent(issue_url, pr_url):
    """
    This function generates a debrief lesson by taking as an input the issue's url and the pr's url that solves the issue.
    The debrief lesson consists only of one task, concisely explaining what the student has learned.
    """

    prompt = """You are an instructor creating programming exercises from closed GitHub issues.

            Input:
            $ISSUE, a github url linking to the GitHub Issue
            $PR, a link to the pull request solving the said issue.

            Output:
            The format of the output should be as follows:
            {
                "title": the title of the lesson
                "tasks:  a list of each task
            }
            The elements of the "tasks" field should be in the following format:
            {
                "id": the id of the task, starting at 0, and incremeting by 1 for each additional task
                "title": the title of the task
                "description": the content of the task, explaining the theoretical concept
                "additional_readings": the list of urls of the additional readings as a list of strings
            }
            The description field of the task should use markdown syntax.
            
            You will now be given a pair of ($ISSUE, $PR). Generate the output following the instructions as closely as possible.
    """
    model_input = "$ISSUE = " + issue_url + ", $PR = " + pr_url

    generated_content = client.beta.chat.completions.parse(
        model = "gpt-4o-mini",
        response_format = NonCodingLessonContent,
        messages = [{"role": "system", "content": prompt}, {"role": "user", "content": model_input}]
    )

    return generated_content.choices[0].message.parsed

To create a lesson folder, I expect that the function receives the following input

```python
class Lesson(BaseModel):
    title: str          # Title of the lesson
    description: str    # General Description of the lesson
    git_info: GitInfo   # All the necessary github informations to generate the config files.
    category: str       # The type of the lesson, either introduction, theory, coding, or debrief
    steps: list[Step]   # A list of the steps, i.e. tasks to be created inside the lesson
    course_path: str    # Absolute path of the lesson folder

class Step(BaseModel):
    id : str                    # ID of the step
    summary : str               # Short summary describing what needs to be done, i.e. the task description
    code_steps : list[CodeStep]  # The list of all the hints for that specific step/task
```

For theory tasks, the `code_steps` field of the `Step()` class will be left empty, as we don't need any hints. The task description will be saved in the `summary` field. The input `steps` will also be empty for all lesson types except for the *coding* lessons, as the steps won't be generated by this part but will have been given by the previous section of the pipeline.

In [27]:
def generateLessonFolder(title, description, git_info, category, steps, course_path):
    """
    This function generates a lesson folder necessary for a JetBrains Academy course. Its input
    are the content of each task it has as well as its title and description
    The types of the lesson's are: "introduction", "theory", "coding", and "debrief"
    """
    
    lesson_path = os.path.join(course_path, category)
    os.makedirs(lesson_path, exist_ok=True)
    task_paths = {}

    if (category == "introduction"):
        # TODO: Don't forget to generate the config.js file to properly set-up the course for the plugin
        # TODO: Generate the task descriptions properly according to the github issue
        lesson = generateIntroductionContent(git_info.issue, git_info.pr)
        file_contents = {
            "task.js": "// It's a coding file, you don't need that in this task"
        }
        yaml_content = {
            "custom_name": lesson.title,
            "content": []
        }
        for task in lesson.tasks:
            yaml_content["content"].append(generateTaskFolder(task.id, task.title, task.description, "theory", file_contents, lesson_path))
        lessonYML_path = os.path.join(lesson_path, "lesson-info.yaml")
        with open(lessonYML_path, "w", encoding="utf-8") as f:
            yaml.dump(yaml_content, f, default_flow_style=False)
        return lesson_path

    elif (category == "theory"):
        lesson = generateTheoryContent(git_info.issue, git_info.pr)
        # Is possible to change the task.js content into smth generated
        file_contents = {
            "task.js": "// It's a coding file, you don't need that in this task"
        }
        yaml_content = {
            "custom_name": lesson.title,
            "content": []
        }
        for task in lesson.tasks:
            yaml_content["content"].append(generateTaskFolder(task.id, task.title, task.description, "theory", file_contents, lesson_path))    
        lessonYML_path = os.path.join(lesson_path, "lesson-info.yaml")
        with open(lessonYML_path, "w", encoding="utf-8") as f:
            yaml.dump(yaml_content, f, default_flow_style=False)
        return lesson_path
        
    elif (category == "coding"):
        return lesson_path

    elif (category == "debrief"):
        lesson = generateDebriefContent(git_info.issue, git_info.pr)
        file_contents = {
            "task.js": "// It's a coding file, you don't need that in this task"
        }
        yaml_content = {
            "custom_name": lesson.title,
            "content": []
        }
        for task in lesson.tasks:
            yaml_content["content"].append(generateTaskFolder(task.id, task.title, task.description, "theory", file_contents, lesson_path))
        lessonYML_path = os.path.join(lesson_path, "lesson-info.yaml")
        with open(lessonYML_path, "w", encoding="utf-8") as f:
            yaml.dump(yaml_content, f, default_flow_style=False)
        return lesson_path
        
    else:
        raise Exception("Wrong Lesson Type given to the function!")


In [28]:
def generateCourseFolder(input_data: GuidedExercise):
    """
    This function takes as input a specific json format generated by the previous part of the pipeline.
    The format is equivalent to the GuidedExercise class described in the beginning.
    It outputs an entire folder which possesses the structure needed to be considered a course by 
    the JetBrains Academy plugin
    """
    # TODO: Check for the proper format of the input
    try:
        course_path = input_data.title
        os.makedirs(course_path, exist_ok=True)
        # TODO: Make the yaml content not so hard_coded
        yaml_content = {
            "type": "marketplace",
            "title": input_data.title,
            "language": "English",
            "programming_language": "JavaScript",
            "content": [],
            "additional_files": [],
            "yaml_version": 3
        }

        # Hardcoded course structure
        yaml_content["content"].append(generateLessonFolder("", "", input_data.git_info,"introduction", [], course_path))
        yaml_content["content"].append(generateLessonFolder("", "", input_data.git_info,"theory", [], course_path))
        # yaml_content["content"].append(generateLessonFolder("Step by Step Coding", input_data.exercise, input_data.git_info,"coding", input_data.steps, course_path))
        yaml_content["content"].append(generateLessonFolder("", "", input_data.git_info,"debrief", [], course_path))
    except:
        print("Wrong input type has been given.")

    return None

## Running the Pipeline

Now we are ready to run the entire pipeline. Feel free to run the last cell to see how a course is generated! 

*Don't forget to run all the other cells first ;)*

In [29]:
git_info_dict = {
    "repo" : "https://github.com/mattermost/mattermost", 
    "issue" : "https://github.com/mattermost/mattermost/issues/28355", 
    "pr" : "https://github.com/mattermost/mattermost/pull/28357"

}
git_info = GitInfo(**git_info_dict)
placeholder_input = {
    "id": 0,
    "title": "Testing Course",
    "git_info": git_info,
    "exercise": "",
    "steps": [],
    "tags": []
}
test_input = GuidedExercise(**placeholder_input)
generateCourseFolder(test_input)
