# Inspecting and Preprocessing the Dataset

The ***target dataset*** for this ***study*** is a collection of ***leetcode*** problem-solution entries, of ***all three difficulty levels***, some of which are well known to seasoned ***Software Engineers*** and ***Computer Science*** students; implemented in the following ***programming languages***

1) Python
2) C++
3) Java
4) JavaScript

Which are thus far, the ***most requested*** programming languages to know from Software Engineers across the world

- The dataset can be found [here](https://huggingface.co/datasets/greengerong/leetcode/viewer/default/train?p=1&views%5B%5D=train)

---

In [1]:
import pandas as pd

dataset_df = pd.read_json("../data/leetcode_dataset.jsonl", lines=True)

In [2]:
dataset_df

Unnamed: 0,id,slug,title,difficulty,content,java,c++,python,javascript
0,1,two-sum,Two Sum,Easy,Given an array of integers `nums` and an integ...,\n ```java\nimport java.util.HashMap;\nimpo...,\n ```cpp\n#include <vector>\n#include <uno...,"\n ```python\ndef twoSum(nums, target):\n ...","\n ```javascript\nfunction twoSum(nums, tar..."
1,2,add-two-numbers,Add Two Numbers,Medium,You are given two **non-empty** linked lists r...,\n ```java\npublic class ListNode {\n in...,\n ```cpp\nstruct ListNode {\n int val;\...,\n ```python\nclass ListNode:\n def __in...,"\n ```javascript\nfunction ListNode(val, ne..."
2,3,longest-substring-without-repeating-characters,Longest Substring Without Repeating Characters,Medium,"Given a string `s`, find the length of the **l...",\n ```java\nimport java.util.HashSet;\nimpo...,\n ```cpp\n#include <string>\n#include <uno...,\n ```python\ndef length_of_longest_substri...,\n ```javascript\nfunction lengthOfLongestS...
3,4,median-of-two-sorted-arrays,Median of Two Sorted Arrays,Hard,Given two sorted arrays `nums1` and `nums2` of...,\n ```java\npublic double findMedianSortedA...,\n ```cpp\ndouble findMedianSortedArrays(ve...,\n ```python\ndef findMedianSortedArrays(nu...,\n ```javascript\nfunction findMedianSorted...
4,5,longest-palindromic-substring,Longest Palindromic Substring,Medium,"Given a string `s`, return _the longest_ _pali...",\n ```java\npublic String longestPalindromi...,\n ```cpp\n#include <string>\n\nstd::string...,\n ```python\ndef longest_palindromic_subst...,\n ```javascript\nfunction longestPalindrom...
...,...,...,...,...,...,...,...,...,...
2355,2608,shortest-cycle-in-a-graph,Shortest Cycle in a Graph,Hard,There is a **bi-directional** graph with `n` v...,\n ```java\nimport java.util.ArrayList;\nim...,\n ```cpp\n#include <vector>\n#include <que...,\n ```python\nfrom collections import deque...,\n ```javascript\nfunction shortestCycleLen...
2356,2609,find-the-longest-balanced-substring-of-a-binar...,Find the Longest Balanced Substring of a Binar...,Easy,You are given a binary string `s` consisting o...,\n ```java\npublic int longestBalancedSubst...,\n ```cpp\nint longestBalancedSubstring(str...,\n ```python\ndef longestBalancedSubstring(...,\n ```javascript\nfunction longestBalancedS...
2357,2610,convert-an-array-into-a-2d-array-with-conditions,Convert an Array Into a 2D Array With Conditions,Medium,You are given an integer array `nums`. You nee...,\n ```java\nimport java.util.ArrayList;\nim...,\n ```cpp\n#include <vector>\n#include <uno...,\n ```python\ndef distinct_rows_from(nums):...,\n ```javascript\nfunction distinctRowsFrom...
2358,2611,mice-and-cheese,Mice and Cheese,Medium,There are two mice and `n` different types of ...,\n ```java\nimport java.util.ArrayList;\nim...,\n ```cpp\n#include <vector>\n#include <alg...,"\n ```python\ndef maxPoints(reward1, reward...",\n ```javascript\nfunction maxPoints(reward...


---

## Preprocessing the Programming Language Documentation

In order to best assist an ***aspiring programmer***, not only should the solution be understood and interpreted appropriately, but also how to give ***appropriate*** and ***up to date*** guidence
on ***official programming language standards***, official documentation *must* be used

Normally, in a professional setting, ***web scraping*** would be used to extract information about the *web hosted* documentation of the programming language, however for the sake of these tests, it will be downloaded and *processed* for this specific application

The documentation downloaded and used in the knoweldgebase concerns only the following programming languages:

1) **Java**
2) **Python**

---

In [7]:
# Downloading the Python documentation
!wget https://docs.python.org/3/archives/python-3.13-docs-html.zip -O ../data/python_docs.zip
!unzip ../data/python_docs.zip -d ../data/python_docs

--2025-07-08 17:42:27--  https://docs.python.org/3/archives/python-3.13-docs-html.zip
Resolving docs.python.org (docs.python.org)... 151.101.192.223, 151.101.64.223, 151.101.128.223, ...
Connecting to docs.python.org (docs.python.org)|151.101.192.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15305784 (15M) [application/zip]
Saving to: ‘../data/python_docs.zip’


2025-07-08 17:42:29 (8.47 MB/s) - ‘../data/python_docs.zip’ saved [15305784/15305784]

Archive:  ../data/python_docs.zip
replace ../data/python_docs/python-3.13-docs-html/_sources/copyright.rst.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


In [6]:
# Donwloading the Java documentation
!wget https://download.oracle.com/otn_software/java/jdk/21.0.7+8/8fe202bfe6c4465583b1dc9710c4fade/jdk-21.0.7_doc-all.zip -O ../data/java_docs.zip
!unzip ../data/java_docs.zip -d ../data/java_docs

--2025-07-08 17:33:33--  https://download.oracle.com/otn_software/java/jdk/21.0.7+8/8fe202bfe6c4465583b1dc9710c4fade/jdk-21.0.7_doc-all.zip
Resolving download.oracle.com (download.oracle.com)... 2.21.144.138
Connecting to download.oracle.com (download.oracle.com)|2.21.144.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 52724951 (50M) [application/zip]
Saving to: ‘../data/java_docs.zip’


2025-07-08 17:33:39 (10.8 MB/s) - ‘../data/java_docs.zip’ saved [52724951/52724951]

Archive:  ../data/java_docs.zip
   creating: ../data/java_docs/docs/
   creating: ../data/java_docs/docs/api/
  inflating: ../data/java_docs/docs/api/allclasses-index.html  
  inflating: ../data/java_docs/docs/api/allpackages-index.html  
  inflating: ../data/java_docs/docs/api/constant-values.html  
  inflating: ../data/java_docs/docs/api/copy.svg  
  inflating: ../data/java_docs/docs/api/deprecated-list.html  
  inflating: ../data/java_docs/docs/api/element-list  
  inflating: ../data/ja

---

Only the solutions in the ***Python*** and ***Java*** programming languages will be stored in order to maintain consistency with the experiments since they are the only languages for which ***official documentation*** has been downloaded

In [13]:
# Preprocessing the LeetCode Solution Dataset for Knowledge Base Integration
import os
import pandas as pd

df = pd.read_json("../data/leetcode_dataset.jsonl", lines=True)  # Or preloaded

base_path = "../data/leetcode"

for _, row in df.iterrows():
    problem_id = str(row['id'])
    title_slug = row['slug']

    path = os.path.join(base_path, f"{problem_id}_{title_slug}")
    os.makedirs(path, exist_ok=True)

    # Save description
    with open(os.path.join(path, "problem.txt"), "w", encoding="utf-8") as f:
        f.write(f"Title: {row['title']}\n")
        f.write(f"Difficulty: {row['difficulty']}\n\n")
        f.write(row['content'])

    # Save solutions
    for lang in ['java', 'python']:
        code = row[lang]
        with open(os.path.join(path, f"solution_{lang.lower()}.txt"), "w", encoding="utf-8") as f:
            f.write(code if isinstance(code, str) else "")