*First attempt* in changing the **.txt** file into **.json** file

In [9]:
# cloning a .json file besides the txt file for the o200K_base vocab via Panda
import pandas as pd

file = "https://github.com/BGSTA9/OptiTok_64k/blob/main/mis_files/OpenAI/o200k_base_vocab_list.txt"
# 1. Read the text file
# 'sep' defines what separates your data. For comma-separated, use ','
df = pd.read_csv(file, sep=',')
# 2. Convert to JSON
# 'orient' determines the JSON structure. 'records' is the most common (list of objects).
# 'indent' makes the file easy for humans to read.
df.to_json('o200k_base_vocab.json', orient='records', indent=4)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 38, saw 2


*Second attempt* in changing the **.txt** file into **.json** file

In [8]:
import pandas as pd

input_file_url = "https://raw.githubusercontent.com/BGSTA9/OptiTok_64k/refs/heads/main/mis_files/OpenAI/o200k_base_vocab_list.txt"
output_file = 'o200k_base_vocab.json'

try:
    # READ: Adjust 'sep' based on your file (use '\t' for tabs)
    df = pd.read_csv(input_file_url, sep=',')

    # WRITE: Save as JSON
    df.to_json(output_file, orient='records', indent=4)

    print(f"Success! {input_file_url} has been converted to {output_file}")
    print(f"Total rows converted: {len(df)}")

except FileNotFoundError:
    print("Error: The file was not found. Check the file path.")
except Exception as e:
    print(f"An error occurred: {e}")

An error occurred: Error tokenizing data. C error: Expected 1 fields in line 12, saw 2



*Third attempt* in changing the **.txt** file into **.json** file

**Meaning of the error above is as follows, by Gemini:**

- "This error happens because your vocabulary file likely contains tokens that are commas (,) or quotes ("). When you use sep=',' (the default), pandas tries to split lines based on commas. If a line in your vocabulary list contains a comma (e.g., the token for a comma itself), pandas gets confused about how many columns should be in that row."

**Solution**

- For vocabulary lists (which are usually one item per line), you should tell pandas to ignore commas and quotes and just read Line by Line.

In [10]:
import pandas as pd
import csv

input_file_url = "https://raw.githubusercontent.com/BGSTA9/OptiTok_64k/refs/heads/main/mis_files/OpenAI/o200k_base_vocab_list.txt"
output_file = 'o200k_base_vocab.json'

try:
    # FIX:
    # 1. sep='\n' -> Read the file line-by-line (ignore commas as separators)
    # 2. header=None -> The first line is data, not a column name
    # 3. engine='python' -> Required when using a multi-character separator like '\n' usually
    # 4. quoting=3 -> (csv.QUOTE_NONE) Treats quote marks as normal text, not special formatting
    df = pd.read_csv(
        input_file_url, 
        sep='\n', 
        header=None, 
        engine='python', 
        quoting=csv.QUOTE_NONE
    )

    # Optional: Rename the single column to something meaningful
    df.columns = ["token"]

    # WRITE: Save as JSON
    df.to_json(output_file, orient='records', indent=4)

    print(f"Success! {input_file_url} has been converted to {output_file}")
    print(f"Total rows converted: {len(df)}")

except Exception as e:
    print(f"An error occurred: {e}")

An error occurred: Specified \n as separator or delimiter. This forces the python engine which does not accept a line terminator. Hence it is not allowed to use the line terminator as separator.


*Fourth attempt* in changing the **.txt** file into **.json** file

suggesting sep='\n' was technically incorrect because Pandas reserves \n for breaking rows, not columns.

To fix this, we need to use a "Fake Separator" trick.

We will tell Pandas to use a separator that definitely does not exist in your file (like a purely random string). When Pandas can't find the separator, it defaults to reading the entire line as a single columnâ€”which is exactly what we want for a vocabulary list.

Here is the corrected, working script:

**Why this works**

- sep='delimiter_fake': We made up a separator. Pandas looks for this text to split columns. Since it never finds it, it puts the entire line of text into one column.

- quoting=csv.QUOTE_NONE: Vocabulary lists often contain the " character as a token. By default, Pandas thinks " is a wrapper for text. This setting forces Pandas to treat " as just another character in the string.

In [11]:
import pandas as pd
import csv

input_file_url = "https://raw.githubusercontent.com/BGSTA9/OptiTok_64k/refs/heads/main/mis_files/OpenAI/o200k_base_vocab_list.txt"
output_file = 'o200k_base_vocab.json'

try:
    # THE FIX:
    # 1. sep='delimiter_fake' -> A separator that doesn't exist, so it reads the whole line.
    # 2. engine='python' -> Allows us to use multi-character separators.
    # 3. quoting=csv.QUOTE_NONE -> Ignores quote marks (") so they don't break the parser.
    df = pd.read_csv(
        input_file_url, 
        sep='delimiter_fake', 
        header=None, 
        engine='python', 
        quoting=csv.QUOTE_NONE
    )

    # Name the column 'token' so the JSON looks nice
    df.columns = ["token"]

    # WRITE: Save as JSON
    df.to_json(output_file, orient='records', indent=4)

    print(f"Success! Converted to {output_file}")
    print(f"Total tokens found: {len(df)}")
    print(df.head()) # Print first 5 items to verify

except Exception as e:
    print(f"An error occurred: {e}")

Success! Converted to o200k_base_vocab.json
Total tokens found: 199998
  token
0   '!'
1   '"'
2   '#'
3   '$'
4   '%'


Adding where to find the modified file from

In [16]:
import pandas as pd
import csv
import os  # <--- Added to help find the file

input_file_url = "https://raw.githubusercontent.com/BGSTA9/OptiTok_64k/refs/heads/main/mis_files/OpenAI/o200k_base_vocab_list.txt"
output_file = 'o200k_base_vocab.json'

try:
    # 1. READ
    # Using the "Fake Separator" trick to read the whole line as one token
    df = pd.read_csv(
        input_file_url, 
        sep='delimiter_fake', 
        header=None, 
        engine='python', 
        quoting=csv.QUOTE_NONE
    )

    df.columns = ["token"]

    # 2. WRITE
    df.to_json(output_file, orient='records', indent=4)

    # 3. LOCATE
    # Get the current working directory
    current_directory = os.getcwd()
    # Combine it with the filename to get the full path
    full_path = os.path.join(current_directory, output_file)

    print("-" * 30)
    print(f"File conversion {output_file} successfully fulfilled!")
    print(f"You may access the new file from -> {full_path}")

except Exception as e:
    print(f"An error occurred: {e}")

------------------------------
File conversion o200k_base_vocab.json successfully fulfilled!
You may access the new file from -> /Users/soheilsanati/Downloads/OptiTok_64k/mis_files/OpenAI/o200k_base_vocab.json
