## 1_spilt_json.ipynb

This notebook provides a script to split a large JSON file into smaller chunks. The main tasks include:
1. Reading a JSON file containing key-value pairs.
2. Splitting the JSON data into a specified number of smaller files.
3. Saving each chunk as a separate JSON file with a unique name.

### Instructions
- Update the `file_path` variable with the path to your large JSON file.
- Specify the number of chunks in the `num_chunks` parameter of the `split_json_file` function.
- Run the notebook to generate smaller JSON files.

### Output
- Multiple smaller JSON files (e.g., `chunk_doc_512_1.json`, `chunk_doc_512_2.json`, etc.), each containing approximately equal portions of the original data.

### Notes
- The script ensures that the last chunk contains any remaining data if the total number of records is not divisible by `num_chunks`.
- The generated files are saved in the current working directory with descriptive names.

In [None]:
import json

def split_json_file(file_path, num_chunks=5):
    # Read JSON files
    with open(file_path, 'r', encoding='utf-8') as file:
        data = json.load(file)
    
    # Convert dictionary data into a list of (key, value)
    items = list(data.items())
    chunk_size = len(items) // num_chunks

    for i in range(num_chunks):
        # Calculate the start and end indexes of the current mini-file
        start_idx = i * chunk_size
        end_idx = start_idx + chunk_size if i < num_chunks - 1 else len(items)

        # Data for the current small file
        chunk_data = dict(items[start_idx:end_idx])

        # Write small files
        chunk_file_name = f"chunk_doc_512_{i + 1}.json"
        with open(chunk_file_name, 'w', encoding='utf-8') as chunk_file:
            json.dump(chunk_data, chunk_file, ensure_ascii=False, indent=4)

        print(f"Saved {chunk_file_name} with {len(chunk_data)} records.")

In [None]:
# File path
file_path = './chunk_512/chunk_doc_512.json'
split_json_file(file_path, num_chunks=5)