# Step 1 : Data Preparation Optimized

**Problem:** Large JSON datasets (1GB+) cause memory crashes and the dataset can have many useless data we need remove or format to better results.

**Solution:** Chunk processing + cleaning + formatting  

**Objective:** Clean dataset ready for fine-tuning without crashes


## Instalations and Imports for necessary libraries

In [2]:
!pip install ijson tqdm psutil gdown



In [3]:
import pandas as pd
import numpy as np
import json
import ijson
import os
import psutil
from tqdm import tqdm
import re
import gdown

# Check if running in Google Colab
try:
    from google.colab import drive
    IN_COLAB = True
    print("Running in Google Colab environment")
except ImportError:
    IN_COLAB = False
    print("Running in local environment")

Running in local environment


## Download Dataset or Mount Google Drive and Define Paths

In [4]:
# Download dataset from Google Drive
DATASET_URL = "https://drive.google.com/file/d/12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK/view"
FILE_ID = "12zH4mL2RX8iSvH0VCNnd3QxO4DzuHWnK"

# Define paths based on environment
if IN_COLAB:
    # Google Colab: Mount drive and use drive path
    drive.mount('/content/drive')
    BASE_PATH = "/content/drive/MyDrive/Fiap"
    print("Google Drive mounted. Using Colab environment.")
else:
    # Local: Use data folder
    BASE_PATH = "./data"
    # Create data directory if it doesn't exist
    os.makedirs(BASE_PATH, exist_ok=True)
    print("Using local environment. Data will be stored in ./data/")


# Processing pipeline files
RAW_DATA_PATH = f"{BASE_PATH}/trn.json" # Original dataset
CLEAN_DATA_PATH = f"{BASE_PATH}/trn_cleaned.jsonl" # Cleaned dataset
FINAL_DATA_PATH = f"{BASE_PATH}/trn_finetune.jsonl" # Final dataset (Alpaca format)
STATS_PATH = f"{BASE_PATH}/processing_stats.json" # Processing statistics


Using local environment. Data will be stored in ./data/


In [5]:
from dataset_downloader import download_dataset, DatasetDownloader

success = download_dataset(BASE_PATH, RAW_DATA_PATH, FILE_ID, DATASET_URL)
if success:
    downloader = DatasetDownloader(BASE_PATH, FILE_ID, DATASET_URL)
    file_info = downloader.get_file_info(RAW_DATA_PATH)
    print(f"\nDataset ready for processing:")
    print(f"  Size: {file_info['size_mb']:.1f} MB")
    print(f"  Total lines: {file_info['total_lines']:,}")
    TOTAL_LINES = file_info['total_lines']
    DATASET_SIZE_MB = file_info['size_mb']
else:
    print("ERROR: Dataset download failed!")

Dataset already exists: ./data/trn.json
Verifying JSON file format...
Format: JSON Lines (JSONL) - each line is a JSON object

Dataset ready for processing:
  Size: 179.8 MB
  Total lines: 1,305,265


## Data Structure Analysis

In [6]:
# Import our dataset analyzer
from dataset_analyzer import analyze_dataset, DatasetAnalyzer

# Run complete dataset analysis
print("Running comprehensive dataset analysis...")
analyzer = analyze_dataset(RAW_DATA_PATH, sample_size=50)

# Get recommended settings for processing
recommended_chunk_size = analyzer.get_recommended_chunk_size()
print(f"\nRecommended chunk size for processing: {recommended_chunk_size}")

# Store results for next steps
RECOMMENDED_CHUNK_SIZE = recommended_chunk_size

Running comprehensive dataset analysis...
Memory: 60.6% used (18.4GB/30.5GB)
Counting lines in ./data/trn.json...


Total lines: 1,305,265
Analyzing dataset structure (50 samples)...
Format: JSON Lines (JSONL) - Compatible

=== DATASET ANALYSIS SUMMARY ===
Sample size: 50
Parse errors: 0
Fields found: 5

Field frequency:
  - uid: 50/50 (100.0%)
  - title: 50/50 (100.0%)
  - content: 50/50 (100.0%)
  - target_ind: 50/50 (100.0%)
  - target_rel: 50/50 (100.0%)

String field lengths:
  - uid: avg=10, min=10, max=10
  - title: avg=46, min=21, max=114
  - content: avg=0, min=0, max=0

Example record structure:
  uid: 0000032050
  title: Adult Ballet Tutu Purple
  content: 
  target_ind: []
  target_rel: []
Counting lines in ./data/trn.json...
Total lines: 1,305,265
Memory: 60.8% used (18.5GB/30.5GB)
Recommended chunk size: 200

Recommended chunk size for processing: 200
