# Creating a Translated Dataset

This notebook demonstrates how to create a dataset with a percentage of queries translated from English to Polish.

## Setup

First, let's set up our environment and import necessary libraries.

In [None]:
import sys
sys.path.append('..')  # Add parent directory to path

import json
import os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.auth import login_to_huggingface
from src.dataset import load_function_calling_dataset, parse_json_entry
from src.translation_dataset import (
    create_translated_dataset,
    analyze_translated_dataset
)

# Visual settings
plt.rcParams['figure.figsize'] = (12, 8)
sns.set_style('whitegrid')

## Creating a Small Test Dataset

First, let's create a small test dataset with a few translated queries to verify that everything works as expected.

In [None]:
# Path for the test dataset
test_output_path = Path('../data/test_translated_dataset.json')

# Create a small dataset with 10 samples, 50% translated
test_dataset_path = create_translated_dataset(
    output_path=test_output_path,
    sample_size=10,
    translation_percentage=0.5,
    random_seed=42
)

print(f"Test dataset created at: {test_dataset_path}")

## Analyzing the Test Dataset

Let's analyze the test dataset to verify the translations.

In [None]:
# Load and explore the test dataset
with open(test_output_path, 'r', encoding='utf-8') as f:
    test_data = json.load(f)

# Display the queries from the first few samples
print("Queries in the test dataset:")
for i, sample in enumerate(test_data):
    print(f"Sample {i+1}: {sample['query']}\n")

In [None]:
# Analyze the language distribution
test_summary = analyze_translated_dataset(test_output_path)
print("Language distribution in the test dataset:")
test_summary

## Creating the Full Translated Dataset

Now, let's create the full dataset with 40% of queries translated to Polish. Depending on the size of the dataset, this may take some time to complete.

In [None]:
# Path for the full dataset
full_output_path = Path('../data/xlam_function_calling_pl.json')

# Option 1: Creating a full dataset with all samples (this will take a long time)
# full_dataset_path = create_translated_dataset(
#     output_path=full_output_path,
#     translation_percentage=0.4,
#     random_seed=42
# )

# Option 2: Creating a more manageable dataset (e.g., 1000 samples)
sample_size = 1000  # Adjust based on your needs and resources
full_dataset_path = create_translated_dataset(
    output_path=full_output_path,
    sample_size=sample_size,
    translation_percentage=0.4,
    random_seed=42
)

print(f"Full dataset created at: {full_dataset_path}")

## Analyzing the Full Dataset

Now, let's analyze the full dataset to verify that approximately 40% of the queries have been translated.

In [None]:
# Analyze the language distribution
full_summary = analyze_translated_dataset(full_output_path)
print("Language distribution in the full dataset:")
full_summary

In [None]:
# Visualize the language distribution
plt.figure(figsize=(10, 6))
sns.barplot(x='Language', y='Percentage', data=full_summary)
plt.title('Language Distribution in the Translated Dataset')
plt.xlabel('Language')
plt.ylabel('Percentage of Queries')
plt.tight_layout()
plt.savefig('../data/language_distribution.png')
plt.show()

## Examining Examples of Translated Queries

Let's look at some examples of translated queries to verify the quality.

In [None]:
# Load the full dataset
with open(full_output_path, 'r', encoding='utf-8') as f:
    full_data = json.load(f)

# Function to detect if a query is in Polish (simplified)
def is_polish(text):
    return any(c in text for c in 'ąćęłńóśźżĄĆĘŁŃÓŚŹŻ')

# Get examples of English and Polish queries
english_examples = []
polish_examples = []

for sample in full_data:
    query = sample['query']
    if len(english_examples) < 5 and not is_polish(query):
        english_examples.append(query)
    elif len(polish_examples) < 5 and is_polish(query):
        polish_examples.append(query)
        
    if len(english_examples) >= 5 and len(polish_examples) >= 5:
        break

# Display examples
print("Examples of English Queries:")
for i, query in enumerate(english_examples):
    print(f"{i+1}. {query}\n")

print("\nExamples of Polish Queries:")
for i, query in enumerate(polish_examples):
    print(f"{i+1}. {query}\n")

## Verify Dataset Structure

Let's verify that the dataset structure is maintained, with only the queries translated.

In [None]:
# Select a sample with a Polish query
polish_sample = None
for sample in full_data:
    if is_polish(sample['query']):
        polish_sample = sample
        break

if polish_sample:
    # Parse the sample to examine its structure
    parsed_sample = parse_json_entry(polish_sample)
    
    print("Structure of a sample with a Polish query:")
    print(f"Query: {parsed_sample['query']}\n")
    
    print("Tools:")
    for tool in parsed_sample['tools'][:2]:  # Show first two tools
        print(f"- Name: {tool['name']}")
        print(f"  Description: {tool['description']}")
        print(f"  Parameters: {list(tool['parameters'].keys())}\n")
    
    print("Answers:")
    for answer in parsed_sample['answers'][:2]:  # Show first two answers
        print(f"- Tool: {answer['name']}")
        print(f"  Arguments: {answer['arguments']}\n")
else:
    print("No sample with a Polish query found.")

## Save Dataset Statistics

Finally, let's save some statistics about the dataset.

In [None]:
# Prepare dataset statistics
stats = {
    "total_samples": len(full_data),
    "english_samples": int(full_summary[full_summary['Language'] == 'English']['Count'].values[0] 
                           if 'English' in full_summary['Language'].values else 0),
    "polish_samples": int(full_summary[full_summary['Language'] == 'Polish']['Count'].values[0] 
                          if 'Polish' in full_summary['Language'].values else 0),
    "polish_percentage": float(full_summary[full_summary['Language'] == 'Polish']['Percentage'].values[0] 
                              if 'Polish' in full_summary['Language'].values else 0),
}

# Display statistics
for key, value in stats.items():
    print(f"{key}: {value}")

# Save statistics to file
stats_path = Path('../data/dataset_stats.json')
with open(stats_path, 'w', encoding='utf-8') as f:
    json.dump(stats, f, indent=2)

print(f"\nStatistics saved to {stats_path}")

## Conclusion

In this notebook, we have:

1. Created a small test dataset to verify the translation functionality
2. Created a larger dataset with approximately 40% of queries translated to Polish
3. Analyzed and visualized the language distribution in the dataset
4. Verified that the dataset structure is maintained, with only the queries translated
5. Saved statistics about the dataset

The resulting dataset can now be used to fine-tune the PLLUM model for function-calling tasks in both English and Polish.