# Download & Process HuggingFace Datasets for Keyboard Training

This notebook downloads and processes datasets for training a keyboard suggestion model:

1. **HuggingFaceTB/everyday-conversations-llama3.1-2k**: Extracts completion text
2. **allenai/prosocial-dialog**: Extracts context + response

**Features:**
- Automatic split detection
- Sentence splitting at punctuation (?, !, .)
- Text cleaning and normalization
- Deduplication
- Quality filtering

**Output:** Clean text files ready for model training

## Step 1: Install Dependencies

In [None]:
# Install required libraries
!pip install datasets huggingface_hub pandas tqdm

## Step 2: Import Libraries

In [None]:
from datasets import load_dataset
import re
from tqdm import tqdm
import pandas as pd

print("✓ Libraries imported successfully!")

## Step 3: Define Text Processing Functions

In [None]:
def split_into_sentences(text):
    """
    Split text into sentences at punctuation marks (., !, ?).
    Handles common abbreviations and edge cases.
    """
    if not text or pd.isna(text):
        return []
    
    # Clean text
    text = str(text).strip()
    
    # Split at sentence boundaries (., !, ?)
    # Use regex to split but keep the punctuation
    sentences = re.split(r'([.!?]+\s+)', text)
    
    # Recombine sentences with their punctuation
    result = []
    for i in range(0, len(sentences)-1, 2):
        sentence = sentences[i]
        if i+1 < len(sentences):
            sentence += sentences[i+1]
        sentence = sentence.strip()
        if sentence:
            result.append(sentence)
    
    # Add last sentence if it doesn't end with punctuation
    if len(sentences) % 2 == 1 and sentences[-1].strip():
        result.append(sentences[-1].strip())
    
    return result

def clean_text(text):
    """
    Clean and normalize text for keyboard training.
    """
    if not text or pd.isna(text):
        return ""
    
    text = str(text)
    
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)
    
    # Remove leading/trailing whitespace
    text = text.strip()
    
    return text

def is_valid_sentence(sentence, min_length=10, max_length=500):
    """
    Check if a sentence is valid for training.
    
    Filters out:
    - Too short sentences (< 10 chars)
    - Too long sentences (> 500 chars)
    - Sentences with excessive special characters
    - URLs and email addresses
    """
    if not sentence:
        return False
    
    # Length check
    if len(sentence) < min_length or len(sentence) > max_length:
        return False
    
    # Filter URLs
    if re.search(r'http[s]?://|www\.', sentence):
        return False
    
    # Filter email addresses
    if re.search(r'\S+@\S+\.\S+', sentence):
        return False
    
    # Check for excessive special characters
    special_char_ratio = len(re.findall(r'[^a-zA-Z0-9\s.,!?\'\"\-]', sentence)) / len(sentence)
    if special_char_ratio > 0.3:
        return False
    
    # Must contain at least some letters
    if not re.search(r'[a-zA-Z]', sentence):
        return False
    
    return True

def get_dataset_split(dataset):
    """
    Get the first available split from a dataset.
    Returns the split name and the data.
    """
    available_splits = list(dataset.keys())
    if not available_splits:
        raise ValueError("Dataset has no splits!")
    
    # Prefer 'train' if available, otherwise use first split
    if 'train' in available_splits:
        split_name = 'train'
    else:
        split_name = available_splits[0]
    
    return split_name, dataset[split_name]

print("✓ Text processing functions defined!")

## Step 4: Download Dataset 1 - Everyday Conversations

In [None]:
print("Downloading HuggingFaceTB/everyday-conversations-llama3.1-2k...\n")

# Load dataset
dataset1 = load_dataset("HuggingFaceTB/everyday-conversations-llama3.1-2k")

print(f"✓ Dataset loaded!")
print(f"\nDataset structure:")
print(dataset1)

# Get available split
split1_name, split1_data = get_dataset_split(dataset1)
print(f"\nUsing split: '{split1_name}'")
print(f"Number of entries: {len(split1_data):,}")

# Show sample
print(f"\nSample entry:")
print(split1_data[0])

# Show available fields
print(f"\nAvailable fields: {list(split1_data[0].keys())}")

## Step 5: Process Dataset 1 - Extract Completions

In [None]:
print("Processing everyday-conversations dataset...\n")

sentences_dataset1 = []

# Extract completion field from each entry
for entry in tqdm(split1_data, desc="Processing entries"):
    # Get completion text (try different field names)
    completion = entry.get('completion', entry.get('text', entry.get('response', '')))
    
    if completion:
        # Clean text
        cleaned = clean_text(completion)
        
        # Split into sentences
        sentences = split_into_sentences(cleaned)
        
        # Filter valid sentences
        for sentence in sentences:
            if is_valid_sentence(sentence):
                sentences_dataset1.append(sentence)

print(f"\n✓ Extracted {len(sentences_dataset1):,} sentences from everyday-conversations")
print(f"\nSample sentences:")
for i, sent in enumerate(sentences_dataset1[:5]):
    print(f"{i+1}. {sent}")

## Step 6: Download Dataset 2 - Prosocial Dialog

In [None]:
print("Downloading allenai/prosocial-dialog...\n")

# Load dataset
dataset2 = load_dataset("allenai/prosocial-dialog")

print(f"✓ Dataset loaded!")
print(f"\nDataset structure:")
print(dataset2)

# Get available split
split2_name, split2_data = get_dataset_split(dataset2)
print(f"\nUsing split: '{split2_name}'")
print(f"Number of entries: {len(split2_data):,}")

# Show sample
print(f"\nSample entry:")
print(split2_data[0])

# Show available fields
print(f"\nAvailable fields: {list(split2_data[0].keys())}")

## Step 7: Process Dataset 2 - Extract Context + Response

In [None]:
print("Processing prosocial-dialog dataset...\n")

sentences_dataset2 = []

# Extract context and response from each entry
for entry in tqdm(split2_data, desc="Processing entries"):
    # Get context and response (try different field names)
    context = entry.get('context', entry.get('prompt', ''))
    response = entry.get('response', entry.get('completion', entry.get('text', '')))
    
    # Process context
    if context:
        cleaned = clean_text(context)
        sentences = split_into_sentences(cleaned)
        for sentence in sentences:
            if is_valid_sentence(sentence):
                sentences_dataset2.append(sentence)
    
    # Process response
    if response:
        cleaned = clean_text(response)
        sentences = split_into_sentences(cleaned)
        for sentence in sentences:
            if is_valid_sentence(sentence):
                sentences_dataset2.append(sentence)

print(f"\n✓ Extracted {len(sentences_dataset2):,} sentences from prosocial-dialog")
print(f"\nSample sentences:")
for i, sent in enumerate(sentences_dataset2[:5]):
    print(f"{i+1}. {sent}")

## Step 8: Combine and Deduplicate

In [None]:
print("Combining and deduplicating sentences...\n")

# Combine all sentences
all_sentences = sentences_dataset1 + sentences_dataset2

print(f"Total sentences before deduplication: {len(all_sentences):,}")

# Deduplicate (case-sensitive)
unique_sentences = list(set(all_sentences))

print(f"Unique sentences after deduplication: {len(unique_sentences):,}")
print(f"Duplicates removed: {len(all_sentences) - len(unique_sentences):,}")

# Sort by length (shorter sentences first - better for keyboard training)
unique_sentences.sort(key=len)

print(f"\n✓ Sentences sorted by length")

## Step 9: Save to Text Files

In [None]:
print("Saving sentences to text files...\n")

# Save all sentences to one file
with open('keyboard_training_data.txt', 'w', encoding='utf-8') as f:
    for sentence in unique_sentences:
        f.write(sentence + '\n')

print(f"✓ Saved {len(unique_sentences):,} sentences to 'keyboard_training_data.txt'")

# Also save separate files for each dataset (optional)
with open('everyday_conversations.txt', 'w', encoding='utf-8') as f:
    for sentence in sentences_dataset1:
        f.write(sentence + '\n')

print(f"✓ Saved {len(sentences_dataset1):,} sentences to 'everyday_conversations.txt'")

with open('prosocial_dialog.txt', 'w', encoding='utf-8') as f:
    for sentence in sentences_dataset2:
        f.write(sentence + '\n')

print(f"✓ Saved {len(sentences_dataset2):,} sentences to 'prosocial_dialog.txt'")

print("\n" + "="*60)
print("PROCESSING COMPLETE!")
print("="*60)
print("\nDownload files from the Files panel on the left.")

## Step 10: Dataset Statistics & Analysis

In [None]:
print("\n" + "="*60)
print("DATASET STATISTICS")
print("="*60)

# Length statistics
lengths = [len(s) for s in unique_sentences]

print(f"\nSentence count: {len(unique_sentences):,}")
print(f"\nLength statistics:")
print(f"  Shortest: {min(lengths)} characters")
print(f"  Longest: {max(lengths)} characters")
print(f"  Average: {sum(lengths)/len(lengths):.1f} characters")
print(f"  Median: {sorted(lengths)[len(lengths)//2]} characters")

# Word count statistics
word_counts = [len(s.split()) for s in unique_sentences]
print(f"\nWord count statistics:")
print(f"  Average words per sentence: {sum(word_counts)/len(word_counts):.1f}")
print(f"  Total words: {sum(word_counts):,}")

# Sample sentences by length
print(f"\nSample short sentences (10-30 chars):")
short_sentences = [s for s in unique_sentences if 10 <= len(s) <= 30]
for sent in short_sentences[:5]:
    print(f"  - {sent}")

print(f"\nSample medium sentences (50-100 chars):")
medium_sentences = [s for s in unique_sentences if 50 <= len(s) <= 100]
for sent in medium_sentences[:5]:
    print(f"  - {sent}")

print(f"\nSample long sentences (150-200 chars):")
long_sentences = [s for s in unique_sentences if 150 <= len(s) <= 200]
for sent in long_sentences[:5]:
    print(f"  - {sent}")

## Summary & Recommendations

### What This Notebook Does:

1. **Downloads** two HuggingFace datasets
2. **Auto-detects** available splits (train, test, validation, etc.)
3. **Extracts** relevant text fields (completion, context, response)
4. **Splits** long text into sentences at punctuation (., !, ?)
5. **Cleans** and normalizes text
6. **Filters** invalid sentences (too short/long, URLs, etc.)
7. **Deduplicates** to remove redundant data
8. **Saves** to text files ready for training

### Output Files:

- **`keyboard_training_data.txt`**: All unique sentences (recommended for training)
- **`everyday_conversations.txt`**: Only everyday conversations
- **`prosocial_dialog.txt`**: Only prosocial dialog

### Recommendations for Better Training:

#### 1. **Add More Diverse Datasets**
Consider adding:
- **SMS/Text messages**: More casual, short-form text
- **Social media**: Twitter, Reddit comments
- **Chat logs**: Discord, Slack conversations
- **Email datasets**: Professional communication

#### 2. **Sentence Length Optimization**
- **Focus on 10-100 character sentences** (most common in typing)
- **Weight shorter sentences more** (people type short messages more often)
- Consider creating separate training sets by length

#### 3. **Context-Aware Training**
- Keep **conversation pairs** together (context → response)
- Train on **sentence transitions** (what typically follows what)
- Include **common phrases** and **idioms**

#### 4. **Domain-Specific Data**
Add datasets for specific use cases:
- **Technical writing** (if users are developers)
- **Customer service** (if for support chat)
- **Casual chat** (if for messaging apps)

#### 5. **Data Augmentation**
- **Typo injection**: Add common typos to learn corrections
- **Abbreviation expansion**: "u" → "you", "ur" → "your"
- **Case variations**: Train on mixed case

#### 6. **Quality Improvements**
- **Remove repetitive patterns** ("I think I think I think")
- **Filter profanity** (if needed)
- **Balance dataset** (equal representation of question/statement/exclamation)

#### 7. **Tokenization Strategy**
For a tiny keyboard model:
- Use **word-level** or **subword tokenization** (BPE)
- Keep vocabulary **small** (10k-30k tokens)
- Focus on **most frequent words** from your `single_word_freq.csv`

#### 8. **Training Format**
Consider creating:
- **Prefix → completion** pairs ("How are" → "you")
- **N-gram sequences** (trigrams, 4-grams)
- **Masked language modeling** ("I ___ going to the store")

### Next Steps:

1. Download and review the generated text files
2. Combine with your existing `single_word_freq.csv` for vocabulary
3. Create training pairs (input → output)
4. Train a small transformer model (e.g., GPT-2 tiny, BERT tiny)
5. Evaluate on real typing scenarios