# N-gram Character Prediction with Kneser-Ney Smoothing


## 1. Setup and Installation

In [None]:
# Install required dependencies
!pip install -q datasets

## 2. Upload Code Files

In [None]:
# Create necessary directories
!mkdir -p src work output

# Upload your NGram_Model.py file
from google.colab import files

print("Please upload your NGram_Model.py file:")
uploaded = files.upload()

# Move to src directory
!mv NGram_Model.py src/NGram_Model.py
print("File uploaded to src/NGram_Model.py")

Verify the file is in place:

In [None]:
!ls -lh src/NGram_Model.py

## 3. Train the Model

Train the N-gram model on Wikitext-103 dataset. This will take about 10-20 minutes.

In [None]:
# Train with default parameters (n=4, vocab_size=1000)
!python src/NGram_Model.py train \
    --work_dir work \
    --n 4 \
    --vocab_size 1000 \
    --discount 0.75

### Alternative: Quick Training (for testing)

If you want faster training, use smaller parameters:

In [None]:
# Faster training with smaller model (uncomment to use)
# !python src/NGram_Model.py train \
#     --work_dir work \
#     --n 3 \
#     --vocab_size 500 \
#     --discount 0.75

Check the trained model checkpoint:

In [None]:
!ls -lh work/model.checkpoint

## 4. Quick Evaluation (1000 samples)

First, let's do a quick evaluation on 1000 samples from the validation set:

In [None]:
!python src/NGram_Model.py evaluate \
    --work_dir work \
    --split validation \
    --max_samples 1000 \
    --verbose

## 5. Full Validation Evaluation

Evaluate on the complete validation set:

In [None]:
!python src/NGram_Model.py evaluate \
    --work_dir work \
    --split validation \
    --verbose

## 6. Test Set Evaluation

Final evaluation on the test set:

In [None]:
!python src/NGram_Model.py evaluate \
    --work_dir work \
    --split test \
    --verbose

## 7. Generate Predictions File

Generate predictions on test set and save to file:

In [None]:
!python src/NGram_Model.py test \
    --work_dir work \
    --split test \
    --test_output output/predictions.txt

# Show first 10 predictions
!head -10 output/predictions.txt

## 8. Download Results

Download the trained model and predictions:

In [None]:
# Create a zip file with model and predictions
!zip -r results.zip work/ output/

# Download the zip file
from google.colab import files
files.download('results.zip')

print("âœ“ Downloaded results.zip containing:")
print("  - work/model.checkpoint (trained model)")
print("  - output/predictions.txt (test predictions)")

## 9. Interactive Testing

Test the model interactively with custom input:

In [None]:
# Load the model and test with custom inputs
import sys
sys.path.append('src')

from NGram_Model import MyModel

# Load trained model
print("Loading model...")
model = MyModel.load('work')
print(f"Model loaded: n={model.n}, vocab_size={len(model.vocab) if model.vocab else 'unlimited'}")

# Test with some example inputs
test_inputs = [
    "Hello, my name is",
    "The quick brown",
    "Machine learning is",
    "I love",
    "Python is a programming",
]

print("\n" + "="*60)
print("Testing predictions:")
print("="*60)

for inp in test_inputs:
    preds = model._get_top_candidates(inp)
    print(f"\nInput: '{inp}'")
    print(f"Top 3 predictions: {preds[0]!r}, {preds[1]!r}, {preds[2]!r}")

### Test with Your Own Input

Try the model with your custom text:

In [None]:
# Enter your own text to test
custom_text = "Once upon a time"  # Change this!

predictions = model._get_top_candidates(custom_text)
print(f"Input: '{custom_text}'")
print(f"Top 3 next character predictions: {predictions}")
print(f"As string: '{predictions[0]}{predictions[1]}{predictions[2]}'")

## 10. Model Analysis

Analyze the trained model:

In [None]:
# Show model statistics
print("Model Statistics:")
print("="*60)
print(f"N-gram order: {model.n}")
print(f"Vocabulary size: {len(model.vocab) if model.vocab else 'unlimited'}")
print(f"Discount parameter: {model.discount}")
print(f"Total unique contexts: {len(model.ngram_counts)}")
print(f"Total unigram count: {sum(model.unigram_counts.values())}")
print(f"Total continuation count: {model.total_continuation}")

# Show most common characters
print("\nTop 20 most common characters:")
print("="*60)
for char, count in model.unigram_counts.most_common(20):
    char_display = repr(char) if char != ' ' else "'space'"
    print(f"{char_display:>10s}: {count:>10,}")

## Notes

### Model Parameters:
- **n=4**: 4-gram model (uses up to 3 characters of context)
- **vocab_size=1000**: Keep only top 1000 most frequent characters
- **discount=0.75**: Kneser-Ney discount parameter

### Performance Tips:
- **Faster training**: Use `--n 3 --vocab_size 500`
- **Better accuracy**: Use `--n 4 --vocab_size 2000`
- **Quick testing**: Use `--max_samples 1000` for evaluation

### Colab Tips:
- **Save your work**: Download results.zip before closing
