# Filtering Perplexity: A Practical Example

This code demonstrates a simple technique for filtering perplexity values. Easily adapt it to your own dataset for better analysis and insights.



## Setup environment and models

In [None]:
!git clone https://github.com/Oztobuzz/Vista.git
%cd LVM_news

In [None]:
!pip install -r requirements.txt
!bash scripts/download_sentencepiece_kenlm_model.sh

In [3]:
import os
current_pwd = os.getcwd()
os.environ["PYTHONPATH"] += f":{current_pwd}"

from datasets import load_dataset
from src.perplexity.filtering import FilteringPerplexity

## Load Dataset

For this example, we'll use the Vista dataset with the vi_llava_conversation subset. You can easily adapt this code to load your own dataset. Make sure your dataset includes a column containing the perplexity values you want to filter.

In [4]:
# Replace with the actual path to your dataset
dataset_name = "Vi-VLM/Vista"
subset = "vi_llava_conversation"  # Turn to None if your data do not have subset

dataset = load_dataset(dataset_name, name=subset, split="train")

## Compute and Filter perplexity

### Set up the perplexity filter

In [5]:
perplexity_filtering = FilteringPerplexity(
    sentencepiece_model_path=os.path.join(current_pwd, "models/vi.sp.model"),
    kenlm_model_path=os.path.join(current_pwd, "models/vi.arpa.bin"),
    num_proc=2
)

### Compute perplexity

In [6]:
data_contains_perplex = perplexity_filtering.compute(dataset)

In [7]:
data_contains_perplex

Dataset({
    features: ['height', 'coco_url', 'date_capture', 'id', 'width', 'conversation', 'captions', 'file_name', 'flickr_url', 'perplexity'],
    num_rows: 107052
})

### Filtering perplexity

In [8]:
data_filtered = perplexity_filtering.filter(data_contains_perplex, threshold=100)

In [9]:
data_filtered

Dataset({
    features: ['height', 'coco_url', 'date_capture', 'id', 'width', 'conversation', 'captions', 'file_name', 'flickr_url', 'perplexity'],
    num_rows: 71498
})