# DVC Exploration for Text Classification

This notebook explores the use of Data Version Control (DVC) for text classification datasets. It demonstrates how to:

1. Initialize a DVC repository
2. Track and version datasets
3. Create and switch between versions
4. Work with remote storage
5. Integrate with data preprocessing

## Setup and Installation

First, let's make sure DVC is installed and we have the necessary dependencies.

In [None]:
# Install DVC if not already installed
!pip install dvc

In [None]:
# Import required libraries
import os
import sys
import subprocess
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Add project root to path to import modules
sys.path.append('..')

# Import custom modules
from src.config import Config
from dvc_helpers.dvc_setup import DVCHandler
from dvc_helpers.remote_storage import RemoteStorageManager

## 1. Create a Sample Dataset

Let's create a simple text classification dataset to use throughout this notebook.

In [None]:
# Create a sample text classification dataset
data = {
    'text': [
        "This movie was amazing! I loved every minute of it.",
        "Worst film I've ever seen. Complete waste of time.",
        "The acting was good but the plot was confusing.",
        "Great performances by the entire cast. Highly recommended!",
        "I fell asleep halfway through. Very boring.",
        "Not bad, not great. Just an average film overall.",
        "The special effects were incredible! Must see in 3D.",
        "Terrible dialogue and poor character development.",
        "One of the best films of the year. A true masterpiece.",
        "I was disappointed by the ending. Expected more."
    ],
    'label': ['positive', 'negative', 'neutral', 'positive', 'negative', 
              'neutral', 'positive', 'negative', 'positive', 'negative']
}

df = pd.DataFrame(data)
df

In [None]:
# Create directories for data if they don't exist
os.makedirs('../data/raw', exist_ok=True)
os.makedirs('../data/processed', exist_ok=True)

# Save to CSV
csv_path = '../data/raw/movie_reviews_v1.csv'
df.to_csv(csv_path, index=False)
print(f"Dataset saved to {csv_path}")

## 2. Initialize and Configure DVC

In [None]:
# Load configuration
config = Config()
print(f"Raw data path: {config.raw_data_path}")
print(f"Processed data path: {config.processed_data_path}")
print(f"DVC repo path: {config.dvc_repo_path}")

In [None]:
# Initialize DVC handler
dvc_handler = DVCHandler(config.dvc_repo_path)

# Check if DVC is initialized
if not dvc_handler.is_dvc_initialized():
    print("Initializing DVC...")
    dvc_handler.initialize_dvc()
else:
    print("DVC is already initialized")

In [None]:
# Get DVC information
dvc_info = dvc_handler.get_dvc_info()
print("DVC Information:")
print(f"Initialized: {dvc_info['initialized']}")
print(f"Repository Path: {dvc_info['repo_path']}")
print(f"Remotes: {dvc_info['remotes']}")

## 3. Version Control the Dataset with DVC

In [None]:
# Add the dataset to DVC and commit
dataset_path = csv_path
success = dvc_handler.add_and_commit_dataset(dataset_path, "Added initial movie reviews dataset (version 1)")

if success:
    print(f"Dataset {dataset_path} added to DVC and committed successfully")
else:
    print(f"Failed to add dataset {dataset_path} to DVC")

## 4. Create and Track a Second Version of the Dataset

In [None]:
# Add more data to create a new version
new_data = {
    'text': [
        "The cinematography was breathtaking throughout the film.",
        "Too many plot holes to be enjoyable.",
        "I'm not sure what to think about this one.",
        "A perfect example of modern storytelling.",
        "The script was lazy and unimaginative."
    ],
    'label': ['positive', 'negative', 'neutral', 'positive', 'negative']
}

# Combine with the original dataset
df_new = pd.DataFrame(new_data)
df_combined = pd.concat([df, df_new], ignore_index=True)
df_combined

In [None]:
# Save the updated dataset
csv_path_v2 = '../data/raw/movie_reviews_v2.csv'
df_combined.to_csv(csv_path_v2, index=False)
print(f"Updated dataset saved to {csv_path_v2}")

In [None]:
# Add and commit the new version
success = dvc_handler.add_and_commit_dataset(csv_path_v2, "Added more reviews (version 2)")

if success:
    print(f"Dataset {csv_path_v2} added to DVC and committed successfully")
else:
    print(f"Failed to add dataset {csv_path_v2} to DVC")

## 5. List All Versioned Datasets

In [None]:
# List all datasets tracked by DVC
datasets = dvc_handler.list_datasets()

print(f"Found {len(datasets)} versioned datasets:")
for dataset in datasets:
    print(f"- {dataset['path']} ({dataset['size_mb']} MB, last updated: {dataset['last_date']})")
    print(f"  Last commit: {dataset['last_message']}")
    print()

## 6. Get Version History for a Dataset

In [None]:
# Get all versions of a specific dataset
dataset_path = csv_path  # Using first dataset
versions = dvc_handler.get_dataset_versions(dataset_path)

print(f"Version history for dataset {os.path.basename(dataset_path)}:")
for version in versions:
    print(f"- {version['date']}: {version['message']}")
    print(f"  Author: {version['author']}")
    print(f"  Commit: {version['hash']}")
    print()

## 7. Checkout a Specific Version

In [None]:
# Get the commit hash of the first version
if versions and len(versions) > 0:
    first_version_hash = versions[-1]['hash']  # Last in the list is the oldest
    
    # Checkout the first version
    success = dvc_handler.checkout_version(dataset_path, first_version_hash)
    
    if success:
        print(f"Successfully checked out version {first_version_hash} of dataset")
        
        # Verify by loading the data
        df_checkout = pd.read_csv(dataset_path)
        print(f"Loaded dataset has {len(df_checkout)} records")
        display(df_checkout.head())
    else:
        print(f"Failed to checkout version {first_version_hash}")
else:
    print("No versions found for the dataset")

## 8. Set Up Remote Storage

In [None]:
# For demonstration, we'll use a local directory as remote storage
remote_dir = '../remote_storage'
os.makedirs(remote_dir, exist_ok=True)

# Add remote to DVC
remote_url = f"file://{os.path.abspath(remote_dir)}"
remote_name = "local-remote"

# Update config with remote info
config.dvc_remote_url = remote_url
config.dvc_remote_name = remote_name

# Setup remote storage manager
remote_storage = RemoteStorageManager(config)
result = remote_storage.setup_remote()

if result:
    print(f"Remote storage '{remote_name}' set up successfully with URL: {remote_url}")
else:
    print("Failed to set up remote storage")

In [None]:
# Push data to remote storage
result = remote_storage.push_to_remote()

if result['success']:
    print(result['message'])
else:
    print(f"Failed to push to remote: {result['message']}")

## 9. Using Memory-Efficient Data Loading

In [None]:
# Import data loader
from src.data.data_loader import DataLoader

# Initialize data loader with the dataset
loader = DataLoader(csv_path, config)

# Load data in batches
batch_size = 3
print(f"Loading data in batches of {batch_size}:")

for i, batch in enumerate(loader.load_batch_generator(batch_size=batch_size)):
    print(f"Batch {i+1}:")
    for record in batch:
        print(f"  - {record['text'][:50]}... [{record['label']}]")
    print()

## 10. Data Preprocessing and Versioning

In [None]:
# Import preprocessor
from src.data.preprocessing import TextPreprocessor

# Initialize preprocessor with custom options
preprocessor = TextPreprocessor({
    'lowercase': True,
    'remove_punctuation': True,
    'remove_stopwords': True,
    'stemming': False,
    'lemmatization': True
})

# Process a sample text
sample_text = "This movie was AMAZING! I loved every minute of it."
processed_text = preprocessor.preprocess_text(sample_text)

print(f"Original: {sample_text}")
print(f"Processed: {processed_text}")

In [None]:
# Process the entire dataset and save
processed_path = os.path.join(config.processed_data_path, 'processed_movie_reviews.csv')
preprocessor.preprocess_and_save(loader, processed_path)

print(f"Processed dataset saved to {processed_path}")

In [None]:
# Load the processed dataset
processed_df = pd.read_csv(processed_path)
processed_df.head()

In [None]:
# Version the processed dataset
success = dvc_handler.add_and_commit_dataset(processed_path, "Added processed movie reviews dataset")

if success:
    print(f"Processed dataset added to DVC and committed successfully")
else:
    print(f"Failed to add processed dataset to DVC")

## 11. Data Statistics and Validation

In [None]:
# Get statistics about the dataset
stats = loader.get_statistics()

print("Dataset Statistics:")
print(f"Number of records: {stats['record_count']}")
print(f"Available fields: {stats['fields']}")
print(f"Average text length: {stats['avg_text_length']:.2f} characters")
print(f"Min text length: {stats['min_text_length']} characters")
print(f"Max text length: {stats['max_text_length']} characters")

In [None]:
# Import data validator
from src.data.data_validator import DataValidator

# Initialize validator
validator = DataValidator(min_text_length=10, require_fields=['text', 'label'])

# Validate the dataset
is_valid, message = validator.validate_dataset(csv_path)

print(f"Dataset validation result: {message}")
print(f"Is valid: {is_valid}")

## 12. Using DVC Pipeline Configuration

In [None]:
# View the content of the dvc.yaml file
!cat ../dvc.yaml

In [None]:
# Run DVC pipeline to preprocess data
!cd .. && dvc run -n preprocess -d {csv_path} -d src/data/preprocessing.py -d src/data/data_loader.py -o {processed_path} python -c "from src.data.preprocessing import TextPreprocessor; from src.data.data_loader import DataLoader; from src.config import Config; config = Config(); preprocessor = TextPreprocessor(config.text_preprocessing); loader = DataLoader('{csv_path}', config); preprocessor.preprocess_and_save(loader, '{processed_path}')"

## 13. Summary and Conclusion

In this notebook, we explored how to use DVC for versioning text classification datasets. We learned how to:

1. Initialize a DVC repository
2. Add and version datasets
3. Track different versions of datasets
4. Checkout specific versions
5. Set up remote storage for collaboration and backup
6. Use memory-efficient data loading with generators
7. Preprocess text data and version the processed datasets
8. Analyze dataset statistics and validate data quality
9. Create DVC pipelines for reproducible data processing

These techniques form the foundation of a robust MLOps pipeline for text classification, ensuring reproducibility, traceability, and efficient data handling.