## 1. Verify GPU & Install Dependencies

**VS Code + Colab Kernel Integration**

This notebook uses a Colab T4 GPU kernel connected through VS Code. Data is loaded from your local filesystem.

In [5]:
# Install required packages (Colab kernel)
!pip install -q transformers torch pandas pyarrow fastparquet tqdm scikit-learn

print("‚úì Dependencies installed")

‚úì Dependencies installed


In [6]:
# Verify GPU
import torch

if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    print(f"‚úì GPU available: {gpu_name}")
    print(f"  CUDA version: {torch.version.cuda}")
    print(f"  Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è  NO GPU DETECTED!")
    print("   Go to: Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU")
    raise RuntimeError("GPU required for fast processing")

‚úì GPU available: Tesla T4
  CUDA version: 12.6
  Memory: 14.7 GB


## 2. Mount Google Drive & Setup Paths

**Google Colab Web Interface**: You can mount your Google Drive directly to access files.

Make sure these files are in your Google Drive:
- `MyDrive/02_Master/11_Thesis/Data_Experiment/data/reddit/topics/comments_expanded_with_topics.parquet`
- `MyDrive/02_Master/11_Thesis/Data_Experiment/data/reddit/topics/submissions_expanded_with_topics.parquet`
- `MyDrive/02_Master/11_Thesis/Data_Experiment/data/reddit/polarisation/test_set_annotation.parquet`

In [None]:
# Mount Google Drive
from google.colab import drive
from pathlib import Path

print("=" * 80)
print("MOUNTING GOOGLE DRIVE")
print("=" * 80)
print("\nüìÇ Mounting your Google Drive...")
print("   You'll be asked to authorize access.")
print("=" * 80)

drive.mount('/content/drive')

print("\n‚úÖ Google Drive mounted successfully!")
print(f"   Access your files at: /content/drive/MyDrive/")

FILE UPLOAD

üì§ Please upload the following 3 files:
   1. comments_expanded_with_topics.parquet
   2. submissions_expanded_with_topics.parquet
   3. test_set_annotations.parquet

‚è≥ Click the 'Choose Files' button below when it appears...


KeyboardInterrupt: 

In [None]:
# Configure paths from Google Drive
from pathlib import Path
import pandas as pd
import numpy as np
from datetime import datetime
import json

TIME_PERIOD = '2016-09_2016-10'

# Google Drive paths
DRIVE_ROOT = Path('/content/drive/MyDrive/02_Master/11_Thesis/Data_Experiment/data/reddit')

comments_path = DRIVE_ROOT / 'topics' / 'comments_expanded_with_topics.parquet'
submissions_path = DRIVE_ROOT / 'topics' / 'submissions_expanded_with_topics.parquet'
annotations_path = DRIVE_ROOT / 'polarisation' / 'test_set_annotation.parquet'

# Output directory (in Google Drive for automatic sync)
OUTPUT_DIR = Path('/content/drive/MyDrive/02_Master/11_Thesis/Data_Experiment/data/reddit/polarisation') / TIME_PERIOD
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("\n" + "=" * 80)
print("PATHS CONFIGURED")
print("=" * 80)
print(f"  Comments: {comments_path}")
print(f"  Submissions: {submissions_path}")
print(f"  Annotations: {annotations_path}")
print(f"  Output: {OUTPUT_DIR}")
print()

# Verify files exist
if not comments_path.exists():
    raise FileNotFoundError(f"‚ùå Comments file not found at: {comments_path}")
if not submissions_path.exists():
    raise FileNotFoundError(f"‚ùå Submissions file not found at: {submissions_path}")
if not annotations_path.exists():
    raise FileNotFoundError(f"‚ùå Annotations file not found at: {annotations_path}")

print("‚úì All input files verified")
print(f"  Comments: {comments_path.stat().st_size / 1024**2:.1f} MB")
print(f"  Submissions: {submissions_path.stat().st_size / 1024**2:.1f} MB")
print(f"  Annotations: {annotations_path.stat().st_size / 1024**2:.1f} MB")

In [None]:
# Load data
print("=" * 80)
print("LOADING DATA")
print("=" * 80)

comments_df = pd.read_parquet(comments_path)
print(f"‚úì Comments: {len(comments_df):,} rows")

submissions_df = pd.read_parquet(submissions_path)
print(f"‚úì Submissions: {len(submissions_df):,} rows")

test_annotations = pd.read_parquet(annotations_path)
print(f"‚úì Annotations: {len(test_annotations):,} samples")

print(f"\nTotal texts to process: {len(comments_df) + len(submissions_df):,}")

## 3. Define Model Class

In [None]:
# Emotion-based model (best performer from model comparison)
from transformers import pipeline
from typing import List

class EmotionBasedModel:
    """Map emotions (anger, disgust, etc.) to affective polarization levels."""
    
    def __init__(self, model_name: str, num_classes: int = 3):
        self.model_name = model_name
        self.num_classes = num_classes
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        
        print(f"Loading model: {model_name}")
        print(f"Device: {self.device}")
        
        # Configure pipeline with GPU and large batch size
        self.pipe = pipeline(
            "text-classification",
            model=model_name,
            device=0 if torch.cuda.is_available() else -1,
            batch_size=512  # Large batches for GPU
        )
        print("‚úì Model loaded")
    
    def predict(self, texts: List[str]) -> np.ndarray:
        """
        Map emotions to polarization:
        - anger, disgust ‚Üí 1-2 (adversarial/intolerant)
        - fear ‚Üí 1 (adversarial)
        - joy, love, surprise, sadness, neutral ‚Üí 0 (none)
        """
        results = self.pipe(texts, truncation=True, max_length=512)
        labels = []
        
        for result in results:
            emotion = result['label'].lower()
            score = result['score']
            
            if emotion in ['anger', 'disgust']:
                # High confidence anger/disgust ‚Üí intolerant
                if score > 0.7:
                    labels.append(2)
                else:
                    labels.append(1)
            elif emotion == 'fear':
                labels.append(1)  # Adversarial
            else:
                labels.append(0)  # None
        
        return np.array(labels)
    
    def predict_proba(self, texts: List[str]) -> np.ndarray:
        """Get probability distribution (simple one-hot based on predictions)."""
        predictions = self.predict(texts)
        probs = np.zeros((len(predictions), self.num_classes))
        for i, pred in enumerate(predictions):
            probs[i, pred] = 1.0
        return probs

print("‚úì Model class defined")

## 4. Determine Number of Classes from Annotations

In [None]:
# Check annotation distribution
test_annotated = test_annotations[test_annotations['affective_polarization_label'].notna()].copy()

print("Annotation Statistics:")
print(f"  Total annotated: {len(test_annotated)}")
print(f"\nLabel distribution:")
label_counts = test_annotated['affective_polarization_label'].value_counts().sort_index()
print(label_counts)

# Determine number of classes
if 3 not in label_counts.index or label_counts.get(3, 0) == 0:
    print("\n‚ö†Ô∏è  No Level 3 (Belligerent) examples found.")
    print("    Using 3-class classification: 0=None, 1=Adversarial, 2=Intolerant")
    num_classes = 3
    class_labels = [0, 1, 2]
    class_names = ['None', 'Adversarial', 'Intolerant']
else:
    print("\n‚úì Using 4-class classification")
    num_classes = 4
    class_labels = [0, 1, 2, 3]
    class_names = ['None', 'Adversarial', 'Intolerant', 'Belligerent']

print(f"\n‚úì Configuration: {num_classes} classes")
print(f"  Labels: {class_labels}")
print(f"  Names: {class_names}")

## 5. Initialize Model

In [None]:
# Initialize the best model (Emotion-English from local testing)
MODEL_NAME = 'j-hartmann/emotion-english-distilroberta-base'

print("=" * 80)
print("INITIALIZING MODEL")
print("=" * 80)

model = EmotionBasedModel(
    model_name=MODEL_NAME,
    num_classes=num_classes
)

print("\n‚úì Model ready for processing")

## 6. Process Comments (Main Processing)

In [None]:
# Process comments with GPU optimization
import time
from tqdm.auto import tqdm

print("=" * 80)
print("PROCESSING COMMENTS")
print("=" * 80)
print(f"Total comments: {len(comments_df):,}")
print(f"Device: {model.device}")
print()

# Extract texts
comment_texts = comments_df['text'].tolist()

# GPU-optimized batch size (T4 can handle large batches)
batch_size = 1024  # Increased for GPU

all_predictions = []
all_probs = []

print(f"Processing in batches of {batch_size}...\n")

start_time = time.time()
first_batch = True

for i in tqdm(range(0, len(comment_texts), batch_size), desc="Processing comments"):
    batch = comment_texts[i:i+batch_size]
    
    # Get predictions
    preds = model.predict(batch)
    probs = model.predict_proba(batch)
    
    all_predictions.extend(preds)
    all_probs.extend(probs)
    
    # Time estimation after first batch
    if first_batch and i == 0:
        first_batch = False
        elapsed = time.time() - start_time
        total_batches = (len(comment_texts) + batch_size - 1) // batch_size
        estimated_time = elapsed * total_batches
        
        print(f"\n‚è±Ô∏è  First batch: {elapsed:.1f}s for {len(batch)} texts")
        print(f"üìä Estimated total: {estimated_time/60:.1f} minutes ({estimated_time/3600:.1f} hours)")
        print(f"üí° Speed: ~{len(batch)/elapsed:.0f} texts/second\n")

# Add results to dataframe
comments_with_polarization = comments_df.copy()
comments_with_polarization['affective_polarization_label'] = all_predictions
comments_with_polarization['affective_polarization_score'] = [
    pred / (num_classes - 1) for pred in all_predictions
]

# Add probability columns
for i in range(num_classes):
    comments_with_polarization[f'polarization_prob_{i}'] = [probs[i] for probs in all_probs]

comments_with_polarization['polarization_confidence'] = [max(probs) for probs in all_probs]

# Summary
elapsed_total = time.time() - start_time
print(f"\n{'='*80}")
print(f"‚úì COMMENTS PROCESSING COMPLETE")
print(f"{'='*80}")
print(f"‚è±Ô∏è  Total time: {elapsed_total/60:.1f} minutes ({elapsed_total/3600:.2f} hours)")
print(f"üí° Average speed: {len(comment_texts)/elapsed_total:.0f} texts/second")
print(f"\nAffective Polarization Distribution:")
print(comments_with_polarization['affective_polarization_label'].value_counts().sort_index())
print(f"\nMean score: {comments_with_polarization['affective_polarization_score'].mean():.3f}")
print(f"Mean confidence: {comments_with_polarization['polarization_confidence'].mean():.3f}")

## 7. Process Submissions

In [None]:
# Process submissions
print("=" * 80)
print("PROCESSING SUBMISSIONS")
print("=" * 80)
print(f"Total submissions: {len(submissions_df):,}")
print()

submission_texts = submissions_df['text'].tolist()

all_predictions = []
all_probs = []

start_time = time.time()

for i in tqdm(range(0, len(submission_texts), batch_size), desc="Processing submissions"):
    batch = submission_texts[i:i+batch_size]
    
    preds = model.predict(batch)
    probs = model.predict_proba(batch)
    
    all_predictions.extend(preds)
    all_probs.extend(probs)

# Add results
submissions_with_polarization = submissions_df.copy()
submissions_with_polarization['affective_polarization_label'] = all_predictions
submissions_with_polarization['affective_polarization_score'] = [
    pred / (num_classes - 1) for pred in all_predictions
]

for i in range(num_classes):
    submissions_with_polarization[f'polarization_prob_{i}'] = [probs[i] for probs in all_probs]

submissions_with_polarization['polarization_confidence'] = [max(probs) for probs in all_probs]

elapsed_total = time.time() - start_time
print(f"\n{'='*80}")
print(f"‚úì SUBMISSIONS PROCESSING COMPLETE")
print(f"{'='*80}")
print(f"‚è±Ô∏è  Time: {elapsed_total/60:.1f} minutes")
print(f"\nAffective Polarization Distribution:")
print(submissions_with_polarization['affective_polarization_label'].value_counts().sort_index())
print(f"\nMean score: {submissions_with_polarization['affective_polarization_score'].mean():.3f}")
print(f"Mean confidence: {submissions_with_polarization['polarization_confidence'].mean():.3f}")

## 8. Save & Download Results

In [None]:
# Results are automatically saved to Google Drive
print("=" * 80)
print("RESULTS SAVED TO GOOGLE DRIVE")
print("=" * 80)
print(f"\n‚úÖ All files saved to Google Drive!")
print(f"\nüìÇ Output location:")
print(f"   {OUTPUT_DIR}")
print(f"\nüìÅ Files saved:")
for filepath in OUTPUT_DIR.glob('*.parquet'):
    file_size_mb = filepath.stat().st_size / 1024**2
    print(f"   ‚úì {filepath.name} ({file_size_mb:.1f} MB)")

metadata_file = OUTPUT_DIR / 'metadata.json'
if metadata_file.exists():
    file_size_kb = metadata_file.stat().st_size / 1024
    print(f"   ‚úì {metadata_file.name} ({file_size_kb:.1f} KB)")

print(f"\nüíæ Files will sync to your local machine automatically via Google Drive sync.")
print(f"\nüîÑ Or access directly from:")
print(f"   drive.google.com/drive/my-drive/02_Master/11_Thesis/Data_Experiment/")

In [None]:
# Save processed datasets to Colab runtime
print("=" * 80)
print("SAVING RESULTS")
print("=" * 80)
print(f"Output directory: {OUTPUT_DIR}\n")

# Save comments
comments_output = OUTPUT_DIR / 'comments_with_polarization.parquet'
comments_with_polarization.to_parquet(comments_output, index=False)
file_size_mb = comments_output.stat().st_size / 1024**2
print(f"‚úì Comments saved: {comments_output.name}")
print(f"  Shape: {comments_with_polarization.shape}")
print(f"  Size: {file_size_mb:.1f} MB\n")

# Save submissions
submissions_output = OUTPUT_DIR / 'submissions_with_polarization.parquet'
submissions_with_polarization.to_parquet(submissions_output, index=False)
file_size_mb = submissions_output.stat().st_size / 1024**2
print(f"‚úì Submissions saved: {submissions_output.name}")
print(f"  Shape: {submissions_with_polarization.shape}")
print(f"  Size: {file_size_mb:.1f} MB\n")

# Save metadata
metadata = {
    'created_at': datetime.now().isoformat(),
    'time_period': TIME_PERIOD,
    'model_used': MODEL_NAME,
    'processing_environment': 'Google Colab (VS Code kernel)',
    'device': str(model.device),
    'batch_size': batch_size,
    'num_classes': num_classes,
    'class_labels': class_labels,
    'class_names': class_names,
    'comments': {
        'total': len(comments_with_polarization),
        'distribution': comments_with_polarization['affective_polarization_label'].value_counts().to_dict(),
        'mean_score': float(comments_with_polarization['affective_polarization_score'].mean()),
        'mean_confidence': float(comments_with_polarization['polarization_confidence'].mean())
    },
    'submissions': {
        'total': len(submissions_with_polarization),
        'distribution': submissions_with_polarization['affective_polarization_label'].value_counts().to_dict(),
        'mean_score': float(submissions_with_polarization['affective_polarization_score'].mean()),
        'mean_confidence': float(submissions_with_polarization['polarization_confidence'].mean())
    }
}

metadata_path = OUTPUT_DIR / 'metadata.json'
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"‚úì Metadata saved: {metadata_path.name}\n")

print("=" * 80)
print("‚úÖ FILES SAVED TO COLAB RUNTIME")
print("=" * 80)

## 9. Quick Summary Statistics

In [None]:
# Display summary
print("=" * 80)
print("PROCESSING SUMMARY")
print("=" * 80)

total_texts = len(comments_with_polarization) + len(submissions_with_polarization)

print(f"\nüìä Dataset Statistics:")
print(f"  Total texts processed: {total_texts:,}")
print(f"  Comments: {len(comments_with_polarization):,}")
print(f"  Submissions: {len(submissions_with_polarization):,}")

# Combined distribution
combined_labels = list(comments_with_polarization['affective_polarization_label']) + \
                  list(submissions_with_polarization['affective_polarization_label'])

print(f"\nüìà Overall Affective Polarization Distribution:")
for label, name in zip(class_labels, class_names):
    count = combined_labels.count(label)
    pct = count / len(combined_labels) * 100
    print(f"  Level {label} ({name:12s}): {count:6,} ({pct:5.1f}%)")

# Overall scores
combined_scores = list(comments_with_polarization['affective_polarization_score']) + \
                  list(submissions_with_polarization['affective_polarization_score'])
combined_confidence = list(comments_with_polarization['polarization_confidence']) + \
                      list(submissions_with_polarization['polarization_confidence'])

print(f"\nüìä Score Statistics:")
print(f"  Mean polarization score: {np.mean(combined_scores):.3f}")
print(f"  Median polarization score: {np.median(combined_scores):.3f}")
print(f"  Mean confidence: {np.mean(combined_confidence):.3f}")

print(f"\n‚úÖ Processing complete! Results saved to Google Drive.")
print(f"   Files will sync to your OneDrive/Thesis folder automatically.")

---

## ‚úÖ Done!

### Workflow Summary

1. ‚úÖ Mounted Google Drive (Section 2)
2. ‚úÖ Processed ~700k texts with T4 GPU (~30-60 minutes)
3. ‚úÖ Saved results directly to Google Drive (Section 8)

### Output Files Location

Files saved to Google Drive:
```
MyDrive/02_Master/11_Thesis/Data_Experiment/
‚îî‚îÄ‚îÄ data/reddit/polarisation/2016-09_2016-10/
    ‚îú‚îÄ‚îÄ comments_with_polarization.parquet
    ‚îú‚îÄ‚îÄ submissions_with_polarization.parquet
    ‚îî‚îÄ‚îÄ metadata.json
```

Files will automatically sync to your local machine if you have Google Drive desktop sync enabled.

### Next Steps

1. **Wait for Google Drive sync** (or download files manually from Google Drive)
2. **Load results** in your local notebook (18_sentiment_detetction.ipynb) 
3. **Create validation sample** for quality checking
4. **Run analysis** on the processed data

### Performance Notes

- **T4 GPU**: ~150-300 texts/second (30-60 min for full dataset)

- **Upgrade to H100**: If T4 is slow, change runtime type to A100/H100 for 3-5x speedup4. Run all cells (Runtime ‚Üí Run all)

- **Batch size**: 1024 (optimal for T4). Increase to 2048+ for H1003. Runtime ‚Üí Change runtime type ‚Üí T4 GPU

2. Right-click ‚Üí Open with ‚Üí Google Colaboratory

### How to Run in Google Colab1. Upload this notebook to Google Drive
