# Reddit Post Cleaning for BERT Uncased Model

This notebook demonstrates how to clean Reddit posts for use with BERT uncased models. The preprocessing steps are designed to prepare text data for generating high-quality embeddings.

In [None]:
# Import necessary libraries
import re
import pandas as pd
import numpy as np

# For demonstration purposes, we'll create sample Reddit posts
sample_posts = [
    "Check out this cool link: https://www.reddit.com/r/MachineLearning and subscribe r/datascience",
    "I've been using BERT for NLP tasks - it's amazing! #NLP #MachineLearning",
    "Question about PyTorch vs TensorFlow? Which one should I use for my project?",
    "Just released my new project on GitHub: github.com/user/project",
    "This post has some *formatting* and **bold text** with [links](https://example.com)"
]

df = pd.DataFrame({'post_text': sample_posts})
print("Original sample posts:")
for post in sample_posts:
    print(f"- {post}")

## Cleaning Functions for BERT Uncased Model

BERT uncased models convert all text to lowercase, so our cleaning process needs to:
1. Remove URLs and links
2. Remove special characters that don't add meaning
3. Remove Reddit-specific formatting (markdown, etc.)
4. Convert to lowercase (for uncased models)
5. Handle whitespace appropri

In [None]:
def clean_for_bert_uncased(text):
    """
    Clean text data specifically for BERT uncased model

    Args:
        text (str): Input text string

    Returns:
        str: Cleaned text suitable for BERT uncased model
    """
    # Check if text is NaN or None
    if pd.isna(text) or text is None:
        return ""

    # Convert to string if not already
    if not isinstance(text, str):
        text = str(text)

    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

    # Remove Reddit-style subreddit references (r/subreddit)
    text = re.sub(r'r/\w+', '', text)

    # Remove Reddit-style username mentions (u/username)
    text = re.sub(r'u/\w+', '', text)

    # Remove markdown formatting (*, **, [], etc.)
    text = re.sub(r'\*\*|\*|\[|\]|\(|\)|\_\_|\_', '', text)

    # Remove hashtags symbol but keep the text
    text = re.sub(r'#(\w+)', r'\1', text)

    # Remove special characters and punctuation (keep basic punctuation for sentence structure)
    text = re.sub(r'[^\w\s\.\,\!\?\-]', '', text)

    # Convert to lowercase (for uncased model)
    text = text.lower()

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply the cleaning function to our sample data
df['cleaned_text'] = df['post_text'].apply(clean_for_bert_uncased)

# Display the cleaned results
print("\nCleaned posts for BERT uncased model:")
for original, cleaned in zip(df['post_text'], df['cleaned_text']):
    print(f"Original: {original}")
    print(f"Cleaned:  {cleaned}")
    print("-" * 80

## Creating a Processing Pipeline for Larger Datasets

For larger datasets, you might want to create a more structured pipeline:

In [None]:
def process_reddit_posts_for_bert(posts_df, text_column='post_text', batch_size=1000):
    """
    Process a large DataFrame of Reddit posts for BERT uncased model

    Args:
        posts_df (DataFrame): DataFrame containing Reddit posts
        text_column (str): Column name containing the text to clean
        batch_size (int): Number of posts to process at once

    Returns:
        DataFrame: DataFrame with original and cleaned text
    """
    if text_column not in posts_df.columns:
        raise ValueError(f"Column '{text_column}' not found in DataFrame")

    # Create a copy to avoid modifying the original
    result_df = posts_df.copy()

    # Add a column for the cleaned text
    result_df['bert_ready_text'] = ''

    # Process in batches to handle large datasets
    total_rows = len(result_df)
    for i in range(0, total_rows, batch_size):
        end_idx = min(i + batch_size, total_rows)
        batch = result_df.iloc[i:end_idx]

        # Apply cleaning function
        result_df.loc[batch.index, 'bert_ready_text'] = batch[text_column].apply(clean_for_bert_uncased)

        # Print progress
        print(f"Processed {end_idx}/{total_rows} posts ({end_idx/total_rows*100:.1f}%)")

    return result_df

# For demonstration, let's create a slightly larger dataset
larger_sample = sample_posts * 2  # Just duplicate our samples for demonstration
larger_df = pd.DataFrame({'post_text': larger_sample})

# Process the larger dataset
processed_df = process_reddit_posts_for_bert(larger_df, batch_size=5)
print(f"\nProcessed {len(processed_df)} posts.")

# Show a sample of the processed data
print("\nSample of processed data:")
print(processed_df.head(3))

## Preparing for BERT Embedding Generation

After cleaning the text, the next step would be to tokenize it using BERT's tokenizer and generate embeddings. Here's how you would typically do that using the `transformers` library:

```python
# This code requires the transformers library
# pip install transformers

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a text
def get_bert_embedding(text, model, tokenizer):
    # Add special tokens and encode
    marked_text = "[CLS] " + text + " [SEP]"

    # Tokenize and convert to tensors
    encoded_dict = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=512,
        padding='max_length',
        truncation=True,
        return_attention_mask=True,
        return_tensors='pt'
    )

    # Get model output
    with torch.no_grad():
        outputs = model(encoded_dict['input_ids'],
                        attention_mask=encoded_dict['attention_mask'])

    # Use the CLS token embedding as the sentence embedding
    sentence_embedding = outputs[0][:, 0, :].numpy()

    return sentence_embedding
```

Note: You would need to install the `transformers` library to run this code. Since it's not in your installed packages list, I've included it as a code example rather than executable code.