# Offensive Words Collection from Urban Dictionary

This notebook collects offensive words from Urban Dictionary, organized by year (2005-2025), with WordNet validation and offensive content scoring.

**Features:**
- Chronological organization by first definition date
- WordNet validation (requires multiple synsets)
- Offensive content scoring system
- Configurable total words and words per year
- JSON export functionality


In [None]:
import requests
import time
from datetime import datetime
from collections import defaultdict
import json
from nltk.corpus import wordnet

## Configuration

Define keywords and patterns used for offensive content detection.

In [None]:
OFFENSIVE_KEYWORDS = [
    'insult', 'offensive', 'derogatory', 'rude', 'hateful', 'abusive',
    'contemptuous', 'pejorative', 'slur', 'racist', 'sexist', 'bigot',
    'discriminat', 'harass', 'mock', 'demean', 'stupid', 'idiot', 'loser'
]

EXPLICIT_WORDS = ['fuck', 'shit', 'ass', 'hell', 'dick', 'bitch', 'bastard']
PERSON_CONTEXTS = ['someone who', 'person who', 'people who', 'anyone who']

✓ Configuration loaded
  - 19 offensive keywords
  - 7 explicit words
  - 4 person contexts


## API Functions

Functions to interact with Urban Dictionary API.


In [None]:
def get_random_urban_words():
    """Fetch random words from Urban Dictionary API"""
    url = "https://api.urbandictionary.com/v0/random"
    try:
        response = requests.get(url, timeout=5)
        response.raise_for_status()
        return response.json().get('list', [])
    except Exception as e:
        print(f"Error fetching random words: {e}")
        return []

def get_word_definitions(word):
    """Fetch all definitions for a specific word"""
    url = "https://api.urbandictionary.com/v0/define"
    try:
        response = requests.get(url, params={'term': word}, timeout=5)
        response.raise_for_status()
        return response.json().get('list', [])
    except Exception as e:
        print(f"Error fetching '{word}': {e}")
        return []


✓ API connection successful - Retrieved 10 random entries


## WordNet Validation

Validate words using WordNet to ensure they have multiple synsets.


In [None]:
def has_multiple_synsets(word):
    """
    Check if a word has more than one synset in WordNet.
    Words with multiple synsets might have semantically shifted in time.
    """
    word_clean = word.lower().replace(' ', '_')
    synsets = wordnet.synsets(word_clean)
    return len(synsets) > 1


WordNet validation test:
  dog        → 8 synsets → ✓ Valid
  run        → 57 synsets → ✓ Valid
  asdfgh     → 0 synsets → ✗ Invalid
  test       → 13 synsets → ✓ Valid


## Offensive Content Analysis

Analyze text content to identify and score offensive language.

In [None]:
def is_offensive_text(text):
    """Check if text contains offensive language"""
    text_lower = text.lower()
    return any(keyword in text_lower for keyword in OFFENSIVE_KEYWORDS)

def calculate_offensive_score(definitions):
    """
    Calculate offensive score based on definitions and examples.
    Higher score = more likely to be offensive content.
    
    Scoring:
    - Offensive keywords in definition: +3
    - Offensive keywords in example: +2
    - Explicit words: +2
    - Person contexts: +1
    """
    score = 0
    
    for entry in definitions:
        definition = entry.get('definition', '').lower()
        example = entry.get('example', '').lower()
        
        # Check for offensive keywords
        if is_offensive_text(definition):
            score += 3
        if is_offensive_text(example):
            score += 2
        
        # Check for explicit words
        if any(word in definition or word in example for word in EXPLICIT_WORDS):
            score += 2
        
        # Check for negative person contexts
        if any(ctx in definition for ctx in PERSON_CONTEXTS):
            score += 1
    
    return score


Offensive scoring test:
  Definition 1: score = 1
  Definition 2: score = 5


## Data Processing

Extract and process word data from Urban Dictionary entries.


In [None]:
def extract_year(date_string):
    """Extract year from ISO date string"""
    try:
        date_obj = datetime.fromisoformat(date_string.replace('Z', '+00:00'))
        return date_obj.year
    except:
        return None

def clean_text(text):
    """Remove Urban Dictionary special characters"""
    return text.replace('[', '').replace(']', '').strip()

def process_word(word_entries, min_score=4):
    """
    Process word entries and return aggregated data.
    
    Returns:
        dict with keys: word, year, definition, examples
        None if word doesn't meet criteria
    """
    if not word_entries:
        return None
    
    # Sort by date (oldest first)
    sorted_entries = sorted(word_entries, key=lambda x: x.get('written_on', ''))
    
    # Find first valid year
    first_year = None
    for entry in sorted_entries:
        year = extract_year(entry.get('written_on', ''))
        if year and 2005 <= year <= 2025:
            first_year = year
            break
    
    if not first_year:
        return None
    
    # Check offensive score
    score = calculate_offensive_score(word_entries)
    if score < min_score:
        return None
    
    # Validate with WordNet
    word = word_entries[0]['word']
    if not has_multiple_synsets(word):
        return None
    
    # Aggregate definitions and examples (max 5)
    definitions = []
    examples = []
    
    for entry in word_entries[:5]:
        definition = clean_text(entry.get('definition', ''))
        example = clean_text(entry.get('example', ''))
        
        if definition and definition not in definitions:
            definitions.append(definition)
        if example and example not in examples:
            examples.append(example)
    
    return {
        'word': word,
        'year': first_year,
        'definition': ' | '.join(definitions),
        'examples': ' | '.join(examples)
    }


✓ Data processing functions loaded


## Main Collection Function

Core function to collect offensive words organized by year.


In [None]:
from tqdm import tqdm

def collect_offensive_words(total_words=100, words_per_year=5):
    """
    Collect offensive words organized by year.
    
    Args:
        total_words (int): Total number of words to collect
        words_per_year (int): Target words per year (2005-2025)
    
    Returns:
        dict: Words organized by year
    """
    years = list(range(2005, 2026))  # 2005-2025
    
    if words_per_year is None:
        words_per_year = total_words // len(years)
    
    words_by_year = defaultdict(list)
    seen_words = set()
    attempts = 0
    max_attempts = total_words * 50  # Safety limit
    
    print(f"🎯 Target: {total_words} total words ({words_per_year} per year)")
    print(f"📅 Period: 2005-2025")
    print(f"🔍 Filters: WordNet (>1 synset) + offensive score")
    print("-" * 70)
    
    # Progress bar
    pbar = tqdm(total=total_words, desc="Collecting words", unit="word")
    
    while attempts < max_attempts:
        attempts += 1
        
        # Get batch of random words
        entries = get_random_urban_words()
        if not entries:
            time.sleep(1)
            continue
        
        for entry in entries:
            word = entry.get('word', '').lower().strip()
            
            # Skip if already processed
            if not word or word in seen_words:
                continue
            
            # Get all definitions
            all_defs = get_word_definitions(word)
            time.sleep(0.3)  # Rate limiting
            
            if not all_defs:
                continue
            
            # Process the word
            word_data = process_word(all_defs)
            
            if word_data:
                year = word_data['year']
                
                # Add if still needed for that year
                if len(words_by_year[year]) < words_per_year:
                    words_by_year[year].append(word_data)
                    seen_words.add(word)
                    
                    total_collected = sum(len(v) for v in words_by_year.values())
                    pbar.update(1)
                    pbar.set_postfix({'word': word[:15], 'year': year})
                    
                    # Check completion
                    if total_collected >= total_words:
                        pbar.close()
                        print("\n✅ Collection completed!")
                        return words_by_year
        
        time.sleep(0.5)
    
    pbar.close()
    print(f"\n⚠️ Reached limit of {max_attempts} attempts")
    return words_by_year

print("✓ Collection function ready")


✓ Collection function ready


## Output and Export Functions

Save results and display summaries.


In [None]:
def save_results(words_by_year, filename='offensive_words.json'):
    """Save results to JSON file"""
    results = []
    for year in sorted(words_by_year.keys()):
        results.extend(words_by_year[year])
    
    with open(filename, 'w', encoding='utf-8') as f:
        json.dump(results, f, indent=2, ensure_ascii=False)
    
    print(f"\n💾 Saved: {filename}")
    return results

def print_summary(words_by_year):
    """Print collection summary"""
    print("\n" + "=" * 70)
    print("📋 SUMMARY")
    print("=" * 70)
    
    for year in sorted(words_by_year.keys()):
        words = words_by_year[year]
        print(f"\n{year} → {len(words)} words:")
        for w in words:
            defn = w['definition'][:50] + "..." if len(w['definition']) > 50 else w['definition']
            print(f"  • {w['word']:20} {defn}")
    
    total = sum(len(v) for v in words_by_year.values())
    print(f"\n{'=' * 70}")
    print(f"TOTAL: {total} words across {len(words_by_year)} years")
    print(f"{'=' * 70}\n")

print("✓ Output functions ready")


✓ Output functions ready


## Run Collection

Execute the collection process with configurable parameters.

**Parameters:**
- `TOTAL_WORDS`: Total number of offensive words to collect
- `WORDS_PER_YEAR`: Target number of words per year (2005-2025)

⚠️ **Warning:** This process may take 30-60 minutes depending on the target number of words.

In [None]:
# CONFIGURABLE PARAMETERS
TOTAL_WORDS = 20      # Change this for different total
WORDS_PER_YEAR = 1     # Change this for words per year

# Run collection
words_by_year = collect_offensive_words(
    total_words=TOTAL_WORDS,
    words_per_year=WORDS_PER_YEAR
)

# Save results
results = save_results(words_by_year, filename='offensive_words.json')

# Display summary
print_summary(words_by_year)
