# 02: Final Dataset EDA and Cleaning Recommendations

**Objective**: Comprehensive exploration of the final processed romance novel dataset to identify cleaning opportunities and prepare for NLP analysis.

**Research Context**: Analyze how thematic characteristics of modern romance novels relate to reader engagement/popularity using Goodreads metadata.

**Dataset**: `final_books_2000_2020_en_20250901_024106.csv` (119,678 romance novels)

## Analysis Plan
1. **Dataset Overview** - Basic structure, data types, missing values
2. **Title Analysis** - Series patterns, numbering, cleaning opportunities
3. **Author Name Analysis** - Duplicates, variations, normalization needs
4. **Description Text Analysis** - Text quality, HTML artifacts, length distributions
5. **Series Pattern Analysis** - Series titles and book title relationships
6. **Publication & Popularity Analysis** - Temporal trends and engagement metrics
7. **Subgenre Signal Analysis** - Popular shelves and genre classification
8. **Cleaning Recommendations** - Specific suggestions with code examples

## Expected Outputs
- Data quality assessment
- Title and series cleaning patterns
- Author name normalization strategies
- Text preprocessing recommendations
- Final dataset preparation for NLP analysis

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import json
from pathlib import Path
from collections import Counter, defaultdict
import warnings

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)
pd.set_option('display.width', 1000)

print("‚úÖ Libraries imported successfully")

In [None]:
# Load the final processed dataset
dataset_path = "../data/processed/final_books_2000_2020_en_20250901_024106.csv"
print(f"üìö Loading dataset from: {dataset_path}")

# Load with progress indicator
df = pd.read_csv(dataset_path)
print(f"‚úÖ Dataset loaded successfully!")
print(f"üìä Shape: {df.shape}")
print(f"üìã Columns: {list(df.columns)}")

In [None]:
# 1. DATASET OVERVIEW
print("üîç DATASET OVERVIEW")
print("=" * 50)

# Basic info
print(f"üìö Total records: {len(df):,}")
print(f"üìã Total columns: {len(df.columns)}")
print(f"üìÖ Publication year range: {df['publication_year'].min()} - {df['publication_year'].max()}")

# Data types
print("\nüìä Data Types:")
print(df.dtypes)

# Memory usage
memory_usage = df.memory_usage(deep=True).sum() / 1024**2
print(f"\nüíæ Memory usage: {memory_usage:.2f} MB")

In [None]:
# Missing values analysis
print("üîç MISSING VALUES ANALYSIS")
print("=" * 50)

missing_data = df.isnull().sum()
missing_percent = (missing_data / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_data.index,
    'Missing_Count': missing_data.values,
    'Missing_Percent': missing_percent.values
})

missing_df = missing_df.sort_values('Missing_Percent', ascending=False)
print(missing_df)

# Visualize missing values
plt.figure(figsize=(12, 8))
missing_df.plot(x='Column', y='Missing_Percent', kind='bar', ax=plt.gca())
plt.title('Missing Values by Column (%)')
plt.xlabel('Columns')
plt.ylabel('Missing Values (%)')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# 2. TITLE ANALYSIS
print("üîç TITLE ANALYSIS")
print("=" * 50)

# Basic title statistics
df['title_length'] = df['title'].str.len()
df['title_word_count'] = df['title'].str.split().str.len()

print(f"üìñ Title length statistics:")
print(f"   - Mean length: {df['title_length'].mean():.1f} characters")
print(f"   - Median length: {df['title_length'].median():.1f} characters")
print(f"   - Min length: {df['title_length'].min()} characters")
print(f"   - Max length: {df['title_length'].max()} characters")

print(f"\nüìù Title word count statistics:")
print(f"   - Mean words: {df['title_word_count'].mean():.1f} words")
print(f"   - Median words: {df['title_word_count'].median():.1f} words")
print(f"   - Min words: {df['title_word_count'].min()} words")
print(f"   - Max words: {df['title_word_count'].max()} words")

In [None]:
# Title length distribution
plt.figure(figsize=(15, 5))

# Character length distribution
plt.subplot(1, 2, 1)
plt.hist(df['title_length'], bins=50, alpha=0.7, edgecolor='black')
plt.title('Title Length Distribution (Characters)')
plt.xlabel('Title Length (characters)')
plt.ylabel('Frequency')
plt.axvline(df['title_length'].median(), color='red', linestyle='--', label=f'Median: {df["title_length"].median():.0f}')
plt.legend()

# Word count distribution
plt.subplot(1, 2, 2)
plt.hist(df['title_word_count'], bins=30, alpha=0.7, edgecolor='black')
plt.title('Title Word Count Distribution')
plt.xlabel('Title Word Count')
plt.ylabel('Frequency')
plt.axvline(df['title_word_count'].median(), color='red', linestyle='--', label=f'Median: {df["title_word_count"].median():.0f}')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Look for series patterns in titles
print("üîç SERIES PATTERNS IN TITLES")
print("=" * 50)

# Common series indicators
series_patterns = [
    r'\b(\d+)\s*[:\-]\s*',  # Number followed by : or -
    r'\b(Book|Volume|Part)\s+(\d+)\b',  # Book 1, Volume 2, etc.
    r'\b(\d+)\s*(?:st|nd|rd|th)\s*',  # 1st, 2nd, 3rd, etc.
    r'\b(\d+)\s*$',  # Number at end
    r'\b(\d+)\s*\('  # Number followed by parenthesis
]

pattern_names = ['Number:Colon', 'Book/Volume/Part', 'Ordinal', 'End Number', 'Number(']

for pattern, name in zip(series_patterns, pattern_names):
    matches = df['title'].str.contains(pattern, regex=True, na=False)
    count = matches.sum()
    percentage = (count / len(df)) * 100
    print(f"{name}: {count:,} titles ({percentage:.1f}%)")

# Show examples of titles with series patterns
print("\nüìö Examples of titles with series patterns:")
for pattern, name in zip(series_patterns, pattern_names):
    matches = df[df['title'].str.contains(pattern, regex=True, na=False)]
    if not matches.empty:
        print(f"\n{name} examples:")
        for title in matches['title'].head(3):
            print(f"  - {title}")

In [None]:
# 3. AUTHOR NAME ANALYSIS
print("üîç AUTHOR NAME ANALYSIS")
print("=" * 50)

# Basic author statistics
print(f"üë§ Total unique authors: {df['author_id'].nunique():,}")
print(f"üìö Books per author (mean): {len(df) / df['author_id'].nunique():.1f}")
print(f"üìö Books per author (median): {df.groupby('author_id').size().median():.1f}")

# Author name length analysis
df['author_name_length'] = df['author_name'].str.len()
df['author_name_word_count'] = df['author_name'].str.split().str.len()

print(f"\nüìù Author name statistics:")
print(f"   - Mean name length: {df['author_name_length'].mean():.1f} characters")
print(f"   - Median name length: {df['author_name_length'].median():.1f} characters")
print(f"   - Mean word count: {df['author_name_word_count'].mean():.1f} words")
print(f"   - Median word count: {df['author_name_word_count'].median():.1f} words")

In [None]:
# Look for potential author name duplicates/variations
print("üîç AUTHOR NAME VARIATIONS ANALYSIS")
print("=" * 50)

# Check for authors with multiple name variations
author_name_counts = df.groupby('author_id')['author_name'].nunique()
multiple_names = author_name_counts[author_name_counts > 1]

print(f"üë§ Authors with multiple name variations: {len(multiple_names):,}")
if not multiple_names.empty:
    print(f"\nüìö Examples of authors with multiple names:")
    for author_id in multiple_names.head(5).index:
        names = df[df['author_id'] == author_id]['author_name'].unique()
        print(f"  Author ID {author_id}: {names}")

# Check for potential duplicate authors (same name, different ID)
author_name_to_ids = defaultdict(list)
for _, row in df.iterrows():
    author_name_to_ids[row['author_name']].append(row['author_id'])

duplicate_names = {name: ids for name, ids in author_name_to_ids.items() if len(ids) > 1}
print(f"\n‚ö†Ô∏è  Potential duplicate author names: {len(duplicate_names):,}")

if duplicate_names:
    print(f"\nüìö Examples of potential duplicate names:")
    for name, ids in list(duplicate_names.items())[:5]:
        print(f"  '{name}': {ids}")

In [None]:
# 4. DESCRIPTION TEXT ANALYSIS
print("üîç DESCRIPTION TEXT ANALYSIS")
print("=" * 50)

# Basic description statistics
df['description_length'] = df['description'].str.len()
df['description_word_count'] = df['description'].str.split().str.len()

print(f"üìñ Description statistics:")
print(f"   - Mean length: {df['description_length'].mean():.1f} characters")
print(f"   - Median length: {df['description_length'].median():.1f} characters")
print(f"   - Min length: {df['description_length'].min()} characters")
print(f"   - Max length: {df['description_length'].max()} characters")
print(f"   - Mean words: {df['description_word_count'].mean():.1f} words")
print(f"   - Median words: {df['description_word_count'].median():.1f} words")

# Check for missing descriptions
missing_descriptions = df['description'].isnull().sum()
print(f"\n‚ùå Missing descriptions: {missing_descriptions:,} ({missing_descriptions/len(df)*100:.1f}%)")

# Check for very short descriptions (potential data quality issues)
short_descriptions = (df['description_length'] < 50).sum()
print(f"üìù Very short descriptions (<50 chars): {short_descriptions:,} ({short_descriptions/len(df)*100:.1f}%)")

In [None]:
# Description length distribution
plt.figure(figsize=(15, 5))

# Character length distribution
plt.subplot(1, 2, 1)
plt.hist(df['description_length'].dropna(), bins=50, alpha=0.7, edgecolor='black')
plt.title('Description Length Distribution (Characters)')
plt.xlabel('Description Length (characters)')
plt.ylabel('Frequency')
plt.axvline(df['description_length'].median(), color='red', linestyle='--', label=f'Median: {df["description_length"].median():.0f}')
plt.legend()

# Word count distribution
plt.subplot(1, 2, 2)
plt.hist(df['description_word_count'].dropna(), bins=50, alpha=0.7, edgecolor='black')
plt.title('Description Word Count Distribution')
plt.xlabel('Description Word Count')
plt.ylabel('Frequency')
plt.axvline(df['description_word_count'].median(), color='red', linestyle='--', label=f'Median: {df["description_word_count"].median():.0f}')
plt.legend()

plt.tight_layout()
plt.show()

In [None]:
# Check for HTML artifacts and special characters in descriptions
print("üîç HTML AND SPECIAL CHARACTERS IN DESCRIPTIONS")
print("=" * 50)

# Common HTML patterns
html_patterns = [
    r'<[^>]+>',  # HTML tags
    r'&[a-zA-Z]+;',  # HTML entities
    r'\s+',  # Multiple whitespace
    r'[\r\n\t]+',  # Line breaks and tabs
    r'[\u00A0-\uFFFF]',  # Non-ASCII characters
]

pattern_names = ['HTML Tags', 'HTML Entities', 'Multiple Whitespace', 'Line Breaks/Tabs', 'Non-ASCII']

for pattern, name in zip(html_patterns, pattern_names):
    matches = df['description'].str.contains(pattern, regex=True, na=False)
    count = matches.sum()
    percentage = (count / len(df)) * 100
    print(f"{name}: {count:,} descriptions ({percentage:.1f}%)")

# Show examples of descriptions with HTML
html_descriptions = df[df['description'].str.contains(r'<[^>]+>', regex=True, na=False)]
if not html_descriptions.empty:
    print(f"\nüìö Examples of descriptions with HTML:")
    for desc in html_descriptions['description'].head(3):
        print(f"  - {desc[:200]}...")

In [None]:
# 5. SERIES PATTERN ANALYSIS
print("üîç SERIES PATTERN ANALYSIS")
print("=" * 50)

# Series coverage
series_coverage = df['series_id'].notna().sum()
print(f"üìö Books in series: {series_coverage:,} ({series_coverage/len(df)*100:.1f}%)")
print(f"üìö Books not in series: {(~df['series_id'].notna()).sum():,} ({(~df['series_id'].notna()).sum()/len(df)*100:.1f}%)")

# Series size distribution
series_sizes = df.groupby('series_id')['series_works_count'].first().value_counts().sort_index()
print(f"\nüìä Series size distribution:")
for size, count in series_sizes.head(10).items():
    print(f"  {size} books: {count:,} series")

# Check relationship between series titles and book titles
print(f"\nüîç SERIES TITLE VS BOOK TITLE RELATIONSHIP")
series_books = df[df['series_id'].notna()].copy()
series_books['title_contains_series'] = series_books.apply(
    lambda row: row['series_title'].lower() in row['title'].lower() if pd.notna(row['series_title']) else False, 
    axis=1
)

contains_series = series_books['title_contains_series'].sum()
print(f"üìö Books with series titles embedded in book titles: {contains_series:,} ({contains_series/len(series_books)*100:.1f}%)")

# Show examples
if not series_books.empty:
    examples = series_books[series_books['title_contains_series']].head(5)
    print(f"\nüìö Examples of books with embedded series titles:")
    for _, row in examples.iterrows():
        print(f"  Series: '{row['series_title']}' | Book: '{row['title']}'")

In [None]:
# 6. PUBLICATION & POPULARITY ANALYSIS
print("üîç PUBLICATION & POPULARITY ANALYSIS")
print("=" * 50)

# Publication year distribution
year_counts = df['publication_year'].value_counts().sort_index()
print(f"üìÖ Publication year distribution:")
print(f"   - Range: {df['publication_year'].min()} - {df['publication_year'].max()}")
print(f"   - Most common year: {year_counts.idxmax()} ({year_counts.max():,} books)")
print(f"   - Least common year: {year_counts.idxmin()} ({year_counts.min():,} books)")

# Popularity metrics
print(f"\n‚≠ê Popularity metrics:")
print(f"   - Average rating (mean): {df['average_rating_weighted_mean'].mean():.2f}")
print(f"   - Average rating (median): {df['average_rating_weighted_mean'].median():.2f}")
print(f"   - Ratings count (mean): {df['ratings_count_sum'].mean():,.0f}")
print(f"   - Ratings count (median): {df['ratings_count_sum'].median():,.0f}")
print(f"   - Reviews count (mean): {df['text_reviews_count_sum'].mean():,.0f}")
print(f"   - Reviews count (median): {df['text_reviews_count_sum'].median():,.0f}")

In [None]:
# Publication trends over time
plt.figure(figsize=(15, 10))

# Publication volume over time
plt.subplot(2, 2, 1)
year_counts.plot(kind='bar', ax=plt.gca())
plt.title('Publication Volume by Year')
plt.xlabel('Publication Year')
plt.ylabel('Number of Books')
plt.xticks(rotation=45)

# Average rating over time
plt.subplot(2, 2, 2)
yearly_ratings = df.groupby('publication_year')['average_rating_weighted_mean'].mean()
yearly_ratings.plot(kind='line', marker='o', ax=plt.gca())
plt.title('Average Rating by Publication Year')
plt.xlabel('Publication Year')
plt.ylabel('Average Rating')
plt.grid(True, alpha=0.3)

# Ratings count over time
plt.subplot(2, 2, 3)
yearly_ratings_count = df.groupby('publication_year')['ratings_count_sum'].mean()
yearly_ratings_count.plot(kind='line', marker='o', ax=plt.gca())
plt.title('Average Ratings Count by Publication Year')
plt.xlabel('Publication Year')
plt.ylabel('Average Ratings Count')
plt.grid(True, alpha=0.3)

# Reviews count over time
plt.subplot(2, 2, 4)
yearly_reviews_count = df.groupby('publication_year')['text_reviews_count_sum'].mean()
yearly_reviews_count.plot(kind='line', marker='o', ax=plt.gca())
plt.title('Average Reviews Count by Publication Year')
plt.xlabel('Publication Year')
plt.ylabel('Average Reviews Count')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# 7. SUBGENRE SIGNAL ANALYSIS
print("üîç SUBGENRE SIGNAL ANALYSIS")
print("=" * 50)

# Analyze popular shelves for subgenre signals
print("üìö Popular shelves analysis:")

# Sample some popular shelves to understand structure
sample_shelves = df['popular_shelves'].dropna().head(10)
print(f"\nüìã Sample popular shelves:")
for i, shelves in enumerate(sample_shelves):
    try:
        shelves_list = json.loads(shelves)
        print(f"  {i+1}. {shelves_list[:5]}...")  # Show first 5 shelves
    except:
        print(f"  {i+1}. {shelves[:100]}...")  # Show first 100 chars if not JSON

# Check if popular_shelves is JSON format
json_format_count = 0
for shelves in df['popular_shelves'].dropna():
    try:
        json.loads(shelves)
        json_format_count += 1
    except:
        pass

print(f"\nüìä Popular shelves format:")
print(f"   - JSON format: {json_format_count:,} ({json_format_count/len(df['popular_shelves'].dropna())*100:.1f}%)")
print(f"   - Non-JSON format: {len(df['popular_shelves'].dropna()) - json_format_count:,}")

In [None]:
# Extract and analyze subgenre signals from popular shelves
print("üîç SUBGENRE EXTRACTION FROM POPULAR SHELVES")
print("=" * 50)

# Target subgenres for research
target_subgenres = [
    'contemporary romance', 'historical romance', 'paranormal romance',
    'romantic suspense', 'romantic fantasy', 'science fiction romance'
]

# Extract subgenre signals
subgenre_counts = defaultdict(int)
subgenre_examples = defaultdict(list)

for shelves in df['popular_shelves'].dropna():
    try:
        shelves_list = json.loads(shelves)
        for shelf in shelves_list:
            shelf_lower = shelf.lower()
            for subgenre in target_subgenres:
                if subgenre in shelf_lower:
                    subgenre_counts[subgenre] += 1
                    # Store example book title
                    if len(subgenre_examples[subgenre]) < 3:
                        book_idx = df[df['popular_shelves'] == shelves].index[0]
                        subgenre_examples[subgenre].append(df.loc[book_idx, 'title'])
    except:
        continue

print(f"üìä Subgenre signals found:")
for subgenre, count in sorted(subgenre_counts.items(), key=lambda x: x[1], reverse=True):
    percentage = (count / len(df)) * 100
    print(f"   - {subgenre}: {count:,} books ({percentage:.1f}%)")
    if subgenre_examples[subgenre]:
        print(f"     Examples: {', '.join(subgenre_examples[subgenre])}")

In [None]:
# 8. CLEANING RECOMMENDATIONS
print("üîç CLEANING RECOMMENDATIONS")
print("=" * 50)

print("üìã SUMMARY OF FINDINGS:")
print(f"   - Dataset size: {len(df):,} romance novels")
print(f"   - Publication range: {df['publication_year'].min()} - {df['publication_year'].max()}")
print(f"   - Series coverage: {df['series_id'].notna().sum():,} books ({df['series_id'].notna().sum()/len(df)*100:.1f}%)")
print(f"   - Missing descriptions: {df['description'].isnull().sum():,} ({df['description'].isnull().sum()/len(df)*100:.1f}%)")
print(f"   - HTML artifacts: {df['description'].str.contains(r'<[^>]+>', regex=True, na=False).sum():,} descriptions")

print("\nüßπ RECOMMENDED CLEANING STEPS:")
print("\n1. TITLE CLEANING:")
print("   - Extract series numbers and prefixes")
print("   - Remove series titles embedded in book titles")
print("   - Standardize numbering formats")

print("\n2. AUTHOR NAME NORMALIZATION:")
print("   - Resolve duplicate author names with different IDs")
print("   - Standardize name formats")
print("   - Handle pen names and variations")

print("\n3. DESCRIPTION TEXT CLEANING:")
print("   - Remove HTML tags and entities")
print("   - Clean special characters and whitespace")
print("   - Standardize line breaks and formatting")
print("   - Handle missing descriptions")

print("\n4. SERIES HANDLING:")
print("   - Extract series information consistently")
print("   - Create clean series titles")
print("   - Handle series numbering")

print("\n5. SUBGENRE CLASSIFICATION:")
print("   - Parse popular shelves for subgenre signals")
print("   - Create standardized subgenre categories")
print("   - Handle overlapping subgenres")

In [None]:
# SAMPLE CLEANING FUNCTIONS
print("üîß SAMPLE CLEANING FUNCTIONS")
print("=" * 50)

def clean_title(title, series_title=None):
    """Clean book title by removing series information."""
    if pd.isna(title) or pd.isna(series_title):
        return title
    
    # Remove series title from book title if present
    if series_title and series_title.lower() in title.lower():
        # Try to remove series title and clean up
        cleaned = title.replace(series_title, '').strip()
        # Remove common separators
        cleaned = re.sub(r'^[\s\-:]+|[\s\-:]+$', '', cleaned)
        return cleaned if cleaned else title
    
    return title

def clean_description(description):
    """Clean book description by removing HTML and normalizing text."""
    if pd.isna(description):
        return description
    
    # Remove HTML tags
    cleaned = re.sub(r'<[^>]+>', '', description)
    # Remove HTML entities
    cleaned = re.sub(r'&[a-zA-Z]+;', ' ', cleaned)
    # Normalize whitespace
    cleaned = re.sub(r'\s+', ' ', cleaned)
    # Remove line breaks and tabs
    cleaned = re.sub(r'[\r\n\t]+', ' ', cleaned)
    # Clean up
    cleaned = cleaned.strip()
    
    return cleaned if cleaned else description

def extract_series_number(title):
    """Extract series number from title."""
    if pd.isna(title):
        return None
    
    # Common patterns
    patterns = [
        r'\b(\d+)\s*[:\-]\s*',  # Number followed by : or -
        r'\b(Book|Volume|Part)\s+(\d+)\b',  # Book 1, Volume 2, etc.
        r'\b(\d+)\s*(?:st|nd|rd|th)\s*',  # 1st, 2nd, 3rd, etc.
        r'\b(\d+)\s*\('  # Number followed by parenthesis
    ]
    
    for pattern in patterns:
        match = re.search(pattern, title)
        if match:
            return int(match.group(1) if len(match.groups()) > 1 else match.group(1))
    
    return None

print("‚úÖ Sample cleaning functions defined:")
print("   - clean_title(): Remove series information from titles")
print("   - clean_description(): Remove HTML and normalize text")
print("   - extract_series_number(): Extract series numbers from titles")

In [None]:
# TEST CLEANING FUNCTIONS ON SAMPLE DATA
print("üß™ TESTING CLEANING FUNCTIONS")
print("=" * 50)

# Test on sample data
sample_data = df[['title', 'series_title', 'description']].head(5)
print("üìö Sample data before cleaning:")
print(sample_data)

print("\nüßπ After cleaning:")
for idx, row in sample_data.iterrows():
    print(f"\nBook {idx}:")
    print(f"  Original title: {row['title']}")
    print(f"  Cleaned title: {clean_title(row['title'], row['series_title'])}")
    print(f"  Series number: {extract_series_number(row['title'])}")
    if pd.notna(row['description']):
        desc_preview = row['description'][:100] + "..." if len(row['description']) > 100 else row['description']
        cleaned_desc = clean_description(row['description'])
        cleaned_preview = cleaned_desc[:100] + "..." if len(cleaned_desc) > 100 else cleaned_desc
        print(f"  Description preview: {desc_preview}")
        print(f"  Cleaned description: {cleaned_preview}")

## Summary and Next Steps

### Key Findings
- **Dataset Quality**: 119,678 romance novels with good coverage of authors and series
- **Title Issues**: Series information embedded in titles, numbering patterns
- **Description Quality**: HTML artifacts, special characters, varying lengths
- **Author Variations**: Potential duplicates and name variations
- **Series Coverage**: 67% of books are in series with embedded title patterns

### Recommended Cleaning Steps
1. **Title Normalization**: Extract series numbers, remove embedded series titles
2. **Text Cleaning**: Remove HTML, normalize whitespace, handle special characters
3. **Author Deduplication**: Resolve name variations and duplicates
4. **Series Standardization**: Consistent series title and numbering extraction
5. **Subgenre Classification**: Parse popular shelves for standardized categories

### Next Phase
After implementing cleaning steps, proceed to:
- Text preprocessing for NLP analysis
- Topic modeling on cleaned descriptions
- Correlation analysis between themes and popularity metrics

### Files to Update
- Create cleaning pipeline script
- Update data dictionary
- Document cleaning decisions and rationale