# IT2311 Assignment - Task 1a: Data Preparation

This notebook performs data understanding and data cleaning on the World Bank project documents dataset.

**Sub-tasks:**
1. **Load Data**: Load the dataset
2. **Data Understanding**: Examine the dataset
3. **Data Cleaning**: Clean the data and perform all necessary pre-processing
4. **Save Data**: Save the cleaned data for the next task

**Dataset**: `Task_1_TM_world_bank_projects_subset.json`

**Citation**: Jordan, Luke S. (2021). World Bank Project Documents [Dataset]. Hugging Face. Available at: https://huggingface.co/datasets/lukesjordan/worldbank-project-documents

**Note**: This analysis uses a modified subset of the original dataset. Any changes were made by the author of this notebook and are not endorsed by the original dataset creator or the World Bank.

**Done by: \<Enter your name and admin number here\>**

## 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import json
import warnings
warnings.filterwarnings('ignore')

# For text pre-processing
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

print('Libraries imported successfully.')

## 2. Load Data

Load the World Bank project documents dataset from the JSON file. The dataset contains documents related to World Bank development projects from 1947-2020.

In [None]:
# Load the dataset from JSON
df = pd.read_json('Task_1_TM_world_bank_projects_subset.json')

print(f'Dataset loaded successfully.')
print(f'Shape: {df.shape}')
print(f'Number of records: {df.shape[0]}')
print(f'Number of features: {df.shape[1]}')

## 3. Data Understanding

In this section, we thoroughly examine the dataset to understand its structure, quality, and characteristics before proceeding with cleaning.

### 3.1 Basic Dataset Exploration

In [None]:
# Display first few rows
print('=== First 5 Rows ===')
df.head()

In [None]:
# Display last few rows
print('=== Last 5 Rows ===')
df.tail()

In [None]:
# Dataset info - shows data types, non-null counts, memory usage
print('=== Dataset Info ===')
df.info()

In [None]:
# Statistical summary for text columns
print('=== Descriptive Statistics ===')
df.describe(include='all')

In [None]:
# Check column names and data types
print('=== Column Names and Data Types ===')
for col in df.columns:
    print(f'{col}: {df[col].dtype}')

### 3.2 Missing Values Analysis

Checking for missing or null values is crucial to ensure data quality. Missing text data could affect topic modelling results.

In [None]:
# Check for missing values
print('=== Missing Values ===')
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df)) * 100
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
print(missing_df)
print(f'\nTotal rows with any missing value: {df.isnull().any(axis=1).sum()}')

In [None]:
# Check for empty strings in text fields
print('=== Empty Strings ===')
for col in df.columns:
    if df[col].dtype == 'object':
        empty_count = (df[col].str.strip() == '').sum()
        print(f'{col}: {empty_count} empty strings')

### 3.3 Duplicate Analysis

Duplicate records can bias topic modelling results by overrepresenting certain themes.

In [None]:
# Check for duplicate rows
print(f'Number of exact duplicate rows: {df.duplicated().sum()}')
print(f'Number of duplicate project_ids: {df.duplicated(subset=["project_id"]).sum()}')
print(f'Number of unique project_ids: {df["project_id"].nunique()}')
print(f'Number of duplicate document_text: {df.duplicated(subset=["document_text"]).sum()}')

### 3.4 Document Type Distribution

Understanding the distribution of document types (APPROVAL vs REVIEW) helps us know if the dataset is balanced.

In [None]:
# Distribution of document types
print('=== Document Type Distribution ===')
print(df['document_type'].value_counts())
print(f'\nPercentage Distribution:')
print(df['document_type'].value_counts(normalize=True) * 100)

# Visualize document type distribution
fig, ax = plt.subplots(1, 1, figsize=(8, 5))
df['document_type'].value_counts().plot(kind='bar', color=['steelblue', 'coral'], ax=ax)
ax.set_title('Distribution of Document Types', fontsize=14)
ax.set_xlabel('Document Type')
ax.set_ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

### 3.5 Text Length Analysis

Analyzing the length of document texts helps identify potential outliers - very short texts may lack meaningful content, while extremely long texts may need special handling.

In [None]:
# Analyze text lengths
df['text_length'] = df['document_text'].str.len()
df['word_count'] = df['document_text'].str.split().str.len()

print('=== Text Length Statistics (characters) ===')
print(df['text_length'].describe())
print(f'\n=== Word Count Statistics ===')
print(df['word_count'].describe())

In [None]:
# Visualize text length distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Character length distribution
axes[0].hist(df['text_length'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Distribution of Text Length (Characters)', fontsize=12)
axes[0].set_xlabel('Number of Characters')
axes[0].set_ylabel('Frequency')

# Word count distribution
axes[1].hist(df['word_count'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1].set_title('Distribution of Word Count', fontsize=12)
axes[1].set_xlabel('Number of Words')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
# Text length by document type
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for i, doc_type in enumerate(df['document_type'].unique()):
    subset = df[df['document_type'] == doc_type]
    axes[i].hist(subset['word_count'], bins=40, color='steelblue' if i == 0 else 'coral', 
                 edgecolor='black', alpha=0.7)
    axes[i].set_title(f'Word Count Distribution - {doc_type}', fontsize=12)
    axes[i].set_xlabel('Number of Words')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

print('=== Word Count by Document Type ===')
print(df.groupby('document_type')['word_count'].describe())

### 3.6 Sample Document Inspection

Inspecting individual documents helps us identify text quality issues such as special characters, encoding problems, or irrelevant content.

In [None]:
# Inspect sample documents
print('=== Sample APPROVAL Document ===')
approval_docs = df[df['document_type'] == 'APPROVAL']
if len(approval_docs) > 0:
    sample = approval_docs.iloc[0]
    print(f'Project ID: {sample["project_id"]}')
    print(f'Text (first 500 chars): {sample["document_text"][:500]}...')

print('\n=== Sample REVIEW Document ===')
review_docs = df[df['document_type'] == 'REVIEW']
if len(review_docs) > 0:
    sample = review_docs.iloc[0]
    print(f'Project ID: {sample["project_id"]}')
    print(f'Text (first 500 chars): {sample["document_text"][:500]}...')

In [None]:
# Check for very short documents that may lack meaningful content
short_docs = df[df['word_count'] < 10]
print(f'Documents with fewer than 10 words: {len(short_docs)}')
if len(short_docs) > 0:
    print('\nSample short documents:')
    for _, row in short_docs.head().iterrows():
        print(f'  Project: {row["project_id"]}, Type: {row["document_type"]}, Text: "{row["document_text"]}"')

### 3.7 Key Findings from Data Understanding

**Summary of findings:**
- Dataset structure: The dataset contains three fields - project_id, document_text, and document_type
- Missing values: Identified and will be handled in the cleaning phase
- Duplicates: Any duplicate records will be removed to avoid bias
- Text quality: Document texts may contain special characters, HTML tags, numbers, and other noise that needs cleaning
- Distribution: The distribution of document types (APPROVAL vs REVIEW) has been examined
- Text length: Text length varies significantly; very short documents may not contain meaningful content for topic modelling

## 4. Data Cleaning

Based on the findings from data understanding, we perform the following cleaning steps:
1. Remove duplicate records
2. Handle missing values
3. Remove very short documents that lack meaningful content
4. Clean text: remove special characters, HTML tags, extra whitespace
5. Convert text to lowercase
6. Tokenize, remove stopwords, and lemmatize

**Rationale**: Each step is justified by findings from the data understanding phase. Clean, consistent text is essential for reliable topic modelling.

### 4.1 Remove Duplicates

**Rationale**: Duplicate documents would overrepresent certain topics and bias the topic model. We remove exact duplicates based on document_text to ensure each unique document is represented once.

In [None]:
print(f'Shape before removing duplicates: {df.shape}')

# Remove exact duplicate rows
df_clean = df.drop_duplicates()
print(f'After removing exact duplicates: {df_clean.shape}')

# Remove duplicates based on document_text (same text, possibly different IDs)
df_clean = df_clean.drop_duplicates(subset=['document_text'], keep='first')
print(f'After removing duplicate texts: {df_clean.shape}')

### 4.2 Handle Missing Values

**Rationale**: Missing text data cannot be used for topic modelling. Missing document_type makes it impossible to categorize the document. We drop rows with missing critical fields rather than imputing, as imputation of text data would introduce noise.

In [None]:
print(f'Missing values before cleaning:')
print(df_clean.isnull().sum())

# Drop rows with missing document_text or document_type
df_clean = df_clean.dropna(subset=['document_text', 'document_type'])

# Also remove rows where document_text is an empty string
df_clean = df_clean[df_clean['document_text'].str.strip() != '']

print(f'\nShape after handling missing values: {df_clean.shape}')

### 4.3 Remove Very Short Documents

**Rationale**: Documents with very few words (less than 10) are unlikely to contain meaningful content for topic modelling. They may be metadata artifacts, headers, or incomplete entries that would introduce noise into the model.

In [None]:
# Recalculate word count after cleaning
df_clean['word_count'] = df_clean['document_text'].str.split().str.len()

print(f'Shape before removing short documents: {df_clean.shape}')
df_clean = df_clean[df_clean['word_count'] >= 10]
print(f'Shape after removing documents with < 10 words: {df_clean.shape}')

### 4.4 Text Cleaning

**Rationale**: Raw text contains noise such as HTML tags, URLs, special characters, numbers, and extra whitespace. These elements do not contribute to understanding document topics and could confuse the topic model. We apply a systematic cleaning pipeline to standardize the text.

In [None]:
def clean_text(text):
    """Clean text by removing noise and standardizing format."""
    if not isinstance(text, str):
        return ''
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', ' ', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', ' ', text)
    
    # Remove numbers
    text = re.sub(r'\d+', ' ', text)
    
    # Remove special characters and punctuation, keep only letters and spaces
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply text cleaning
df_clean['cleaned_text'] = df_clean['document_text'].apply(clean_text)

# Show a sample before and after cleaning
print('=== Sample Text Before Cleaning ===')
print(df_clean['document_text'].iloc[0][:300])
print('\n=== Sample Text After Cleaning ===')
print(df_clean['cleaned_text'].iloc[0][:300])

### 4.5 Tokenization, Stopword Removal, and Lemmatization

**Rationale**: 
- **Tokenization** breaks text into individual words for analysis.
- **Stopword removal** eliminates common words (e.g., 'the', 'is', 'and') that don't carry topic-specific meaning, reducing noise.
- **Lemmatization** reduces words to their base form (e.g., 'running' → 'run'), grouping related words together for more coherent topics.

In [None]:
# Initialize tools
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Add domain-specific stopwords that appear frequently but don't carry topic meaning
custom_stopwords = {'project', 'bank', 'world', 'document', 'page', 'report', 
                    'would', 'also', 'may', 'one', 'two', 'three', 'could',
                    'million', 'percent', 'year', 'years'}
stop_words = stop_words.union(custom_stopwords)

def preprocess_text(text):
    """Tokenize, remove stopwords, and lemmatize text."""
    if not isinstance(text, str) or text.strip() == '':
        return ''
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove stopwords and short tokens (length < 3)
    tokens = [t for t in tokens if t not in stop_words and len(t) >= 3]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    return ' '.join(tokens)

# Apply preprocessing
print('Preprocessing text (this may take a few minutes)...')
df_clean['processed_text'] = df_clean['cleaned_text'].apply(preprocess_text)
print('Text preprocessing complete.')

# Show sample
print('\n=== Sample Processed Text ===')
print(df_clean['processed_text'].iloc[0][:300])

In [None]:
# Remove any rows where processed text is empty after cleaning
before_count = len(df_clean)
df_clean = df_clean[df_clean['processed_text'].str.strip() != '']
after_count = len(df_clean)
print(f'Removed {before_count - after_count} rows with empty processed text')
print(f'Final dataset shape: {df_clean.shape}')

### 4.6 Post-Cleaning Verification

Verify the quality of cleaned data to ensure cleaning was effective.

In [None]:
# Verify cleaning results
print('=== Post-Cleaning Summary ===')
print(f'Final number of records: {len(df_clean)}')
print(f'Missing values: {df_clean[["project_id", "document_text", "document_type", "processed_text"]].isnull().sum().sum()}')
print(f'Duplicate records: {df_clean.duplicated().sum()}')
print(f'\nDocument type distribution:')
print(df_clean['document_type'].value_counts())

# Word count statistics after cleaning
df_clean['processed_word_count'] = df_clean['processed_text'].str.split().str.len()
print(f'\nProcessed text word count statistics:')
print(df_clean['processed_word_count'].describe())

In [None]:
# Visualize word count distribution after cleaning
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df_clean['word_count'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_title('Original Word Count Distribution (After Filtering)', fontsize=12)
axes[0].set_xlabel('Number of Words')
axes[0].set_ylabel('Frequency')

axes[1].hist(df_clean['processed_word_count'], bins=50, color='green', edgecolor='black', alpha=0.7)
axes[1].set_title('Processed Word Count Distribution', fontsize=12)
axes[1].set_xlabel('Number of Words')
axes[1].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

## 5. Save Cleaned Data

Save the cleaned dataset for use in Task 1b (Topic Modelling). We save both the original and processed text so that the topic modelling task has flexibility in choosing the text representation.

In [None]:
# Select columns to save
columns_to_save = ['project_id', 'document_text', 'document_type', 'cleaned_text', 'processed_text']
df_save = df_clean[columns_to_save].copy()

# Save to JSON
df_save.to_json('Task_1_cleaned_data.json', orient='records', indent=2)
print(f'Cleaned data saved to Task_1_cleaned_data.json')
print(f'Number of records saved: {len(df_save)}')
print(f'Columns saved: {list(df_save.columns)}')

## Summary

### Data Understanding Findings:
- The dataset contains World Bank project documents with three fields: project_id, document_text, and document_type
- Document types include APPROVAL (project launch documents) and REVIEW (end-of-project review documents)
- Text lengths vary significantly across documents
- Some documents may be very short and lack meaningful content

### Data Cleaning Steps Performed:
1. **Removed duplicates**: Eliminated exact duplicate rows and duplicate document texts to avoid topic bias
2. **Handled missing values**: Dropped rows with missing text or document type (imputation not suitable for text)
3. **Removed short documents**: Filtered out documents with fewer than 10 words as they lack meaningful content
4. **Text cleaning**: Converted to lowercase, removed HTML tags, URLs, emails, numbers, special characters, and extra whitespace
5. **Text preprocessing**: Tokenized, removed stopwords (including domain-specific ones), and lemmatized to standardize vocabulary
6. **Post-cleaning verification**: Confirmed data quality after all cleaning steps

The cleaned dataset is saved and ready for Topic Modelling in Task 1b.