# NYP IT2311 Assignment - Task 1a: Data Preparation

**Done by:** [Your Name] [Your Admin Number]

---

## Overview

This notebook performs **data preparation** on the World Bank Project Documents dataset. The goal is to load, explore, understand, and clean the data so it is ready for downstream NLP tasks such as sentiment classification or text analysis.

Data preparation is a critical first step in any data science pipeline. Poorly prepared data leads to unreliable models and misleading conclusions. In this notebook, we take a methodical approach:

1. **Load** the raw JSON dataset
2. **Understand** the structure, distributions, and quality of the data through exploration and visualization
3. **Clean** the data by handling missing values, duplicates, and noisy text
4. **Save** the cleaned dataset for use in subsequent tasks

Each step includes detailed rationale explaining *why* specific decisions were made.

---
## Import Libraries

We begin by importing the necessary Python libraries:

- **pandas** and **numpy**: Core data manipulation and numerical computing libraries. Pandas provides DataFrame structures ideal for tabular data, while NumPy supports efficient numerical operations.
- **matplotlib** and **seaborn**: Visualization libraries. Seaborn builds on matplotlib and provides a higher-level interface for statistical graphics, making it easier to produce informative plots.
- **json**: For handling JSON file operations beyond what pandas provides natively.
- **re**: Python's regular expression module, essential for text cleaning operations such as removing HTML tags, URLs, and special characters.
- **warnings**: We suppress warnings to keep the notebook output clean and focused on our analysis.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
import re
import warnings
warnings.filterwarnings('ignore')

# Set visual style for all plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

print('All libraries imported successfully.')

---
## 1. Load Data

The dataset is stored in JSON format (`Task_1_TM_world_bank_projects_subset.json`). JSON (JavaScript Object Notation) is a lightweight, human-readable data interchange format commonly used for web APIs and document storage.

We use `pd.read_json()` to load it directly into a pandas DataFrame. This is the most straightforward approach for structured JSON files. If the JSON structure were deeply nested, we might need `json.load()` followed by `pd.json_normalize()`, but for this dataset the direct approach is appropriate.

After loading, we immediately inspect the data to confirm it loaded correctly and to get an initial sense of its structure.

In [None]:
# Load the JSON dataset
df = pd.read_json('Task_1_TM_world_bank_projects_subset.json')

print(f'Dataset loaded successfully.')
print(f'Shape: {df.shape[0]} rows x {df.shape[1]} columns')
print(f'Columns: {list(df.columns)}')

In [None]:
# Display the first few rows to get a visual overview of the data
df.head()

In [None]:
# Display detailed information about the DataFrame
# This tells us column names, non-null counts, and data types
df.info()

**Observation:** The dataset has been loaded into a DataFrame. We can see three columns as expected: `project_id`, `document_text`, and `document_type`. The `.info()` output reveals the data types and whether there are any null values at a glance. This initial inspection is crucial — it tells us whether the data loaded correctly and gives us a roadmap for the exploration ahead.

---
## 2. Data Understanding

Before any cleaning or transformation, it is essential to deeply understand the data. This phase answers key questions:

- **What does the data look like?** (types, shape, distributions)
- **What quality issues exist?** (missing values, duplicates, anomalies)
- **What patterns or insights can we discover?** (class balance, text characteristics)

A thorough understanding phase prevents us from making uninformed decisions during cleaning and ensures we preserve important information while removing genuine noise. It also helps us anticipate challenges for downstream tasks.

### 2.1 Basic Statistics and Data Types

In [None]:
# Check data types for each column
print('=== Data Types ===')
print(df.dtypes)
print()

# Descriptive statistics for all columns
print('=== Descriptive Statistics ===')
df.describe(include='all')

**Reflection:** The `describe(include='all')` output gives us summary statistics for all columns regardless of type. For text columns, it shows count, unique values, top (most frequent) value, and frequency. This is our first quantitative look at the data's characteristics. Notably, we can see how many unique project IDs and document types exist.

### 2.2 Missing Values Analysis

Missing values are one of the most common data quality issues. They can arise from data collection errors, system failures, or intentional omissions. Understanding the pattern of missingness is crucial because:

- **Missing Completely at Random (MCAR):** Safe to drop or impute without bias
- **Missing at Random (MAR):** Missingness depends on observed variables
- **Missing Not at Random (MNAR):** Missingness depends on the missing value itself

We visualize missing values using a heatmap, which makes it easy to spot patterns of missingness across columns.

In [None]:
# Check for missing values in each column
missing_counts = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_counts,
    'Missing Percentage (%)': missing_pct.round(2)
})
print('=== Missing Values Summary ===')
print(missing_df)
print(f'\nTotal missing values: {df.isnull().sum().sum()}')

In [None]:
# Visualize missing values with a heatmap
fig, ax = plt.subplots(figsize=(8, 4))
sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis', ax=ax)
ax.set_title('Missing Values Heatmap', fontsize=14, fontweight='bold')
ax.set_xlabel('Columns')
ax.set_ylabel('Rows')
plt.tight_layout()
plt.show()

print('Yellow regions indicate missing values. A completely dark heatmap means no missing data.')

**Insight:** The heatmap provides a visual summary of data completeness. If we see yellow streaks or blocks, that would indicate systematic missing data patterns worth investigating further. A uniformly dark heatmap is the ideal scenario, indicating no missing values.

### 2.3 Duplicate Records Analysis

Duplicates can inflate our dataset artificially and bias any model trained on it. In NLP tasks, duplicate texts are especially problematic because they can cause data leakage between training and test sets. We check for:

1. **Exact duplicates** — rows where every column matches
2. **Duplicate document texts** — same text appearing with different project IDs or types

In [None]:
# Check for exact duplicate rows
exact_dupes = df.duplicated().sum()
print(f'Exact duplicate rows: {exact_dupes}')
print(f'Percentage of duplicates: {(exact_dupes / len(df) * 100):.2f}%')

# Check for duplicate document_text values
text_dupes = df['document_text'].duplicated().sum()
print(f'\nDuplicate document_text values: {text_dupes}')

# Check for duplicate project_id values
id_dupes = df['project_id'].duplicated().sum()
print(f'Duplicate project_id values: {id_dupes}')

if exact_dupes > 0:
    print(f'\nSample duplicate rows:')
    display(df[df.duplicated(keep=False)].head(10))

**Reflection:** Understanding the nature of duplicates is important. A project may legitimately have multiple documents (e.g., both an APPROVAL and a REVIEW document), so duplicate `project_id` values are expected. However, exact duplicate rows (identical across all columns) are likely data entry errors and should be removed.

### 2.4 Document Type Distribution

Since `document_type` is our categorical variable (with values "APPROVAL" and "REVIEW"), understanding its distribution is critical. Class imbalance can significantly affect downstream classification tasks — a model trained on an imbalanced dataset may develop a bias toward the majority class.

In [None]:
# Examine unique values and their counts
print('=== Document Type Value Counts ===')
type_counts = df['document_type'].value_counts()
print(type_counts)
print(f'\nNumber of unique document types: {df["document_type"].nunique()}')
print(f'\nProportions:')
print(df['document_type'].value_counts(normalize=True).round(4) * 100)

In [None]:
# Pie chart showing document type distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pie chart
colors = ['#2ecc71', '#e74c3c']
type_counts.plot.pie(
    ax=axes[0],
    autopct='%1.1f%%',
    colors=colors,
    startangle=90,
    explode=[0.03] * len(type_counts),
    shadow=True
)
axes[0].set_title('Document Type Distribution (Pie)', fontsize=14, fontweight='bold')
axes[0].set_ylabel('')

# Bar chart
type_counts.plot.bar(ax=axes[1], color=colors, edgecolor='black')
axes[1].set_title('Document Type Distribution (Bar)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Document Type')
axes[1].set_ylabel('Count')
axes[1].set_xticklabels(type_counts.index, rotation=0)

# Annotate bars with counts
for i, (val, name) in enumerate(zip(type_counts.values, type_counts.index)):
    axes[1].text(i, val + 0.5, str(val), ha='center', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

**Discovery:** The pie and bar charts reveal the balance (or imbalance) between APPROVAL and REVIEW documents. If one class significantly outnumbers the other, we may need to consider techniques like oversampling, undersampling, or class weights during modeling. Even in data preparation, this awareness guides our cleaning decisions — we should be cautious not to disproportionately remove records from the minority class.

### 2.5 Text Length and Word Count Analysis

Analyzing the length distribution of `document_text` provides valuable insights:

- **Very short texts** may lack meaningful content (e.g., headers or placeholders)
- **Very long texts** may include boilerplate or repeated sections
- **Distribution shape** tells us whether most documents are similar in length or if there's wide variation

We examine both character length and word count, as they capture different aspects of text complexity.

In [None]:
# Calculate text length (characters) and word count for each document
df['text_length'] = df['document_text'].astype(str).apply(len)
df['word_count'] = df['document_text'].astype(str).apply(lambda x: len(x.split()))

print('=== Text Length Statistics (Characters) ===')
print(df['text_length'].describe().round(2))
print()
print('=== Word Count Statistics ===')
print(df['word_count'].describe().round(2))

In [None]:
# Visualize text length and word count distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Text length distribution
axes[0].hist(df['text_length'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(df['text_length'].mean(), color='red', linestyle='--', label=f"Mean: {df['text_length'].mean():.0f}")
axes[0].axvline(df['text_length'].median(), color='orange', linestyle='--', label=f"Median: {df['text_length'].median():.0f}")
axes[0].set_title('Distribution of Text Length (Characters)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Text Length (characters)')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Word count distribution
axes[1].hist(df['word_count'], bins=50, color='coral', edgecolor='black', alpha=0.7)
axes[1].axvline(df['word_count'].mean(), color='red', linestyle='--', label=f"Mean: {df['word_count'].mean():.0f}")
axes[1].axvline(df['word_count'].median(), color='orange', linestyle='--', label=f"Median: {df['word_count'].median():.0f}")
axes[1].set_title('Distribution of Word Count', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

**Insight:** The histograms reveal whether the text lengths follow a normal, skewed, or multimodal distribution. A right-skewed distribution (long tail to the right) would indicate that most documents are relatively short, with a few very long outliers. The mean and median lines help us assess skewness — a large gap between them indicates significant skew.

### 2.6 Text Length Comparison: APPROVAL vs REVIEW

A key question is whether APPROVAL and REVIEW documents have systematically different text characteristics. If one type tends to be longer or shorter, this could be an informative feature for classification. It also affects our cleaning thresholds — we need to ensure our cleaning doesn't inadvertently bias toward one document type.

In [None]:
# Compare text length by document type
print('=== Text Length by Document Type ===')
print(df.groupby('document_type')['text_length'].describe().round(2))
print()
print('=== Word Count by Document Type ===')
print(df.groupby('document_type')['word_count'].describe().round(2))

In [None]:
# Boxplot comparing text length across document types
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Text length boxplot
sns.boxplot(x='document_type', y='text_length', data=df, ax=axes[0], palette='Set2')
axes[0].set_title('Text Length by Document Type', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Document Type')
axes[0].set_ylabel('Text Length (characters)')

# Word count boxplot
sns.boxplot(x='document_type', y='word_count', data=df, ax=axes[1], palette='Set2')
axes[1].set_title('Word Count by Document Type', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Document Type')
axes[1].set_ylabel('Word Count')

plt.tight_layout()
plt.show()

**Discovery:** The boxplots allow us to compare the central tendency, spread, and outliers of text lengths between document types. Key observations to look for:
- **Median differences:** Do APPROVAL and REVIEW documents tend to differ in length?
- **Spread (IQR):** Is one type more variable in length than the other?
- **Outliers:** Are there extreme values that might need special handling?

These differences, if they exist, could serve as useful features for classification and also inform our cleaning thresholds.

### 2.7 Check for Empty or Whitespace-Only Texts

Empty or whitespace-only texts provide no value for analysis and should be identified. These can arise from data extraction errors where the document metadata was captured but the actual text content was not.

In [None]:
# Check for empty strings
empty_texts = df[df['document_text'].astype(str).str.strip() == ''].shape[0]
print(f'Empty or whitespace-only texts: {empty_texts}')

# Check for very short texts (less than 50 characters)
very_short = df[df['text_length'] < 50].shape[0]
print(f'Very short texts (< 50 chars): {very_short}')

# Check for NaN converted to string "nan"
nan_strings = df[df['document_text'].astype(str).str.lower() == 'nan'].shape[0]
print(f'Texts that are literal "nan": {nan_strings}')

if very_short > 0:
    print(f'\n--- Sample very short texts ---')
    display(df[df['text_length'] < 50][['project_id', 'document_type', 'document_text', 'text_length']].head(10))

### 2.8 Sample Texts from Each Document Type

Reading actual sample texts gives us qualitative insight that statistics alone cannot provide. By examining real examples, we can identify:
- Common vocabulary and phrasing patterns
- Presence of HTML tags, URLs, or other artifacts
- Structural differences between document types
- Potential noise that needs to be cleaned

In [None]:
# Display sample texts from each document type
for doc_type in df['document_type'].unique():
    print(f'\n{"=" * 80}')
    print(f'Sample {doc_type} document:')
    print(f'{"=" * 80}')
    sample_text = df[df['document_type'] == doc_type]['document_text'].iloc[0]
    # Show first 500 characters to keep output manageable
    print(str(sample_text)[:500])
    print('...' if len(str(sample_text)) > 500 else '')

**Reflection on Data Understanding Phase:**

Through this comprehensive exploration, we have built a solid understanding of our dataset:
- We know the data types, shape, and structure
- We have identified the presence (or absence) of missing values and duplicates
- We understand the class distribution of document types
- We have characterized the text length distributions and compared them across document types
- We have examined actual text samples to identify potential noise and artifacts

This understanding now informs our cleaning strategy in the next section. Every cleaning decision will be grounded in what we discovered here.

---
## 3. Data Cleaning

Data cleaning transforms raw, noisy data into a form suitable for analysis. Our cleaning strategy is guided by the findings from Section 2. The key principle is: **clean aggressively enough to remove genuine noise, but conservatively enough to preserve meaningful information.**

Our cleaning pipeline consists of:
1. Handle missing values
2. Remove duplicate records
3. Clean text content (HTML, URLs, special characters, whitespace)
4. Remove very short documents
5. Validate results

### 3.1 Handle Missing Values

Missing values need to be addressed first because they can cause errors in subsequent cleaning steps. Our strategy depends on the column:
- **document_text:** Rows with missing text have no value for NLP tasks — we drop them
- **document_type:** This is our label; rows without it cannot be used for classification — we drop them
- **project_id:** Missing IDs are less critical but still dropped for data integrity

In [None]:
# Store original shape for comparison
original_shape = df.shape
print(f'Original dataset shape: {original_shape}')

# Drop rows with any missing values
df_cleaned = df.dropna(subset=['document_text', 'document_type', 'project_id']).copy()

rows_dropped_missing = original_shape[0] - df_cleaned.shape[0]
print(f'Rows dropped due to missing values: {rows_dropped_missing}')
print(f'Shape after handling missing values: {df_cleaned.shape}')

**Rationale:** We chose to drop rows with missing values rather than impute them because:
- Text data cannot be meaningfully imputed — generating synthetic document text would introduce artificial data
- Document type labels must be accurate for classification; guessing would introduce noise
- The cost of dropping a few rows is minimal compared to the risk of introducing bad data

### 3.2 Remove Duplicate Records

Duplicate records artificially inflate dataset size and can cause data leakage in train/test splits. We remove exact duplicates (all columns identical) to ensure each record is unique.

In [None]:
# Remove exact duplicate rows
before_dedup = df_cleaned.shape[0]
df_cleaned = df_cleaned.drop_duplicates().reset_index(drop=True)
after_dedup = df_cleaned.shape[0]

rows_dropped_dupes = before_dedup - after_dedup
print(f'Rows dropped due to duplicates: {rows_dropped_dupes}')
print(f'Shape after deduplication: {df_cleaned.shape}')

**Rationale:** We use `drop_duplicates()` which considers all columns. This is the safest approach because it only removes rows that are identical in every aspect. We keep the first occurrence by default, which is a standard practice. The `reset_index(drop=True)` ensures a clean, sequential index after removal.

### 3.3 Clean Document Text

Text cleaning is the most critical step for NLP tasks. Raw text often contains artifacts that add noise without contributing meaning. Our cleaning pipeline addresses each type of noise systematically:

1. **HTML tags** (e.g., `<p>`, `<br>`, `<div>`): These are markup artifacts from web scraping, not part of the actual document content
2. **URLs** (e.g., `http://...`, `www...`): Web addresses don't contribute to document meaning in this context
3. **Special characters**: Non-alphanumeric characters (except basic punctuation) can interfere with tokenization
4. **Case normalization**: Converting to lowercase ensures "Project" and "project" are treated as the same word
5. **Extra whitespace**: Multiple spaces, tabs, and newlines are collapsed for consistency

Each transformation is applied sequentially, and we preserve the order to avoid conflicts (e.g., removing HTML before special characters).

In [None]:
def clean_text(text):
    """
    Comprehensive text cleaning pipeline.
    Applies a sequence of regex-based transformations to remove noise from text.
    """
    if pd.isna(text) or not isinstance(text, str):
        return ''
    
    # Step 1: Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    
    # Step 2: Remove URLs
    text = re.sub(r'http\S+|www\.\S+', ' ', text)
    
    # Step 3: Remove special characters (keep letters, numbers, basic punctuation)
    text = re.sub(r'[^a-zA-Z0-9\s.,;:!?\'\"()-]', ' ', text)
    
    # Step 4: Convert to lowercase
    text = text.lower()
    
    # Step 5: Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

print('Text cleaning function defined successfully.')
print('Pipeline: HTML removal -> URL removal -> Special char removal -> Lowercase -> Whitespace normalization')

In [None]:
# Save original text for before/after comparison
df_cleaned['original_text'] = df_cleaned['document_text'].copy()

# Apply the cleaning function to all document texts
df_cleaned['document_text'] = df_cleaned['document_text'].astype(str).apply(clean_text)

print('Text cleaning applied to all documents.')
print(f'Dataset shape: {df_cleaned.shape}')

### 3.4 Before/After Comparison

It is essential to visually verify that our cleaning pipeline is working as intended. By comparing original and cleaned text side by side, we can confirm that:
- Noise has been removed
- Meaningful content has been preserved
- No unintended modifications occurred

In [None]:
# Show before/after comparison for sample texts
print('=== Before vs After Cleaning Comparison ===')
for i in range(min(3, len(df_cleaned))):
    print(f'\n--- Document {i + 1} (Type: {df_cleaned.iloc[i]["document_type"]}) ---')
    print(f'BEFORE (first 300 chars):')
    print(str(df_cleaned.iloc[i]['original_text'])[:300])
    print(f'\nAFTER (first 300 chars):')
    print(str(df_cleaned.iloc[i]['document_text'])[:300])
    print()

### 3.5 Remove Very Short Documents

Documents that are extremely short after cleaning likely contain no meaningful content for analysis. These could be:
- Documents that were mostly HTML/URLs and had little actual text
- Placeholder or error documents
- Metadata fragments

We set a minimum threshold of **50 characters** after cleaning. This threshold is chosen because:
- 50 characters is approximately 8-10 words, the minimum for a meaningful sentence
- It removes genuinely empty/useless records without being overly aggressive
- It preserves the vast majority of legitimate documents

In [None]:
# Recalculate text length after cleaning
df_cleaned['cleaned_text_length'] = df_cleaned['document_text'].apply(len)
df_cleaned['cleaned_word_count'] = df_cleaned['document_text'].apply(lambda x: len(x.split()))

# Identify very short documents
min_length_threshold = 50
short_docs = df_cleaned[df_cleaned['cleaned_text_length'] < min_length_threshold]
print(f'Documents shorter than {min_length_threshold} characters after cleaning: {len(short_docs)}')

if len(short_docs) > 0:
    print(f'\nShort document type distribution:')
    print(short_docs['document_type'].value_counts())
    print(f'\nSample short documents:')
    display(short_docs[['project_id', 'document_type', 'document_text', 'cleaned_text_length']].head(5))

# Remove very short documents
before_short_removal = df_cleaned.shape[0]
df_cleaned = df_cleaned[df_cleaned['cleaned_text_length'] >= min_length_threshold].reset_index(drop=True)
after_short_removal = df_cleaned.shape[0]

print(f'\nRows dropped (too short): {before_short_removal - after_short_removal}')
print(f'Shape after removing short documents: {df_cleaned.shape}')

**Rationale for 50-character threshold:** We chose 50 characters as the minimum because it strikes a balance between removing genuinely meaningless records and preserving data. A shorter threshold (e.g., 10 characters) would leave noise in the dataset, while a longer one (e.g., 200 characters) might remove legitimate short but meaningful documents. The threshold was informed by our exploration in Section 2.5 where we examined the text length distribution.

### 3.6 Validate Cleaning Results

After all cleaning steps, we perform a comprehensive validation to ensure data quality. This includes checking that:
- No missing values remain
- No duplicates remain
- Text lengths are reasonable
- Class distribution hasn't been severely distorted

In [None]:
print('=== Post-Cleaning Validation ===')
print(f'Final dataset shape: {df_cleaned.shape}')
print(f'\nMissing values:')
print(df_cleaned[['project_id', 'document_text', 'document_type']].isnull().sum())
print(f'\nDuplicate rows: {df_cleaned.duplicated().sum()}')
print(f'\nDocument type distribution after cleaning:')
print(df_cleaned['document_type'].value_counts())
print(f'\nText length statistics after cleaning:')
print(df_cleaned['cleaned_text_length'].describe().round(2))

In [None]:
# Visualize distribution changes after cleaning
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Cleaned text length distribution
axes[0].hist(df_cleaned['cleaned_text_length'], bins=50, color='seagreen', edgecolor='black', alpha=0.7)
axes[0].axvline(df_cleaned['cleaned_text_length'].mean(), color='red', linestyle='--',
                label=f"Mean: {df_cleaned['cleaned_text_length'].mean():.0f}")
axes[0].axvline(df_cleaned['cleaned_text_length'].median(), color='orange', linestyle='--',
                label=f"Median: {df_cleaned['cleaned_text_length'].median():.0f}")
axes[0].set_title('Text Length Distribution (After Cleaning)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Text Length (characters)')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Cleaned word count distribution
axes[1].hist(df_cleaned['cleaned_word_count'], bins=50, color='mediumpurple', edgecolor='black', alpha=0.7)
axes[1].axvline(df_cleaned['cleaned_word_count'].mean(), color='red', linestyle='--',
                label=f"Mean: {df_cleaned['cleaned_word_count'].mean():.0f}")
axes[1].axvline(df_cleaned['cleaned_word_count'].median(), color='orange', linestyle='--',
                label=f"Median: {df_cleaned['cleaned_word_count'].median():.0f}")
axes[1].set_title('Word Count Distribution (After Cleaning)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# Summary of all cleaning operations
print('=' * 60)
print('DATA CLEANING SUMMARY')
print('=' * 60)
print(f'Original records:          {original_shape[0]}')
print(f'Dropped (missing values):  {rows_dropped_missing}')
print(f'Dropped (duplicates):      {rows_dropped_dupes}')
print(f'Dropped (too short):       {before_short_removal - after_short_removal}')
print(f'Final records:             {df_cleaned.shape[0]}')
print(f'Total records removed:     {original_shape[0] - df_cleaned.shape[0]}')
print(f'Retention rate:            {(df_cleaned.shape[0] / original_shape[0] * 100):.1f}%')
print('=' * 60)

**Reflection on Data Cleaning:**

Our cleaning pipeline was designed to be thorough yet conservative. The key decisions and their rationale:

1. **Dropping missing values** rather than imputing — text data cannot be meaningfully imputed
2. **Removing exact duplicates** — prevents model bias and data leakage
3. **HTML and URL removal** — these are web artifacts, not meaningful document content
4. **Special character removal** — reduces vocabulary noise for NLP while preserving standard punctuation
5. **Lowercase conversion** — normalizes text to reduce vocabulary size without losing meaning
6. **Minimum length threshold** — removes documents too short to carry meaningful information

The cleaning results show that we retained a high percentage of the original data while significantly improving its quality for downstream analysis.

---
## 4. Save Cleaned Data

We save the cleaned dataset in two formats:
- **CSV**: Universal, human-readable, easy to import into any tool
- **JSON**: Preserves data types and is ideal for web-based tools and APIs

We only save the essential columns (`project_id`, `document_text`, `document_type`) to keep the output files clean and focused. The helper columns (`text_length`, `word_count`, `original_text`, etc.) were used for analysis only.

In [None]:
# Select only the essential columns for saving
columns_to_save = ['project_id', 'document_text', 'document_type']
df_final = df_cleaned[columns_to_save].copy()

# Save to CSV
csv_path = 'Task_1_cleaned_data.csv'
df_final.to_csv(csv_path, index=False)
print(f'Cleaned data saved to CSV: {csv_path}')

# Save to JSON
json_path = 'Task_1_cleaned_data.json'
df_final.to_json(json_path, orient='records', indent=2)
print(f'Cleaned data saved to JSON: {json_path}')

# Confirm save
print(f'\nFinal dataset shape: {df_final.shape}')
print(f'Columns saved: {list(df_final.columns)}')
print(f'\nFirst 3 rows of saved data:')
df_final.head(3)

**Verification:** The cleaned data has been saved successfully. The CSV format ensures compatibility with a wide range of tools (Excel, R, SQL), while the JSON format preserves the original data structure. Both files contain only the essential columns and can be loaded directly for Task 1b and beyond.

---
## Citation

The dataset used in this notebook is sourced from:

> Jordan, Luke S. (2021). *World Bank Project Documents* [Dataset]. Hugging Face. 

This dataset contains a subset of World Bank project documents categorized by document type (APPROVAL and REVIEW), and is used here for educational purposes as part of the NYP IT2311 assignment.

---

*End of Task 1a: Data Preparation*