# 🧹 CCUS Projects Data Cleaning Notebook

This notebook supports the development of the **CCUS Projects Dashboard** by preparing a clean and analysis-ready dataset from the original IEA CCUS Projects Database.

## 📌 Objectives

- Load and inspect the raw dataset
- Identify and handle missing or inconsistent values
- Standardize column names and data formats
- Parse key fields (e.g., dates, MtCO₂ capacity)
- Filter or recode categorical fields for dashboard use
- Export a cleaned `.csv` for Tableau visualization

## 🛠️ Tools Used

- `pandas` for data manipulation
- `numpy` for handling nulls and type conversion
- `openpyxl` for loading Excel files (if `.xlsx`)
- (Optional) `matplotlib` or `seaborn` for quick exploratory charts

---

> Dataset Source: [IEA CCUS Projects Database](https://www.iea.org/data-and-statistics/data-product/ccus-projects-database)  
> Last updated: April 30, 2025  
> License: [IEA Terms of Use](https://www.iea.org/terms)

In [2]:
# 📦 Import standard libraries
import pandas as pd
import numpy as np

# 📊 (Optional) For quick visual checks
import matplotlib.pyplot as plt
import seaborn as sns

# 📁 For handling Excel files (if needed)
import openpyxl  # required for reading .xlsx files with pd.read_excel()

# 🔧 Display options for better readability
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

In [3]:
# Read in Excel file

df = pd.read_excel('../data/ccus_projects_raw.xlsx', sheet_name="CCUS Projects Database")


In [4]:
# Removing columns we do not need

df.drop([
    'Project name', 'ID', 'Partners', 'Announcement', 'FID',
    'Part of CCUS hub','Ref 1', 'Ref 2', 'Ref 3', 'Ref 4', 'Ref 5',
    'Ref 6', 'Ref 7', 'Link 1', 'Link 2', 'Link 3', 'Link 4', 'Link 5',
    'Link 6', 'Link 7'
], axis=1, inplace=True)


In [16]:
# Saving the dataframe as Excel

df.to_excel('../data/ccus_projects_raw.x', index=False)