# Global CO₂ Emissions Analysis
## Notebook 1: Data Loading and Cleaning


**Datasets used:**

- **Annual CO₂ emissions by country** - Global Carbon Budget (2025) – with major processing by Our World in Data. “Annual CO₂ emissions” [dataset]. Global Carbon Project, “Global Carbon Budget v15” [original data].
Source: Global Carbon Budget (2025) – with major processing by Our World In Data

- **CO₂ emissions per capita** - Global Carbon Budget (2025); Population based on various sources (2024) – with major processing by Our World in Data. “CO₂ emissions per capita” [dataset]. Global Carbon Project, “Global Carbon Budget v15”; Various sources, “Population” [original data].
Source: Global Carbon Budget (2025), Population based on various sources (2024) – with major processing by Our World In Data

- **CO₂ emissions by fuel or industry type** - Global Carbon Budget (2025) – with major processing by Our World in Data. “Other industry” [dataset]. Global Carbon Project, “Global Carbon Budget v15” [original data].
Source: Global Carbon Budget (2025) – with major processing by Our World In Data

- **Population (historical + projections to 2100)** - UN, World Population Prospects (2024) – processed by Our World in Data. “Population, medium projection – UN WPP” [dataset]. United Nations, “World Population Prospects” [original data].
Source: UN, World Population Prospects (2024) – processed by Our World In Data

- **GDP per capita** - Eurostat, OECD, IMF, and World Bank (2025) – with minor processing by Our World in Data. “GDP per capita – World Bank – In constant international-$” [dataset]. Eurostat, OECD, IMF, and World Bank, “World Development Indicators 122” [original data].
Source: Eurostat, OECD, IMF, and World Bank (2025) – with minor processing by Our World In Data

**Data source:** [Our World in Data](https://ourworldindata.org/)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df1 =  pd.read_csv("/content/annual-co2-emissions-per-country.csv")
df2 = pd.read_csv("/content/co-emissions-per-capita.csv")
df3 = pd.read_csv("/content/co2-by-source.csv")
df4 = pd.read_csv("/content/population.csv")
df4.columns = ["Entity", "Code", "Year", "Population"] #rename population columns for clarity

df = df1.merge(df2,on=["Entity","Code","Year"],how="outer").merge(df3,on=["Entity","Code","Year"],how="outer").merge(df4,on=["Entity","Code","Year"],how="outer")



## data cleaning
1 - remove regional aggregates

2 - remove world entity

3 - filter for modern age (>= 1960)

4 - check for missing values

In [2]:
df = df.dropna(subset="Code")
df = df[~(df["Entity"]=="World")]
df = df[df["Year"]>=1960]
missing = df.isna().sum()
missing = missing[missing > 0].sort_values(ascending = False)
print(missing)

Annual CO₂ emissions from other industry    31859
Annual CO₂ emissions from gas               26409
Annual CO₂ emissions from coal              24478
Annual CO₂ emissions from flaring           24461
Annual CO₂ emissions from cement            20365
Annual CO₂ emissions from oil               20141
Annual CO₂ emissions (per capita)           20126
Annual CO₂ emissions                        20043
Population                                    120
dtype: int64
