# Preliminary Data Analysis

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Read in Data

This dataset from the Climate Watch describes the emissions in tons of CO2 per year per sector per country from 1960 to 2023. The data covers each country in the world (196) and six distinct sectors, including `Cement`, `Coal`, `Gas`, `Gas flaring`, `Oil`, `Total fossil fuels and cement`.

In [4]:
climate_watch_GCP = pd.read_csv("ClimateWatch_HistoricalEmissions/CW_HistoricalEmissions_GCP.csv")
climate_watch_GCP.head()

Unnamed: 0,Country name,Country,Sector,Source,Gas,1960,1961,1962,1963,1964,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Afghanistan,AFG,Cement,GCP,CO2,0.018012,0.021806,0.029074,0.05088,0.061783,...,0.028644,0.041189,0.076126,0.044785,0.05688,0.038329,0.060674,0.016345,0.016345,0.016345
1,Afghanistan,AFG,Coal,GCP,CO2,0.12712,0.17587,0.29678,0.26381,0.30045,...,3.6215,2.7224,2.7176,3.2573,3.6334,3.7006,4.116,3.3965,3.5313,3.8431
2,Afghanistan,AFG,Gas,GCP,CO2,0.0,0.0,0.0,0.0,0.0,...,0.27125,0.28213,0.31864,0.30045,0.29302,0.24549,0.15394,0.15755,0.14828,0.14971
3,Afghanistan,AFG,Gas flaring,GCP,CO2,,,,,,...,,,,,,,,,,
4,Afghanistan,AFG,Oil,GCP,CO2,0.26876,0.29312,0.36274,0.39205,0.47632,...,5.1647,6.6245,5.7941,6.0749,6.6186,6.8407,7.2753,6.7015,6.8619,7.0111


## Data Cleaning

In [17]:
climate_watch_GCP.shape

(1013, 69)

In [7]:
# List of columns
climate_watch_GCP.columns

Index(['Country name', 'Country', 'Sector', 'Source', 'Gas', '1960', '1961',
       '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970',
       '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979',
       '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
       '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
       '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',
       '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023'],
      dtype='object')

In [25]:
# Unique row values (cols `Country name`, `Sector`, `Source`, `Gas`)

def print_unique_values(df):
    columns = ['Country name', 'Sector', 'Source', 'Gas']

    for col in columns:
        if col in df.columns:
            unique_vals = df[col].unique()
            print(f"\n{col}:")
            print(f"  Number of unique values: {len(unique_vals)}")
            # print(f"  Values: {unique_vals}")
        else:
            print(f"\n{col}: Column not found in dataframe")

print_unique_values(climate_watch_GCP)


Country name:
  Number of unique values: 196

Sector:
  Number of unique values: 6

Source:
  Number of unique values: 1

Gas:
  Number of unique values: 1


In [26]:
# Unique `Sector` values
climate_watch_GCP['Sector'].unique()

array(['Cement', 'Coal', 'Gas', 'Gas flaring', 'Oil',
       'Total fossil fuels and cement'], dtype=object)

In [19]:
# Proportion of NaN column values

nan_cols = climate_watch_GCP.isna().sum() / len(climate_watch_GCP)
nan_cols.sort_values(ascending = False)

1960            0.227048
1962            0.224087
1961            0.222113
1963            0.220138
1964            0.213228
                  ...   
Country         0.000000
Gas             0.000000
Source          0.000000
Sector          0.000000
Country name    0.000000
Length: 69, dtype: float64

In [21]:
# Proportion of NaN row values

nan_rows = climate_watch_GCP.isna().sum(axis = 1) / len(climate_watch_GCP.columns)
nan_rows.sort_values(ascending = False)

780     0.913043
120     0.913043
152     0.913043
718     0.898551
288     0.898551
          ...   
405     0.000000
406     0.000000
407     0.000000
408     0.000000
1012    0.000000
Length: 1013, dtype: float64

**Brief note before proceeding**

Before cleaning the data, it is important to understand what type of data is missing. Given the granularity of this dataset, contextualizing why certain rows are missing more data than others and can provide key information into the types of emissions intensive activities in different geographies, thus aiding in the future mapping of effective policy instruments per region. 