# **Compulsory Assignment 1**
## *Machine Learning and Deep Learning (CDSCO2041C)*
Group: MLS26_CA01

Student IDs: 185912, 160363, xxx & xxx

Dataset: greenhouse-gas-emissions.xlsx (UK_by_source)

## Question 1
### Exploratory Data Analysis (EDA)
The provided dataset contains UK territorial greenhouse gas emissions by source and activity, cover-
ing the period from 1990 onwards. Emissions are attributed to the sector that emits them directly
and include indicators related to UK territorial totals, international aviation and shipping, and Paris
Agreement coverage.
Perform Exploratory Data Analysis (EDA) to investigate the key factors driving changes in UK
greenhouse gas emissions over time.
1. Write a Python program to perform a covariance- and correlation-based analysis to examine
relationships between emissions, sources, and activities across years. Do not use any built-in
covariance or correlation functions. You must implement your calculations.
2. Write another Python program to visualise your findings from the previous step and briefly
explain the observed emission patterns. Hints: You may consider using histograms, boxplots,
and scatterplots.

In [20]:
# All imports used in the notebook
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt

In [21]:
# Load dataset
file_path = "Data/greenhouse-gas-emissions.xlsx"

# Load sheets
sheet2 = pd.read_excel(file_path, sheet_name=1)  # Variable descriptions
sheet3 = pd.read_excel(file_path, sheet_name=2)  # Emission data

#### Initial Dataset Overview

In [None]:
# Inspect variable descriptions
# Clean column names
sheet2.columns = sheet2.columns.str.strip()

# Relevant columns only
variables_table = sheet2.iloc[:, :2]
variables_table.columns = ["Variable", "Description"]
variables_table

Unnamed: 0,Variable,Description
0,Included in UK territorial total,Identifies emissions included in the UK territ...
1,Included in UK UNFCCC total,Identifies emissions included in the territori...
2,Included in UK Paris Agreement total,Identifies emissions included in the territori...
3,GHG,"The greenhouse gas emitted, with different spe..."
4,GHG grouped,"The greenhouse gas emitted, with F-gases group..."
5,CRT category,Categories defined in international guidelines...
6,CRT category description,Text description for each CRT category.
7,Year,The calendar year in which the emissions occur...
8,Territory name,"The territory where the emissions occurred, ei..."
9,Territorial Emissions Statistics sector,A grouping of the TES subsectors and categories.


In [24]:
# Basic shape of dataset
print("Rows, columns:", sheet3.shape)

# Year coverage
print("Years:", sheet3["Year"].min(),
      "→", sheet3["Year"].max(),
      "| n =", sheet3["Year"].nunique())

# Structural overview
print("Unique sectors:",
      sheet3["Territorial Emissions Statistics sector"].nunique())

print("Unique subsectors:",
      sheet3["Territorial Emissions Statistics subsector"].nunique())

print("Unique sources:",
      sheet3["Source"].nunique())

print("Unique fuel groups:",
      sheet3["Fuel group"].nunique())

# Missing values check
print("Missing emission values:",
      sheet3["Emissions (MtCO2e)"].isna().sum())

sheet3.describe()

Rows, columns: (78022, 15)
Years: 1990 → 2024 | n = 35
Unique sectors: 9
Unique subsectors: 28
Unique sources: 700
Unique fuel groups: 5
Missing emission values: 0


Unnamed: 0,Year,Emissions (MtCO2e)
count,78022.0,78022.0
mean,2007.771295,0.293069
std,9.982078,3.036902
min,1990.0,-18.002028
25%,1999.0,0.000124
50%,2008.0,0.002087
75%,2016.0,0.026358
max,2024.0,184.014756


#### Restricting the Analysis to Policy-Relevant Emissions

UK Carbon Budgets and official progress assessments are based on **territorial emissions**, meaning emissions that occur within the UK’s geographic boundaries. To ensure that the statistical analysis is aligned with the policy framework and carbon accounting methodology, we restrict the dataset to observations included in the *UK territorial total*.

This step ensures that subsequent covariance and correlation results reflect structurally relevant emission dynamics rather than memo items or internationally reported components that are treated separately in climate policy.


In [None]:
territorial = sheet3[
    sheet3["Included in UK territorial total"].str.lower() == "yes"
].copy()

print("Rows after territorial filter:", territorial.shape)

### Aggregating Emissions by Year and Sector

Rather than analysing hundreds of highly granular emission sources, we aggregate emissions to the top-level *Territorial Emissions Statistics sectors*. This aggregation improves interpretability, reduces dimensionality, and ensures that the covariance structure captures meaningful sector-level relationships rather than noise from very small individual sources. The sector-level aggregation reflects structured combinations of underlying sources and activities. Each sector is composed of multiple emission sources and fuel or activity categories, meaning that sectoral time series capture broader structural emission dynamics rather than isolated micro-level processes.

In [None]:
SECTOR_COL = "Territorial Emissions Statistics sector"

sector_year = (
    territorial
    .groupby(["Year", SECTOR_COL], as_index=False)["Emissions (MtCO2e)"]
    .sum()
)

sector_year.head()

#### Reshaping the Data to Wide Format (Years × Sectors)

To perform covariance and correlation analysis, the data must be structured so that each sector represents a separate variable observed over time. We therefore pivot the dataset into wide format, where rows represent years, columns represent sectors, and the values correspond to total annual emissions.

In this structure, each column becomes a sector-specific emission time series, allowing us to examine how sectors co-move and evolve across years.

In [None]:
pivot_sec = (
    sector_year
    .pivot(index="Year",
           columns=SECTOR_COL,
           values="Emissions (MtCO2e)")
    .fillna(0.0)
    .sort_index()
)

print("Shape of sector-level dataset:", pivot_sec.shape)
pivot_sec.head()

#### Custom Covariance and Correlation (No Built-in Functions)

In this section, we manually implement the sample covariance and correlation formulas to examine the relationships between emission sources across years. This approach ensures full transparency of the statistical calculations and demonstrates a clear understanding of the underlying theoretical foundations from Lecture 02.

#### Sample Mean

For a variable ( x ) with ( n ) observations:

$$
\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i
$$

The mean represents the central tendency of emissions across years.

#### Sample Covariance

The sample covariance between two variables ( x ) and ( y ) is defined as:

$$
\text{cov}(x,y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$

We use **(n−1)** instead of **n** because we compute the *sample covariance*, not the population covariance. This correction ensures an unbiased estimator.

* **Positive covariance** → Two emission sources increase or decrease together over time.
* **Negative covariance** → When one source increases, the other tends to decrease.
* **Covariance close to zero** → No clear linear co-movement.

However, covariance depends on the scale of the variables. Large emission sectors will naturally produce larger covariance values.

#### Sample Standard Deviation

Standard deviation is derived from variance:

$$
s_x = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n}(x_i - \bar{x})^2}
$$

It measures dispersion of emissions over time.

#### Sample Correlation

Correlation standardizes covariance:

$$
\text{corr}(x,y) = \frac{\text{cov}(x,y)}{s_x s_y}
$$

Unlike covariance, correlation is **scale-independent** and bounded between -1 and 1.

* +1: Perfect positive linear relationship
* -1: Perfect negative linear relationship
* 0: No linear relationship

In our greenhouse gas dataset, correlation helps identify:

* Which sectors structurally decline together
* Whether reductions are economy-wide or sector-specific
* Whether certain sources behave independently

#### Application for our Emission Analysis

Understanding covariance and correlation allows us to:

* Detect structural transformation in the UK economy
* Identify co-movement between industrial sectors
* Evaluate whether emission reductions occur uniformly
* Support later clustering and policy interpretation

Covariance gives magnitude of joint variation.
Correlation gives strength and direction of relationship.


In [None]:
# We implement the sample mean, variance, standard deviation,
# covariance, and correlation manually (ddof = 1).

def mean(xs):
    xs = list(xs)
    return sum(xs) / len(xs)


def variance(xs, ddof=1):
    xs = list(xs)
    n = len(xs)
    if n <= ddof:
        return float("nan")
    m = mean(xs)
    return sum((x - m) ** 2 for x in xs) / (n - ddof)


def std(xs, ddof=1):
    v = variance(xs, ddof=ddof)
    return math.sqrt(v) if not math.isnan(v) else float("nan")


def covariance(xs, ys, ddof=1):
    xs, ys = list(xs), list(ys)
    n = len(xs)
    if n != len(ys) or n <= ddof:
        return float("nan")
    mx, my = mean(xs), mean(ys)
    return sum((x - mx) * (y - my) for x, y in zip(xs, ys)) / (n - ddof)


def correlation(xs, ys):
    cov = covariance(xs, ys, ddof=1)
    sx = std(xs, ddof=1)
    sy = std(ys, ddof=1)
    if sx == 0 or sy == 0 or math.isnan(cov):
        return float("nan")
    return cov / (sx * sy)

In [None]:
# Covariance and correlation matrices (sector-level)

sectors = pivot_sec.columns.tolist()

cov_matrix = pd.DataFrame(index=sectors, columns=sectors, dtype=float)
corr_matrix = pd.DataFrame(index=sectors, columns=sectors, dtype=float)

for s1 in sectors:
    for s2 in sectors:
        cov_matrix.loc[s1, s2] = covariance(
            pivot_sec[s1].values,
            pivot_sec[s2].values
        )
        corr_matrix.loc[s1, s2] = correlation(
            pivot_sec[s1].values,
            pivot_sec[s2].values
        )

print("Covariance matrix:")
display(cov_matrix.round(2))

In [None]:
print("Correlation matrix:")
display(corr_matrix.round(3))

In [None]:
# Year-over-year differences

diff_sec = pivot_sec.diff().dropna()

diff_corr_matrix = pd.DataFrame(index=sectors, columns=sectors, dtype=float)

for s1 in sectors:
    for s2 in sectors:
        diff_corr_matrix.loc[s1, s2] = correlation(
            diff_sec[s1].values,
            diff_sec[s2].values
        )

display(diff_corr_matrix.round(3))


### Spurious trend issue acknowledgement
Because sectoral emissions exhibit strong long-run trends, correlations in levels partly reflect shared time dependence. To assess whether sectors co-move beyond common trends, we also analyse year-over-year changes.

In [None]:
# Correlation with Year (trend strength)

years = pivot_sec.index.values

corr_with_year = pd.Series({
    sector: correlation(years, pivot_sec[sector].values)
    for sector in sectors
}).sort_values()

display(corr_with_year.to_frame("corr(Year, sector_emissions)").round(3))


Because covariance is scale-dependent, sectors with large emission levels and strong absolute changes over time, such as Electricity supply and Industry, exhibit substantially larger covariance values. In contrast, smaller sectors such as Agriculture generate lower covariance magnitudes, even when their correlations with other sectors remain high. This highlights that covariance reflects joint variability in absolute terms, whereas correlation provides a scale-independent measure of linear association.

#### Fuel Group Trajectories
Fuel groups represent a small number of energy categories, such as solid fuels, liquid fuels, and gaseous fuels. Analysing fuel-level emission trajectories provides a complementary perspective to sector-level analysis and helps identify structural shifts in the UK energy mix over time.

In [None]:
FUEL_COL = "Fuel group"

fuel_year = (
    territorial
    .groupby(["Year", FUEL_COL], as_index=False)["Emissions (MtCO2e)"]
    .sum()
)

pivot_fuel = (
    fuel_year
    .pivot(index="Year", columns=FUEL_COL, values="Emissions (MtCO2e)")
    .fillna(0.0)
    .sort_index()
)

fuel_cols = list(pivot_fuel.columns)
fuel_corr = pd.DataFrame(index=fuel_cols, columns=fuel_cols, dtype=float)

for a in fuel_cols:
    for b in fuel_cols:
        fuel_corr.loc[a, b] = correlation(pivot_fuel[a].values, pivot_fuel[b].values)

display(pivot_fuel.head())
display(fuel_corr.round(3))

corr_fuel_year = pd.Series({c: correlation(pivot_fuel.index.values, pivot_fuel[c].values) for c in fuel_cols}).sort_values()
display(corr_fuel_year.to_frame("corr(Year, fuel_group_emissions)").round(3))

#### Visualisation of Emission Patterns

To complement the covariance and correlation analysis, we now visualise the emission trajectories across sectors and fuel groups. Visualisation allows us to examine time trends, variability, and distributional patterns more directly, and to assess whether the statistical relationships identified in the previous section are reflected in observable emission dynamics.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

COL_RED   = "#fa4d56"
COL_GREEN = "#6fdc8c"
COL_BLUE  = "#33b1ff"
COL_GRAY  = "#6b7280"
COL_DARK  = "#111827"

SECTOR_PALETTE = [COL_RED, COL_BLUE, COL_GREEN, "#ffb000", "#a56eff", "#ff7eb6", "#8a8f98", "#00bcd4"]

def set_clean_theme():
    sns.set_theme(
        style="white",  # crucial: not whitegrid
        context="notebook",
        rc={
            "figure.facecolor": "white",
            "axes.facecolor": "white",
            "axes.edgecolor": COL_DARK,
            "axes.labelcolor": COL_DARK,
            "text.color": COL_DARK,
            "xtick.color": COL_GRAY,
            "ytick.color": COL_GRAY,
            "axes.grid": False,   # hard off
            "grid.alpha": 0.0,
            "font.size": 10,
            "axes.titlesize": 13,
            "axes.labelsize": 11,
            "figure.dpi": 160,
            "savefig.dpi": 300,
        }
    )

def clean_ax(ax):
    ax.grid(False)
    sns.despine(ax=ax, top=True, right=True)
    ax.margins(x=0.01)
    ax.tick_params(labelsize=9)

set_clean_theme()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import LinearSegmentedColormap

COL_RED   = "#fa4d56"
COL_GREEN = "#6fdc8c"
COL_BLUE  = "#33b1ff"
COL_GRAY  = "#6b7280"
COL_DARK  = "#111827"

SECTOR_PALETTE = [COL_RED, COL_BLUE, COL_GREEN, "#ffb000", "#a56eff", "#ff7eb6", "#8a8f98", "#00bcd4"]

def set_clean_theme():
    sns.set_theme(
        style="white",
        context="notebook",
        rc={
            "figure.facecolor": "white",
            "axes.facecolor": "white",
            "axes.edgecolor": COL_DARK,
            "axes.labelcolor": COL_DARK,
            "text.color": COL_DARK,
            "xtick.color": COL_GRAY,
            "ytick.color": COL_GRAY,
            "axes.grid": False,
            "grid.alpha": 0.0,
            "font.size": 10,
            "axes.titlesize": 13,
            "axes.labelsize": 11,
            "figure.dpi": 160,
            "savefig.dpi": 300,
        }
    )

def clean_ax(ax):
    ax.grid(False)
    sns.despine(ax=ax, top=True, right=True)
    ax.margins(x=0.01)
    ax.tick_params(labelsize=9)

def sparse_xticks(ax, xvals, step=5):
    xvals = np.asarray(xvals)
    if len(xvals) == 0:
        return
    idx = np.arange(0, len(xvals), step)
    ax.set_xticks(xvals[idx])
    ax.set_xticklabels([str(int(v)) for v in xvals[idx]])

set_clean_theme()

BASE_YEAR = 1990

if "Year_idx" not in territorial.columns:
    territorial = territorial.copy()
    territorial = territorial[territorial["Year"] >= BASE_YEAR]
    territorial["Year_idx"] = territorial["Year"] - BASE_YEAR + 1

sector_col = "Territorial Emissions Statistics sector"
value_col = "Emissions (MtCO2e)"

if "pivot_sec" not in globals():
    pivot_sec = (
        territorial.groupby(["Year", "Year_idx", sector_col])[value_col]
        .sum()
        .reset_index()
        .pivot(index="Year", columns=sector_col, values=value_col)
        .sort_index()
    )

if "corr_matrix" not in globals():
    corr_matrix = pivot_sec.corr()

t = (pivot_sec.index.values - pivot_sec.index.min()) + 1
year0 = int(pivot_sec.index.min())

# Plot 1: Total territorial emissions over time
tot = (
    territorial.groupby(["Year", "Year_idx"])[value_col]
    .sum()
    .reset_index()
    .sort_values("Year")
)
fig, ax = plt.subplots(figsize=(10.5, 3.6))
sns.lineplot(data=tot, x="Year_idx", y=value_col, ax=ax, color=COL_BLUE, linewidth=2.6)
ax.set_title("UK Territorial Greenhouse Gas Emissions Over Time", pad=10)
ax.set_xlabel(f"Year index ({BASE_YEAR} = 1)")
ax.set_ylabel("Emissions (MtCO2e)")
sparse_xticks(ax, tot["Year_idx"].values, step=5)
clean_ax(ax)
plt.tight_layout()
plt.show()

# Plot 2: Sector emissions trajectories over time
sec_long = (
    territorial.groupby([sector_col, "Year", "Year_idx"])[value_col]
    .sum()
    .reset_index()
    .sort_values([sector_col, "Year"])
)
fig, ax = plt.subplots(figsize=(11.5, 5.2))
sns.lineplot(
    data=sec_long,
    x="Year_idx",
    y=value_col,
    hue=sector_col,
    palette=SECTOR_PALETTE,
    linewidth=2.0,
    ax=ax
)
ax.set_title("Sector Emissions Over Time (Territorial Total)", pad=10)
ax.set_xlabel(f"Year index ({BASE_YEAR} = 1)")
ax.set_ylabel("Emissions (MtCO2e)")
sparse_xticks(ax, np.sort(sec_long["Year_idx"].unique()), step=5)
clean_ax(ax)
ax.legend(frameon=False, ncol=2, bbox_to_anchor=(1.02, 1.0), loc="upper left", fontsize=9, title=None)
plt.tight_layout()
plt.show()

# Plot 3: Correlation heatmap across sectors
cmap = LinearSegmentedColormap.from_list("rgb_custom", [COL_RED, "white", COL_BLUE], N=256)
fig, ax = plt.subplots(figsize=(7.8, 6.8))
sns.heatmap(
    corr_matrix.astype(float),
    vmin=-1, vmax=1,
    cmap=cmap,
    annot=True, fmt=".2f",
    annot_kws={"size": 8, "color": COL_DARK},
    cbar_kws={"label": "Correlation"},
    linewidths=0.0,
    ax=ax
)
ax.set_title("Correlation Matrix (Sectors)", pad=10)
clean_ax(ax)
plt.tight_layout()
plt.show()

yoy = pivot_sec.diff().dropna()

# Plot 4: Histogram of YoY changes for two sectors
candidates = ["Electricity supply", "Domestic transport", "Industry", "Buildings and product uses"]
chosen = [c for c in candidates if c in yoy.columns][:2]
if len(chosen) < 2:
    chosen = yoy.columns[:2].tolist()

fig, ax = plt.subplots(figsize=(10.5, 3.6))
sns.histplot(yoy[chosen[0]], bins=20, kde=False, color=COL_RED, alpha=0.35, edgecolor=None, ax=ax, label=chosen[0])
if len(chosen) > 1:
    sns.histplot(yoy[chosen[1]], bins=20, kde=False, color=COL_GREEN, alpha=0.35, edgecolor=None, ax=ax, label=chosen[1])
ax.set_title("Distribution of Year-over-Year Emission Changes", pad=10)
ax.set_xlabel("YoY change (MtCO2e)")
ax.set_ylabel("Count")
ax.legend(frameon=False, fontsize=9)
clean_ax(ax)
plt.tight_layout()
plt.show()

# Plot 5: Boxplot of YoY changes across sectors
yoy_long = yoy.reset_index().melt(id_vars=["Year"], var_name="Sector", value_name="YoY_change")
sector_order = list(yoy.columns)

fig, ax = plt.subplots(figsize=(11.5, 4.6))
sns.boxplot(
    data=yoy_long,
    x="Sector",
    y="YoY_change",
    order=sector_order,
    hue="Sector",
    palette=SECTOR_PALETTE,
    dodge=False,
    width=0.6,
    fliersize=2.5,
    linewidth=1.0,
    ax=ax
)
ax.legend_.remove()

for artist in ax.artists:
    artist.set_alpha(0.40)

ax.set_title("Boxplot: YoY Emission Changes by Sector", pad=10)
ax.set_xlabel("")
ax.set_ylabel("YoY change (MtCO2e)")
ax.tick_params(axis="x", rotation=90, labelsize=9)

clean_ax(ax)
plt.tight_layout()
plt.show()

# Plot 6: Scatter with regression line for strongest correlated sector pairs
for (s1, s2), r in top_pairs.items():
    fig, ax = plt.subplots(figsize=(5.6, 4.8))
    sns.regplot(
        x=pivot_sec[s1].values,
        y=pivot_sec[s2].values,
        scatter_kws={"s": 42, "alpha": 0.75, "color": COL_BLUE, "edgecolor": "none"},
        line_kws={"linewidth": 2.2, "color": COL_RED},
        ax=ax
    )
    ax.set_title(f"{s1} vs {s2} (r = {r:.3f})", pad=10)
    ax.set_xlabel(f"{s1} (MtCO2e)")
    ax.set_ylabel(f"{s2} (MtCO2e)")
    clean_ax(ax)
    plt.tight_layout()
    plt.show()

# Plot 7: Stacked area showing sector composition over time
fig, ax = plt.subplots(figsize=(11.5, 5.2))
ax.stackplot(
    t,
    np.vstack([pivot_sec[c].values for c in pivot_sec.columns]),
    labels=pivot_sec.columns,
    alpha=0.75,
    colors=[SECTOR_PALETTE[i % len(SECTOR_PALETTE)] for i in range(len(pivot_sec.columns))]
)
ax.set_title("Sector Composition of Territorial Emissions Over Time", pad=10)
ax.set_xlabel(f"Year index ({year0} = 1)")
ax.set_ylabel("Emissions (MtCO2e)")
sparse_xticks(ax, t, step=5)
clean_ax(ax)
ax.legend(frameon=False, ncol=2, bbox_to_anchor=(1.02, 1.0), loc="upper left", fontsize=9, title=None)
plt.tight_layout()
plt.show()

# Question 2
## Cluster Analysis
Cluster analysis is used to group data points based on similarity in their attributes.
1. Chooseoneclusteringalgorithmdiscussedinthelecturesandapplyittogroupemissionsources
or activities based on their emission trajectories over time.
2. Clearlyjustifythevariablesusedforclustering(e.g.,emissionlevels,rateofchange,fuelgroup).
Interpret the resulting clusters and explain what they reveal about structural changes in UK
greenhouse gas emissions.
3. Relate your clustering results to UK climate policy by identifying which clusters align with
sectorstargeted under UKCarbonBudgets and which sectorsappear moreresistant toemission
reductions.

# Question 3
## Policy Interpretation and Critical Analysis
1. Using the column Included in UK territorial total, compare emission trends with and
without international aviation and shipping. Discuss how this distinction affects progress as-
sessment against UK Carbon Budgets.
2. UsingtheIncluded in UK Paris Agreement total indicator,identifywhichemissionsources
arecoveredundertheUK’sParisAgreementreporting. Explaintheimplicationsofthiscoverage
for interpreting national emission reduction performance.
3. Based on your data-driven findings, critically assess whether historical emission trends suggest
that the UK is structurally aligned with its long-term climate targets. Support your answer with
quantitative evidence from your analysis.