
## Exercise 3 in Data-Oriented Programming Paradigms - Group 42
# On the Evolution of Nuclear Energy Use


Structure:

* Overview of version + modules required to run
* Which questions are we trying to answer
* Datasets, what is in them, why we chose them
* Data processing & exploration
* q1
* q2
* q3
* q4
* conclusions
* discussion on problems with data & biases, tools & techniques learned, work division

Everything should run in this notebook, using the folder structure within this directory


In [None]:
# installing neceessary modules
!pip install pyreadstat

Collecting pyreadstat
  Downloading pyreadstat-1.1.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[K     |████████████████████████████████| 2.4 MB 5.3 MB/s 
[?25hCollecting pandas>=1.2.0
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 23.6 MB/s 
Installing collected packages: pandas, pyreadstat
  Attempting uninstall: pandas
    Found existing installation: pandas 1.1.5
    Uninstalling pandas-1.1.5:
      Successfully uninstalled pandas-1.1.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas~=1.1.0; python_version >= "3.0", but you have pandas 1.3.5 which is incompatible.[0m
Successfully installed pandas-1.3.5 pyreadstat-1.1.4


In [None]:
# imports

import pandas as pd
import numpy as np
import pyreadstat
import matplotlib.pyplot as plt

## Data


### Our World in Data - Energy Dataset

Our World in Data is a fantastic resource for datasets of this kind. The data is thoroughly documented, provided in several common formats including simple CSV files, and updated regularly. As such, it is a natural fit to form the backbone of this project. The source data can be found here: https://github.com/owid/energy-data

It should be noted that it is unclear how complete this data is for countries with unreliable reporting, such as China. For the most part however the data seems fairly complete for most major countries of interest in regards to nuclear energy use.

In [None]:
df_energy = pd.read_csv("data/OWID_energy/owid-energy-data.csv")


In [None]:
# further preprocessing as needed by other parts

### Our World in Data - CO2 Dataset
(by Dario Giovannini)

Just as with the Energy dataset, the CO2 dataset is well documented and easy to use. The source data can be found here: https://github.com/owid/co2-data

In [None]:
# dataset is loaded
df_co2 = pd.read_csv("data/OWID_CO2/owid-co2-data.csv")

In [None]:
# A joined dataframe combining the Energy and CO2 datasets is created
relevant_energy = df_energy[["iso_code", "country", "year",
                             "gdp", "population",
                             "primary_energy_consumption", "electricity_generation",
                             "energy_per_gdp", "energy_per_capita",
                             "nuclear_electricity", "nuclear_share_elec",
                             "nuclear_consumption", "nuclear_share_energy"]]
relevant_co2 = df_co2[["iso_code", "country", "year",
                       "population", "gdp", "primary_energy_consumption",
                       "co2", "methane", "total_ghg"]]

joined = relevant_energy.join(
    relevant_co2.set_index(["iso_code", "country", "year"]),
    on = ["iso_code", "country", "year"], rsuffix = "_co2", lsuffix = "_energy"
).set_index(["iso_code", "country", "year"])

In [None]:
# the combined dataframe is investigated, specifically for differences in shared variables & missing values
value_exists = joined[joined["co2"].notna()].groupby(["iso_code", "year"]).count()
(value_exists == 0).sum()

comparevals = ["gdp", "population", "primary_energy_consumption"]
energy_vals = [x+"_energy" for x in comparevals]
co2_vals = [x+"_co2" for x in comparevals]

val_diffs = (joined[energy_vals] - joined[co2_vals].values)[joined["primary_energy_consumption_energy"].notna()]
mean_diff = val_diffs.groupby("country").mean()
max_vals = joined[joined["primary_energy_consumption_energy"].notna()][energy_vals].groupby("country").max() * 100

relative_diff = mean_diff / max_vals.loc[mean_diff.index] * 100
print(relative_diff["gdp_energy"][relative_diff["gdp_energy"].isna()].index)

print(relative_diff["gdp_energy"].dropna().abs().sort_values())

While the values for GDP as well as population differ somewhat between the two datasets, these differences are relatively small. For lack of a better option, the differences will remain unreconciled, and the data from the Energy dataset considered authorative as it also forms the basis for all other considerations in this project.

This combined energy and CO2 dataset will form the basis for specifically the analysis of the CO2 impact of nuclear energy. 

In [None]:
df_co2_and_energy = joined[energy_vals +
            [x for x in relevant_energy.columns if x not in comparevals and x not in joined.index.names] +
            [x for x in relevant_co2.columns if x not in comparevals and x not in joined.index.names]]
df_co2_and_energy.columns = comparevals + list(df_co2_and_energy.columns[3:])

### Integrated Values Survey
(by: Dario Giovannini)

The World Values Survey and European Values Survey are two long-running projects, which each collect similar survey data from across the entire world, and more specifically europe respectively. This data is made available in various formats, but separately as it is aggregated by different institutions. The combination of these two datasets is known the Integrated Values Survey. This dataset was chosen to provide insights into public opinion, specifically regarding nuclear energy, the environment, and trust in public institutions. 

In order to enable users to perform this combination in a consistent manner, a Merge Syntax is provided. Original data, the merge syntax files, and instructions on how to use them can be found here: https://www.worldvaluessurvey.org/WVSEVStrend.jsp

Actually applying this merge syntax turned out to be quite tricky, as I was not familiar with the STATA and SPSS datatypes, but the merge syntax only exists for these. A consirable amount of trial-and-error as well as googling lead me to a proprietary software by IBM that deals with SPSS data, found here: https://www.ibm.com/products/spss-statistics

Luckily, this software exists as a trial version, which was used to successfully apply the merge syntax, creating the integrated values survey datafile. This can be downloaded at the place indicated in the install_data.txt file. 

This data was then explored somwhat, to get a first feel for the overall structure of data and especially missing data:

In [None]:
# IVS is loaded

ivs_data, ivs_meta = pyreadstat.read_sav("data/WVS/Integrated_values_surveys_1981-2021.sav", encoding="cp850")

In [None]:
# looking at rough distribution of data in time intervals

(ivs_data["S020"].astype(int)//5 * 5).value_counts().sort_index(ascending=False).plot.barh()
plt.title("Number of Survey Responses per 5-Year Interval")

In [None]:
# country names are converted to ISO-standard 3-letter-codes, which didn't exist in the original data in a unified way

iso_codes = pd.read_csv("data/WVS/iso country codes/iso3166.tsv", sep="\t")
iso_codes["Numeric"] = iso_codes["Numeric"].fillna(0).astype(int)
alpha2_to_alpha3 = iso_codes.set_index("Alpha-2 code")["Alpha-3 code"].to_dict()

def map_codes(alpha2val):
    if alpha2val in alpha2_to_alpha3:
        return alpha2_to_alpha3[alpha2val]
    else:
        return "invalid"

ivs_data["country"] = ivs_data["S009"].apply(map_codes)

# since the 1985 inteval has so few responses, it is combined with the 1980 one.
df["year"] = (df["S020"].astype(int)//5 * 5)
df["year"][df["year"] == 1985] = 1980

In [None]:
responses_per_country_per_interval = df["country"].groupby(df["year"]).apply(lambda x: x.value_counts().sort_values()).unstack(level=0).fillna(0).astype(int)

share_of_invalid_responses = responses_per_country_per_interval.loc["invalid"] / responses_per_country_per_interval.iloc[:-1].sum()
print(share_of_invalid_responses)

responses_per_country_per_interval.loc[["AUT", "DEU", "SWE", "FRA", "RUS", "TUR", "CHN", "IND", "JPN", "IRN", "USA", "BRA", "CAN", "MEX"]]


the share of responses from unrecognized (as per ISO-3166) countries per time-interval is fairly small, and as might be expected more often found in the older parts of the dataset. Looking at a small sample of potentially interesting countries, none are present in all time-intervals, which indicates potential issues with continuity in the data.

Next, a look is taken at the share of missing data for various questions deemed interesting (Please refer to the EVS_WVS_Dictionary_IVS file for details on these questions beyond the short comment given here):

In [None]:
interesting_questions = ["A001", # family
                         "A002", # friends
                         "A003", # leisure time
                         "A004", # politics
                         "A005", # work
                         "A006", # religion
                         "A010", # happiness
                         "A165", # most people can be trusted
                         "B008", # protecting environment vs econ growth
                         "D059", "D060", # sexism
                         "E069_04", # confidence in press
                         "E069_11", # confidence in government
                         "E069_14", # confidence in environmental protection movement
                         "E235", # importance of democracy
                         "F034", # religious person (maybe redundant with A006)
                         "G006", # proud of nationality
                         ]
# share of non-responses - all these questions have responses on a scale from 1-x, where 0 or negative values are considered non-responses of various descriptions.
by_interval = df[interesting_questions].applymap(lambda x: x if x > 0 else np.nan).isna().groupby(df["year"])
non_responses = (by_interval.sum() / by_interval.count().max()).T
non_responses

Most of these questions have fairly high rates of non-responses in the 1980 interval notable exceptions are A165 (general trust) and F034 (religious person). Other questions undergo large fluctuations. Overall, A001-A006, A165, E069_04, F034 and G006 seem like the most reliable values in terms of the share of missing values.

[ Power Plant DB Stuff ]

## Questions Asked
The original questions on the topic provided the baseline for our investigation into the data. We further refined the questions into these specific forms:


1. How has the use of nuclear energy evolved over time? 
  1. Has the “focus” of nuclear energy shifted in terms of nations or regions heavily employing it?
  1. Are there trends in the types of generators used?
  1. Are there observable impacts of events / disasters (e.g. Chernobyl, Fukushima) on the use of nuclear energy?
1. How well does the use of nuclear energy correlate with changes in carbon emissions? 
1. Are there characteristics of a country that correlate with increases or decreases in the use of nuclear energy?

We chose these questions to gain insights on the evolution of the use of nuclear energy over time as well as regionally and the investigate the impacts it has had as an alternative to fossil fuels in reducing greenhouse gas emissions, specifically CO2. 

### Has the "focus" of nuclear energy shifted in terms of nations or regions heavily employing it?

[ Map ]

### Are there trends in the types of generators used? 

not sure if we actually did this / if the PowerplantDB data has this but if not we should still mention it because it's on the workplan


### Are there observable impacts of events / disasters (e.g. Chernobyl, Fukushima) on the use of nuclear energy?

[ events analysis]

### How well does the use of nuclear energy correlate with changes in carbon emissions?
( by Dario Giovannini )

This question will be answered by looking at the yearly difference in CO2 emissions and nuclear energy generation, as well as a look at the overall trend of energy use, per country.

In [None]:
# A dataframe containing all relevant absolute as well as scaled data is created.
# Not all of these values ended up being used, as some turned out to show the same
# thing better, and were thus preferred.

data_carboncorr = df_co2_and_energy[["population", "gdp", "co2", "total_ghg",
           "primary_energy_consumption", "electricity_generation",
           "nuclear_consumption", "nuclear_electricity"]]

data_carboncorr["co2_per_capita"] = df_co2_and_energy["co2"] / df_co2_and_energy["population"] * 1e6
data_carboncorr["co2_per_gdp"] = df_co2_and_energy["co2"] / df_co2_and_energy["gdp"] * 1e6
data_carboncorr["co2_per_kwh"] = df_co2_and_energy["co2"] / df_co2_and_energy["primary_energy_consumption"]
data_carboncorr["total_ghg_per_capita"] = df_co2_and_energy["total_ghg"] / df_co2_and_energy["population"]
data_carboncorr["total_ghg_per_gdp"] = df_co2_and_energy["total_ghg"] / df_co2_and_energy["gdp"]
data_carboncorr["energy_per_capita"] = df_co2_and_energy["primary_energy_consumption"] / df_co2_and_energy["population"]
data_carboncorr["energy_per_gdp"] = df_co2_and_energy["primary_energy_consumption"] / df_co2_and_energy["gdp"]
data_carboncorr["energy_normalized"] = df_co2_and_energy["primary_energy_consumption"] / df_co2_and_energy["primary_energy_consumption"].unstack(level=-1).T.max()
data_carboncorr["electricity_per_capita"] = df_co2_and_energy["electricity_generation"] / df_co2_and_energy["population"]
data_carboncorr["electricity_per_gdp"] = df_co2_and_energy["electricity_generation"] / df_co2_and_energy["gdp"]

data_carboncorr["nuclear_consumption_per_capita"] = df_co2_and_energy["nuclear_consumption"] / df_co2_and_energy["population"]
data_carboncorr["nuclear_consumption_per_gdp"] = df_co2_and_energy["nuclear_consumption"] / df_co2_and_energy["gdp"]
data_carboncorr["nuclear_electricity_per_capita"] = df_co2_and_energy["nuclear_electricity"] / df_co2_and_energy["population"]
data_carboncorr["nuclear_electricity_per_gdp"] = df_co2_and_energy["nuclear_electricity"] / df_co2_and_energy["gdp"]

data_carboncorr["nuclear_energy_share"] = df_co2_and_energy["nuclear_consumption"] / df_co2_and_energy["primary_energy_consumption"]
data_carboncorr["nuclear_energy_share_pct"] = df_co2_and_energy["nuclear_consumption"] / df_co2_and_energy["primary_energy_consumption"] * 100
data_carboncorr["nuclear_electricity_share"] = df_co2_and_energy["nuclear_electricity"] / df_co2_and_energy["electricity_generation"]
data_carboncorr["nuclear_electricity_share_pct"] = df_co2_and_energy["nuclear_electricity"] / df_co2_and_energy["electricity_generation"] * 100

iso_name_dict = {iso:country for iso,country,year in data_carboncorr.index}

The most useful of these values turned out to be the CO2 per kWh metric, which refers to tons of CO2 (in the original data, the CO2 column is in Million Tons, and the primary energy use in Terawatthours - this cancels out to kg CO2 per kWh, which is a great metric to show the overall "dirtyness" of the energy used.

Another value worth explaining is the "normalized" energy, which is simply the primary energy consumption divided by its maximum value in the data. This brings the energy use to a scale of 0 to 1, which makes it easily comparable with other ratio data such as the nuclear energy share.

#### Nuclear Energy "Leaderboard"
The first step in this analysis was to figure out which countries even use nuclear energy. To do this, an aggregate dataframe was created containing information such as the total energy from nuclear sources per country, as well as the share of total energy consumption from nuclear sources. It should  be noted here that in general this analysis considered the total energy consumption rather than electricity, as this presents a more complete picture of the environmental impacts of energy use. 

In [None]:
nuclear_countries = data_carboncorr[data_carboncorr["nuclear_consumption"] > 0].groupby(["iso_code", "country"]).any().index
nuclear_country_info = pd.DataFrame(index=nuclear_countries)
total_nuclear_energy = data_carboncorr["nuclear_consumption"].reset_index().groupby(["iso_code","country"]).sum().drop("year", axis=1)
nuclear_country_info["total_nuclear_energy"] = total_nuclear_energy.loc[nuclear_country_info.index]
total_energy = data_carboncorr["primary_energy_consumption"].reset_index().groupby(["iso_code","country"]).sum().drop("year", axis=1)
nuclear_country_info["total_energy"] = total_energy.loc[nuclear_country_info.index]
nuclear_country_info["lifetime_nuclear_share"] = nuclear_country_info["total_nuclear_energy"] / nuclear_country_info["total_energy"]

nuclear_country_info.sort_values(by="lifetime_nuclear_share")

Perhaps unsurprisingly, France turns out to be the country with the overall largest share of energy use from nuclear sources. Lithuania came as a surprise to me in featuring so high on this list, but the reason for this shall become apparent later. 

#### Correlation in Nuclear Energy Use Change and CO2 Emission Change

The core of this question is investigating a potential correlation between a country's CO2 emissions and their use of nuclear energy. This is done by primarily looking at the yearly difference in the share of a country's energy that is produced by nuclear sources, and the change in the CO2 per kWh metric, which measures the "dirtyness" of the country's energy - that is, how much CO2 (in kilos) is produced per kWh of primary energy use from all sources.

Incidentially, while only one of these values is a ratio and thus expected to be in the range from 0 to 1, the CO2 per kWh metric also shares a similar range, making the comparison convenient to visualize.

In [None]:
# A small helper function extracts the correlation between the two target values.
def get_corr(df):
    df.columns = df.columns.get_level_values(0)
    c = df.corr()
    return c.iloc[0,1]

yearly_change = data_carboncorr.unstack(level=[0,1]).diff()[["nuclear_energy_share", "co2_per_kwh"]]
yearly_change_corr = yearly_change.groupby(level=[1,2], axis=1).apply(get_corr).dropna()
yearly_change_corr.sort_values().plot.barh()

This plot shows that for most countries that use any amount of nuclear energy, the correlation between the share of nuclear energy and the CO2 per kWh metric is negative, meaning that as nuclear energy share increases, the amount of CO2 per kWh can be expected to decrease. This matches well with expecations, given that nuclear energy is not from fossil fuels and has no carbon emissions at all in operation, although building the powerplants is of course a significant efforts, and issues around storing the nuclear waste are not considered at all here. 

The correlation, while technically "there", is for the most part not very strong, with only a few countreis reaching values of even +/- 0.4. This indicates that there are other effects at play as well, presumably a general increase in efficiency of fossil fuel-based power plants, as well as the general rise of renewable energies such as solar and wind. 

Very notably, Lithuania has a very high negative correlation evident from this yearly change data. To investigate this and some other interesting cases, some individual countries are looked at in more detail.

In [None]:
# A convenient function for quickly plotting a single country's trajectory through the years.
def make_country_plots(iso):
    d = data.loc[(iso, iso_name_dict[iso])]
    fig1 = d[["co2_per_kwh", "nuclear_energy_share", "energy_normalized"]].plot()
    fig2 = d.diff().plot.scatter("co2_per_kwh", "nuclear_energy_share")
    plt.show()

In [None]:
# case study: Lithuania
make_country_plots("LTU")

As it turns out, Lithuania shut down their last nuclear reactors in 2004 and 2009. This is clearly visible in this plot through the drop in nuclear energy share in those years, very clearly in 2009 when it drops to 0. At the same time this coincides with a significant rise in the CO2 per kWh metric, showing that the energy that had previously come from nuclear sources was most likely made up for by energy from fossil sources, which of course emit significant amounts of CO2. As a result, the correlation - as can be seen from the initial results as well as the scatter plot shown here - is a very strong one, and even without the outlier year of 2009 would be significant.

Another interesting feature of this country's development is the large drop in energy use around 1991. This coincides with the dissolution of the Soviet Union, and is likely not a coincidence and very interesting, but further investigation of this effect is outside the scope of this project.

In [None]:
# case study: France
make_country_plots("FRA")

France, being the overall leader in nuclear energy share worldwide, is an obvious candidate for further examination. They also show the second-best correlation, although only barely over -0.6 and thus not very significant. Nonetheless, studying these plots tells a compelling story - the massive expansion in nuclear energy use in the 1980s was accompanied by a significant reduction in CO2 emissions per kWh used. Subsequent stagnation in nuclear energy share lead to a similar stagnation in energy "dirtyness", although it is still slightly decreasing - likely due to aforementioned factors of increased efficiency of power plants and the general spread of renewable energies.

In [None]:
# case study: Ukraine
make_country_plots("UKR")

Ukraine is an interesting case in that it is an example of a positive correlation, which is counter to the intuitive expectation. This correlation is very weak and thus shouldn't be considered significant, but still bears investigation.

The first plot shows a similarly steep decline in overall energy use as Lithuania, which again is likely connected to the dissolution of the Soviet Union. Overall while the share of nuclear energy is trending upwards in time, the CO2 per kWh metric also trends upwards, which leads to the observed slightly positive correlation. The explanation is likely that the upwards trending nuclear energy share is more due to decreased overall energy consumption rather than expansion of nuclear energy use however.

In [None]:
# case study: Germany
make_country_plots("DEU")

An interesting country to look at simply due to geographical and cultural proximity to Austria (which of course does not have nuclear energy itself [except what might be imported, which is difficult to track]), is Germany. They have an overall weak correlation between nuclear energy share and CO2 per kWh, which is partially explained by their overall low nuclear energy share, but more significantly by the reduction in nuclear energy use starting in the mid-2000s, which is accompanied by not an increase in CO2 per kWh, but a further steady decline, accelerating in recent years. This is likely to a comparably large expansion of renewable energy, as well as potentially more efficient fossil fuel-based powerplants. 

In [None]:
# case study: USA
make_country_plots("USA")

A slow increase in nuclear energy share corresponds to an almost constant kg CO2 per kWh value. Simultaneously, the slight decline in energy "dirtyness" in the 2010s corresponds to a roughly constant nuclear energy share. Thus, there is very little correlation to be found here, and reductions in CO2 per kWh cannot solely be attributed to nuclear energy.

In [None]:
# case study: Japan
make_country_plots(("JPN"))

Immediately obvious is the large drop in nuclear energy share after 2010, which is almost certainly due to the Fukushima disaster following a tsunami in March 2011. Japan has a comparably strong correlation between nuclear energy share and CO2 per kWh, although still weak with a value of ~ -0.4. Following the drop in nuclear energy share, it is now increasing again, accompanied by a renewed reduction in co2 per kWh. Very likely this is due to short-term solutions to make up for missing energy from nuclear sources being phased out in favour of renewable energies, as well as new nuclear sources as public fear and scepticism wanes.

In [None]:
# case study: World
make_country_plots("OWID_WRL")

As the final "case study", a look was taken at the overall data for the entire world. Given that only a few countries use nuclear energy to a significant degree, the overall low share of nuclear energy use is expected. Overall, no significant correlation can be detected from this data, but it is quite interesting to see the rather steep increase in worldwide energy use, especially when compared to the graphs for the same value for some of the countries looked at here, many of which are showing a decrease in total energy use in recent years, which indicates a shift in where growth happens worldwide.

#### Conclusions for Correlation between Nuclear Energy Use and CO2 Emissions

As has been shown, different countries show very different profiles in regards to this question. Overall, a slight trend can be observed that widespread use of nuclear energy (which is to say, the country has a high nuclear energy share) is accompanied by a reduction in CO2 emissions per kWh energy used in total. This trend hold especially strongly for cases where there is a rapid decline in nuclear energy use - such as in the cases of Lithuania and Japan - when carbon-based fossil fuels are used as immediately available replacements. For countries such as France, for which nuclear energy constitutes a large portion of their overall energy use, this correlation can also be observed in the stages of expansion of nuclear energy. For most countries however neither extreme holds, and thus no statistically significant correlation can be identified. 

### Are there characteristics of a country that correlate with increases or decreases in the use of nuclear energy? 

[ stuff ]

## Conclusions


## Key Takeaways, Techniques Learned, Problems & Biases, Work Division 

I'm sure writing this won't be annoying at all