
## Exercise 3 in Data-Oriented Programming Paradigms - Group 42
# On the Evolution of Nuclear Energy Use


Structure:

* Overview of version + modules required to run
* Which questions are we trying to answer
* Datasets, what is in them, why we chose them
* Data processing & exploration
* q1
* q2
* q3
* q4
* conclusions
* discussion on problems with data & biases, tools & techniques learned, work division

Everything should run in this notebook, using the folder structure within this directory


In [15]:
# installing neceessary modules
!pip install pyreadstat

Collecting pyreadstat
  Downloading pyreadstat-1.1.4-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
[K     |████████████████████████████████| 2.4 MB 5.3 MB/s 
[?25hCollecting pandas>=1.2.0
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
[K     |████████████████████████████████| 11.3 MB 23.6 MB/s 
Installing collected packages: pandas, pyreadstat
  Attempting uninstall: pandas
    Found existing installation: pandas 1.1.5
    Uninstalling pandas-1.1.5:
      Successfully uninstalled pandas-1.1.5
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-colab 1.0.0 requires pandas~=1.1.0; python_version >= "3.0", but you have pandas 1.3.5 which is incompatible.[0m
Successfully installed pandas-1.3.5 pyreadstat-1.1.4


In [1]:
# imports

import pandas as pd
import numpy as np
import pyreadstat
import matplotlib.pyplot as plt

## Integrated Values Survey
(by: Dario Giovannini)

The World Values Survey and European Values Survey are two long-running projects, which each collect similar survey data from across the entire world, and more specifically europe respectively. This data is made available in various formats, but separately as it is aggregated by different institutions. The combination of these two datasets is known the Integrated Values Survey.

In order to enable users to perform this combination in a consistent manner, a Merge Syntax is provided. Original data, the merge syntax files, and instructions on how to use them can be found here: https://www.worldvaluessurvey.org/WVSEVStrend.jsp

Actually applying this merge syntax turned out to be quite tricky, as I was not familiar with the STATA and SPSS datatypes, but the merge syntax only exists for these. A consirable amount of trial-and-error as well as googling lead me to a proprietary software by IBM that deals with SPSS data, found here: https://www.ibm.com/products/spss-statistics

Luckily, this software exists as a trial version, which was used to successfully apply the merge syntax, creating the integrated values survey datafile. This can be downloaded at the place indicated in the install_data.txt file. 

This data was then explored somwhat, to get a first feel for the overall structure of data and especially missing data:

In [None]:
# IVS is loaded

ivs_data, ivs_meta = pyreadstat.read_sav("Integrated_values_surveys_1981-2021.sav", encoding="cp850")

In [None]:
# looking at rough distribution of data in time intervals

(ivs_data["S020"].astype(int)//5 * 5).value_counts().sort_index(ascending=False).plot.barh()
plt.title("Number of Survey Responses per 5-Year Interval")

In [None]:
# country names are converted to ISO-standard 3-letter-codes, which didn't exist in the original data in a unified way

iso_codes = pd.read_csv("iso country codes/iso3166.tsv", sep="\t")
iso_codes["Numeric"] = iso_codes["Numeric"].fillna(0).astype(int)
alpha2_to_alpha3 = iso_codes.set_index("Alpha-2 code")["Alpha-3 code"].to_dict()

def map_codes(alpha2val):
    if alpha2val in alpha2_to_alpha3:
        return alpha2_to_alpha3[alpha2val]
    else:
        return "invalid"

ivs_data["country"] = ivs_data["S009"].apply(map_codes)

# since the 1985 inteval has so few responses, it is combined with the 1980 one.
df["year"] = (df["S020"].astype(int)//5 * 5)
df["year"][df["year"] == 1985] = 1980

In [None]:
responses_per_country_per_interval = df["country"].groupby(df["year"]).apply(lambda x: x.value_counts().sort_values()).unstack(level=0).fillna(0).astype(int)

share_of_invalid_responses = responses_per_country_per_interval.loc["invalid"] / responses_per_country_per_interval.iloc[:-1].sum()
print(share_of_invalid_responses)

responses_per_country_per_interval.loc[["AUT", "DEU", "SWE", "FRA", "RUS", "TUR", "CHN", "IND", "JPN", "IRN", "USA", "BRA", "CAN", "MEX"]]


the share of responses from unrecognized (as per ISO-3166) countries per time-interval is fairly small, and as might be expected more often found in the older parts of the dataset. Looking at a small sample of potentially interesting countries, none are present in all time-intervals, which indicates potential issues with continuity in the data.

Next, a look is taken at the share of missing data for various questions deemed interesting (Please refer to the EVS_WVS_Dictionary_IVS file for details on these questions beyond the short comment given here):

In [None]:
interesting_questions = ["A001", # family
                         "A002", # friends
                         "A003", # leisure time
                         "A004", # politics
                         "A005", # work
                         "A006", # religion
                         "A010", # happiness
                         "A165", # most people can be trusted
                         "B008", # protecting environment vs econ growth
                         "D059", "D060", # sexism
                         "E069_04", # confidence in press
                         "E069_11", # confidence in government
                         "E069_14", # confidence in environmental protection movement
                         "E235", # importance of democracy
                         "F034", # religious person (maybe redundant with A006)
                         "G006", # proud of nationality
                         ]
# share of non-responses - all these questions have responses on a scale from 1-x, where 0 or negative values are considered non-responses of various descriptions.
by_interval = df[interesting_questions].applymap(lambda x: x if x > 0 else np.nan).isna().groupby(df["year"])
non_responses = (by_interval.sum() / by_interval.count().max()).T
non_responses

Most of these questions have fairly high rates of non-responses in the 1980 interval notable exceptions are A165 (general trust) and F034 (religious person). Other questions undergo large fluctuations. Overall, A001-A006, A165, E069_04, F034 and G006 seem like the most reliable values in terms of the share of missing values.