Author: Jianji Chen

email: jianjichen001@gmail.com

Please feel free to reach out with any questions or comments. Thank you!

Note: If you would like to replicate the results with the following codes, please make sure that you get access to the SHARE datasets.

All the datasets used here are publicly accessible via application on the official website of the SHARE: https://share-eric.eu/data/data-access

The datasets used here are:
1.	SHARE-ERIC (2024). easySHARE. Release version: 9.0.0. SHARE-ERIC. Data set DOI: 10.6103/SHARE.easy.900
2.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 1. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w1.900
3.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 2. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w2.900
4.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 3 – SHARELIFE. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w3.900
5.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 4. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w4.900
6.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 5. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w5.900
7.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 6. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w6.900
8.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 7. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w7.900
9.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 8. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w8.900
10.	SHARE-ERIC (2024). Survey of Health, Ageing and Retirement in Europe (SHARE) Wave 9. Release version: 9.0.0. SHARE-ERIC. Data set. DOI: 10.6103/SHARE.w9.900

In [None]:
import pandas as pd
from functools import reduce
import numpy as np

In [None]:
# Import harmanized demographic information from the easySHARE dataset

demographics_harmonized = pd.read_stata('GH_SHARE_g_rel9-0-0_ALL_datasets_stata/GH_SHARE_g_subset.dta')
## the dta file "GH_SHARE_g_subset" was subseted with only a couple of variables from the original oficial dta file "GH_SHARE_g" due to too large file size
## in stata: "keep mergeid country ragender rabyear raeducl rabcountry rabplace racitizen radadeducl ramomeducl"

demographics_harmonized = demographics_harmonized.rename(
    columns = {"ragender": "gender",
               "rabyear": "birth_year",
               "raeducl": "education_level",
               "rabcountry": "born_same_country",
               "rabplace": "born_country",
               "racitizen": "own_citizenship",
               "radadeducl": "father_education_level",
               "ramomeducl": "mother_education_level"               
               })
demographics_vars1 = list(demographics_harmonized)

In [3]:
demographics_vars1

['mergeid',
 'country',
 'birth_year',
 'gender',
 'education_level',
 'born_country',
 'born_same_country',
 'own_citizenship',
 'mother_education_level',
 'father_education_level']

In [22]:
# Modify the variables' values in harmanized demographic data the for easier understanding

# "case_when" in pandas only execute the conditions mentioned while igore all others that are not mentioned
# "where" in numpy is if_else and does not ignore NaN values, so will be complicated when the variable contains missing values

# demographics_harmonized["gender"] = np.where(demographics_harmonized["gender"] == "1.man", "man", "woman")
demographics_harmonized["gender"] = demographics_harmonized["gender"].str.strip("12.").str.capitalize()

demographics_harmonized["education_level"] = demographics_harmonized["education_level"].str.strip("123.").str.capitalize()

demographics_harmonized["born_country"] = demographics_harmonized["born_country"].str.strip("0123456789.")

demographics_harmonized["born_same_country"] = demographics_harmonized["born_same_country"].case_when(
    [
        (demographics_harmonized["born_same_country"] == "1.in country", "Yes"),
        (demographics_harmonized["born_same_country"] == "0.out of country", "No")
    ]
)

demographics_harmonized["own_citizenship"] = demographics_harmonized["own_citizenship"].str.strip("01.")

demographics_harmonized["mother_education_level"] = demographics_harmonized["mother_education_level"].str.strip("0123456789.")

demographics_harmonized["father_education_level"] = demographics_harmonized["father_education_level"].str.strip("0123456789.")

In [23]:
demographics_harmonized["education_level"]

0         Upper secondary and vocational training
1         Upper secondary and vocational training
2                              Tertiary education
3                              Tertiary education
4                              Tertiary education
                           ...                   
158759    Upper secondary and vocational training
158760    Upper secondary and vocational training
158761    Upper secondary and vocational training
158762    Upper secondary and vocational training
158763    Upper secondary and vocational training
Name: education_level, Length: 158764, dtype: object

In [25]:
# Import common demographics data from wave 1 to up to wave 9

# compile the list of dataframes to merge
demographics_frames = []

# Common demographic information was not asked in the wave 3
waves = list(range(1, 3)) + list(range(4, 10))

for wave in reversed(waves):
    table_demographics = pd.read_stata(f"sharew{wave}_rel9-0-0_ALL_datasets_stata/sharew{wave}_rel9-0-0_dn.dta")

    # mark the participation
    table_demographics[f"participate_w{wave}"] = 1

    # add suffixes to differentiate information of the same variable asked in different waves
    table_demographics = table_demographics.rename(columns = {c: c + f"w{wave}" for c in table_demographics.columns if not c.startswith(("mergeid", "hhid", "coupleid", "participate"))})
    
    globals()[f"demographics_w{wave}"] = table_demographics
    
    print(f"Table demographics_w{wave} read and modified successfully!")

    # compile the list of dataframes to merge
    demographics_frames.append(globals()[f"demographics_w{wave}"])

# merge all the 6 waves of demographics information
demographics_merged = reduce(lambda  left, right: pd.merge(left, right, on = ['mergeid'], how = 'outer'), demographics_frames)
print("8 waves of demographics information merged!")

Table demographics_w9 read and modified successfully!
Table demographics_w8 read and modified successfully!
Table demographics_w7 read and modified successfully!
Table demographics_w6 read and modified successfully!
Table demographics_w5 read and modified successfully!
Table demographics_w4 read and modified successfully!
Table demographics_w2 read and modified successfully!
Table demographics_w1 read and modified successfully!
8 waves of demographics information merged!


In [26]:
## Create demographic information variables by taking the first most recent non-null value across waves (priority order)

####### prepare for the relavant variables
#gender = [col for col in demographics_merged if col.startswith("dn042_")]
#birth_year = [col for col in demographics_merged if col.startswith("dn003_")]
#education_degree = [col for col in demographics_merged if col.startswith("dn010_")]
#education_year = [col for col in demographics_merged if col.startswith("dn041")]
#born_native = [col for col in demographics_merged if col.startswith("dn004_")]
#born_country = [col for col in demographics_merged if col.startswith("dn005c")]
#own_citizenship = [col for col in demographics_merged if col.startswith("dn007_")]
migrant_year = [col for col in demographics_merged if col.startswith("dn006_")]
other_citizenship = [col for col in demographics_merged if col.startswith("dn008c")]
citizenship_year = [col for col in demographics_merged if col.startswith("dn502_")] ## only available from w5
born_citizen = [col for col in demographics_merged if col.startswith("dn503_")] ## only available from w5
born_country_mother = [col for col in demographics_merged if col.startswith("dn504c")] ## only available from w5
born_country_father = [col for col in demographics_merged if col.startswith("dn505c")] ## only available from w5

demographics_vars_cols = [#gender, birth_year, education_degree, education_year,
                          #born_native, born_country, own_citizenship,
                          migrant_year, other_citizenship, citizenship_year, born_citizen,
                          born_country_mother, born_country_father]

demographics_vars2 = [#"gender", "birth_year", "education_degree", "education_year",
                      #"born_native", "born_country", "own_citizenship",
                      "migrant_year", "other_citizenship", "citizenship_year", "born_citizen",
                      "born_country_mother", "born_country_father"]

# Loop over variable names and their corresponding list of wave columns
for var, cols in zip(demographics_vars2, demographics_vars_cols):
    # Only keep columns that actually exist in the merged dataframe
    cols = [c for c in cols if c in demographics_merged.columns]

    with pd.option_context("future.no_silent_downcasting", True):
    # together with "infer_objects(copy = False)" to silence warning of future version of function
        demographics_merged[var] = demographics_merged[cols].bfill(axis = 1).iloc[:, 0].infer_objects(copy = False)

In [27]:
## check if every respondent was always interviewed in the same country
demographics_merged["n_country"] = demographics_merged[[col for col in demographics_merged if col.startswith("country")]].nunique(axis = 1)
demographics_merged["n_country"].value_counts(dropna = False)

n_country
1    158764
Name: count, dtype: int64

In [28]:
# merge the harmanized data and the merged data from common waves
demographics = pd.merge(demographics_merged, demographics_harmonized, on = "mergeid", how = "outer")

In [None]:
# categorize respondents into natives, European immigrants and non-European immigrants

In [29]:
## A list of European countries/regions from 1900 until 2025, suited for the distinct values in the original dataset

european_countries = [
    # Core EU & EFTA modern states
    "Austria", "Belgium", "Bulgaria", "Croatia", "Cyprus", "Czech Republic", 
    "Denmark", "Estonia", "Finland", "France", "Germany", "Greece", "Hungary",
    "Ireland", "Italy", "Latvia", "Lithuania", "Luxembourg", "Malta", 
    "Netherlands", "Poland", "Portugal", "Romania", "Slovakia", "Slovenia", 
    "Spain", "Sweden", "Switzerland", "Norway", "Iceland", "Liechtenstein",
    
    # United Kingdom variants
    "United Kingdom", "United Kingdom of Great Britain and Northern Ireland", "UK", 
    "England", "Scotland", "Wales", "Northern Ireland",
    
    # Historical dissolved states
    "Czechoslovakia",
    "Socialist Federal Republic of Yugoslavia", "Yugoslavia",
    "The former Yugoslav Republic of Macedonia", "Macedonia, The former Yugoslav Republic of",
    "Former Austria-Hungary", "Austria-Hungary",
    "U.S.S.R", "U.S.S.R.", "Soviet Union",
    "Russian Federation", "Russia",
    "Former Eastern Terr. of German Reich", "Former Territories of German Reich",
    "Galicia (Central Europe)",  # historical Habsburg region
    
    # Successor states of USSR in Europe
    "Ukraine", "Belarus", "Moldova", "Republic of Moldova", 
    "Estonia", "Latvia", "Lithuania", "Georgia", "Armenia", "Azerbaijan",
    
    # Successor states of Yugoslavia
    "Slovenia", "Croatia", "Bosnia and Herzegovina", "Bosnia and Herzegowina", 
    "Serbia", "Montenegro", "Kosovo", "North Macedonia",
    
    # Microstates
    "Andorra", "Monaco", "San Marino", "Vatican City", "Holy See",
    
    # Overseas/territorial but European-coded in surveys
    "Faroe Islands", "Greenland", "Aaland Islands",
    
    # Special cases in the dataset
    "Chechnya", "Minor Asia", "Caucasus", "Kurdistan (Region)", "Galicia (Central Europe)",
    "EU-Citizenship", "European E.U."
]

european_countries = [s.lower().replace(".", "").strip(" ") for s in european_countries]

In [30]:
## A list of European citizenships/nationalties from 1900 until 2025, suited for the distinct values in the original dataset

european_adjectives = [
    # Major
    "Austrian", "Belgian", "Bulgarian", "Croatian", "Cypriot", "Czech", "Danish", 
    "Estonian", "Finnish", "French", "German", "Greek", "Hungarian", "Irish", "Italian", 
    "Latvian", "Lithuanian", "Luxembourgian", "Maltese", "Dutch", "Polish", "Portuguese", 
    "Romanian", "Slovak", "Slovakian", "Slovenian", "Spanish", "Swedish", "Swiss", 
    "Norwegian", "Icelandic", "Liechtensteiner", "British", "English", "Scottish", "Welsh",
    
    # Dissolved / historical
    "Yugoslav", "Yugoslavian", "Macedonian",
    "Soviet", "USSR", "Czechoslovak", "Austro-Hungarian",
    
    # Post-Soviet European
    "Ukrainian", "Belarusian", "Moldovan", "Georgian", "Armenian", "Azerbaijani",
    
    # Post-Yugoslav
    "Slovenian", "Croatian", "Bosnian", "Bosnian and Herzegowinan", 
    "Serbian", "Montenegrin", "Kosovar", "Kosovarian", "North Macedonian",
    
    # Microstates
    "Andorran", "Monégasque", "San Marinese", "Vatican", "Holy See",
    
    # Mixed/adjectival combos found in your data
    "German-Spanish", "Italian-Austrian", "Dutch-Czech", "German-Italian", 
    "American-Irish", "Finnish-Greek", "French-German", "British-Estonian", 
    "English-Irish", "Argentinian-Italian", 
    
    # EU generic
    "European", "European E.U.", "EU-Citizenship"
]

european_adjectives = [s.lower().replace(".", "").strip(" ") for s in european_adjectives]

In [31]:
# create a new column of "born_country" that considers the historical regions of Germany, Austria and Hungary
# In theorey, the Eastern European countries (other than Hungary) also need to be reclassified,
# however, here I will exclud the eastern european coungtries of interview from my analysis due to
# the complicated geopolitical and migrantion histories during the past dacades (eg., Soviet Union, Former Yugoslavia)

demographics["born_country_new"] = np.select(
    condlist = [
        # for identifyinhg the natives, historical Western Germany and Eastern Germany are both considered Germany
        np.logical_and(demographics["country"] == "Germany", demographics["born_country"].str.lower().str.contains("german", na = False)),
        # for identifyinhg the natives, historical former Austria-Hungary is treated as Austria or Hungary depending on the country of interview
        np.logical_and(demographics["country"] == "Austria", demographics["born_country"].str.lower().str.contains("austria", na = False)),
        np.logical_and(demographics["country"] == "Hungary", demographics["born_country"].str.lower().str.contains("hungary", na = False)),
        # for those with unclear replies, replace the values as NaN
        demographics["born_country"].isin(["Don't know", "Refusal", "Not codable", "Not yet coded"])
    ],
    
    choicelist = [
        "germany", "austria", "hungary", ""
    ],
    default = demographics["born_country"].str.lower().replace(".", "").str.strip(" ")
)

demographics["born_country_new"] = demographics["born_country_new"].replace('', np.nan, regex = True)

In [32]:
demographics["native_or_immigrant"] = np.select(
    condlist = [
        # indentify native for those born in the same country as the country of interview
        np.logical_or(demographics["born_same_country"] == "Yes",
                      demographics["born_country_new"] == demographics["country"].str.lower().replace(".", "").str.strip(" ")),
        
        # identify those born in another European country different from the country of interview
        demographics["born_country_new"].isin(european_countries + european_adjectives),

        # for those with NaN values, treat them with missing values as well
        demographics["born_country_new"].isna()                 
    ],
    choicelist = [
        "Native", "European immigrant", ""
    ],

    default = "Non-European immigrant"
)

demographics["native_or_immigrant"] = demographics["native_or_immigrant"].replace('', np.nan, regex = True)

In [33]:
pd.crosstab(demographics.country, demographics.native_or_immigrant, dropna=False)

native_or_immigrant,European immigrant,Native,Non-European immigrant
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Austria,559,6829,133
Germany,880,9849,300
Sweden,500,6244,129
Netherlands,131,6061,299
Spain,169,8683,343
Italy,73,8386,56
France,453,8164,706
Denmark,167,5885,108
Greece,94,6444,108
Switzerland,784,3965,131


In [34]:
pd.crosstab(demographics.native_or_immigrant, demographics.own_citizenship, dropna=False)

own_citizenship,No,Yes,NaN
native_or_immigrant,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
European immigrant,4039,7522,5
Native,618,141110,53
Non-European immigrant,924,3477,4
,41,56,915


In [35]:
demographics.own_citizenship.value_counts(dropna=False)

own_citizenship
Yes    152165
No       5622
NaN       977
Name: count, dtype: int64

In [36]:
# subset variables that are interested in
demographics_vars = demographics_vars1 + demographics_vars2 + ["born_country_new", "native_or_immigrant"]
demographics = demographics[demographics_vars]

In [37]:
list(demographics)

['mergeid',
 'country',
 'birth_year',
 'gender',
 'education_level',
 'born_country',
 'born_same_country',
 'own_citizenship',
 'mother_education_level',
 'father_education_level',
 'migrant_year',
 'other_citizenship',
 'citizenship_year',
 'born_citizen',
 'born_country_mother',
 'born_country_father',
 'born_country_new',
 'native_or_immigrant']

In [38]:
## Check the categories/ values of these variables
for var in demographics_vars1:
    print(f"{demographics[var].value_counts(dropna = False)}")

mergeid
SK-999958-02    1
AT-000327-01    1
AT-000327-02    1
AT-000674-01    1
AT-000787-01    1
               ..
AT-002136-03    1
AT-002136-01    1
AT-002132-01    1
AT-001937-01    1
AT-001881-02    1
Name: count, Length: 158764, dtype: int64
country
Germany           11091
Belgium           10900
Czech Republic     9685
France             9404
Spain              9326
Estonia            8611
Italy              8562
Poland             8219
Austria            7575
Slovenia           7031
Sweden             6941
Greece             6653
Netherlands        6651
Denmark            6185
Croatia            5906
Switzerland        4923
Israel             4576
Hungary            4029
Portugal           2775
Finland            2702
Latvia             2663
Luxembourg         2200
Romania            2198
Lithuania          2152
Slovakia           2105
Bulgaria           2021
Cyprus             1331
Malta              1314
Ireland            1035
Name: count, dtype: int64
birth_year
1950.0    5

In [39]:
## save the data
demographics.to_csv('data_out/demographics.csv', index = False)