## Health Condition using 'hcond' and 'hcondnew'
### By Gavin Qu - May 23rd 2024
#### Data Extraction 

-	Encode hcond and new healthcond variables correctly 
-	Note that individuals are asked about pre-existing health conditions on their first interview in the UKHLS – the hcond (i) variable, where i codes different conditions – and then asked whether they have developed new conditions in subsequent interviews – the hcondn(i) variable in waves 1-9 and hcondnew(i) in waves 10 onwards.
-	hcond in wave 1 and new entrants for succeeding waves, hcondn in wave 1-9, hcondnew in wave 10-13

For example, hcond1-19 has 1, 3-13 waves, and it's for new interviewees only. While hcondn1-19 have 2-9 waves asking the existing interviewees about newly devloped conditions, and hcondnew1-19 have wave 9-13 for the same questions. 
hcond21 and hcondnew21 only exist from wave 10-13, while hcondnew22 only exist from 10-13. 

'dcsedfl_dv' is death data, but it's onyl 50% accurate when it comes to health mortality

In [13]:
import pandas as pd
import os

# Base directory containing the data files
base_dir = '/Users/gavinqu/Desktop/School/Dissertation/UKDA-6614-stata/stata/stata13_se/ukhls'

# List of base variable names to extract, including 'pidp'
base_variables = [
    'pidp', 'hcond1', 'hcond2', 'hcond3', 'hcond4', 'hcond5', 'hcond6', 'hcond7',
    'hcond8', 'hcond9', 'hcond10', 'hcond11', 'hcond12', 'hcond13', 'hcond14',
    'hcond15', 'hcond16', 'hcond17', 'hcond18', 'hcond19', 'hcond21', 'hcond22',
    'hcondn1', 'hcondn2', 'hcondn3', 'hcondn4', 'hcondn5', 'hcondn6', 'hcondn7',
    'hcondn8', 'hcondn9', 'hcondn10', 'hcondn11', 'hcondn12', 'hcondn13', 'hcondn14',
    'hcondn15', 'hcondn16', 'hcondn17', 'hcondn18', 'hcondn19', 'hcondnew1', 'hcondnew2',
    'hcondnew3', 'hcondnew4', 'hcondnew5', 'hcondnew6', 'hcondnew7', 'hcondnew8',
    'hcondnew10', 'hcondnew11', 'hcondnew12', 'hcondnew13', 'hcondnew14', 'hcondnew15',
    'hcondnew16', 'hcondnew19', 'hcondnew21', 'hcondnew22'
]

# Wave prefixes from 'a' to 'm'
wave_prefixes = [chr(i) for i in range(ord('a'), ord('n'))]

# Function to load and filter wave data
def load_wave_data(wave_prefix, base_dir, base_variables):
    file_path = os.path.join(base_dir, f'{wave_prefix}_indresp.dta')
    if os.path.exists(file_path):
        print(f"Loading data from {file_path}")
        wave_data = pd.read_stata(file_path, convert_categoricals=False)
        
        # Construct the actual variable names for the current wave
        wave_variables = [f'{wave_prefix}_{var}' if var != 'pidp' else var for var in base_variables]
        
        # Find the intersection of desired variables and available columns
        available_columns = set(wave_variables).intersection(wave_data.columns)
        print(f"Available columns in {wave_prefix}: {available_columns}")
        
        # Select only the available columns
        if available_columns:
            selected_data = wave_data[list(available_columns)].copy()
            selected_data['wave'] = wave_prefix
            return selected_data
    return None

# List to store data from each wave
all_waves_data = []

# Loop through wave prefixes
for prefix in wave_prefixes:
    wave_data = load_wave_data(prefix, base_dir, base_variables)
    if wave_data is not None:
        all_waves_data.append(wave_data)

# Check if we have any data
if all_waves_data:
    # Combine all waves into a single DataFrame
    combined_data = pd.concat(all_waves_data, ignore_index=True)

    # Save the combined data to a CSV file
    combined_data.to_csv('combined_ukhls_data.csv', index=False)

    # Display the first few rows of the combined data
    print(combined_data.head())
else:
    print("No data was loaded. Please check the file paths and variable names.")

Loading data from /Users/gavinqu/Desktop/School/Dissertation/UKDA-6614-stata/stata/stata13_se/ukhls/a_indresp.dta
Available columns in a: {'a_hcond6', 'a_hcond12', 'a_hcond14', 'a_hcond10', 'a_hcond8', 'a_hcond4', 'a_hcond2', 'a_hcond7', 'a_hcond9', 'a_hcond11', 'a_hcond13', 'a_hcond15', 'a_hcond16', 'a_hcond5', 'a_hcond17', 'a_hcond3', 'pidp', 'a_hcond1'}
Loading data from /Users/gavinqu/Desktop/School/Dissertation/UKDA-6614-stata/stata/stata13_se/ukhls/b_indresp.dta
Available columns in b: {'b_hcondn11', 'b_hcondn8', 'b_hcondn10', 'b_hcondn5', 'b_hcondn13', 'b_hcondn2', 'b_hcondn4', 'b_hcondn16', 'b_hcondn7', 'b_hcondn9', 'b_hcondn15', 'b_hcondn3', 'b_hcondn14', 'b_hcondn12', 'b_hcondn17', 'b_hcondn1', 'pidp', 'b_hcondn6'}
Loading data from /Users/gavinqu/Desktop/School/Dissertation/UKDA-6614-stata/stata/stata13_se/ukhls/c_indresp.dta


One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  wave_data = pd.read_stata(file_path, convert_categoricals=False)


Available columns in c: {'c_hcond9', 'c_hcondn9', 'c_hcondn13', 'c_hcondn2', 'c_hcond17', 'c_hcondn5', 'c_hcond11', 'c_hcondn3', 'c_hcond2', 'c_hcond8', 'c_hcondn14', 'c_hcond12', 'c_hcond3', 'c_hcond16', 'c_hcond5', 'c_hcondn6', 'c_hcondn15', 'c_hcondn1', 'c_hcond4', 'c_hcond6', 'c_hcond13', 'c_hcondn4', 'c_hcondn11', 'c_hcondn16', 'c_hcond15', 'c_hcondn12', 'c_hcondn17', 'c_hcondn8', 'c_hcond14', 'c_hcondn10', 'c_hcond10', 'c_hcondn7', 'pidp', 'c_hcond1', 'c_hcond7'}
Loading data from /Users/gavinqu/Desktop/School/Dissertation/UKDA-6614-stata/stata/stata13_se/ukhls/d_indresp.dta
Available columns in d: {'d_hcond8', 'd_hcond3', 'd_hcond10', 'd_hcond1', 'd_hcond13', 'd_hcond14', 'd_hcondn12', 'd_hcondn13', 'd_hcond16', 'd_hcond12', 'd_hcond11', 'pidp', 'd_hcond15', 'd_hcondn11', 'd_hcondn5', 'd_hcond9', 'd_hcondn16', 'd_hcond6', 'd_hcondn9', 'd_hcondn1', 'd_hcondn6', 'd_hcond4', 'd_hcondn7', 'd_hcondn15', 'd_hcondn2', 'd_hcondn17', 'd_hcondn10', 'd_hcondn8', 'd_hcondn3', 'd_hcondn4', '

One or more strings in the dta file could not be decoded using utf-8, and
so the fallback encoding of latin-1 is being used.  This can happen when a file
has been incorrectly encoded by Stata or some other software. You should verify
the string values returned are correct.
  wave_data = pd.read_stata(file_path, convert_categoricals=False)


Available columns in e: {'e_hcond6', 'e_hcond7', 'e_hcondn12', 'e_hcond5', 'e_hcondn1', 'e_hcond8', 'e_hcond2', 'e_hcondn15', 'e_hcondn3', 'e_hcond9', 'e_hcond11', 'e_hcondn4', 'e_hcondn10', 'e_hcond17', 'e_hcond1', 'e_hcondn7', 'e_hcondn16', 'e_hcond16', 'e_hcond3', 'e_hcondn5', 'e_hcondn11', 'e_hcondn14', 'e_hcondn9', 'e_hcond15', 'e_hcond10', 'e_hcond12', 'e_hcond4', 'e_hcond14', 'e_hcondn17', 'e_hcondn13', 'e_hcond13', 'e_hcondn6', 'e_hcondn2', 'e_hcondn8', 'pidp'}
Loading data from /Users/gavinqu/Desktop/School/Dissertation/UKDA-6614-stata/stata/stata13_se/ukhls/f_indresp.dta
Available columns in f: {'f_hcond17', 'f_hcondn11', 'f_hcondn13', 'f_hcondn15', 'f_hcond18', 'f_hcond4', 'f_hcondn10', 'f_hcondn14', 'f_hcond14', 'f_hcondn6', 'f_hcondn1', 'f_hcond13', 'f_hcond11', 'f_hcond7', 'f_hcond16', 'f_hcondn7', 'f_hcond1', 'f_hcondn18', 'f_hcondn16', 'f_hcond12', 'f_hcondn3', 'f_hcondn4', 'f_hcondn5', 'f_hcond8', 'f_hcond15', 'f_hcond2', 'f_hcondn9', 'f_hcond3', 'f_hcond6', 'f_hcondn1

Now we save it in the long panel format based on wave and pidp