# **Medical Expenditure Dataset**

The Medical Expenditure Panel Survey (MEPS) dataset is a large-scale survey dataset that collects comprehensive information on the healthcare services used by families and individuals, their associated costs, frequencies of services, demographics, and more. The dataset serves as a critical resource for studying healthcare utilization, costs, and disparities across different population groups.
The dataset can be found [here](https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_detail.jsp?cboPufNumber=HC-181)


## **Key Features**
The dataset includes the following key features:

### **1. Protected Attributes**
- **`RACE`**: Encoded as:
  - `1.0` = White (Privileged group)
  - `0.0` = Non-White (Non-Privileged group)
- **`SEX`**: Encoded as:
  - `1.0` = Male (Privileged group)
  - `0.0` = Female (Non-Privileged group)

### **2. Health Scores**
- **`PCS42`**: Physical Component Summary (PCS) score from the SF-12v2 survey.
  - Reflects physical health status.
  - Range: 0 to 100 (higher scores indicate better physical health).
- **`MCS42`**: Mental Component Summary (MCS) score from the SF-12v2 survey.
  - Reflects mental health status.
  - Range: 0 to 100 (higher scores indicate better mental health).



### **3. Socioeconomic Status**
- **Poverty Categories (`POVCAT=X`)**:
  - `POVCAT=1`: Poor.
  - `POVCAT=2`: Near poor.
  - `POVCAT=3`: Low income. 
  - `POVCAT=4`: Middle income. 
  - `POVCAT=5`: High income. 

- **Insurance Coverage (`INSCOV=X`)**:
  - `INSCOV=1`: Any Private (private health insurance).
  - `INSCOV=2`: Public Only (public health insurance, no private coverage).
  - `INSCOV=3`: Uninsured (no health insurance).

### **4. Target Variable**
- **`UTILIZATION`**: Composite feature measuring the total number of medical care visits. It is created by summing the following:
  - `OBTOTV15(16)`: Number of office-based visits.
  - `OPTOTV15(16)`: Number of outpatient visits.
  - `ERTOT15(16)`: Number of emergency room visits.
  - `IPNGTD15(16)`: Number of inpatient nights.
  - `HHTOTD16`: Number of home health visits.

  **Classification Task**:
  - Predict whether a person would have **high utilization**:
    - `High utilization` = `UTILIZATION >= 10`.
    - `Low utilization` = `UTILIZATION < 10`.

---

## **Dataset Preprocessing**
1. **Invalid Values**:
   - Negative values (`-1`, `-9`) in `PCS42` and `MCS42` were replaced with `NaN`.
   - Missing values were imputed using the median of the respective columns.

2. **Age Encoding**:
   - **`Age (decade)_X`**: One-hot encoded columns representing age groups in decades.
    - Example: `Age (decade)_20` = 1 for individuals aged 20–29.
    - Covers age groups from `0` to `70+`.

3. **Feature Selection**:
   - Retained only essential columns: `RACE`, `SEX`, `PCS42`, `MCS42`, age group columns, `POVCAT`, `INSCOV`, and `UTILIZATION`.

---




In [1]:
import sys
import numpy as np
import pandas as pd
import copy
from IPython.display import display
from aif360.datasets import MEPSDataset19
from aif360.datasets import StandardDataset

# Ensure reproducibility
np.random.seed(1)

# Append a path if needed
sys.path.append("../")

In [2]:
original_dataset = MEPSDataset19()
df = pd.DataFrame(original_dataset.features, columns=original_dataset.feature_names)
df

Unnamed: 0,AGE,RACE,PCS42,MCS42,K6SUM42,REGION=1,REGION=2,REGION=3,REGION=4,SEX=1,...,EMPST=3,EMPST=4,POVCAT=1,POVCAT=2,POVCAT=3,POVCAT=4,POVCAT=5,INSCOV=1,INSCOV=2,INSCOV=3
0,53.0,1.0,25.93,58.47,3.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,56.0,1.0,20.42,26.57,17.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
2,23.0,1.0,53.12,50.33,7.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,3.0,1.0,-1.00,-1.00,-1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,27.0,0.0,-1.00,-1.00,-1.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15825,25.0,0.0,56.71,62.39,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15826,25.0,0.0,56.71,62.39,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15827,2.0,1.0,-1.00,-1.00,-1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
15828,54.0,0.0,43.97,42.45,24.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [10]:
original_dataset.feature_names

['AGE',
 'RACE',
 'PCS42',
 'MCS42',
 'K6SUM42',
 'REGION=1',
 'REGION=2',
 'REGION=3',
 'REGION=4',
 'SEX=1',
 'SEX=2',
 'MARRY=1',
 'MARRY=2',
 'MARRY=3',
 'MARRY=4',
 'MARRY=5',
 'MARRY=6',
 'MARRY=7',
 'MARRY=8',
 'MARRY=9',
 'MARRY=10',
 'FTSTU=-1',
 'FTSTU=1',
 'FTSTU=2',
 'FTSTU=3',
 'ACTDTY=1',
 'ACTDTY=2',
 'ACTDTY=3',
 'ACTDTY=4',
 'HONRDC=1',
 'HONRDC=2',
 'HONRDC=3',
 'HONRDC=4',
 'RTHLTH=-1',
 'RTHLTH=1',
 'RTHLTH=2',
 'RTHLTH=3',
 'RTHLTH=4',
 'RTHLTH=5',
 'MNHLTH=-1',
 'MNHLTH=1',
 'MNHLTH=2',
 'MNHLTH=3',
 'MNHLTH=4',
 'MNHLTH=5',
 'HIBPDX=-1',
 'HIBPDX=1',
 'HIBPDX=2',
 'CHDDX=-1',
 'CHDDX=1',
 'CHDDX=2',
 'ANGIDX=-1',
 'ANGIDX=1',
 'ANGIDX=2',
 'MIDX=-1',
 'MIDX=1',
 'MIDX=2',
 'OHRTDX=-1',
 'OHRTDX=1',
 'OHRTDX=2',
 'STRKDX=-1',
 'STRKDX=1',
 'STRKDX=2',
 'EMPHDX=-1',
 'EMPHDX=1',
 'EMPHDX=2',
 'CHBRON=-1',
 'CHBRON=1',
 'CHBRON=2',
 'CHOLDX=-1',
 'CHOLDX=1',
 'CHOLDX=2',
 'CANCERDX=-1',
 'CANCERDX=1',
 'CANCERDX=2',
 'DIABDX=-1',
 'DIABDX=1',
 'DIABDX=2',
 'JTPAIN=-

In [11]:
def further_preprocessing_aif360(meps_dataset):
    """
    Further preprocess the MEPS dataset to prepare it for AIF360 analysis.
    """
    # Convert MEPSDataset19 to a DataFrame
    df, metadata = meps_dataset.convert_to_dataframe()

    # Clean invalid values for PCS42 and MCS42
    for col in ['PCS42', 'MCS42']:
        if col in df.columns:
            # Replace negative values with NaN
            df[col] = df[col].apply(lambda x: pd.NA if x < 0 else x)
            # Fill NaN values with the column median
            df[col] = df[col].fillna(df[col].median(skipna=True))

    # Group age into decades
    df['Age (decade)'] = df['AGE'].apply(lambda x: min(x // 10 * 10, 70))

    # One-hot encode categorical columns
    categorical_columns = ['Age (decade)']
    df = pd.get_dummies(df, columns=categorical_columns)

    # Rename one-hot encoded columns to remove `.0` suffix
    df.rename(columns=lambda col: col.replace('.0', '') if 'Age (decade)' in col else col, inplace=True)

    # Dynamically retrieve column names for encoded categories
    age_decade_columns = [col for col in df.columns if 'Age (decade)_' in col]

    # Rename SEX column and convert RACE
    df.rename(columns={'SEX=1': 'SEX'}, inplace=True)

    # Step 6: Include new features (POVCAT and INSCOV)
    additional_features = [
        'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5',
        'INSCOV=1', 'INSCOV=2', 'INSCOV=3'
    ]

    # Ensure these additional features are present in the dataset
    for feature in additional_features:
        if feature not in df.columns:
            raise ValueError(f"Feature {feature} not found in the dataset.")

    # Step 7: Retain only necessary columns
    selected_columns = (
        ['RACE', 'SEX', 'PCS42', 'MCS42'] +  # Include numerical and protected attributes
        age_decade_columns +
        additional_features +
        ['UTILIZATION']  # Include target variable
    )
    df = df[selected_columns]

    # Step 8: Create the processed AIF360 dataset
    processed_dataset = StandardDataset(
        df,
        label_name='UTILIZATION',
        favorable_classes=[1.0],
        protected_attribute_names=['RACE', 'SEX'],
        privileged_classes=[[1.0], [1.0]],  # Privileged groups: White and Male
    )

    return processed_dataset


In [12]:
# Apply further preprocessing
processed_meps = further_preprocessing_aif360(original_dataset)

# Inspect the AIF360 dataset
print(processed_meps.feature_names)

['RACE', 'SEX', 'PCS42', 'MCS42', 'Age (decade)_0', 'Age (decade)_10', 'Age (decade)_20', 'Age (decade)_30', 'Age (decade)_40', 'Age (decade)_50', 'Age (decade)_60', 'Age (decade)_70', 'POVCAT=1', 'POVCAT=2', 'POVCAT=3', 'POVCAT=4', 'POVCAT=5', 'INSCOV=1', 'INSCOV=2', 'INSCOV=3']


In [13]:
df_proc, metadata = processed_meps.convert_to_dataframe()
df_proc

Unnamed: 0,RACE,SEX,PCS42,MCS42,Age (decade)_0,Age (decade)_10,Age (decade)_20,Age (decade)_30,Age (decade)_40,Age (decade)_50,...,Age (decade)_70,POVCAT=1,POVCAT=2,POVCAT=3,POVCAT=4,POVCAT=5,INSCOV=1,INSCOV=2,INSCOV=3,UTILIZATION
0,1.0,1.0,25.930,58.47,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,1.0,0.0,20.420,26.57,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
3,1.0,0.0,53.120,50.33,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,1.0,1.0,53.435,54.37,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
5,0.0,1.0,53.435,54.37,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16573,0.0,1.0,56.710,62.39,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
16574,0.0,0.0,56.710,62.39,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
16575,1.0,0.0,53.435,54.37,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
16576,0.0,0.0,43.970,42.45,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0


In [14]:
# Find the maximum scores for PCS42 and MCS42
max_pcs42 = df_proc['PCS42'].max()
max_mcs42 = df_proc['MCS42'].max()

# Print the results
print(f"Maximum PCS42 Score: {max_pcs42}")
print(f"Maximum MCS42 Score: {max_mcs42}")


Maximum PCS42 Score: 72.07
Maximum MCS42 Score: 75.51


In [6]:
# Check data types and null values
print(df_proc.info())

<class 'pandas.core.frame.DataFrame'>
Index: 15830 entries, 0 to 16577
Data columns (total 21 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RACE             15830 non-null  float64
 1   SEX              15830 non-null  float64
 2   PCS42            15830 non-null  float64
 3   MCS42            15830 non-null  float64
 4   Age (decade)_0   15830 non-null  float64
 5   Age (decade)_10  15830 non-null  float64
 6   Age (decade)_20  15830 non-null  float64
 7   Age (decade)_30  15830 non-null  float64
 8   Age (decade)_40  15830 non-null  float64
 9   Age (decade)_50  15830 non-null  float64
 10  Age (decade)_60  15830 non-null  float64
 11  Age (decade)_70  15830 non-null  float64
 12  POVCAT=1         15830 non-null  float64
 13  POVCAT=2         15830 non-null  float64
 14  POVCAT=3         15830 non-null  float64
 15  POVCAT=4         15830 non-null  float64
 16  POVCAT=5         15830 non-null  float64
 17  INSCOV=1         

In [7]:
# Summary statistics for numerical columns
print(df_proc.describe())

               RACE           SEX         PCS42         MCS42  Age (decade)_0  \
count  15830.000000  15830.000000  15830.000000  15830.000000    15830.000000   
mean       0.357296      0.478838     51.000047     52.840373        0.152369   
std        0.479218      0.499568      8.358387      7.684650        0.359389   
min        0.000000      0.000000      7.330000      0.050000        0.000000   
25%        0.000000      0.000000     51.570000     52.000000        0.000000   
50%        0.000000      0.000000     53.435000     54.370000        0.000000   
75%        1.000000      1.000000     54.800000     56.580000        0.000000   
max        1.000000      1.000000     72.070000     75.510000        1.000000   

       Age (decade)_10  Age (decade)_20  Age (decade)_30  Age (decade)_40  \
count     15830.000000     15830.000000     15830.000000     15830.000000   
mean          0.160328         0.137587         0.134997         0.120720   
std           0.366922         0.344477

In [9]:
labels = (processed_meps.labels).flatten()
label_counts = pd.Series(labels).value_counts()
print(label_counts)

0.0    13112
1.0     2718
Name: count, dtype: int64




## Feature Analysis Based on Statistics

### RACE
- **Mean**: 0.357 → 35.7% White, 64.3% Non-White.
- **Conclusion**: While White group looks underrepresented, Non-White group is composed of a diverse set of racial and ethnic subgroups—including Black, Hispanic, Asian, Native American, and Multiracial individuals.

### SEX
- **Mean**: 0.479 → 47.9% Male, 52.1% Female.
- **Conclusion**: Gender distribution is nearly balanced with no significant underrepresentation.

### PCS42 (Physical Health Score)
- **Range**: 7.33 to 72.07, Mean: 51.0.
- **Conclusion**: Physical health scores cluster around 51 (median: 53.4), indicating average physical health.

### MCS42 (Mental Health Score)
- **Range**: 0.05 to 75.51, Mean: 52.84.
- **Conclusion**: Mental health scores are similar to physical health, clustering around 52 (median: 54.4).

### Age Groups (Age (decade)_X)
- **Mean** values are below 0.20 for all age decades.
- **Conclusion**: Most individuals are concentrated in younger age brackets (e.g., `Age (decade)_0` to `Age (decade)_40`).

### POVCAT (Poverty Categories)
- **POVCAT=1 (Poor)**: 22%.
- **POVCAT=2 (Near Poor)**: 6.4%.
- **POVCAT=3 (Low Income)**: 17.4%.
- **POVCAT=4 (Middle Income)**: 28.1%.
- **POVCAT=5 (High Income)**: 26.1%.
- **Conclusion**: The dataset is socioeconomically diverse, with many individuals in the `Middle` and `High Income` groups.

### INSCOV (Insurance Coverage)
- **INSCOV=1 (Private Insurance)**: 53%.
- **INSCOV=2 (Public Only)**: 35.4%.
- **INSCOV=3 (Uninsured)**: 11.6%.
- **Conclusion**: Most individuals have private insurance, with a smaller percentage uninsured.

### UTILIZATION
- **Mean**: 0.172 → 17.2% have high utilization (>=10 visits).
- **Conclusion**: Most individuals have low utilization (fewer than 10 visits).



In [24]:
high_utilization_df = df_proc[df_proc['UTILIZATION'] == 1]


In [None]:
pd.set_option('display.max_columns', None)  
pd.set_option('display.width', None)       
print(high_utilization_df.describe())


              RACE          SEX        PCS42        MCS42  Age (decade)_0  \
count  2718.000000  2718.000000  2718.000000  2718.000000     2718.000000   
mean      0.531273     0.370125    43.689671    49.870430        0.048197   
std       0.499113     0.482927    12.354174    10.582272        0.214222   
min       0.000000     0.000000     7.460000     2.580000        0.000000   
25%       0.000000     0.000000    34.167500    43.715000        0.000000   
50%       1.000000     0.000000    47.580000    54.335000        0.000000   
75%       1.000000     1.000000    53.435000    56.750000        0.000000   
max       1.000000     1.000000    66.860000    74.980000        1.000000   

       Age (decade)_10  Age (decade)_20  Age (decade)_30  Age (decade)_40  \
count      2718.000000      2718.000000      2718.000000      2718.000000   
mean          0.062546         0.078366         0.100074         0.101545   
std           0.242189         0.268797         0.300153         0.302105  

## Characteristics of High Utilization Individuals (UTILIZATION = 1)

### **Key Demographic Characteristics**
1. **RACE**:
   - **Mean**: 0.531 → 53.1% White, 46.9% Non-White.
   - **Conclusion**: While White individuals make up a slight majority among high utilizers, the Non-White group represents a nearly equal proportion. However, this Non-White group is composed of a diverse set of racial and ethnic subgroups—including Black, Hispanic, Asian, Native American, and Multiracial individuals—which highlights the broad range of healthcare needs within this category.

2. **SEX**:
   - **Mean**: 0.370 → 37.0% Male, 63.0% Female.
   - **Conclusion**: Females dominate the high utilization group, despite the nearly balanced gender distribution in the overall dataset.

3. **Age Groups**:
   - **Age (decade)_50**: 18.6%.
   - **Age (decade)_60**: 18.9%.
   - **Age (decade)_70**: 23.4%.
   - **Conclusion**: Older individuals (50+ years) are more likely to have high healthcare utilization.

---

### **Health Scores**
1. **PCS42 (Physical Health)**:
   - **Mean**: 43.69 (lower than the overall dataset mean of 51.0).
   - **Conclusion**: High utilizers have poorer physical health on average.

2. **MCS42 (Mental Health)**:
   - **Mean**: 49.87 (lower than the overall dataset mean of 52.84).
   - **Conclusion**: High utilizers also have slightly poorer mental health.

---

### **Socioeconomic Status (POVCAT)**
1. **POVCAT=1 (Poor)**: 19.9%.
2. **POVCAT=4 (Middle Income)**: 25.5%.
3. **POVCAT=5 (High Income)**: 33.0%.
4. **Conclusion**: High utilizers are more concentrated in the `High Income` group, but there is notable representation from the `Poor` category.

---

### **Insurance Coverage (INSCOV)**
1. **INSCOV=1 (Private Insurance)**: 55.5%.
2. **INSCOV=2 (Public Insurance Only)**: 42.1%.
3. **INSCOV=3 (Uninsured)**: 2.4%.
4. **Conclusion**: Most high utilizers have either private or public insurance, with very few uninsured individuals.



In [28]:
low_income_df = df_proc[df_proc['POVCAT=1'] == 1]
pd.set_option('display.max_columns', None)  
pd.set_option('display.width', None)       
print(low_income_df.describe())

              RACE          SEX        PCS42        MCS42  Age (decade)_0  \
count  3486.000000  3486.000000  3486.000000  3486.000000     3486.000000   
mean      0.220310     0.433161    50.142925    51.738178        0.254733   
std       0.414515     0.495584     8.944188     8.420590        0.435774   
min       0.000000     0.000000    10.020000     0.050000        0.000000   
25%       0.000000     0.000000    51.555000    51.832500        0.000000   
50%       0.000000     0.000000    53.435000    54.370000        0.000000   
75%       0.000000     1.000000    53.435000    54.370000        1.000000   
max       1.000000     1.000000    70.350000    70.310000        1.000000   

       Age (decade)_10  Age (decade)_20  Age (decade)_30  Age (decade)_40  \
count      3486.000000      3486.000000      3486.000000      3486.000000   
mean          0.214573         0.137980         0.119334         0.080608   
std           0.410584         0.344929         0.324228         0.272271  


### **Race Disparity**
- The majority (**78.0%**) of the low-income group are **Non-White**, highlighting socioeconomic disparities tied to racial and ethnic backgrounds.
- This imbalance likely contributes to bias in models, as **Non-White individuals in low-income groups** may face compounded challenges in accessing healthcare.

### **Insurance Gaps**
- The reliance on **public insurance** and the higher rate of **uninsured individuals** in this group may limit access to quality healthcare services, affecting health outcomes and utilization patterns.
