# Data Preprocessing Pipeline

This notebook provides an initial preprocessing pipeline for the dataset.

## Step 1: Load Necessary Libraries
We start by importing the necessary libraries for data processing.

In [2]:

# Step 1: Load Necessary Libraries
import pandas as pd
import numpy as np


## Step 2: Load Dataset
In this step, we'll load the dataset that we want to preprocess.

In [3]:

# Step 2: Load Dataset
# Assuming datasets are in CSV format and stored in the 'data' folder

members_df = pd.read_csv('../data/raw/Members.csv')
claims_df = pd.read_csv('../data/raw/Claims.csv')
drugcount_df = pd.read_csv('../data/raw/DrugCount.csv')
labcount_df = pd.read_csv('../data/raw/LabCount.csv')
daysinhospital_y2_df = pd.read_csv('../data/raw/DaysInHospital_Y2.csv')
daysinhospital_y3_df = pd.read_csv('../data/raw/DaysInHospital_Y3.csv')


In [5]:
# display the data frame you want here:
claims_df

Unnamed: 0,MemberID,ProviderID,Vendor,PCP,Year,Specialty,PlaceSvc,PayDelay,LengthOfStay,DSFS,PrimaryConditionGroup,CharlsonIndex,ProcedureGroup,SupLOS
0,42286978,8013252.0,172193.0,37796.0,Y1,Surgery,Office,28,,8- 9 months,NEUMENT,0,MED,0
1,97903248,3316066.0,726296.0,5300.0,Y3,Internal,Office,50,,7- 8 months,NEUMENT,1-2,EM,0
2,2759427,2997752.0,140343.0,91972.0,Y3,Internal,Office,14,,0- 1 month,METAB3,0,EM,0
3,73570559,7053364.0,240043.0,70119.0,Y3,Laboratory,Independent Lab,24,,5- 6 months,METAB3,1-2,SCS,0
4,11837054,7557061.0,496247.0,68968.0,Y2,Surgery,Outpatient Hospital,27,,4- 5 months,FXDISLC,1-2,EM,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2668985,14932948,6641119.0,693028.0,22193.0,Y2,Obstetrics and Gynecology,Inpatient Hospital,58,,0- 1 month,GYNEC1,0,EM,0
2668986,31248189,6932712.0,223304.0,70748.0,Y3,Internal,Inpatient Hospital,23,,0- 1 month,GIBLEED,1-2,EM,0
2668987,43767339,1483429.0,35565.0,5278.0,Y3,Diagnostic Imaging,Office,122,,4- 5 months,ODaBNCA,0,SIS,0
2668988,96393713,7094351.0,347045.0,93075.0,Y3,Internal,Office,151,,1- 2 months,METAB3,1-2,EM,0


## Step 3: Preprocessing Functions
We define several helper functions to perform data aggregation and one-hot encoding.

In [None]:

# Step 3: Preprocessing Functions

# Function to perform one-hot encoding and summing by member
def one_hot_encode_and_sum(df, member_col, feature_col):
    encoded_df = pd.get_dummies(df[feature_col])
    encoded_df[member_col] = df[member_col]
    return encoded_df.groupby(member_col).sum().reset_index()

# Function to sum numerical features by member
def sum_numeric_feature(df, member_col, numeric_col):
    return df.groupby(member_col)[numeric_col].sum().reset_index()

# Function to count unique IDs
def count_unique(df, member_col, id_col):
    return df.groupby(member_col)[id_col].nunique().reset_index()


## Step 4: Data Aggregation
We aggregate data at the member level by counting or summing relevant features.

In [None]:

# Step 4: Data Aggregation

# Example of applying these functions
# Summing PayDelay by member
paydelay_sum_df = sum_numeric_feature(claims_df, 'MemberID', 'PayDelay')

# One-Hot Encoding for categorical features like PlaceSvc
place_svc_df = one_hot_encode_and_sum(claims_df, 'MemberID', 'PlaceSvc')

# Counting unique providers per member
unique_provider_count_df = count_unique(claims_df, 'MemberID', 'ProviderID')


## Step 5: Save Processed Data
Finally, we save the processed data into CSV files for later use in modeling.

In [None]:

# Step 5: Save Processed Data

# Save the processed datasets as CSV
paydelay_sum_df.to_csv('../data/processed/PayDelay_Sum.csv', index=False)
place_svc_df.to_csv('../data/processed/PlaceSvc_Encoded.csv', index=False)
unique_provider_count_df.to_csv('../data/processed/Provider_Unique_Count.csv', index=False)
