# 3. Pre-processing for DrugCount & LabCount data
This notebook will load DrugCount & LabCount data, then split in years, do features transformation, sum data by Id as the preprocessing.

## Part a: Data import, feature type transformation, and split by years
First, load DrugCount & LabCount data from CSV files.

In [1]:
import pandas as pd

DrugCount_df = pd.read_csv('../data/raw//DrugCount.csv')
LabCount_df = pd.read_csv('../data/raw//LabCount.csv')

Replace the special string with an integer and convert others to integer also

In [2]:
# Replace string 7+ and 10+ as integer
DrugCount_df['DrugCount'].replace({'7+': '7'}, inplace=True)
LabCount_df['LabCount'].replace({'10+': '10'}, inplace=True)

# transform other numbers recorded in str as int
DrugCount_df['DrugCount'] = pd.to_numeric(DrugCount_df['DrugCount'], downcast='integer')
LabCount_df['LabCount'] = pd.to_numeric(LabCount_df['LabCount'], downcast='integer')

# Group by year
DrugCount_Y1 = DrugCount_df[DrugCount_df['Year'] == 'Y1']
DrugCount_Y2 = DrugCount_df[DrugCount_df['Year'] == 'Y2']
DrugCount_Y3 = DrugCount_df[DrugCount_df['Year'] == 'Y3']

LabCount_Y1 = LabCount_df[LabCount_df['Year'] == 'Y1']
LabCount_Y2 = LabCount_df[LabCount_df['Year'] == 'Y2']
LabCount_Y3 = LabCount_df[LabCount_df['Year'] == 'Y3']



The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  DrugCount_df['DrugCount'].replace({'7+': '7'}, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  LabCount_df['LabCount'].replace({'10+': '10'}, inplace=True)


In [3]:
# Display data set
DrugCount_Y1

Unnamed: 0,MemberID,Year,DSFS,DrugCount
2,61221204,Y1,2- 3 months,1
8,30786520,Y1,1- 2 months,1
11,28420460,Y1,10-11 months,1
12,11861003,Y1,4- 5 months,1
14,66905595,Y1,6- 7 months,4
...,...,...,...,...
818232,61260763,Y1,11-12 months,1
818234,26018322,Y1,3- 4 months,1
818236,23642577,Y1,5- 6 months,1
818237,99212686,Y1,1- 2 months,2


In [4]:
LabCount_Y1

Unnamed: 0,MemberID,Year,DSFS,LabCount
1,10143167,Y1,0- 1 month,2
2,1054357,Y1,0- 1 month,6
26,43367747,Y1,0- 1 month,10
27,4317556,Y1,0- 1 month,3
28,12608801,Y1,7- 8 months,2
...,...,...,...,...
361476,22029558,Y1,8- 9 months,1
361477,4677404,Y1,2- 3 months,4
361478,11564814,Y1,3- 4 months,4
361479,43366611,Y1,5- 6 months,2


## Part b: Summing Numeric Feature: `DrugCount` & `LabCount`

In each year, calculate the sum of amount for two features by member id. Drop other features.

In [5]:
# Function to sum numeric feature
def sum_numeric_feature(df, feature):
    numeric_sum = df.groupby('MemberID')[feature].sum().reset_index()
    return numeric_sum

# Calculate drug counts and lab test counts for each member in each year 
DrugCount_summary_Y1 = sum_numeric_feature(DrugCount_Y1, 'DrugCount')
DrugCount_summary_Y2 = sum_numeric_feature(DrugCount_Y2, 'DrugCount')
DrugCount_summary_Y3 = sum_numeric_feature(DrugCount_Y3, 'DrugCount')

LabCount_summary_Y1 = sum_numeric_feature(LabCount_Y1, 'LabCount')
LabCount_summary_Y2 = sum_numeric_feature(LabCount_Y2, 'LabCount')
LabCount_summary_Y3 = sum_numeric_feature(LabCount_Y3, 'LabCount')



In [6]:
# Display data set
DrugCount_summary_Y1

Unnamed: 0,MemberID,DrugCount
0,210,5
1,3197,5
2,3889,30
3,4187,61
4,9063,2
...,...,...
49833,99988469,24
49834,99992565,2
49835,99994536,2
49836,99995554,8


In [7]:
LabCount_summary_Y1

Unnamed: 0,MemberID,LabCount
0,210,2
1,3889,10
2,11951,3
3,14661,2
4,14778,2
...,...,...
53222,99992565,6
53223,99994536,11
53224,99995554,11
53225,99997895,10


## Part c: Merge all Features by ID and Save in Years

In [8]:
def merge_annual_data(drug_data, lab_data, year):
    # Use outer join to merge data
    merged_data = pd.merge(drug_data, lab_data, on='MemberID', how='outer', suffixes=('_Drug', '_Lab'))
    # If there are NaNs in the data, fill them with 0
    merged_data.fillna(0, inplace=True)
    merged_data.to_csv(f'../data/processed/processed_lab&drug_Y{year}.csv', index=False)

# Call function to merge data
merge_annual_data(DrugCount_summary_Y1, LabCount_summary_Y1, 1)
merge_annual_data(DrugCount_summary_Y2, LabCount_summary_Y2, 2)
merge_annual_data(DrugCount_summary_Y3, LabCount_summary_Y3, 3)

In [9]:
# Display data set
y1_df = pd.read_csv('../data/processed/processed_lab&drug_Y1.csv')
y1_df

Unnamed: 0,MemberID,DrugCount,LabCount
0,210,5.0,2.0
1,3197,5.0,0.0
2,3889,30.0,10.0
3,4187,61.0,0.0
4,9063,2.0,0.0
...,...,...,...
64850,99992565,2.0,6.0
64851,99994536,2.0,11.0
64852,99995554,8.0,11.0
64853,99997895,0.0,10.0
