# 1. Member Data Import and Information Processing

This notebook will load the member data and do pre-processing on it.

## Part a: Data import
First, load the member information data from a CSV file.

In [1]:

import pandas as pd
import numpy as np

# Step 1: Data Loading
# Load the members' information from the CSV file
members_df = pd.read_csv('../data/raw/Members.csv')


In [2]:
# display data set
members_df

Unnamed: 0,MemberID,AgeAtFirstClaim,Sex
0,14723353,70-79,M
1,75706636,70-79,M
2,17320609,70-79,M
3,69690888,40-49,M
4,33004608,0-9,M
...,...,...,...
112995,99711514,40-49,F
112996,31690877,50-59,F
112997,9519985,30-39,F
112998,92806272,50-59,F


## Part b: Members' gender information processing
We process the `Sex` column:
- Fill in missing gender.
- Perform One-Hot encoding.

In [3]:
# Members' gender Information Processing

# Fill missing values in the 'Sex' column with 'Unknown'
members_df['Sex'].fillna('Unknown', inplace=True)

# One-Hot encode the 'Sex' column into 'Male', 'Female', and 'Unknown'
members_df['Male'] = (members_df['Sex'] == 'M').astype(int)
members_df['Female'] = (members_df['Sex'] == 'F').astype(int)
members_df['Unknown'] = (members_df['Sex'] == 'Unknown').astype(int)

# Drop the original 'Sex' column
members_df.drop(columns=['Sex'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  members_df['Sex'].fillna('Unknown', inplace=True)


## Part c: Members' age information processing for the first claim
We process the `AgeAtFirstClaim` column:
- map the age range to the average value
- Remove missing values

In [4]:
""" # Remove rows where the 'AgeAtFirstClaim' column is missing
members_df = members_df[members_df['AgeAtFirstClaim'].notna()] """

# Map age ranges to their respective average values
age_mapping = {
    '0-9': 5,
    '10-19': 15,
    '20-29': 25,
    '30-39': 35,
    '40-49': 45,
    '50-59': 55,
    '60-69': 65,
    '70-79': 75,
    '80+': 80
}
members_df['AgeAtFirstClaim'] = members_df['AgeAtFirstClaim'].map(age_mapping)

# Remove rows where the 'AgeAtFirstClaim' column is missing
members_df = members_df[members_df['AgeAtFirstClaim'].notna()]


## Part d: Save the data after processing
Finally, sort the data by id and save as csv file.

In [5]:
# Sort the DataFrame by MemberID
members_df = members_df.sort_values(by='MemberID')

# Save the processed members' information to a CSV file
members_df.to_csv('../data/processed/MemberInfo_df.csv', index=False)

print("Member information processing is completed and saved to MemberInfo_df.csv")

Member information processing is completed and saved to MemberInfo_df.csv


In [6]:
# display the data after processing
new_members_df = pd.read_csv('../data/processed/MemberInfo_df.csv')
new_members_df

Unnamed: 0,MemberID,AgeAtFirstClaim,Male,Female,Unknown
0,4,5.0,1,0,0
1,210,35.0,0,0,1
2,3197,5.0,0,1,0
3,3457,5.0,1,0,0
4,3713,45.0,0,1,0
...,...,...,...,...,...
107242,99996214,45.0,1,0,0
107243,99997485,15.0,1,0,0
107244,99997895,45.0,1,0,0
107245,99998627,35.0,0,1,0
