This notebook contains code related to some basic data preprocessing and data cleaning. It analyses the existing dataframe and then make chnages and save in a new file.

In [4]:
import pandas as pd

def load_and_print_data(file_path):

    # Load the dataset
    df = pd.read_csv(file_path)

    # Print the dataframe before pre-processing
    print("Dataframe before pre-processing:")
    print(df.head())

    return df


In [None]:
def preprocess_insurance_data(df):

    # Replace '?' with NaN for proper handling of missing values
    df.replace('?', pd.NA, inplace=True)

    # Convert categorical target variable to binary (Y -> 1, N -> 0)
    df['fraud_reported'] = df['fraud_reported'].map({'Y': 1, 'N': 0})

    # Convert date columns to datetime format
    df['policy_bind_date'] = pd.to_datetime(df['policy_bind_date'], errors='coerce')
    df['incident_date'] = pd.to_datetime(df['incident_date'], errors='coerce')

    # Drop unnecessary columns
    drop_columns = ['policy_number', 'insured_zip', 'incident_location', 'auto_make', 'auto_model']
    df.drop(columns=drop_columns, inplace=True, errors='ignore')

    # Handle missing values by replacing categorical NaNs with 'Unknown'
    categorical_cols = df.select_dtypes(include=['object']).columns
    df[categorical_cols] = df[categorical_cols].fillna('Unknown')

    # Encode categorical variables using one-hot encoding
    df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

    return df


In [5]:
# Example usage
file_path = "/content/insurance_claims.csv"
df_old = load_and_print_data(file_path)
df_cleaned = preprocess_insurance_data(file_path)

# Print the dataframe after pre processing
print("\nDataframe after pre-processing:")
print(df_cleaned.head())

# Save the preprocessed dataset
df_cleaned.to_csv("insurance_claims_preprocessed.csv", index=False)


Dataframe before pre-processing:
   months_as_customer  age  policy_number policy_bind_date policy_state  \
0                 328   48         521585       2014-10-17           OH   
1                 228   42         342868       2006-06-27           IN   
2                 134   29         687698       2000-09-06           OH   
3                 256   41         227811       1990-05-25           IL   
4                 228   44         367455       2014-06-06           IL   

  policy_csl  policy_deductable  policy_annual_premium  umbrella_limit  \
0    250/500               1000                1406.91               0   
1    250/500               2000                1197.22         5000000   
2    100/300               2000                1413.14         5000000   
3    250/500               2000                1415.74         6000000   
4   500/1000               1000                1583.91         6000000   

   insured_zip  ... witnesses police_report_available total_claim_amoun

The changes made to the dataset,
1. **Handling Missing Values:**  
   - Replaced `?` with `NaN` for proper identification of missing data.  
   - Filled missing categorical values with `'Unknown'`.  

2. **Target Variable Conversion:**  
   - Converted `fraud_reported` from categorical (`Y/N`) to binary (`1/0`).  

3. **Date Format Conversion:**  
   - Converted `policy_bind_date` and `incident_date` into proper `datetime` format.  

4. **Dropped Unnecessary Columns:**  
   - Removed `policy_number`, `insured_zip`, `incident_location`, `auto_make`, and `auto_model` as they were unlikely to contribute to fraud detection.  

5. **Encoding Categorical Variables:**  
   - Applied one-hot encoding to categorical columns while avoiding dummy variable traps by dropping the first category.  

6. **Final Dataset Preparation:**  
   - Processed dataset was saved as `insurance_claims_preprocessed.csv` for further model training.  
