In [39]:
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split



In [43]:
# Read the CSV file using the pandas library
df = pd.read_csv('MNP_12_10.csv')

  df = pd.read_csv('MNP_12_10.csv')


## Data Cleaning:

- It is important to understand why these values are missing and if there is a pattern in the missing values. For example, if the missing values are present only in a particular subset of the data, it could be that these values are missing for a reason and imputing them with the mode strategy might not be the best approach.

- In addition to imputing the missing values, it might also be a good idea to perform some feature engineering to create new features that can help the autoencoder learn better.

- Once the missing values have been imputed and any new features have been created, it is a good idea to perform some data visualization to understand the distribution of the data and identify any potential outliers. This can help us identify any data points that might be anomalies, and it can also help us decide on the appropriate scaling and normalization techniques for the data.

In [44]:
# drop all the columns except those with the object datatype
df = df.select_dtypes(include=['object'])

In [46]:
# delet the following columns V42 to V107
df = df.drop(df.columns[39:105], axis=1)

for i, col in enumerate(df.columns):
     print(i, col)

0 BoardNotifiedDate
1 BusinessDate
2 ServiceFromDate
3 ServiceToDate
4 Cost
5 ExpenseSubgroup
6 ExpenseItem
7 PayeeCategory
8 PayeeSubcategory
9 Payee
10 PayeeKey
11 ClinicName
12 ProviderNo
13 TariffCode
14 TariffDescription
15 TariffFee
16 WcbServiceCode
17 WcbServiceCategory
18 WcbService
19 Units
20 DrugQuantity
21 AdjudicationStatus
22 WageLossClaimType
23 InjuryType
24 SpecialInvestigationsFlag
25 ICDCode
26 InjuryBodyPartGroup
27 InjuryBodyPart
28 InjuryEventGroup
29 InjuryEvent
30 InjuryNatureGroup
31 InjuryNature
32 MSI
33 PsychologicalCondition
34 InjurySourceGroup
35 InjurySource
36 stickmanGroup
37 stickmanSubGroup
38 stickman


In [47]:
# What is the percentage of missing values in the dataset?
def get_missing_percentage(df):
    missing_percentage = {}
    for col in df.columns:
        missing_percentage[col] = df[col].isnull().sum() / df.shape[0]
    return missing_percentage


# get the percentage of missing values for different columns in the dataset.
missing_percentage = get_missing_percentage(df)
missing_percentage

{'BoardNotifiedDate': 0.0,
 'BusinessDate': 0.0,
 'ServiceFromDate': 0.0,
 'ServiceToDate': 0.0,
 'Cost': 0.0,
 'ExpenseSubgroup': 0.0,
 'ExpenseItem': 0.0,
 'PayeeCategory': 0.0,
 'PayeeSubcategory': 0.0,
 'Payee': 0.0,
 'PayeeKey': 0.0,
 'ClinicName': 0.0,
 'ProviderNo': 0.0,
 'TariffCode': 0.0,
 'TariffDescription': 0.0,
 'TariffFee': 0.0,
 'WcbServiceCode': 0.002088275128679083,
 'WcbServiceCategory': 0.005908410888978454,
 'WcbService': 0.012405466098256528,
 'Units': 0.023241363449088094,
 'DrugQuantity': 0.06571969501678734,
 'AdjudicationStatus': 0.12498178699504023,
 'WageLossClaimType': 0.20187003289324737,
 'InjuryType': 0.2707902852616105,
 'SpecialInvestigationsFlag': 0.37073922159970485,
 'ICDCode': 0.42783368578766934,
 'InjuryBodyPartGroup': 0.47839702243707716,
 'InjuryBodyPart': 0.42797938982734746,
 'InjuryEventGroup': 0.3520509973105276,
 'InjuryEvent': 0.27950159355387394,
 'InjuryNatureGroup': 0.22945113512413312,
 'InjuryNature': 0.20304687321372453,
 'MSI': 0.19

In [54]:
# impute the missing values with mode strategy.
for col in df.columns:
    df[col] = df[col].fillna(df[col].mode()[0])

## Feature Engineering

Some potential feature engineering approaches that you could try are:

- Derive new features from the dates columns (BoardNotifiedDate, BusinessDate, ServiceFromDate, ServiceToDate) such as the difference between ServiceFromDate and ServiceToDate to calculate the duration of the service, or the difference between BoardNotifiedDate and BusinessDate to calculate the time lag between the two dates.

- Create new features based on the string columns such as the length of the string in the Payee column, or the number of unique words in the TariffDescription column.

- Extract numerical features from the string columns such as the numbers present in the Cost or TariffFee columns.

- Group the columns based on their categories and create new features that represent the relationships between the different categories. For example, you could create a new feature that represents the relationship between the InjuryType and InjuryBodyPart columns, or the relationship between the InjuryNatureGroup and InjuryNature columns.

- Use some advanced feature engineering techniques such as text embeddings to represent the string columns in a more compact and meaningful way. This could be useful for columns such as TariffDescription or InjuryEvent that contain a large number of unique words.


## AutoEnncoder

In [None]:
# Define the encoder network
inputs = tf.keras.Input(shape=(data.shape[1],))
encoded = tf.keras.layers.Dense(128, activation='relu')(inputs)
encoded = tf.keras.layers.Dense(64, activation='relu')(encoded)
encoded = tf.keras.layers.Dense(32, activation='relu')(encoded)

# Define the decoder network
decoded = tf.keras