# Preprocess Medical Transcriptions Dataset for Analysis with IBM Project Debater

### Introduction
This dataset contains sample medical transcriptions for various medical specialties. It was downloaded from [Kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions) on 03/09/2021. Its output is the dataset that's used in the notebook `Key_Point_Analysis`, which uses Project Debater to extract key points from sentences in medical transcriptions.

For more information about IBM Project Debater, please refer to the Readme file of this repo and the above-mentioned notebook.

In [None]:
# Imports
import pandas as pd
import geopandas as gpd

In [None]:
# Read dataframe
df = pd.read_csv('./mtsamples.csv').drop(columns=['Unnamed: 0'])
print('Shape:',df.shape)
df.head()

### Duplicates

In [None]:
# Sample names
print('Unique sample names:', df['sample_name'].nunique()) 

There are fewer sample names than observations. Some may relate to the same patient? Need to check for duplicates.

In [None]:
# Mark duplicates
df = df.sort_values(by = ['transcription', 'description'])
df['dup'] = df.duplicated(subset = ['transcription', 'description'])

In [None]:
# View a slice of the dataframe
df[100:120].head(20)

There are quite a few duplicates. It seems that the same case was classified in several categories, probably at different times during the treatment. Notice that one of the observations for a patient are sometimes put in generic categories such as "Discharge Summary" or "Counsult - History and Phy." (but not always). "Surgery" is also so prevalent that it is almost not informative. To get rid of these categories, I am going to replace them by the more informative pair, if it exists. Then, I am going to drop duplicates with respect to the subset `description` and `transcription` and keep the first observation for each. 

In [None]:
# Replace uninformative medical specialties
df.reset_index(inplace=True, drop = True)
df['medical_specialty_new'] = df['medical_specialty']
generic_cat = [' Surgery', ' Consult - History and Phy.', ' Discharge Summary', ' SOAP / Chart / Progress Notes', ' Office Notes', ' Letters']
for i in range(1, len(df)-1):
    if (df['transcription'][i] == df['transcription'][i-1])&(df['description'][i] == df['description'][i-1]):
        if (df['medical_specialty_new'][i] not in generic_cat) & (df['medical_specialty_new'][i-1] in generic_cat):
            df['medical_specialty_new'][i-1] = df['medical_specialty_new'][i]

df.head(10)


In [None]:
# Drop duplicates
df_nd = df.drop_duplicates(subset = ['description', 'transcription'], keep = 'first').drop(columns='dup')
df_nd.reset_index(inplace = True, drop = True)
print('Shape new df:', df_nd.shape)
df_nd.head()

In [None]:
# Drop observations with no transcription
df_nd.dropna(inplace=True)

# View data
df_nd.tail(20)

In [None]:
# Value counts of original medical specialty
df_nd['medical_specialty'].value_counts(normalize=True).head(20)

In [None]:
# Value counts of modified specialty
df['medical_specialty_new'].value_counts(normalize=True).head(20)

It seems there's still lot of surgeries left, but it looks better.

### Generate random dates and locations

In [None]:
from random import choices, seed
seed(10)

## DATES
df_nd['year'] = choices([2010, 2013, 2016], k = len(df_nd)) 

## LOCATIONS
boroughs = gpd.read_file("https://skgrange.github.io/www/data/london_boroughs.json")
df_nd['borough'] = choices(list(set(boroughs.name)), k = len(df_nd))

df_nd.head(10)

In [None]:
print('Frequency of random years')
df_nd['year'].value_counts(normalize=True)

In [None]:
print('Frequency of random locations')
df_nd['borough'].value_counts(normalize=True)

### Export clean dataset to CSV

In [None]:
# Save the whole modified dataframe
df_nd.to_csv('./mtsamples_clean.csv')

In [None]:
# Split description column in sentences and save for analysis
df_nd_select = df_nd[['description', 'medical_specialty_new', 'year', 'borough']].reset_index().values.tolist() # Add month if required by analysis

sentence_list = []
for line in df_nd_select:
    for sent in line[1].split('. '):
        if len(sent) >= 4: # drop weird white-space-only lines
            sentence_list.append([line[0], line[2], sent, line[3], line[4]]) # Add line[4] for month if required by analysis

sentences = pd.DataFrame(data = sentence_list, columns=['id_description','medical_specialty_new','text', 'year', 'borough']) # Add month if required by analysis

sentences.to_csv('./mtsamples_descriptions_clean.csv', index_label = 'id')