# Deal with writing a csv file for all the reports - no distinction between train and test data; only for the supervised case

Author: Geeticka Chauhan

Before using this, need to make sure that data conversion for the language modeling has been run to create the csv file from all reports and process them. 

In [2]:
%load_ext autoreload

In [3]:
%autoreload

import os
import sys
sys.path.append('../..')
import pandas as pd
import joint_img_txt.data.text.utils as utils
import joint_img_txt.data.text.lm_utils as lm_utils
import re
import joint_img_txt.data.text.preprocess as preprocess

# keep below as the directory where three types of files are located: 1) class{0,1,2,3}.txt files 2) split.csv files
class_info_dir = '/data/vision/polina/projects/chestxray/work_space_v2/report_processing/edema_labels-12-03-2019/'
# keep below as the directory where gold standard labels are present; likely your processing 
# will be different from mine because my csv format was different from what you will have
seth_info_dir = '/data/vision/polina/projects/chestxray/geeticka/class_information/' # class information and reports are located here
# below is the location of the original MIMIC-CXR txt files
raw_reports_dir = '/data/vision/polina/projects/chestxray/data_v2/reports/'
# below is where you would like to keep the .tsv files
out_dir = '/data/vision/polina/projects/chestxray/geeticka/pre-processed/supervised/'
# below is the latest .csv split file; this should have the most up to date version of the labels as the model 
# will end up using the labels present in this file
latest_csv = '/data/vision/polina/projects/chestxray/work_space_v2/report_processing/edema_labels-12-03-2019/mimic-cxr-sub-img-edema-split.csv'
# this is where the outputted csv files will be located inside the directory reports_list/

def class_info_res(filename):
    return os.path.join(class_info_dir, filename)
def seth_labels_res(filename):
    return os.path.join(seth_info_dir, filename)
def raw_reports_res(filename):
    return os.path.join(raw_reports_dir, filename)
def out_res(filename): return os.path.join(out_dir, filename)
def lm_res(filename): return os.path.join('/data/vision/polina/projects/chestxray/geeticka/pre-processed/', filename)

if not os.path.exists(out_dir):
    os.mkdir(out_dir)

### Following are the different classes with their keywords:
class0 None: 'no pulmonary edema', 'no vascular congestion', 'no fluid overload’, 'no acute cardiopulmonary    process’                                                                                                           

class1 Vascular congestion: 'mild pulmonary vascular congestion', ‘cephalization', 'mild hilar engorgement','mild vascular plethora'                                                                                                

class2 Interstitial edema: 'kerley', 'interstitial edema’, 'interstitial thickening’, 'interstitial opacities’

class3 Alveolar edema: 'perihilar infiltrates', 'peri-hilar infiltrates', 'hilar infiltrates', 'alveolar infiltrates', 'severe pulmonary edema’ 

In [43]:
edema_pred_dict = {'none': 0 ,'vascular congestion': 1, 'interstitial edema': 2, 'alveolar edema': 3}
# might be interesting to consider the ranking loss function for the classification task in the future
# this needs to be used when generating the test data csv

In [44]:
data = pd.read_csv(latest_csv)
def convert_to_filename(row):
    filename = row['study_id']
    study_id = 's'+ str(filename) + '.txt'
    return study_id
data['study_id'] = data.apply(convert_to_filename, axis=1)
data_edema = data.loc[data['edeme_severity'] >= 0].reset_index(drop=True)

In [45]:
data_edema = data_edema.drop_duplicates(subset=['study_id', 'edeme_severity'])

In [46]:
len(data_edema)

6660

`data_edema` is less because lateral views of some images is not available, so we lost some reports

train.csv contains duplicates from test.csv - need to remove those

In [47]:
train_df_filename = utils.train_filename_df(class_info_res) # this data is incomplete in terms of the keywords that 
# I have

In [48]:
len(train_df_filename)

6710

In [49]:
# data_edema.rename(columns={'study_id':'filename',
#                           'Edema severity':'edema_severity'}, 
#                  inplace=True)

In [50]:
# data_edema = data_edema[['filename', 'edema_severity']]

In [51]:
# data_edema['metadata'] = '{}'
# train_df_filename = data_edema

Next todo: Need to write the test file and then double check if all the reports in the folder are located. Also check if there is any report overlap for train vs test. Also while writing the test file, need to make sure that s1232.txt appears rather than 1232.txt. 

## Write the train and test files

In [9]:
test_df_filename, duplicated_filenames = utils.test_filename_df(seth_labels_res, edema_pred_dict)

Radiologist (Seth) did not label report: s52375169.txt
Original test data frame is 199 long but there are 21 duplicated rows
Removed duplicates to return 178 length dataframe


In [10]:
test_filenames = test_df_filename['filename'].unique().tolist()

In [11]:
len(test_filenames)

178

In [12]:
train_filenames_overlap = train_df_filename[train_df_filename['filename'].isin(test_filenames)]['filename'].unique().tolist()
print("%d filenames from the test data are present in the train_filename.csv file. These will need to be removed"%(
len(train_filenames_overlap)))

178 filenames from the test data are present in the train_filename.csv file. These will need to be removed


In [13]:
set(test_filenames) - set(train_filenames_overlap)

set()

above set should be 0

#### Removing test filenames from the train data

In [14]:
train_df_filename = train_df_filename[~train_df_filename['filename'].isin(test_filenames)].reset_index(drop=True)

In [15]:
all_data_filename = pd.concat([train_df_filename, test_df_filename], axis=0, ignore_index=True)

Therefore, `all_data` contains the labels of the train files from the keyword matching, but the test files from seth labels. This is necessary for the testing stage

## Write the train and test filename csv

In [16]:
if not(os.path.exists(out_res('reports_list'))):
    os.mkdir(out_res('reports_list'))

In [52]:
# utils.write_dataframe(all_data_filename, out_res('reports_list/all_data_list.csv'))
# # only write the above dataframe when the duplicates from the test data have been removed. 

# Now let's grab the actual text located in these and tokenize it. We will use scispacy to do the same. 

In [53]:
all_filenames = all_data_filename['filename'].unique().tolist()

In [54]:
len(all_filenames)

6710

In [55]:
# # take in a line in a report and do some pre-processing like number normalization, sentence segmentation etc
# def pre_process_report(line_report):
#     # do this later

In [56]:
original_reports_df = lm_utils.read_dataframe(lm_res('lm_reports/original_reports.csv'))

In [57]:
all_data_reports = utils.write_report_into_df(all_data_filename, original_reports_df)

### Write the train and test files with original text

In [24]:
if not os.path.exists(out_res('reports')):
    os.mkdir(out_res('reports'))

In [58]:
# utils.write_dataframe(all_data_reports, out_res('reports/all_data.csv'))
# # # only write the above dataframe when the duplicates from the test data have been removed. 

In [59]:
# indexes = [2,10,18,34,40,48,55,61,63,70,72,73,74,76,88,101,106,119,147,148,155,159,165,169,170]
# for metadata in test_df_reports.iloc[indexes]['metadata']:
#     print(metadata['keywords_found'])
# #     print(report)
# #     print(report['metadata']['keywords_found'], '\n')

Now check if any of the filenames in the original csv that ray is using is not present in all_data_reports - that should not be the case

### Tokenize and pre-process the data

In [60]:
import spacy
import scispacy
nlp = spacy.load('en_core_sci_md')
# from scispacy.umls_linking import UmlsEntityLinker

In [61]:
# linker = UmlsEntityLinker()

In [62]:
# nlp.add_pipe(linker)

In [63]:
# punctuations = ['.', ',', ':', '?', '!', ';', '-', '(', ')', '{', '}', '"', "'"]
# def is_punct(char):
#     if char in punctuations:
#         return True
#     else:
#         return False

In [65]:
all_data_reports['normalized_report'] = all_data_reports.apply(preprocess.normalize_report, axis=1)

## Write the reports with normalized data

In [32]:
if not os.path.exists(out_res('reports_normalized')):
    os.mkdir(out_res('reports_normalized'))

In [None]:
# utils.write_dataframe(all_data_reports, out_res('reports_normalized/all_data_original.csv'))
# # only write the above dataframe when the duplicates from the test data have been removed. 

In [None]:
# utils.write_dataframe(test_df_reports, out_res('reports_normalized/test_original.csv'))
# # only write the above dataframe when the duplicates from the test data have been removed. 

## NOW, the next step is to use the script located in chestxray_joint/data/text/data_splitting.py to generate the tsv files

This is the data conversion step

In [79]:
output_channel_encoding = 'multiclass' # or multilabel
training_mode = 'supervised' # or semisupervised

In [80]:
# TODO: fill this in order to do the generation of the tsv files as well as make the code more modular
tsv_in_dir = '/data/vision/polina/projects/chestxray/geeticka/pre-processed/' + training_mode + '/reports_normalized'
tsv_out_dir = '/data/vision/polina/projects/chestxray/geeticka/bert/converted_data/' + output_channel_encoding + \
'/' + training_mode

def tsv_in_res(filename): return os.path.join(tsv_in_dir, filename)
def tsv_out_res(filename): return os.path.join(tsv_out_dir, filename)

# development_or_test = 'development'

In [81]:
all_data_df = utils.read_dataframe(tsv_in_res('all_data_original.csv'))
# test_df = utils.read_dataframe(tsv_in_res('test_original.csv'))

In [82]:
if not os.path.exists(tsv_out_res('full')):
    os.makedirs(tsv_out_res('full'))

### Write the newly formed train and dev files (that are taken from original train data and are only to be used for tuning)

In [85]:
new_all_data_df_bert = utils.get_df_bert_multilabel(all_data_df, output_channel_encoding)
new_all_data_df_bert.to_csv(tsv_out_res('full/all_data.tsv'), sep='\t', index=False, header=False)
# # don't need to keep rewriting these so can comment above out

In [41]:
# new_dev_df_bert = utils.get_df_bert_multilabel(new_dev_df)
# new_dev_df_bert.to_csv(tsv_out_res('development/dev.tsv'), sep='\t', index=False, header=False)
# # don't need to keep rewriting these so can comment above out

Let's now check if Ray's splits csv file contains any filenames I don't have in the train and test

In [146]:
# directory = '/data/vision/polina/projects/chestxray/work_space_v2/report_processing/edema_labels-7-11-2019/mimic-cxr-sub-img-edema-finding-split_arranged.csv'
# data = pd.read_csv(directory)

In [147]:
# def convert_to_filename(row):
#     filename = row['study_id']
#     study_id = 's'+ str(filename) + '.txt'
#     return study_id
# data['study_id'] = data.apply(convert_to_filename, axis=1)

In [148]:
# all_filenames_by_img_model = data['study_id'].unique().tolist()

In [150]:
# all_filenames_by_txt_model = all_data_df['filename'].unique().tolist()

In [151]:
# len(all_filenames_by_img_model) # these are all the reports; they don't all have edema severity labels

15837

In [152]:
# len(all_filenames_by_txt_model)

3045

In [153]:
# data_edema = data.loc[data['Edema severity'] >= 0]

In [154]:
# edema_all_filenames_by_img_model = data_edema['study_id'].unique().tolist()

In [155]:
# len(edema_all_filenames_by_img_model)

3022

In [156]:
# len(set(edema_all_filenames_by_img_model) - set(all_filenames_by_txt_model))

0

### Write the original train and test files (that are to be used for reporting)

In [79]:
# train_df_bert = utils.get_df_bert_multilabel(train_df)
# train_df_bert.to_csv(tsv_out_res('testing/train.tsv'), sep='\t', index=False, header=False)

In [80]:
# test_df_bert = utils.get_df_bert_multilabel(test_df)
# test_df_bert.to_csv(tsv_out_res('testing/dev.tsv'), sep='\t', index=False, header=False)
# # we must call it dev for the purposes of the evaluation - that is just the name that the algorithm expects
# # this can probably be changed in the future

In [84]:
a = {1:'sd', 2:'sdas'}
list(a.keys())

[1, 2]