## this notebook is for labelling the WOHS Lung-Rads dataset

#### first install labelling library

In [2]:
!pip install superintendent



You should consider upgrading via the '/Users/jjaskolkambp/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


#### now import needed libraries

In [3]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
from collections import Counter, OrderedDict
import re
from IPython import display
from superintendent import ClassLabeller

### import and check the data

this field loads the data from the csv file and turns it into something called a dataframe

In [5]:
data = pd.read_csv('lung_rads_data.csv')

this field shows the first 5 rows of the dataframe to give a sense of what's in it

In [6]:
data.head()

Unnamed: 0,Exam,Procedure Date,Referral/Recall,Lung Rads,Rad Recommendation,Exam Result,Lung Rads Split,lr,mod,rec,extracted lung rads,lr num,needs f/u,report,labels
0,HIGH RISK LUNG - ANNUAL,2019-04-02,unknown,Lung-RADS:? Category: 3\r\n\r\nModifiers: S\r\...,Recommendation: 6 month follow-up has been ini...,CLINICAL INFORMATION: High Risk Screening.\r\n...,"['Lung-RADS: Category: 3\r\n\r\n', 'Modifiers:...",Lung-RADS: Category: 3\r\n\r\n,Modifiers: S\r\n\r\n,RECOMMENDATION: 6 month follow-up has been ini...,3,3,Y,CLINICAL INFORMATION: High Risk Screening.\r\n...,3
1,HIGH RISK LUNG - ANNUAL,2019-04-02,Book Annual,Lung-RADS:? Category: 2\r\n\r\nModifiers: None...,Recommendation: LDCT in 12 months,CLINICAL INFORMATION: High Risk Screening.\r\n...,"['Lung-RADS: Category: 2\r\n\r\n', 'Modifiers:...",Lung-RADS: Category: 2\r\n\r\n,Modifiers: None\r\n\r\n,RECOMMENDATION: LDCT in 12 months,2,2,N,CLINICAL INFORMATION: High Risk Screening.\r\n...,2
2,HIGH RISK LUNG - ANNUAL,2019-04-02,Book Annual,Lung-RADS:? Category: 1\r\n\r\nModifiers: None...,Recommendation: LDCT in 12 months,CLINICAL INFORMATION: High Risk Screening.\r\n...,"['Lung-RADS: Category: 1\r\n\r\n', 'Modifiers:...",Lung-RADS: Category: 1\r\n\r\n,Modifiers: None\r\n\r\n,RECOMMENDATION: LDCT in 12 months,1,1,N,CLINICAL INFORMATION: High Risk Screening.\r\n...,1
3,HIGH RISK LUNG - ANNUAL,2019-04-02,Book Annual,Lung-RADS:? Category: 2\r\n\r\nModifiers: None...,Recommendation: LDCT in 12 months,CLINICAL INFORMATION: High Risk Screening.\r\n...,"['Lung-RADS: Category: 2\r\n\r\n', 'Modifiers:...",Lung-RADS: Category: 2\r\n\r\n,Modifiers: None\r\n\r\n,RECOMMENDATION: LDCT in 12 months,2,2,N,CLINICAL INFORMATION: High Risk Screening.\r\n...,2
4,HIGH RISK LUNG - ANNUAL,2019-04-02,Book Annual,Lung-RADS:? Category: 2\r\n\r\nModifiers: None...,Recommendation: LDCT in 12 months,CLINICAL INFORMATION: High Risk Screening.\r\n...,"['Lung-RADS: Category: 2\r\n\r\n', 'Modifiers:...",Lung-RADS: Category: 2\r\n\r\n,Modifiers: None\r\n\r\n,RECOMMENDATION: LDCT in 12 months,2,2,N,CLINICAL INFORMATION: High Risk Screening.\r\n...,2


this field prints the shape of the dataframe - in this case 6101 rows and 15 columns

In [23]:
data.shape

(6101, 15)

this field shows the names of the columns

In [8]:
data.columns

Index(['Exam', 'Procedure Date', 'Referral/Recall', 'Lung Rads',
       'Rad Recommendation', 'Exam Result', 'Lung Rads Split', 'lr', 'mod',
       'rec', 'extracted lung rads', 'lr num', 'needs f/u', 'report',
       'labels'],
      dtype='object')

## labelling

this field counts all the different labels  
   
the key is the "nan" field which means it's currently empty

In [24]:
Counter(data.labels)

Counter({'3': 271,
         '2': 3118,
         '1': 2218,
         '0': 3,
         '4': 187,
         '4a': 2,
         '4x': 3,
         '4b': 3,
         None: 296})

gotta rename "nan" as "None" for the labeller

In [25]:
data['labels'] = data['labels'].replace({np.nan:None})

Counter(data['labels'])

Counter({'3': 271,
         '2': 3118,
         '1': 2218,
         '0': 3,
         '4': 187,
         '4a': 2,
         '4x': 3,
         '4b': 3,
         None: 296})

run the next field by clicking "shift + enter"

In [26]:
labelling_widget = ClassLabeller(
    features=data['report'],
    labels = data['labels'],
    display_func=lambda x: display.display(display.Markdown(x)),
    options=['0', '1', '2', '3', '4', '4a', '4b', '4x','uncertain'],
)

labelling_widget

ClassLabeller(children=(HBox(children=(FloatProgress(value=0.0, description='Progress:', max=1.0),)), Box(chil…

when done, run the next field to see if the None values decreased

In [27]:
Counter(data['labels'])

Counter({'3': 271,
         '2': 3118,
         '1': 2218,
         '0': 3,
         '4': 187,
         '4a': 2,
         '4x': 3,
         '4b': 3,
         None: 296})

run the next cell to update the labels in the dataframe with the new labels you just created

In [28]:
data['labels'] = labelling_widget.new_labels

In [29]:
Counter(data['labels'])

Counter({'3': 278,
         '2': 3300,
         '1': 2306,
         '0': 4,
         '4': 191,
         '4a': 9,
         '4x': 8,
         '4b': 3,
         None: 2})

save the updated dataframe back to the csv file

In [30]:
data.to_csv('lung_rads_data.csv', index = False)