Skip to content

SmokeShine/Convolutional-Attention-forMultiLabel-classification-CAML

Repository files navigation

Steps

  1. Download MIMIC III data from physionet.org
  2. Unzip the files and extract the csv
  3. Start Hadoop cluster
    3.1 Switch to hdoop user su - hdoop
    3.2 ssh localhost
    3.3 Start services
    cd hadoop-3.2.1/sbin/
    ./start-dfs.sh
    ./start-yarn.sh
    3.4 Check jps for running services
    jps
hdoop@pop-os:~/hadoop-3.2.1/sbin$ jps
54400 DataNode
54289 NameNode
54599 SecondaryNameNode
55287 Jps
54907 NodeManager
54812 ResourceManager
  1. Move the files to hadoop 4.1 (hw5) hdoop@pop-os:~$ cd /home/ubuntu/Documents/CAML_GroupProject/physionet.org/files/mimiciii/1.4
    4.2 hdfs dfs -put NOTEEVENTS.csv /input/raw_data/ 4.3 hdfs dfs -put DIAGNOSES_ICD.csv /input/raw_data/
    4.5 hdfs dfs -put PROCEDURES_ICD.csv /input/raw_data
    4.6 Check content on disk - hdfs dfs -ls /input/raw_data/
(hw5) hdoop@pop-os:/home/ubuntu/Documents/CAML_GroupProject/physionet.org/files/mimiciii/1.4$ hdfs dfs -ls /input/raw_data/
Found 2 items
-rw-r--r--   1 hdoop supergroup   12548562 2021-04-09 16:19 /input/raw_data/NOTEEVENTS.csv
-rw-r--r--   1 hdoop supergroup   19137527 2021-04-09 16:30 /input/raw_data/DIAGNOSES_ICD.csv
  1. Activate conda environment - source anaconda3/bin/activate
  2. Activate pyspark environment - source activate hw5
  3. From host, give all write permissions to python source code folder
chmod 777 src
  1. sudo chmod 777 main.py 8.1) ./main.py -h
  2. Usage
usage: main.py [-h] [--server [SERVER]] [--data_processing] [--gpu] [--train]
               [--embedding_size [EMBEDDING_SIZE]] [--batch_size [BATCH_SIZE]]
               [--num_workers [NUM_WORKERS]] [--num_epochs [NUM_EPOCHS]]
               [--patience [PATIENCE]]
               [--max_length_of_sentence [MAX_LENGTH_OF_SENTENCE]]
               [--top_k [TOP_K]] [--prediction_samples [PREDICTION_SAMPLES]]

Final Group Project

optional arguments:
  -h, --help            show this help message and exit
  --server [SERVER]     Provide address of server (default:
                        hdfs://localhost:9000)
  --data_processing     Data Preprocessing (default: False)
  --gpu                 Use GPU for training (default: False)
  --train               Train Model (default: False)
  --embedding_size [EMBEDDING_SIZE]
                        Size of embedding layer (default: 100)
  --batch_size [BATCH_SIZE]
                        Batch size for training the model (default: 32)
  --num_workers [NUM_WORKERS]
                        Number of Available CPUs (default: 5)
  --num_epochs [NUM_EPOCHS]
                        Number of Epochs for training the model (default: 10)
  --patience [PATIENCE]
                        Number of epochs Early Stopping (default: 2)
  --max_length_of_sentence [MAX_LENGTH_OF_SENTENCE]
                        Maximum length of sentence for Spark Word2Vec Model
                        (default: 1000)
  --top_k [TOP_K]       Top k predictions (default: 5)
  --prediction_samples [PREDICTION_SAMPLES]
                        Number of prediction samples (default: 5)
  1. ./main.py --data_processing --gpu --train

About

Explainable Prediction of Medical Codes from Clinical Text https://arxiv.org/abs/1802.05695

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages