# Document classification project
In this project, the goal is to classify (categorize) medical articles based on their content. Each article explains something about one <b>(or more)</b> cardiovascular-related disease. Each article includes: 
- An identification number. 
- A "title" and/or "abstract": title (if present) is always the first sentence of the document and is not represented differently.
- One (or more) assigned labels. 

| Class description                                     |code|
|-------------------------------------------------------|----|
| Bacterial Infections and Mycoses                      | C01|
| Virus Diseases                                        | C02|
| Parasitic Diseases                                    | C03|
| Neoplasms                                             | C04|
| Musculoskeletal Diseases                              | C05|
| Digestive System Diseases                             | C06|
| Stomatognathic Diseases                               | C07|
| Respiratory Tract Diseases                            | C08|
| Otorhinolaryngologic Diseases                         | C09|
| Nervous System Diseases                               | C10|
| Eye Diseases                                          | C11|
| Urologic and Male Genital Diseases                    | C12|
| Female Genital Diseases and Pregnancy Complications   | C13|
| Cardiovascular Diseases                               | C14|
| Hemic and Lymphatic Diseases                          | C15|
| Neonatal Diseases and Abnormalities                   | C16|
| Skin and Connective Tissue Diseases                   | C17|
| Nutritional and Metabolic Diseases                    | C18|
| Endocrine Diseases                                    | C19|
| Immunologic Diseases                                  | C20|
| Disorders of Environmental Origin                     | C21|
| Animal Diseases                                       | C22|
| Pathological Conditions, Signs and Symptoms           | C23|

This project (and its data) is divided into two main phases:
- Phase-1: Multi-class classification (Milestones 1 and 2)
- Phase-2: Multi-label classification (Milestone 3)

### Phase-1: Multi-class classification (Milestones 1 and 2)
Training set examples in this phase are associated with <b>only one</b> of abovementioned diseases. Hence, your job is to assign <b>only one</b> disease label to each document. The corpus is partitioned into training set (70%) and test set (30%) and is located in "data/project_dc_multiclass.gz" file which can be easily read with the following function: 

In [1]:
import json , gzip
def read_gziped_json(file_address):
    with gzip.GzipFile(file_address, 'r') as f:
        json_bytes = f.read()
    json_str = json_bytes.decode('utf-8')
    data = json.loads(json_str)
    return data

multiclass_corpus = read_gziped_json("data/project_dc_multiclass.gz")
print (type(multiclass_corpus))
print (multiclass_corpus.keys())
print ("number of training set examples:" , len (multiclass_corpus["training"]))
print ("number of test set     examples:" , len (multiclass_corpus["test"]))

<class 'dict'>
dict_keys(['test', 'training'])
number of training set examples: 12802
number of test set     examples: 5500


In [2]:
for example in multiclass_corpus["training"][0:3]:
    print (example , "\n")

['0025164', 'The level of tryptase in human tears. An indicator of activation of conjunctival mast cells.\n Tryptase, a neutral endoprotease, is secreted by activated mast cells in human tissues.\n Tryptase levels in various body fluids have been used as an indicator of mast cell activation.\n The authors determined tryptase levels in unstimulated tears collected from the following groups of patients: (1) normal control, (2) nonallergic ocular inflammation, (3) asymptomatic seasonal allergic conjunctivitis, (4) symptomatic seasonal allergic conjunctivitis, (5) vernal conjunctivitis, and (6) contact lens-associated giant papillary conjunctivitis.\n They also assessed the release of tryptase into the tear fluid after provoking the conjunctiva with (7) allergens, (8) compound 48/80, and (9) rubbing.\n Tryptase levels were elevated in tears of patients with active ocular allergy and also increased after provoking the conjunctiva with allergens in atopic subjects and with compound 48/80 and

In [3]:
for example in multiclass_corpus["test"][0:3]:
    print (example , "\n")

['0017649', 'Ampicillin-resistant enterococcal species in an acute-care hospital.\n A prospective review of all enterococcal isolates for 13 months showed that 9.0% were resistant to ampicillin (MIC, greater than or equal to 16 micrograms/ml; zone diameter, less than 15 mm), as determined by the Vitek system, disk diffusion, microdilution MIC testing, and macrodilution MIC testing.\n All were beta-lactamase negative.\n A total of 19 and 3 resistant isolates were from urine and intravascular sites, respectively.\n Ampicillin-resistant enterococci appear to be a growing clinical problem.\n', 'C01'] 

['0025450', 'Pharmacokinetics of ceftazidime in serum and suction blister fluid during continuous and intermittent infusions in healthy volunteers.\n The pharmacokinetics of ceftazidime were investigated during intermittent (II) and continuous (CI) infusion in eight healthy male volunteers in a crossover fashion.\n The total daily dose was 75 mg/kg of body weight per 24 h in both regimens, g

### Phase-2: Multi-label classification (Milestone 3)
Training set examples in this phase are associated with <b>at least one</b> of abovementioned diseases. Hence, your job is to assign <b>one or more</b> disease labels to each document. The corpus is in "data/project_dc_multilabel.gz" file and can be easily read with the following function:

In [4]:
multilabel_corpus = read_gziped_json("data/project_dc_multilabel.gz")
print ("number of examples:" , len (multilabel_corpus))
for example in multilabel_corpus[0:3]:
    print (example , "\n")


number of examples: 34389
['0036407', 'Inhalation of foreign bodies by children: review of experience with 74 cases from Dubai.\n Seventy four out of 94 cases of bronchoscopy carried out over a five year period are reviewed.\n The clinical history of choking followed by recurrent spasmodic cough were found to be the most important element in making the diagnosis and proceeding to diagnostic and therapeutic bronchoscopy.\n Radiology was inferior as a diagnostic aid although radioactive scanning may be helpful in difficult cases.\n', ['C08', 'C21']] 

['0035922', 'Mechanism of cyclosporine-induced hypertension.\n Cyclosporine is a common immunosuppressive agent used in solid organ and bone marrow transplants and the treatment of some immunological diseases.\n It has been established that treatment with cyclosporine can cause a patient to develop hypertension within a few weeks of treatment.\n This review will examine this effect and effective ways to treat it.\n', ['C14', 'C23']] 

['004

Note that still there are documents with single labels in the corpus. This is quite normal in a multi-label setup. The following code shows some of them: 

In [5]:
x = [example for example in multilabel_corpus if len(example[2])==1]
for example in x[0:3]:
    print (example, "\n")

['0013473', "Patients in a persistent vegetative state attitudes and reactions of family members.\n Patients in a persistent vegetative state (PVS) constituted approximately 3% of the population in four Milwaukee nursing homes.\n In order to understand family members' attitudes and reactions toward such patients, 33 (92%) of 36 family members of patients in PVS contacted were studied.\n The age of the patients ranged from 19 to 95 with a mean age of 73.4 +/- 17.2 years, and family members' ages ranged from 41 to 89 with a mean age of 61.8 +/- 3.3 years.\n The etiology of the PVS varied from dementia to cerebral trauma.\n The mean duration of the PVS was 54 +/- 8.4 months (range 12 to 204).\n Family members reported that they visited patients 260 times during the first year following the onset of the PVS and were still visiting at a rate of 209 visits yearly at the time of the interview.\n There was no significant correlation between the frequency of the family members visits and the du

In [6]:
x = [example for example in multilabel_corpus if len(example[2])>5]
for example in x[0:3]:
    print (example, "\n")

['0030176', 'Lower limb problems in diabetic patients. What are the causes? What are the remedies?\n Peripheral neuropathy, infection, and peripheral vascular disease can produce serious problems in diabetic patients, particularly in the lower limbs.\n Ulceration of the foot may progress to gangrene and ultimately necessitate amputation.\n Distal symmetric polyneuropathy causes sensory loss.\n Such loss in patients with peripheral vascular disease creates a high risk for foot ulcers, which are vulnerable to infection.\n Treatment includes relief of neuropathic pain and antibiotic therapy for infection.\n Pentoxifylline (Trental) improves microvascular flow and appears to be effective against peripheral vascular disease.\n Aldose reductase inhibitors are being investigated as therapy for diabetic neuropathy.\n Prevention is the mainstay of management in these patients.\n Patient education is essential to help maintain health and prevent the potential adverse effects of diabetes.\n', ['C

# Milestones

## Phase 1: Multi-class classification (Milestones 1 and 2) 
In this phase, you will work with the given multi-class training and test sets. 
You should use micro-averaged F1-score as the evaluation/optimization metric. 
See: http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html 


### 1.1 Train a bag-of-words ANN as the baseline classifier
Train a bag-of-words ANN classifier on the multi-class training data, and then predict (assign) a label for each given document in the multi-class test set. Report your results with different network hyperparameters.

You should also try using the pre-trained word-embeddings: http://evexdb.org/pmresources/vec-space-models/PubMed-and-PMC-w2v.bin (see: http://bio.nlplab.org/pdf/pyysalo13literature.pdf). Try using 100K, 200K, 500K and 1M top most frequent words in the model. 


### 1.2 Improve classification performance using CNNs
In this step, you have to improve the classification performance by training different ConvNets. This means you should try to achieve a micro-averaged F1-score higher than what you had achieved using the bag-of-words ANN, on the test set. 

- Try different CNN architectures, like using multiple convolutional layers with different window sizes. 
- Try different activation functions. 
- Try adding one/more fully connected hidden layers after convolutional/pooling layers, before the decision layer.
- Try using the given pre-trained word-embeddings.
- Try various network hyperparameters. 

Analyze different neural network architectures and hyperparameters. 
Report the results you achieve using different architectures/hyperparameters. 

### 2.1 Analyze confusion-matrix 
For your best CNN architecture, print (or plot) the confusion matrix for the test set. 
(See: https://en.wikipedia.org/wiki/Confusion_matrix), and analyze this matrix. For example, you should report what are the easiest/hardest labels to predict (i.e., for what labels, your classifier works best/worst) ? 
What labels are very hard to distinguish (separate) from each other? 

### 2.2 Analyze the convolutional kernels
Using the example codes shown during the lectures, analyze where your convolutional kernels are activating. Try to explain what each convolutional kernel learns.  

## Phase 2: Multi-label classification (Milestone 3) 
In this phase, you have to work with the given multi-label dataset. 
Because there is no separate train/test set, you should first randomly partition the corpus into 65% training and 35% test set. 

- Note that a single document may be associated with more than one label! 
- Use <b>the same</b> training/test set split for all of your experiments! 
- Use f1-score or accuracy for optimization/evaluation. 

### 3. CNNs for multi-label dataset
- Modify the CNN architecture/learning parameters so that it can be used for multi-label document classification. 
- Explain the main differences between your multi-class and multi-label setup. 
- Train the network on your training set and then predict labels for each of the documents in your test set. 
- Try to optimize your CNN to achieve higher score on the test set. 
- Comprehensively discuss how your CNN performs on the test set for multi-label classification. 
