# Tutorial 10 in week 11: Convert text into sparse form
The aim of this tutorial is to learn how to convert unstructured text into a spare or structured form. The text data that you are going to process contains 16 clinical reports, each of which is stored in the following format
 <img src="pic_1.png" height="500" width="500">
where sentences are separated by an empty line, and a sequence of 10 asterisks ("*") is used to indicate segment boundaries. Your task is to convert the text files into a proper format so that text segmentation algorithms can take the preprocessed text as input.

The preprocessing task is to generate the following files:
* <b>clinical_voc.txt</b> contains the vocabulary generated from the 16 clinical reports. It should look like 
<img src="pic_2.png" height="200" width="300">
* <b>clinical.txt</b> contains the preprocessed clinical reports. Each line corresponds to one sentence in a report. The order of sentences in each clinical report should be kept. The file you are going to generate should look like 
<img src="pic_3.png" height="300" width="500">
where each line starts with the name of the report (i.e., the file name), following by "word_index:count". "word_index" is the index of the word in the vocabulary, and "count" is the frequency of the word in the sentence.
* <b>clinical_boundaries.txt</b> contains the segmentation of each clinical reports. 
<img src="pic_4.png" height="300" width="500">
where each line starts with the name of the clinical report, followed by a binary vector. "1" indicates there is a boundary after the current sentence, "0" otherwise.

The sample output files are provided, which are
* <b>clinical_voc_sample.txt</b>
* <b>clinical_sample.txt</b>
* <b>clinical_boundaries_sample.txt</b>

Your task is to write your own Python code to produce exactly the same files.

In order to generate the three files, you are going to think about how to put some of the techniques covered in week 10 together. For example, we need
* Word tokenization
* Case Normalization
* Stopword removal
and other techniques from lecture materials (in particular, chapter 2 "Exploring Pre-Processed text and Generating Features")in week 11. Please finish the following tasks in this tutorial.

Note that for demonstration purpose, we have split the code into difference chunks in difference cells. Actually, most of the processes can be done within a couple of loops.

In [1]:
import os
import nltk

In [2]:
text_file_folder = "./data"

## Task 1: Read the data files 

The first task is to read the 16 clinical reports stored in the folder <font color="red">"./data"</font>. The file extension used is ".ref". The actual task contains:
for each file,
1. make sure the file's extension is ".ref" 
2. read all the sentences, and store them in list. The sentence order does matters. The blank lines between sentences should be removed.
3. read the segmentation boundaries. You should record where the boundaries are in terms of sentence indexes. For example, if there is a boundary after the 10th, 20th, 30th and 40th sentences, you should generate a list that looks like [9, 19, 29, 39].

In terms of data structure, you can use two dictionaries to store sentences and boundaries separately. For both dictionaries, the key must be the same, e.g., the name of each clinical report so that we can easily match the pre-processed report with the corresponding boundaries. 

You can also use one dictionary, where keys are the report file names, values are a pair of sentence list and boundary list. the following code implement the idea of using one dictionary. Please fill in the missing code.

In [3]:
sents_dict = {}
for root, subFolders, files in os.walk(text_file_folder):
    for file_name in files: # for each files in ./data
        file_path  = os.path.join(root, file_name)
        # mark sure the file extension is "ref". We try to excldue files such as
        # ".DS_Store" in MacOS.   
        if file_path.endswith('ref'):
            file_reader = open(file_path)
            i = 0 # count the total number of sentences in each file
            sents = []
            bds = []
            for line in file_reader.readlines():
            #########please fill in the missing code below#######
                line = line.strip();
                if line != '': # exclude the empty lines 
                    if line.strip() != "**********":
                        sents.append(line)
                        i = i + 1
                    # Lines only contain "**********" indicate segment boundaries.
                    # For text segmentation task, we need to record where the boundaries locate 
                    # in the text 
                    elif line.strip() == "**********": 
                        bds.append(i-1)
            ######################################################
            sents_dict[file_name] = (sents, bds)

It is always a good idea to check the output of your code, i.e., a sanity check. For example, print the list of sentences of "000.ref" and the list of boundaries generated about, and manually check them again the original text file "000.ref".

In [4]:
sents_dict['000.ref'][0]

['Physical diagnosis had its origins in Grecian medicine',
 'Clinical medicine flourished before the Greeks  especially in Egypt  Crete  and Babylonia  and undoubtedly the Greeks were influenced by these earlier physicians',
 'But writings from these countries did not become part of the mainstream of Western civilization  as did those of the Greeks',
 'Table contains two quotations that illustrate the level of medicine practiced by the Greeks',
 'They took a careful history and practiced direct auscultation',
 'They were masters of observation  their descriptions of patients could fit modern texts without much change',
 'Greek medicine flourished early',
 'Homer in the Iliad ca  b c',
 'described  wounds and used  anatomic terms',
 'Hippocrates ca  b c',
 'lived during the Golden Age of Greece',
 'His contemporaries included Plato  Socrates  Aeschylus  Sophocles  Euripides  Aristophanes  and Pericles',
 'Medicine became in his hands an art  a science  a profession Major',
 'The Hippocr

In [5]:
sents_dict['000.ref'][1]

[36, 90, 149, 213, 286, 355, 420, 463, 523, 559, 593, 623, 719, 798, 801]

## Task 2 Tokenize the text for each sentence in each document

In this task, you are going to tokenize all the sentences for each clinical report. The tokenization function is provided as follows:

In [6]:
def tokenize_sent(sent):
    """
    The function tokenizes a sentence, and return a list of words that only contain alphabet 
    letters.
    """
    return [word for word in nltk.word_tokenize(sent.lower()) if word.isalpha()]

The above function uses the nltk's  built-in tokenizer to tokenize a given sentence. All the words are converted to lower cases and must only contain alphabet letters. Now, you should write your code to tokenize all the sentences stored in <font color="orange">sents_dict</font>

In [7]:
tokenized_sents = {} #The key is the document name, the value is a list of tokenized sentences
#########please fill in the missing code below#######
for key, value in sents_dict.items():
    tokenized_sents[key] = [tokenize_sent(sent) for sent in value[0]]
######################################################

Similarly, you should check the out put of your code. Here, we print the first three tokenized sentences of "000.ref", and compare them with the original sentences in "000.ref".

In [8]:
print(tokenized_sents['000.ref'][0])
print(tokenized_sents['000.ref'][1])
print(tokenized_sents['000.ref'][2])

['physical', 'diagnosis', 'had', 'its', 'origins', 'in', 'grecian', 'medicine']
['clinical', 'medicine', 'flourished', 'before', 'the', 'greeks', 'especially', 'in', 'egypt', 'crete', 'and', 'babylonia', 'and', 'undoubtedly', 'the', 'greeks', 'were', 'influenced', 'by', 'these', 'earlier', 'physicians']
['but', 'writings', 'from', 'these', 'countries', 'did', 'not', 'become', 'part', 'of', 'the', 'mainstream', 'of', 'western', 'civilization', 'as', 'did', 'those', 'of', 'the', 'greeks']


## Task 3 remove stop words

As we discussed in the lectures, stop words do not contribute much to the lexical content. In most text analysis tasks (e.g., IR, text classification, topic modeling), we choose to remove all the stop words. This task starts with counting the frequency of each unique word in the 16 reports. The NLTK package that we are going to use is the <b><a ref="http://www.nltk.org/api/nltk.html#nltk.probability.FreqDist">FreqDist</a></b> in the <font color="blue">nltk.probability</font>. To use <b>FreqDist</b>, we need to concatenate words in all the sentences of all the reports, and form a long list of tokens. The function for concatenating all the words is provides as follows:

In [9]:
from nltk.probability import *

def word_concat(dsd):
    """
    concatenate all the words stored in the values of a given dictionary. Each value is a list
    of tokenized sentences.
    """
    all_words = []
    for value in dsd.values():
        for sent in value:
            all_words += sent
    print("tokens:", len(all_words))
    print("types:", len(set(all_words)))
    return all_words

Now, write your code the find the 20 most_common words in the code cell below.

In [10]:
freq_dist = FreqDist(word_concat(tokenized_sents))
freq_dist.most_common(20)

tokens: 61425
types: 6935


[('the', 4497),
 ('of', 2590),
 ('and', 1674),
 ('a', 1436),
 ('to', 1416),
 ('in', 1397),
 ('is', 1023),
 ('with', 637),
 ('or', 636),
 ('be', 631),
 ('that', 535),
 ('patient', 529),
 ('for', 491),
 ('as', 479),
 ('by', 450),
 ('may', 361),
 ('are', 348),
 ('this', 335),
 ('disease', 325),
 ('it', 303)]

You should find that nearly all the words are functional words in those top 20 most-common words, except for "patient" and "disease". However, the two words actually appear in every clinical report. We will consider removing this type of words later. Here your task is to remove all the stop words in the following list:

In [11]:
stopwords = []
with open('./stopwords_en.txt') as f:
    stopwords = f.read().splitlines()
stopwords = set(stopwords)

In [12]:
def remove_words(words, stops):
    """
    This function excludes all the words appearing in a given list.
    Here the list is named "stops"
    """
    return [word for word in words if word not in stops]

Write your code below. Your code should use the above the <font color="blue">remove_words</font> function defined above.

In [13]:
tokenized_sents_stop = {}
#########please fill in the missing code below#######
for key, value in tokenized_sents.items():
    tokenized_sents_stop[key] = [remove_words(sent, stopwords) for sent in value]
######################################################

After removing the stop words, you can find the difference of the most common words between before and after removing stopwords.

In [14]:
freq_dist = FreqDist(word_concat(tokenized_sents_stop))
freq_dist.most_common(20)

tokens: 30408
types: 6516


[('patient', 529),
 ('disease', 325),
 ('patients', 255),
 ('pain', 225),
 ('history', 208),
 ('test', 202),
 ('symptoms', 155),
 ('physician', 141),
 ('heart', 141),
 ('pressure', 126),
 ('physical', 126),
 ('syncope', 124),
 ('chest', 124),
 ('clinical', 121),
 ('blood', 118),
 ('cardiac', 113),
 ('examination', 113),
 ('diagnostic', 111),
 ('medical', 110),
 ('exercise', 106)]

## Task 4 Remove most/less frequent words based on document frequency

In text analysis, we often remove the most and less frequent words in the text based on either the pure word frequency or document frequency. The former counts the total number of occurrences of a word in a corpus; the latter counts the total number of documents containing the word. Here, your task is to remove the most/less frequent words based on word's document frequency. It is not hard to imagine that if a word appears in every document, it will not help us distinguish two documents. Here, we choose to remove words appearing either more than 14 records or less than 2 records. We usually use two steps:
1. generate a list that contains both most and less frequent words
2. remove those words in a similar way as we remove stop words.
Now, write your code below.

In [15]:
all_words =[]
for key, value in tokenized_sents_stop.items():
    words = []
    for sent in value:
        words += sent
    #print len(words)    
    words = list(set(words))
    #print len(words)
    all_words += words

In [16]:
freq_dist = FreqDist(all_words)
print(freq_dist.most_common()) # print the most common words
print(freq_dist.hapaxes()) # print the words that appear just in one clinical report

[('patients', 16), ('patient', 16), ('disease', 16), ('important', 16), ('diagnosis', 15), ('history', 15), ('significant', 15), ('symptoms', 15), ('heart', 15), ('physical', 15), ('clinical', 14), ('physician', 14), ('common', 14), ('examination', 14), ('state', 14), ('diagnostic', 14), ('occur', 14), ('pain', 13), ('general', 13), ('presence', 13), ('major', 13), ('time', 13), ('information', 13), ('normal', 13), ('rate', 12), ('result', 12), ('myocardial', 12), ('chest', 12), ('medical', 12), ('age', 12), ('related', 12), ('problem', 12), ('states', 12), ('hypertension', 12), ('system', 12), ('specific', 12), ('factors', 12), ('similar', 12), ('obtained', 11), ('infarction', 11), ('found', 11), ('include', 11), ('require', 11), ('left', 11), ('risk', 11), ('evidence', 11), ('pressure', 11), ('commonly', 11), ('provide', 11), ('rest', 11), ('cardiac', 11), ('results', 11), ('diseases', 11), ('conditions', 11), ('made', 11), ('blood', 11), ('artery', 11), ('terms', 10), ('question', 1

In [17]:
words2remove = [word for word, freq in freq_dist.items() if freq > 14 or freq < 2]

final_sent ={}
for key, value in tokenized_sents_stop.items():
    final_sent[key] = [remove_words(sent, words2remove) for sent in value]

## Task 5 Generate the final vocabulary 
So far we have tokenized all the sentences, removed stop words and most/less frequent words. We have carried out most necessary steps in pre-processing clinical report for the text segmentation task. In this task, you should generate the final vocabulary and save the vocabulary in a file. You might need to use the <font color="orange">word_concat</font>. 

In [18]:
voc = list(set(word_concat(final_sent)))

tokens: 22126
types: 2430


In [19]:
v_writer = open("clinical_voc.txt", "w")
for type in voc:
    v_writer.write(type+"\n")
v_writer.close()

## Task 6 Save the preprocessed text in a sparse format
Given the vocabulary generated in Task 5, you need to store the tokenized text in a sparse format, i.e., "word_index:word_count". Please first refer to the sample output for what the sparse format exactly is, and then write your code in the following cell:

In [20]:
w_writer = open("./clinical.txt", "w")
for doc, sents in final_sent.items():
    print(doc)
    for sent in sents:
        w_writer.write(doc)
        fd = FreqDist(sent)
        for word, count in fd.items():
            w_writer.write(",{0}:{1}".format(voc.index(word), count))
        w_writer.write("\n")
w_writer.close()

015.ref
009.ref
000.ref
011.ref
004.ref
014.ref
013.ref
001.ref
008.ref
006.ref
007.ref
005.ref
002.ref
012.ref
003.ref
010.ref


## Task 7 Save the segmentation boundaries
The last task is to save the boundary information. You are going to use one binary vector to represent the segmentation of a clinical document. Refer to the sample output <b>clinical_boundaries_sample.txt</b>, and reproduce the sample out put in the following cell.

In [21]:
s_writer = open("clinical_boundaries.txt", 'w')
for key, value in sents_dict.items():
    b_list = ['0']*len(value[0])
    for idx in value[1]:
        b_list[idx] = '1'
    s_writer.write(key+",")
    s_writer.write(",".join(b_list))
    s_writer.write("\n")
s_writer.close()

## Task 8 Generate TF-IDF vector for each sentence
Now assume that we are also interested in computing the similarity between any two sentences in a vector space. Instead of using count vectors generated in the previous tasks, we can also consider use TF-IDF vectors. In this task, you are going to use the <a ref="http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer">TfidfVectorizer</a> in the <a ref="http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text">sklearn.feature_extraction.text </a> package. It is also good to read the <a ref="http://scikit-learn.org/stable/modules/feature_extraction.html"> Feature Extraction</a> tutorial on the sklearn website.

In [22]:
sents_ids = []
sents_words =[]
for key, value in final_sent.items():
    i = 0
    for sent in value:
        #print key
        sents_ids.append("{0},{1}".format(key, i))
        txt = ' '.join(sent)
        sents_words.append(txt)
        i = i + 1
sents_words

['blood pressure pressure measured mercury major arterial system body',
 'separated systolic diastolic determinations',
 'systolic pressure maximum blood pressure contraction diastolic pressure minimum pressure recorded prior contraction',
 'blood pressure written systolic pressure diastolic pressure mm hg',
 'minimum acceptable blood pressure determined adequate perfusion vital organs hypotension',
 'mm hg systolic mm hg diastolic great variation',
 'report joint committee detection evaluation treatment high blood pressure recommended scheme categorizing arterial pressure individuals age years',
 'scheme table',
 'accurate measurement arterial blood pressure obtained direct methods involve expensive equipment artery',
 'methods settings measurements easier safer accurate clinical situations',
 'standard blood pressure cuff proper size minimize errors blood pressure determinations',
 'bladder ideally circumference limb tested',
 'standard bladder length',
 'length recommended limb circ

Write your code below to generate TF-IDF vector.

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(input = 'content', analyzer = 'word')
tfidf_vectors = tfidf_vectorizer.fit_transform(sents_words)
tfidf_vectors.shape

(3281, 2430)

Now you should think about how to save the TF-IDF vector for each sentences in each report. Lets print out the sparse vector for the first sentence in "002.ref".

In [24]:
vocab = tfidf_vectorizer.get_feature_names()
for word, weight in zip(vocab, tfidf_vectors.toarray()[0]):
    if weight > 0:
        print(word, ":", weight)

arterial : 0.286087534952
blood : 0.246358374296
body : 0.301793721164
major : 0.296474271432
measured : 0.407675741483
mercury : 0.41785146834
pressure : 0.487665028734
system : 0.316019761988
