# FIT5196 Assessment 1
#### Student Name: Ed Farrell
#### Student ID: 28629396

Date: 30/08/2017

Version: 1.0

Environment: Python 2.7.13 and Jupyter notebook

Libraries used: please include the main libraries you used in your assignment here, e.g.,:
* re (for regular expression, included in Anaconda Python 2.7)
* os (for directory management and creation)
* ElementTree (for parsing XML files)
* nltk (for text processing of abstracts; *probability*, *corpus* and *util* modules used)
* collections (for Counter function)
* copy (for creating deep copies of variables)

## 1. Introduction

This assignment aims to extract information from a corpus of 2500 patents stored as XML files. Off the bat, the format of the original 'patents.xml' file raised errors as it was a collection of XML files saved to a single file, rather than an individual XML file. Following a visual inspection of the original file via Notepad++, the patents were split into individual XML files prior to parsing. Given the complex structure of XML files, ElemetTree's ability to parse XML data was key in extracting the relevant information as required by the assignment description.

ElementTree was chosen for its simplicity, with RegEx being used for its ability to quickly and effectively manipulate strings.
The use of dictionaries with list values was chosen throughout the assignment due to the ability to store a single unique key (in this case, a patent ID) with an effectively unlimited, and more importantly mutable, storage volume as the value. By using a single list, the program is able to store a single string, the tokenised version of that string, a list of lists, etc. etc.
Due to this use of dictionaries, the use of dedicated dataframe packages such as Numpy or Pandas was disregarded as they were identified as providing the same service without the benefit of being a built-in python datatype.

__ PLEASE NOTE: THIS PROGRAM WAS BUILT ON A WINDOWS OS, AND ASSUMES THAT IT WILL BE RUN / MARKED ON ONE. THIS PROGRAM CREATES SUBDIRECTORIES; IF FOR WHATEVER REASON YOU DO NOT HAVE PERMISSION / THE ABILITY TO PERFORM THIS ON YOUR DEVICE, PLEASE ATTEMPT ON ONE THAT DOES.__

## 2.  Import libraries 

In [1]:
# Key initial imports
import re
import os

# Parsing tools
import xml.etree.ElementTree as etree

# Text processing
import nltk
from nltk.probability import *
from nltk.collocations import *
from nltk.corpus import stopwords as Stopwords
from nltk.util import ngrams

# Miscellaneous libraries
from collections import Counter
import copy


## 3. Declare top-level variables & create subdirectories

This includes variables that determine the patent file being read, which stopwords store is used, and which data extraction paths are to be used.

In [3]:
# Assigns the patent file to the variable patent
patent = 'patents.xml'

# Path variables for data extraction
patent_cites_path = 'us-bibliographic-data-grant/references-cited/citation/patcit/document-id'
IPC_path = 'us-bibliographic-data-grant/classifications-ipcr/classification-ipcr'
abstract_path = 'abstract'

# Target variables for data extraction
target_IPC = ['section', 'class', 'subclass', 'main-group', 'subgroup']
target_cites = ['doc-number']
target_patents = ['p']

# Stopword declaration; this list is sourced from NLTK
stopwords_NLTK = Stopwords.words('english')


In [4]:
# Checks to see whether folders named 'Patents' and 'Text-files' exist, and if not then the necessary folders are created.
# This will be used to store individual patent xml files once they have been split from the patents.xml file, and the required
# .txt files storing extracted data records as outlined in the Assignment description.
if os.path.exists("Patents"):
    print "Patent sub-directory found."
else:
    os.makedirs("Patents")
    print "Patent sub-directory created."
if os.path.exists("Text-files"):
    print "Text-files sub-directory found."
else:
    os.makedirs("Text-files")
    print "Text-files sub-directory created."

# Creates variables storing paths for the Patents and Text-files subdirectories
xml_subdir = os.path.join(os.getcwd(), "Patents")
txt_subdir = os.path.join(os.getcwd(), "Text-files")

# Variable containing the names of each .txt file that will store extracted data
text_file_names = ['classification.txt', 'citations.txt', 'cited.txt', 'count_vectors.txt', 'vocab.txt']

Patent sub-directory created.
Text-files sub-directory created.


## 4. Define program-wide functions

In [5]:
# A function used to extract information from each patent in a list of patent files. This function assumes the following;
# 1. Each patent is an individual .xml file, and which conforms to a pre-defined path structure (as per the path argument)
# 2. Each extraction is for information on a single level; information across multiple levels will require separate runs of
# dataTrawler.

def dataTrawler(patents, path, targets):
    # Variable declarations - patent_data is filled with data trawled from patents, and then is appended to complete_data when
    # the trawling is finished for a particular patent.
    complete_data = []
    patent_data = []
    
    # Iterates over all patents in 'patents', setting a root and then descending on the path nominated in the path argument.
    # Once at the final level, the function pulls all text data associated with the targets stated in the target argument,
    # appending each occurence to the patent_data variable.
    for patent in patents:
        tree = etree.parse(patent)
        root = tree.getroot()
        for element in root.findall(path):
            for data in element:
                if data.tag in targets:
                    if data.tag == 'p':
                        if data.attrib['id'] == 'p-0001':
                            patent_data.append(data.text)
                    else:
                        patent_data.append(data.text)
        complete_data.append(patent_data)
        patent_data = []
    
    # Creates a dictionary variable, where each patent name is a key and the value of each key is a list of the corresponding
    # extracted IPC code details.
    data_in_dict = dict(zip(patent_names, complete_data))
    return data_in_dict

## 5. Extracting data and writing to individual files

As ElementTree is being used to parse the XML data, the concatenated series of XML files in __'patents.xml'__ are split into individual patent files that are then iterated through by ElementTree & the function __dataTrawler__ (defined above).

In [6]:
# Pulls data from patents.xml into a list of lists, with each sublist containing an individual patent where each line
# is an element of the sublist.

# Variable declarations
single_patent = []
all_patents = []
 
with open(patent) as patents:
    for line in patents:
        if not line.startswith("<?xml version="):
            single_patent.append(line)
        else:
            # If this is the first occurence of '<?xml version=', then write this line to single_patent
            if len(single_patent) == 0:
                single_patent.append(line)
            else:
            # If this is not the first occurence of '<?xml version=', then append single_patent to all_patents, wipe single_patent,
            # and then write the line to single_patent. In essence, transfer the previous patent to all_patents and then
            # over-write the contents of the list.
                all_patents += ([single_patent])
                single_patent = []
                single_patent.append(line)
    # Writes the last patent in the xml file to all_patents, as this would not have been done otherwise.
    all_patents.append(single_patent)

In [7]:
# Creates a list containing the document names for each patent in patents.xml, as well as a list containing the patent names.
patent_names = []
for item in all_patents:
    for i in item:
        patent_filename = re.findall(r"\w{10}\-\w{8}\.XML", i)
        if len(patent_filename) > 0:
            patent_names.append(patent_filename)
patent_names = [i[:10] for j in patent_names for i in j] # Flattens the list of lists, and cleans the patent name information.


In [8]:
# Creates an individual xml file for each patent file stored in patents.xml
# A list comprehension is used to generate the filenames (including directory path) for all patent names stored in the
# patent_names variable.
list_of_xmls = [xml_subdir+"\\"+ name+".XML" for name in patent_names]
count = 0
for document in list_of_xmls:
    contents = all_patents[count]
    newfile = open(document, 'w')
    newfile.write("".join(contents))
    count += 1
    newfile.close()

print count, "patent files separated."

2500 patent files separated.


## 5. Parse the XML files and create text files containing extracted data

Four text files are created below. These files are detailed as follows;
1. __'classifications.txt'__ - contains each patent and its related heirarchical IPC code, in the order Section, Class, Subclass, Main Group and Subgroup
2. __'citations.txt'__ - contains each patent, and each existent patent cited by that patent
3. __'citations_count.txt'__ - contains each patent, and the number of patents cited by that patent. This is Mohsen Laali's interpretation of the 'cited.txt' requirements.
4. __'cited.txt'__ - contains each patent cited by the patents in the original 'patents.xml' file, as well as a count of the number of times that patent is cited within the whole body of patents.

In [9]:
# Variable used to store each patent's IPC code records as a dictionary, where each key is a patent and each value is a list
# containing the IPC codes in the order section, class, subclass, main-group, and subgroup.
IPC_data = dataTrawler(list_of_xmls, IPC_path, target_IPC)

# Writes the contents of IPC_data to the text file classifications.txt
count = 0
with open(txt_subdir+"\\classifications.txt", 'w') as classification_file:
    for i,j in IPC_data.iteritems():
        count += 1
        classification_file.write(i+':'+str(j).replace('\'', '').strip('[]')+'\n')

print "'classifications.txt' successfully written to file with", count, "lines written."

'classifications.txt' successfully written to file with 2500 lines written.


In [10]:
# Variable used to store the patents referenced by each patent as a dictionary, where each key is a patent and each value is a
# patent referenced by the key patent.
patents_cited_data = dataTrawler(list_of_xmls, patent_cites_path, target_cites)

# Writes the contents of patents_cited_data to the text file citations.txt
count = 0
with open(txt_subdir+"\\citations.txt", 'w') as patents_cited_file:
    for i,j in patents_cited_data.iteritems():
        count += 1
        patents_cited_file.write(i+':'+str(j).replace('\'', '').strip('[]')+'\n')

print "'citations.txt' successfully written to file with", count, "lines written."

'citations.txt' successfully written to file with 2500 lines written.


In [11]:
# Counts the number of citations that each patent cites, then zips the count to the patent number in a patent:count dictionary.
citations_count = []
for i, j in patents_cited_data.iteritems():
    citations_count.append([len(j)])
citations_count_dict = dict(zip(patents_cited_data, citations_count))

# Writes the contents of citations_count_dict to the text file citation_count.txt
count = 0
with open(txt_subdir+"\\citation_count.txt", 'w') as patent_citations_file:
    for i,j in citations_count_dict.iteritems():
        count += 1
        patent_citations_file.write(i+':'+str(j).replace('\'', '').strip('[]')+'\n')
print "'citation_count.txt' successfully written to file with", count, "lines written."

'citation_count.txt' successfully written to file with 2500 lines written.


In [12]:
# Collects a list of each patent cited by <patent>.xml files, as seen in the values from the key:value relationships
# in 'citations.txt'.
patent_citations = []
for key, value in patents_cited_data.iteritems():
    patent_citations.append(value)
patent_citations = [i for j in patent_citations for i in j]

# For each patent in a set of the patents cited in each <patent>.xml file, append a dictionary containing the patent cited and the
# number of times it has been cited by <patent>.xml files. A set is used to ensure that each patent cited only appears as a key
# once, rather than every time that it appears as a citation in a <patent>.xml file.
all_patent_count = []
for patent in set(patent_citations):
    patent_record = {patent: patent_citations.count(patent)}
    all_patent_count.append(patent_record)

# Writes the details from all_patent_count to a text file named 'cited.txt', with the count printed to the terminal providing
# the number of patents referenced by the 2500 <patent>.xml files
count = 0
with open(txt_subdir+"\\cited.txt", 'w') as citation_count_file:
    for count_vector in all_patent_count:
        for i,j in count_vector.iteritems():
            count += 1
            citation_count_file.write(i+':'+str(j).replace('\'', '').strip('[]')+'\n')

print "'cited.txt' successfully written to file with", count, "lines written."

'cited.txt' successfully written to file with 40728 lines written.


## 6. Abstract extraction

This section extracts the abstract data from each individual patent, and provides the following variables;
1. A list of lists __(all_token_list)__, where each sub-list contains a tokenised abstract for each patent
2. A list of lists __(all_uniq_list)__, where each sub-list contains a tokenised abstract for each patent where each token is unique
3. A list __(unique_tokens)__ containing tokens that only appear in a single patent (regardless of the number of times it appears in that patent)
4. A list __(flat_token_list)__ that is a flattened variant of the list of lists all_token_list
5. A dictionary __(pat_tokens_cleaned)__, where each key-value pair is a patent matched to a list containing the patent's abstract once it has been tokenised, and has been stripped of stopwords and unique tokens.

In [13]:
# Tokenise abstract for each patent file; each patent's abstract is contained as a key:value pair in abstract_data, with each
# patent's unique tokens being stored as elements of all_uniq_list. This list will be used to establish whether a word appears
# in only one patent.
abstract_data = dataTrawler(list_of_xmls, abstract_path, target_patents)
all_uniq_list = []
all_token_list = []

for i,j in abstract_data.iteritems():
    # Declares variables for holding all tokens in an abstract, and a list of unique tokens for an abstract.
    indiv_uniq_token = []
    indiv_token_list = j
    patent_num = copy.deepcopy(i)

    # Iterates through the abstracts, extracting tokens for each word and for the punctuation marks '.' and ','. These are
    # extracted so that bigram calculations can take into account common delimiters.
    for string in indiv_token_list:
        if isinstance(string, (str, unicode)):
            tokens = re.findall(r"[a-zA-Z-]+", string.lower().encode('utf-8'))
        else:
            print "A non-text token has been caught. Details are as follows; \n type:", type(string), patent_num, string
    
    # Creates a list with only one of each token occuring in the patent, and then appends this list 
    indiv_uniq_token = set(tokens)
    
    # Appends each unique token to the list all_uniq_list. A count run on this variable will return how often each term in the
    # list occurs, i.e. how many patents it occurs in. Given this information, if the count returns as 1 then the word only
    # appears in one patent, and can therefore be removed.
    for uniq_token in indiv_uniq_token:
        all_uniq_list.append(uniq_token)
    
    # Appends the list of token for each patent into the list of lists all_token_list.
    all_token_list.append(tokens)

# Flattens all_token_list, and also handles unicode encoding if it occurs.
flat_token_list = [i for j in all_token_list for i in j]

# Creates a list of tokens that only appear in a single patent; these tokens can therefore be removed from the count vector
# vocabulary list. Note that counter returns a dictionary for each element in the list all_uniq_list, where the key is the
# element and value is the number of occurences of the element in all_uniq_list.
unique_tokens = []
for key, value in Counter(all_uniq_list).iteritems():
    if value == 1:
        unique_tokens.append(key)

print "There are",len(unique_tokens),"unique tokens in the corpus."

There are 5427 unique tokens in the corpus.


In [14]:
# Creates a dictionary where each key is a patent and the value is a list of the patent abstract once it has been tokenised. The 
# variable pat_tokens_raw contains just the tokenised abstracts as elements of the list, while pat_tokens_cleaned has had any
# stopwords and tokens present in only one patent removed.

pat_tokens_raw = dataTrawler(list_of_xmls, abstract_path, target_patents)
pat_tokens_cleaned = copy.deepcopy(pat_tokens_raw)

for patent, abstract in pat_tokens_raw.iteritems():
    for i in abstract:
        string = i.encode('utf-8') # Encodes text to type str to deal with Python 2.7's inherent issues with Unicode characters.
        tokens = [token for token in re.findall(r"[a-zA-Z-]+", string.lower())] #|[.,]
    pat_tokens_raw[patent] = tokens

# Cleans pat_tokens_raw of stopwords, and words that appear in only one patent.
pat_tokens_cleaned[patent] = [token for token in pat_tokens_raw if token not in unique_tokens if token not in stopwords_NLTK]

# Cleans flat_token_list of stopwords, and words that appear in only one patent. While pat_tokens_cleaned contains abstract
# tokens for each patent, cleaned_flat_token_list contains the contents of all patent abstracts in the corpus.
cleaned_flat_token_list = [token for token in flat_token_list if token not in stopwords_NLTK]

## 7. Abstract processing
This section processes the tokenised abstracts as follows;
1. Bigram creation and collocation using likelihood ratios
2. Integration of bigrams into the abstracts, replacing unigram pairs with relevant bigrams where necessary
3. Removal of unique tokens, and tokens that are stopwords as determined by either the NLTK Stopwords list, or the Data Wrangling Stopwords list.
4. Removal of the top-20 most frequent words
5. Transferral of this final collection of terms to the vocabulary file __'vocab.txt'__

In [15]:
# Creates bigrams of the patent abstract corpus, filtering out any bigrams where one of the tokens is shorter than 3 letters or
# is a stopword in the NLTK stopwords list.

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(flat_token_list)
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in stopwords_NLTK)
finder.apply_freq_filter(25) # Ignores all bigrams that occur less than 25 times

# Identifies the top 150 bigrams as identified by likelihood ratio. PMI was not chosen as it is sensitive to rare bigrams.
# See Boswell, D. (2004) - http://dustwell.com/PastWork/Collocations.pdf
bigram_150 = finder.nbest(bigram_measures.likelihood_ratio, 150)

In [16]:
# Creates the variable bigrammed_abstracts, which once this block is run will contain a dictionary where they key refers to a
# specific patent, while the value is a list containing the tokenised abstract, with the top ******* bigrams replacing relevant
# unigrams.
bigrammed_abstracts = copy.deepcopy(pat_tokens_raw)

# Iterates through each key:value pair in bigrammed_abstracts, checking if each pair of tokens in the abstract is a bigram. If
# so, the bigram is appended, and each token is replaced with a nonesense phrase that is removed through a list comprehension
# at the end of the process.
for key, abstract in bigrammed_abstracts.iteritems():
    for token in range(len(bigrammed_abstracts[key])-1):
        if (bigrammed_abstracts[key][token], bigrammed_abstracts[key][token+1]) in bigram_150:
            append_bigram = (bigrammed_abstracts[key][token]+'_'+bigrammed_abstracts[key][token+1])
            bigrammed_abstracts[key].append(append_bigram)
            bigrammed_abstracts[key][token] = '*@*'
            bigrammed_abstracts[key][token+1] = '*@*'
    bigrammed_abstracts[key] = [token for token in bigrammed_abstracts[key] if token != '*@*']

In [18]:
# Returns any terms that appear in bigram pairs, as well as the pair, if that term also appears in a list of the
# 1*0 most common unigrams after cleaning for stopwords.

# Appends each bigram-editted abstract to a list, from which a frequency distribution can be run to establish the top 20 tokens
# that will be removed from the vocabulary list. Note that stopwords and single-patent occuring tokens are removed prior to
# calculating the frequency distribution.
bigrammed_flat_list = []
for key, value in bigrammed_abstracts.iteritems():
    for token in value:
        bigrammed_flat_list.append(token)

In [19]:
# Creates the variable common_unibigrams, containing the 20 most common tokens (either unigrams or bigrams) in the entirety
# of the abstracts. Note that stopwords dropby
unibigram_FD = FreqDist([i for i in bigrammed_flat_list if i not in stopwords_NLTK if i not in unique_tokens])
common_unibigrams = unibigram_FD.most_common(20)

In [20]:
# Remove the 20 most common tokens (unigram and bigrams) from each patent's cleaned abstract - cleaned in that both stopwords,
# and words that appear in only one abstract, have been removed.
# This results in a dictionary where each key is a patent id and each value a list containing the abstracts for that patent,
# with bigrams replacing unigram pairs where relevant, and the patents have been stripped of i. stopwords, ii. the most common
# unigrams and bigrams, and iii. the 

cleanout_list = []
cleanout_list += unique_tokens
for i in stopwords_NLTK:
    cleanout_list.append(i.encode('utf-8'))
for i in common_unibigrams:
    cleanout_list.append(i[0])
                
for key, vector in bigrammed_abstracts.iteritems():
    bigrammed_abstracts[key] = [i for i in vector if i not in cleanout_list]

In [21]:
# Takes the finalised abstracts from bigrammed_abstracts and extracts them to a list of lists, which is then flattened and 
# turned to a set to remove duplicate tokens. This set will be used to create the file 'vocab.txt'.
vocab = []
for key, value in bigrammed_abstracts.iteritems():
    vocab.append(value)    
vocab = [i for j in vocab for i in j]
vocab = set(vocab)
vocab = [i for i in vocab] #Used to strip set formatting from the variable

In [22]:
# Creates a dictionary where each token in the vocab variable becomes a value, with its key being a unique identifier number that
# will be referenced in 'vector_count.txt'
count = 0
vocab_dict = {}
for i in vocab:
    count += 1
    vocab_dict[count] = i

In [23]:
# Writes the vocab_dict variable to file.
count = 0
with open(txt_subdir+"\\vocab.txt", 'w') as vocab_file:
    for i, j in vocab_dict.iteritems():
        count += 1
        vocab_file.write(str(i)+':'+j+'\n')
        
print "'vocab.txt' successfully written to file with", count, "lines written."

'vocab.txt' successfully written to file with 5733 lines written.


## 8. Generate sparse vector count for each patent

This section marks the final stage of the assignment, and involves creating a sparse vector count for each patent. This count 
takes the form $<$patent$>$, $<$vocab_id1$>$:$<$count1$>$, $<$vocab_id2$>$:$<$count2$>$, ... , $<$vocab_id_n$>$:$<$count_n$>$, where each vocab_id refers to 
an index in the 'vocab.txt' file and the relevant count is for the number of times that vocab_id appears in the bigrammed and cleaned abstracts.

In [24]:
# Swaps the key:value order of vocab_dict, so that the token id can be more easily accessed.
swap_vocab = {j:i for i, j in vocab_dict.iteritems()}

In [25]:
# Creates an empty list variable named count_vector, which will store the final count vector for the assignment. This count
# vector is generated by iterating through the bigrammed_abstracts dictionary (the abstracts post-cleaning/post-bigramming), and
# creating a list of the vocab identifiers attached to each word in the tokenised abstracts. This list is then run through
# Counter to identify the vocab token counts for each patent, and then appended to a list with the patent number as the first
# element of the list. This list is then appended to count_vector, which will be read to the 'count_vector.txt' file below.
count_vector = []
for patent, abstract in bigrammed_abstracts.iteritems():
    temp_hold = []
    second_hold = []
    for element in abstract:
        for term, identifier in swap_vocab.iteritems():
            if element == term:
                temp_hold.append(identifier)
    temp_hold = Counter(temp_hold)
    second_hold.append(patent+', '+str(temp_hold).strip('Counter({})').replace(': ',':'))
    temp_hold = []
    count_vector.append(second_hold)
    second_hold = []


In [26]:
# Writing 'count_vectors.txt' to file, using count_vector variable defined above.
count = 0
with open(txt_subdir+"\\count_vectors.txt", 'w') as vector_file:
    for i in count_vector:
        count += 1
        vector_file.write(str(i).strip('\'[]\'')+'\n')
        
print "'count_vectors.txt' successfully written to file with", count, "lines written."

'count_vectors.txt' successfully written to file with 2500 lines written.


## 9. Summary

This assessment presented a number of challenges to students; while some of them (i.e. the complexity of dealing with a concatenated 'mega XML file') were a great chance for students to develop their skills, there were a number of issues that should be addressed for next semester's assessments.

A discussion amongst the teaching staff prior to the release of assignments would ensure that all tutors will be on top of assignment requirements; Mohsen's multiple attempts to double-down on what has been stated by Dickson to be an incorrect interpretation of the assignment description highlights the lack of communication regarding the assignment beforehand by staff. Furthermore, the confusion over whether Python 3.6 would be accepted, and the mixed responses from Rasika and Dickson, further highlight the lack of coordination by staff. While minor in the wider scheme of things, only small changes will need to be made to deal with these issues in the future.

In terms of utilising the material covered during the semester's tutorials, I feel that this assignment presents students with a number of ways outside those taught to us to reach the end of the assignment. The count vector above, for example, makes no use of the SKLearn packages. Given that we were offerd ElementTree, lxml, and BeautifulSoup, the lack of resources especially for lxml would have steered a number of students away from these offerings. Integrating the use of modules and packages such as these will force students to develop a better grasp of what is being taught during the semester.

Future developments for this class may benefit from also having students place more importance on the collocation method that they chose. While a vast majority of the available resources automatically navigate towards using PMI, alternatives such as likelihood ratio (used here) should not be disregarded especially given the relatively small body of text being examined.