# FIT5196 Task 1 in Assessment 1

#### Group number: 17
#### Student Names: Roopesh Kumar Ramesh, Nikita Mary John
#### Student ID: 30344565, 30142776

Date: 25/08/2019

Version: 1.0

Environment: Python 3 and Jupyter notebook

Libraries used: 
* pandas (for dataframe, included in Anaconda Python 3) 
* re (for regular expressions, included in Anaconda Python 3) 

## Introduction

The aim of this assignment is to parse through a text file in xml format, extract data and write this data to a csv and a json file. The given text file contains information about various grants given for IP patent claims. We are required to extract grant_id, patent_kind, patent_title, number_of_claims, citations_examiner_count, citations_applicant_count, inventors, claims_text and abstract. In the text file provided to us, there are a total of 150 grant entries. 

NOTE: In order for the following code to run, the input text file must be stored in the same folder as this file.

## 1.  Import libraries 

Over the course of this assignment, we used the pandas and re libraries extensively. The pandas library was used to create the dataframe which we then wrote to a csv file. The re library was used to extract the required data. 

In [1]:
import pandas as pd
import re

## 2. Examining and loading data

First, we opened the text file and stored its contents as a string in "text". We then used a regular expression and the re library to extract each block that represented one patent grant.

In [2]:
file=open('Group017.txt','r') 
text = file.read() #storing the entire file as a string

We found that each entry began with "<\?xml version="1.0" encoding="UTF-8"\?>" and ended with "</us-patent-grant>". We used this to extract each entry and store them as an element of the list 'patents'.

In [3]:
regex = r'<\?xml version="1.0" encoding="UTF-8"\?>.*?</us-patent-grant>' 
patents = re.findall(regex, text, flags=16) #the findall function returns a list of everything matching the specified pattern
len(patents)

150

We then checked the length of the list obtained above and came to the conclusion that the text file contains 150 data entries. We finially printed the first entry and examined it in order to find patters to extract the required information.

In [4]:
print(patents[0])

<?xml version="1.0" encoding="UTF-8"?>
<us-patent-grant lang="EN" dtd-version="v4.5 2014-04-03" file="US10359846-20190723.XML" status="PRODUCTION" id="us-patent-grant" country="US" date-produced="20190709" date-publ="20190723">
<us-bibliographic-data-grant>
<publication-reference>
<document-id>
<country>US</country>
<doc-number>10359846</doc-number>
<kind>B2</kind>
<date>20190723</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>16005056</doc-number>
<date>20180611</date>
</document-id>
</application-reference>
<us-application-series-code>16</us-application-series-code>
<classifications-ipcr>
<classification-ipcr>
<ipc-version-indicator><date>20060101</date></ipc-version-indicator>
<classification-level>A</classification-level>
<section>G</section>
<class>06</class>
<subclass>F</subclass>
<main-group>3</main-group>
<subgroup>01</subgroup>
<symbol-position>F</symbol-position>
<classification-value>I

## 3. Parse group17.txt

The first task was to extract some data. This was done as follows:

1. Grant ID- a unique ID for a patent grant made of alphanumeric characters. 
In order to do this, we used regular expressions, the re findall function. We found that this ID was located between 'file="' and '-' on numerous instances in one patent record. We therefore went one patent at a time and used the all() function to check if every occurance is the same, and if it was, then that ID is added to a list 'grant_id'.

In [5]:
grant_id=[]
for item in patents:
    gid=re.findall(r'file="([A-Z]+[0-9]+)-', item)
    if all(x == gid[0] for x in gid): #checks to see if all items are the same, if yes then the item is added to the list
        grant_id.append(gid[0])   
    else:
        print("ERROR!")
len(grant_id)

150

2. Patent title- a title given by the inventor. We noticed that the patent title occured only once in each patent instance and therefore used a regular expression and the re findall function to go through the entire text file and extract the title of each.

In [6]:
patent_title=re.findall(r'\>(.*?)\</invention-title>', text)
len(patent_title)

150

3. Patent kind- a category to which the patent grant belongs. In order to do this, we had to derive the meaning of each Id used. 

    - B1: Utility Patent Grant (no published application) issued on or after January 2, 2001.
    - B2: Utility Patent Grant (with a published application) issued on or after January 2, 2001.
    - S1: Design Patent
    - E1: Reissue Patent
    - P2: Plant Patent Grant (no published application) issued on or after January 2, 2001
    - P3: Plant Patent Grant (with a published application) issued on or after January 2, 2001

We first noticed that each xml block had numerous "kind" elements. We had to extract the appropriate portion using regular expressions. We then got the kind id and checked each and appended the appropriate string to the list "kind".

In [7]:
patent_kind=[]
for item in patents:
    line=re.findall(r'<us-bibliographic-data-grant>(.*?)</document-id>', item, flags=16)
    for item1 in line:
        kind=re.findall(r'<kind>(.*?)</kind>', item1)
        for item2 in kind:
            if item2=="B1":
                k="Utility Patent Grant (no published application) issued on or after January 2, 2001."
            elif item2=="B2":
                k="Utility Patent Grant (with a published application) issued on or after January 2, 2001."
            elif item2=="S1":
                k="Design Patent"
            elif item2=="E1":
                k="Reissue Patent"
            elif item2=="P2":
                k="Plant Patent Grant (no published application) issued on or after January 2, 2001"
            elif item2=="P3":
                k="Plant Patent Grant (with a published application) issued on or after January 2, 2001"           
    patent_kind.append(k)
len(patent_kind)

150

4. Number of claims- integer denoting the number of claims for a given grant. We noticed that the number of claims occured only once in each patent instance and therefore used a regular expression and the re findall function to go through the entire text file and extract the integer for each.

In [8]:
number_of_claims=re.findall(r'\>(.*?)\</number-of-claims>', text)
len(number_of_claims)

150

5. Citations examiner count- integer that denotes the number of citations made by an examiner. We counted the number of occurances of the text "cited by examiner" in each pantent entry and stored the number in a list "citations_examiner_count"

In [9]:
citations_examiner_count=[]
for item in patents:
    citations_examiner_count.append(len(re.findall(r'cited by examiner', item)))
len(citations_examiner_count)

150

6. Citations applicant count- integer that denotes the number of citations made by the applicant of the grant. We counted the number of occurances of the text "cited by applicant" in each pantent entry and stored the number in a list "citations_applicant_count"

In [10]:
citations_applicant_count=[]
for item in patents:
    citations_applicant_count.append(len(re.findall(r'cited by applicant', item)))
len(citations_applicant_count)

150

7. Inventors- a list of the patent inventors. We noticed that the the names of the inventors were split as first name and last name. We therefore extracted every first name and every last name in each patent and joined the two lists in order and created one list per patent. We then joined each of these lists together so that the each element in the overall list is a string of names.

In [11]:
inventors=[]
for item in patents:
    inventor=[]
    reg = r'<inventors>.*?</inventors>' 
    text=re.findall(reg, item, flags=16)
    if len(text)==0:
        inventor=["NA"]
    else:
        for item in text:
            Lname=re.findall(r'<last-name>(.*?)</last-name>', item)
            Fname=re.findall(r'<first-name>(.*?)</first-name>', item)
            for i in range(len(Fname)):
                inventor.append(Fname[i]+" "+Lname[i])
    inventors.append(inventor)

for i in range(len(inventors)):
        inventors[i] = "["+",".join(inventors[i])+"]"
len(inventors)   

150

8. Claims text- list of claims text for the various patent claims. We used regular expressions and the re library's findall and sub function to find the relevent data and clean it. The claims text for the various patents are in the list "claims_text".

In [12]:
claims_text=[]
for item in patents:
    claims=[]
    reg=r'<claim-text>.*?</claim>'
    te=re.findall(reg, item, flags=16)
    if len(te)==0:
        claims1="[NA]"
    else:
        for item in te:
            item=re.sub(r'<.+?>', '', item)
            item=re.sub(r'\n', '', item)
            claims.append(item.strip())
            claims1="["+",".join(claims)+"]"
    claims_text.append(claims1)       
len(claims_text)

150

9. Abstract- the patent abstract text. This is extracted in a similar manner to thats of the claims text- using the re library and regular expressions.

In [13]:
abstract=[]
for item in patents:
    reg=r'<abstract id="abstract">.*?</abstract>'
    t=re.findall(reg, item, flags=16)
    if len(t)==0:
        abst="NA"
    else:
        for item1 in t:
            t2=re.findall(r'>(.*?)</p>', item1)
            for item2 in t2:
                item2=re.sub(r'<.+?>', '', item2)
                item2=re.sub(r'\n', '', item2)
                abst=item2
    abstract.append(abst)
len(abstract)

150

We finally combined the various lists to create a pandas dataframe containing all the extracted data in tabular form.

In [14]:
df = pd.DataFrame({

'grant_id' : grant_id, 
'patent_title' : patent_title,
'kind' : patent_kind, 
'number_of_claims' : number_of_claims, 
'inventors' : inventors,
'citations_applicant_count': citations_applicant_count,
'citations_examiner_count': citations_examiner_count,
'claims_text': claims_text,
'abstract': abstract
})

df

Unnamed: 0,grant_id,patent_title,kind,number_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract
0,US10359846,Wearable device gesture detection,Utility Patent Grant (with a published applica...,20,"[Nissanka Arachchige Bodhi Priyantha,Jie Liu]",0,2,[1. A wearable device comprising:at least one ...,The description relates to smart rings. One ex...
1,US10360147,Data storage layout,Utility Patent Grant (with a published applica...,20,"[Kyle B. Wheeler,Timothy P. Finkbeiner]",296,12,[1. A system for storing data elements compris...,Examples of the present disclosure provide app...
2,US10360793,Preventing vehicle accident caused by intentio...,Utility Patent Grant (no published application...,20,"[Eliseba Costantini,Alice Guidotti,Daniele Mor...",9,5,[1. A method for detecting and managing a vehi...,"A method, computer system, and a computer prog..."
3,US10358535,Thermal interface material,Utility Patent Grant (with a published applica...,11,"[Matthew Collins Weisenberger,John Davis Cradd...",12,13,"[1. A thermal interface material, comprising:a...",A flexible sheet of aligned carbon nanotubes i...
4,US10361019,Moisture resistant layered sleeve heater and m...,Utility Patent Grant (with a published applica...,20,"[Elias Russegger,Gerhard Schefbanker,Gernot An...",0,9,[1. A method of forming a heater assembly comp...,A method of forming a layered heater assembly ...
5,US10360916,Enhanced voiceprint authentication,Utility Patent Grant (with a published applica...,12,[Erik Keil Perotti],3,29,"[1. A method, comprising:receiving a first utt...",The invention relates to a method for enhanced...
6,US10361627,Reduction of low frequency noise in a discrete...,Utility Patent Grant (no published application...,20,[Joerg Erik Goller],2,5,"[1. An integrated circuit, comprising:a timeba...",An integrated circuit. The integrated circuit ...
7,US10360694,Methods and devices for image loading and meth...,Utility Patent Grant (with a published applica...,14,"[Binghui Chen,Xiaoming Li]",30,5,"[1. A method for video playback, the method co...",The present disclosure provides methods and de...
8,US10359381,Methods and systems for determining an interna...,Utility Patent Grant (with a published applica...,19,"[Steven Lewis,Matthew Biermann,Suman Pattnaik]",14,9,[1. A system for determining internal properti...,Systems and methods are provided to determine ...
9,US10358532,Method for producing a non-porous composite ma...,Utility Patent Grant (with a published applica...,17,"[Ren&#xe9; Chelle,David Nguyen,Arnaud Vilbert]",3,1,[1. A method of making a non-porous biodegrada...,The subject matter of the present invention is...


## 4. Creating CSV file

We use the pandas to_csv() function to convert the above dataframe to a csv file. We specify that index is False so as to not include the default index column.

In [15]:
df.to_csv('Group017.csv', index = False)

## 5. Creating json file

In order to convert to json, we just used a for loop and concatenated the required data in the right format.

In [16]:
sample=open("Group017.json", "w")
sample.write("{")
for i in range(len(patents)-1):
    sample.write('"'+grant_id[i]+'"' + ":{" + '"patent_title"'+":"+ '"'+ patent_title[i]+'"'+","+'"kind"'+":"+'"'+ patent_kind[i]+'"'+","+'"number_of_claims"'+":"+ str(number_of_claims[i])+","+ '"inventors"' +":"+'"'+ inventors[i]+'"'+","+'"citations_applicant_count"'+":"+ str(citations_applicant_count[i])+","+'"citations_examiner_count"'+":"+ str(citations_examiner_count[i])+","+'"claims_text"'+":"+'"'+ claims_text[i]+'"'+","+'"abstract"'+":"+ '"'+abstract[i]+'"'+"},")
sample.write('"'+grant_id[len(patents)-1]+'"' + ":{" + '"patent_title"'+":"+ '"'+ patent_title[len(patents)-1]+'"'+","+'"kind"'+":"+'"'+ patent_kind[len(patents)-1]+'"'+","+'"number_of_claims"'+":"+ str(number_of_claims[len(patents)-1])+","+ '"inventors"' +":"+'"'+ inventors[len(patents)-1]+'"'+","+'"citations_applicant_count"'+":"+ str(citations_applicant_count[len(patents)-1])+","+'"citations_examiner_count"'+":"+ str(citations_examiner_count[len(patents)-1])+","+'"claims_text"'+":"+'"'+ claims_text[len(patents)-1]+'"'+","+'"abstract"'+":"+ '"'+abstract[len(patents)-1]+'"'+"}}")
sample.close()

## 6. Summary

This assessment measured the understanding of text file parsing techniques using Python and helped us to understand the uses of the pandas and re libraries. We were also able to improve our understanding of regular expressions, since they were used extensively in this assignment. 
Apart from this, we were able to learn and understand the formats of xml, csv and json files. 

The process of writing the program was long but thoroughly interesting. 
The code runs with no errors and gives the desired output. While working on this assignment we were able to implement concepts that were covered in the lectures and tutorials. 

## 7. References

- CSEstack. (2019). 3 Ways to Check if all Elements in List are Same [Python Code]. [online] Available at: https://www.csestack.org/python-check-if-all-elements-in-list-are-same/ [Accessed 20 Aug. 2019].

- Stack Overflow. (2019). How would you make a comma-separated string from a list of strings?. [online] Available at: https://stackoverflow.com/questions/44778/how-would-you-make-a-comma-separated-string-from-a-list-of-strings/44781#44781 [Accessed 20 Aug. 2019].

- Stack Overflow. (2019). What do 'lazy' and 'greedy' mean in the context of regular expressions?. [online] Available at: https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions [Accessed 20 Aug. 2019].

- Squarespace. (2019). What is JSON?. [online] Available at: https://developers.squarespace.com/what-is-json [Accessed 20 Aug. 2019].