# FIT5196 Assessment 1

###  Group 103:

##### Student Name: Alan Gewerc
##### Student ID: 29961246
##### Student Name: Cristiana Garcia Gewerc
##### Student ID: 30088887


Date: 24/08/2019

Environment: Python 3.7.1 and Jupyter Notebook 5.7.4 (64-bit)

Libraries used:
* pandas 0.23.4 (for data frames, included in Anaconda Python 3.7.1) 
* re 0.23.4 (for regular expressions, included in Anaconda Python 3.7.1) 



## 1. Introduction
This assessment touches the very first step of analyzing textual data, i.e., extracting data from semi-structured text files.
For our group, there are a total of 150 patents grants in one 16.13 KB file named `patents.txt`. The required task is to extract the data and transform the data into the CSV and JSON format with the following elements:

1. grant_id: a unique ID for a patent grant consisting of alphanumeric characters.
2. patent_kind: a category to which the patent grant belongs.
3. patent_title: a title given by the inventor to the patent claim.
4. number_of_claims: an integer denoting the number of claims for a given grant.
5. citations_examiner_count: an integer denoting the number of citations made by the examiner for a given patent grant (0 if None)
6. citations_applicant_count: an integer denoting the number of citations made by the applicant for a given patent grant (0 if None)
7. inventors: a list of the patent inventors’ names ([NA] if the value is Null).
8. claims_text: a list of claim texts for the different patent claims ([NA] if the value is Null).
9. abstract: the patent abstract text (‘NA’ if the value is Null)

More details for each task will be given in the following sections.

## 2. Import libraries

In [1]:
import pandas as pd
import re

## 3. Examining and loading data

Each patent-grant starts with the `<us-patent-grant ...>` tag, and consequently ends with `</us-patent-grant>`. 

To prepare our list for the further analysis, we break the list into strings such as each string represents one patent-grant observation (information that refers to what will end up being one line in our future dataframe). 

A regex is defined so strings starting with an patent grant declaration `<us-patent-grant ...>` and ending with the closing tag `</us-patent-grant>` are captured individually. The non-greedy pattern `*?` is necessary so the whole file is not matched. The regex also uses the pattern `[\s\S]` (white space or non white space characters) which causes to capture everything, even line breaks, between the XML declaration and the closing tag. We use `re.findall()` function to return all non-overlapping matches of pattern in string in a list of strings, as described in the module's [documentation](https://docs.python.org/2/library/re.html). 

In [2]:
# read the whole file
with open("data/patents.txt","r") as patent_file:
    patent_string = patent_file.read()
    
# matches everything between the us-patent-grant declaration and the root closing tag
regex = r'<us-patent-grant[\s\S]*?</us-patent-grant>' 
patent_list = re.findall(regex, patent_string)
print(len(patent_list))

150


## 4. Parsing XML and data extraction 
We will need to extract some specific information from the `patent_list`. To perform this extractions, we use regular expressions. So, first we define the regex patterns we will need subsequently.
### 4.1. Regex Design:
#### grant_id:
For each `patent-grant`, we see as part of the first tag the following kind of pattern:

`file="US10357937-20190723.XML"`

For this example, the `grant_id` is `US10357937`, which are the letters and numbers that follow `file="` and before a `-`, some characters and a `.XML`. To extract the characters we will use the pattern (\w+) that matches numbers and letters from the alphabet (a-z, A-Z and 0-9). We are using in a greedy way, because we want any \w related character that comes before `-`. The `backslash` had to be used in this case to remove the special meaning of `doublequotes`.


In [3]:
pattern_grant_id = "file=\"(\w+)-\w+.XML"

#### title:
For each `patent-grant`, we see a `invention-title` tag with the following kind of pattern:

`<invention-title id="d2e71">`Fiber laminate, method for manufacturing fiber laminate, and fiber reinforced composite`</invention-title>`

For this example, the `patent_title` is `Fiber laminate, method for manufacturing fiber laminate, and fiber reinforced composite`, which are the characters that follow `<invention-title id=...">` just before the closing tag `</invention-title>`. We will extract any character inside the tags, so the `.+` is used, with are a greedy strategy in this case, because every patent has only one title. More about greedy and lazy strategies [here](https://javascript.info/regexp-greedy-and-lazy). Again, The `backslash` had to be used in this case to remove the special meaning of `doublequotes`.

In [4]:
pattern_title = r"<invention-title id=\"\w+\">(.+)</invention-title>"

#### kind:
For each `patent-grant`, we see some `kind` tags with the following kind of pattern:

`<kind>`U1`</kind>`

The one we are interested is the first one. To make it simpler, we will create a pattern that captures them all (for ech pattent) but later we will select only the first capturing group of the seraching function.

What we have here is a code, that is not our final goal, but can be translated to it's actual meaning after we extract it. So, we will first extract the code, that is alway one capital letter (which is why we are using `[A-Z]` and not `[a-z]`) optionally followed by one digit (which is why there is a `?` after `\d`), located between `<kind>` and `</kind>`.

In [5]:
pattern_kind = r"<kind>([A-Z]\d?)</kind>" 

#### number_of_claims:
For each `patent-grant`, we see a `number-of-claims` tag with the following kind of pattern:

`<number-of-claims>`5`</number-of-claims>`

It's very straightforward, the digits we want to extract are located between `<number-of-claims>` and `</number-of-claims>`.

In [6]:
pattern_number_claims = r"<number-of-claims>(\d+)</number-of-claims>" 

#### inventors (first and last names):
This extraction is a little bit more complex. For each `patent-grant`, we see a variable number of inventors. Also, to get their full names, we must extract their first names and their last names. 

The first step here is defining the regex patern that extract the text block where the inventor's names can be found, that is in between the `<inventors>` and `</inventors>` tags. That makes us confident enough for, when extracting the names, be sure their are regarding inventors and not examiners, for instance. We are again being greedy, using `(.+)`, since we are confident we will find only 1 pair of tags of invertors. <br>

In [7]:
pattern_inventors = r"<inventors>(.+)</inventors>"

Later, we will extract the names from the new string and it will be quite simple, the first names are the characters in between `<first-name>` and `</first-name>` - and the last names in between `<last-name>` and `</last-name>`. The non-greedy pattern `+?` is necessary in case there is more than 1 inventor for a given patent.

In [8]:
pattern_first_name = r"<first-name>(.+?)</first-name>"
pattern_last_name = r"<last-name>(.+?)</last-name>"

#### citations_applicant_count:
For each `patent-grant`, there is a variable number of citations by applicant. We must count how many times they appear to get the desired number. Therefore, we first need the pattern to identify their occurence.

They can be identified by the following string: `<category>cited by applicant</category>`

In [9]:
pattern_cited_applicant = r"<category>cited by applicant</category>"

#### citations_examiner_count:
It's the exact same case as `citations_examiner_count`, expept that to identify them the pattern now is: `<category>cited by examiner</category>`

In [10]:
pattern_cited_examiner = r"<category>cited by examiner</category>"

#### abstract:
It's everything located within the `abstract` tags. They start with `<abstract id="abstract">` and finish with `</abstract>`. We are again being greedy, using `(.+)`, since we are confident we will find only 1 pair of tags of abstract. One can notice that everytime we find double-quotes on a pattern they must follow a backslash that removes meaning.

In [11]:
pattern_abstract = r"<abstract id=\"abstract\">(.+)</abstract>"

#### claims_text:
It's everything located within the `claims` tags. They start with `<claims id="claims">` and finish with `</claims>`. We are again using a greedty strategy.

In [12]:
pattern_claim_text = r"<claims id=\"claims\">(.+)</claims>"

Furthermore, we will create support lists to record the information extracted with the patterns defined above. 

### 4.2. Extracting Data

Then, we will create and populate supporting lists with elements extracted from the text according to our regex patterns. Each element of the lists will refer to one aspect of one patent. 

In [13]:
# We are creating at total 10 lists
list_grant_id, list_patent_title, list_number_claims, list_inventors, list_cited_applicant = [], [], [], [], []
list_cited_examiner, list_abstract, list_claims_text, list_kind, list_inventors_names = [], [], [], [], []

Now we will populate the lists, appending items that are matched by our regex patterns. We use `re.search` because for this stage, there is only one match per patent, for instance, `grant_id` and `patent_title` of one patent must be unique. In all cases below, we are looking for a pattern match anywhere in the given string. See [re documentation](https://docs.python.org/2/library/re.html).

For the `abstract`, we did observe that it's possible not to have it present for some patents. In these cases, the instructions are to give it `NA` values. Therefore, first we search if there is a match with the corresponding pattern. If there is, we extract it and append to the list. If there isn't, we just append an `NA` to the `list_abstract` instead.

Some searchs need a `DOTALL` flag in order to make the `.` special character match any character at all, including a newline. These are the cases of the `abstract` and `claims_text`, for instance, because they have their representative pattern span multiple lines.

In [14]:
for element in patent_list: # each element is a string with all the information regarding one single patent
    
    list_grant_id.append(re.search(pattern_grant_id, element).group(1))
    list_patent_title.append(re.search(pattern_title, element).group(1))
    list_number_claims.append(re.search(pattern_number_claims, element).group(1))
    list_claims_text.append(re.search(pattern_claim_text, element, re.DOTALL).group(1))
    list_kind.append(re.search(pattern_kind, element).group(1)) # the first group of "kind" search is the one we are interested
    list_inventors.append(re.search(pattern_inventors, element, re.DOTALL).group(1))
    
    if bool(re.search(pattern_abstract, element, re.DOTALL)): # if there is an abstract match:
        list_abstract.append(re.search(pattern_abstract, element, re.DOTALL).group(1))
    else: # if there's no abstract for the patent:
        list_abstract.append('NA')    

    # in the following cases we are looking for the number of citations, not the citation itself
    list_cited_applicant.append(element.count(pattern_cited_applicant)) 
    list_cited_examiner.append(element.count(pattern_cited_examiner))

#### inventors
As stated before, the inventors names are a little bit trickier than the others to extract. The reason is that it cannot be extracted with only one regex pattern. <br>

What we have so far is a list with all the data regarding the inventors of each patent. We need to extract only the names (first name / last name) and put them in the our target format, i.e. [Name1 Lastname1, Name2 Lastname2, ...] Now that we have our supporting variables defined, we iterate over the `inventors_list` in order to first populate the `inventors_first_names` and `inventors_last_names` list and after combine it's values to form the `list_inventors_names`.

We can have multiple inventors for each patent. The `findall` function is able to capture all, but we need to iterate over its output in order to extract the first name(s), the last name(s) and combine them one by one when needed. 


In [15]:
for patent in list_inventors:
    inventors_first_names = re.findall(pattern_first_name,patent)
    inventors_last_names = re.findall(pattern_last_name,patent)
    inventor_names = ""
    # for each patent it's possible to have multiple inventors, that's why we need a second for loop. 
    for inventor in range(len(inventors_first_names)):
        # the same inventor ocuppies the same indexing position in the inventors_first_names and inventors_last_names
        inventor_names = inventor_names + inventors_first_names[inventor] + " " + inventors_last_names[inventor] + ","
    inventor_names = "[" + inventor_names[:-1] + "]" # inventor_names[:-1] to eliminate the last undesired ","
    list_inventors_names.append(inventor_names) 

Let's check if we already got the desired output by printing the final 15 observations.

In [16]:
print(*list_inventors_names[-15:], sep = "\n")

[Duminda Dewasurendra,Vivek Rajendran,Daniel Jared Sinder]
[G. Kyle Lobisser]
[Julien Hitce,Maria Dalko,Marie-C&#xe9;line Frantz]
[Luis Emilio San Martin,Ahmed E. Fouda,Burkay Donderici]
[Xiaofeng Zou,Rengui Zhou,Zhijin Lin]
[Alex W. MacKay,Choudhury A. Al Sayeed,David C. Bownass]
[Masahiro Miwa,Kazuma Ando,Yasutaka Ohta]
[KRISTAPHER MIXELL,Paul Phillips]
[Jinping Zhou,Lizhu Yu,Hongtao Ma,Jingxiao Lan,Guodong Wang]
[Ravi V. Ika,Yogendra K. Jain,Anand M. Tati,Paul Ducey,James Lee,Prakash Tallabattula,Kamal Patel,Satish Kumar Cheepurpalli]
[Richard J. Cartwright,David S. McGrath,Glenn N. Dickins]
[Curtis E. Quady]
[Joseph Maalouf,Sumukh Shevde,Curtis Gong,Christian Sporck,Cheong Kun]
[Joel Alexander Booth,Justyn Howard,Aaron Rankin]
[Inseok Hwang,Jente B Kuang,Janani Mukundan]


Now, the inventors text is already formated in a similar way to our target format, i.e. [Name1 Lastname1, Name2 Lastname2, ...]

Anyway, there are further adjustments to be done, for instance:
- The names must start with a uppercase letter and be followed by lowercase letters;
- Some character such as `&#xe9`; must be converted.

### 4.3. Further Parsing
Now we have extracted all necessary information from the text file, but we still don't have all the lists in the proper format. 

We will reshape some of the lists through futher extraction and start cleaning/formating the lists until they are tranformed in the proper way to be future displayed in the columns of our final dataframe. 

#### inventors

For now, we will adjust the uppercase/lowercase letters, but we will leave the conversion to do togheter with other potential lists that need the same kind of treatment. We didn't find any proper built in function that deals correctily with apostrophes and #s in the middle of the name, so, we followed the aproach suggested in the [python documentation](https://docs.python.org/3/library/stdtypes.html), creating a function as written there.

In [17]:
# Define function that makes the names starts with upper case and be followed by lower case letters
def titlecase(s):
    return re.sub(r"[A-Za-z]+('?[\w&#;]+)?",
                  lambda mo: mo.group(0)[0].upper() +
                             mo.group(0)[1:].lower(), s)

# applying the title function to all elements of the list_inventors:

list_inventors_names = list(map(lambda name: titlecase(name), list_inventors_names))

#### claims_text
The `list_claims_text` has some newlines that must be replaced by "," in order to generate the desired output.
The following code we will do the steps:

1. Replace the pattern `<\/claim-text>\n<\/claim>\n<claim id=.+?>\n<claim-text>` by `,` using the [re.sub](https://lzone.de/examples/Python%20re.sub) function. This is the pattern wich identifies the end of one claim/start of a new claim, which is exactly the point where we need a comma.
2. Eliminate the remaining `\n`.

The [map function](https://www.geeksforgeeks.org/python-map-function/) used to apply the with the definined lambda function to each every element of the `list_claims_text` and return a list with the results.

In [18]:
list_claims_text = list(
    map(lambda x:
         re.sub(r'<\/claim-text>\n<\/claim>\n<claim id=.+?>\n<claim-text>',',',x)
         .replace('\n','')
        , 
        list_claims_text))

There is still one last adjustment to be done, that is to put the list_claims_text elements within [] such as in the sample output.  

In [19]:
list_claims_text = list(map(lambda x: '[' + x + ']',list_claims_text))

### 4.4. Generating the "kind" dictionary:
The "kind" column needs proper translation of the kind_list. To find out the correct replacements for each kind code, we will genarate a dictionary based on our finds throughout the sample input and output files.

To find the key-value pairs of our dictionary, we must capture the codes (keys) in the `Sample_input.txt` and the equivalent keys in the `Sample_output.csv` (or JSON, but we chose to use the csv). 

The patternt to identify the relevant kind tags here include `<us-patent-.+?` because the kinds we are interested in are the ones the show up at the beggining of each pattent. Before, when we have capture `kind`, we have used the `re.search` function and selected only teh first capturing group because we were dealing with one patent at a time. Now it's diferent, we are handling the whole document and using `re.findall` to get a list of all matches.

In [20]:
dictionary_pattern = r'<us-patent-.+?<kind>(\w\d?)</kind>'

So, we read the `Sample_input.txt`, use `rstrip` method to strip all kinds of trailing whitespace, as explained by [Markus Jarderot](https://stackoverflow.com/questions/275018/how-can-i-remove-a-trailing-newline), join the lines in a unique string `input_data`, look for all occurences of the defined pattern and finally, save it as a list called `kind_dic_keys`.

In [21]:
with open('Sample_input.txt', 'r') as sample_input:
    input_data="".join(line.rstrip() for line in sample_input)
    kind_dic_keys = re.findall(dictionary_pattern,input_data)

To generate the `kind_dic_values` list, it's as simple as reading the kind column of the `Sample_output.csv`. 

In [22]:
kind_dic_values = pd.read_csv('Sample_output.csv', usecols=['kind'])

The last step is to actually create the dictionary by usign the zipped keys/values previouly obtained.

In [23]:
kind_dictionary = dict(zip(kind_dic_keys,kind_dic_values.kind.tolist()))

We can check the dictionary, and, to make sure it's making sense, you can check an online [patent kind dictionary](https://www.finnegan.com/en/insights/blogs/prosecution-first/the-abcs-of-patent-kind-codes.html).

In [24]:
print(kind_dictionary)

{'B2': 'Utility Patent Grant (with a published application) issued on or after January 2, 2001.', 'S1': 'Design Patent', 'E1': 'Reissue Patent', 'B1': 'Utility Patent Grant (no published application) issued on or after January 2, 2001.', 'P3': 'Plant Patent Grant (with a published application) issued on or after January 2, 2001', 'P2': 'Plant Patent Grant (no published application) issued on or after January 2, 2001'}


Now, with the dictionary generated and looking good, we can use it to re-write the `kind_list` in the translated format:

In [25]:
list_kind = list(map(lambda x: kind_dictionary[x], list_kind))

## 5. Preparing and Cleaning the DataFrame
There is some cleaning to be done in the columns `patent_title`, `inventors`, `claims_text` and `abstract`.

### 5.1. Identify all HTML special characters

They are the ones that start with `&#` and ends with `;`. We look for this pattern in our target lists and append them all in the `list_char`.

In [26]:
list_char = []
pattern_char = r"&#.*?;"

for i in range(len(list_patent_title)):
    
    list_char.append(re.findall(pattern_char, list_patent_title[i]))
    list_char.append(re.findall(pattern_char, list_inventors_names[i]))
    list_char.append(re.findall(pattern_char, list_abstract[i]))
    list_char.append(re.findall(pattern_char, list_claims_text[i]))    

Generating a set with all non empty and no repeting elements to clean:

In [27]:
# eliminate empty elements
list_char2 = [item for item in list_char if item != []]
# transforming it in a "flat" list
list_char3 = [item for sublist in list_char2 for item in sublist]
# visualise the unique values
print(set(list_char3))

{'&#x3c;', '&#x394;', '&#xd7;', '&#x3be;', '&#xbc;', '&#x2208;', '&#x201c;', '&#x2211;', '&#x3f5;', '&#x2550;', '&#x2212;', '&#x26;', '&#x2014;', '&#x2062;', '&#x3e;', '&#x201d;', '&#x2003;', '&#xe9;', '&#x3b8;', '&#x2264;', '&#x2061;', '&#x2159;', '&#x3b1;', '&#xb7;', '&#x3bc;', '&#xf1;', '&#x2032;', '&#xf6;', '&#xb0;', '&#x2033;', '&#x3bb;'}


### 5.2. Creating the Dictionaries

Now we will create a dictonary to replace the HTML Entity (hex)	 characters. 

It is based on the special characters found above and in the table found at the following [link](http://www.howtocreate.co.uk/sidehtmlentity.html).

In [28]:
clean_dict = {'&#x3c;':'<', '&#x3bc;':'μ', '&#xf6;':'ö', '&#x2211;':'∑', '&#x2014;':'—', '&#x2033;':'″', '&#x26;':'&',
              '&#x3b8;': 'θ', '&#x2061;':'', '&#x3f5;':'ϵ' , '&#xb0;':'°', '&#x2062;':'' , '&#x2550;':'═' , '&#x3bb;': 'λ',
              '&#xd7;': '×', '&#x2208;':'∈', '&#x2159;':'⅙', '&#x2264;':'≤', '&#x201c;': '“', '&#x201d;': '”', '&#x3e;': '>',
              '&#xb7;': '·', '&#x2032;': '′', '&#x3be;': 'ξ', '&#x394;': 'Δ', '&#x3b1;': 'α', '&#xbc;': '¼',
              '&#x2212;':'−', '&#x2003;':' ', '&#xf1;': 'ñ', '&#xe9;': 'é',
              '&#xe7;':'ç', '&#x2018;':'‘', '&#x2019;':'’', '&#x2261;':'≡', '&#x3c;':'<'}

#### unicode dictionary

For the JSON file, the special characters must be converted to unicode characters. We will use the `clean_dict` (that converts from hex) as the base for our new dictionary, that will use the same keys and translate their values to unicode versions. To form the unicode characters, we will use the hex characters from  `clean_dict` keys and reshape them properly to look like `\uXXXX`. To do so, we use the `re.sub` function and replace the `&#x` by `\u`. Some zeros are inserted after in case there are less than 4 hexadecimal numbers after the given patern.

In [29]:
unicode_values = []

for char in clean_dict.keys():
    # we replace the whole HTML pattern by \u followed by the capturing group \w+, whihc is a hex.
    unicode_aux = re.sub(r'&#x(\w+);',r'\\u\1',char)
    # than, add the zeros where needed
    if len(unicode_aux) == 5:
        unicode_aux = unicode_aux[0:2] + "0" + unicode_aux[2:]
    if len(unicode_aux) == 4:
        unicode_aux = unicode_aux[0:2] + "00" + unicode_aux[2:]
    unicode_values.append(unicode_aux)
# the keys are the same ones as in the original dictionary, and the values are obtained through the transformation above    
unicode_keys = clean_dict.keys()
unicode_dict = dict(zip(unicode_keys,unicode_values))

### 5.3. Cleaning Functions
Defining a function that removes HTML tags (replace them for "nothing"). The pattern of  a tag is `<[^<]*?>` which means it's anything in between `<` and `>`, except a "<", because if there is one "<" in between it, it means the first "<" was not the opening of the tag. 

In [30]:
def remove_tags(element):
    return re.sub(r'<[^<]*?>','', element)    

Now we will create a function that replaces the HTML charactercs according to a given dictionary (might be the `clean_dict`, to genarate the CSV file, or the `unicode_dict`, to generate the JSON).

In [31]:
def replace_char(element,dictionary):
    for k,v in dictionary.items():
        element = element.replace(k, v)
    return element

Finally, we apply the functions to clean our lists. The map function is used to apply the lambda function in all elements of the lists and return a transformed list. We first remove the tags, than replace the characters, than eliminate the "new lines". 

In [32]:
list_patent_title_csv = list(map(lambda x: replace_char(remove_tags(x), clean_dict).replace('\n',''), list_patent_title))
list_inventors_names_csv = list(map(lambda x: replace_char(remove_tags(x), clean_dict).replace('\n',''), list_inventors_names))
list_claims_text_csv = list(map(lambda x: replace_char(remove_tags(x), clean_dict).replace('\n',''), list_claims_text))
list_abstract_csv = list(map(lambda x: replace_char(remove_tags(x), clean_dict).replace('\n',''), list_abstract))

### 5.4. DataFrame
Now, we are going to make a dataframe based on the lists we have generated. We construct it using the [dictionary approach](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame).

In [33]:
patent_df = pd.DataFrame({
    'grant_id' : list_grant_id,
    'patent_title' : list_patent_title_csv,
    'kind' : list_kind,
    'number_of_claims' : list_number_claims,
    'inventors' : list_inventors_names_csv,
    'citations_applicant_count' : list_cited_applicant,
    'citations_examiner_count' : list_cited_examiner,
    'claims_text' : list_claims_text_csv,
    'abstract' : list_abstract_csv
})

Checking the output:

In [34]:
patent_df

Unnamed: 0,grant_id,patent_title,kind,number_of_claims,inventors,citations_applicant_count,citations_examiner_count,claims_text,abstract
0,US10357937,"Fiber laminate, method for manufacturing fiber...",Utility Patent Grant (with a published applica...,5,"[Genki Yoshikawa,Ryuta Kamiya]",4,0,"[1. A fiber laminate comprising, at least in a...",A fiber laminate W is configured by laminating...
1,US10357070,Articles of apparel utilizing targeted venting...,Utility Patent Grant (with a published applica...,19,[Edward Louis Harber],165,15,[1. A method of producing a garment comprising...,Garments may include targeted vent or heat ret...
2,US10357780,Magnetic capture of a target from a fluid,Utility Patent Grant (with a published applica...,12,"[Joo Hun Kang,Donald E. Ingber,Michael Super]",24,3,[1. A method of capturing at least one target ...,Disclosed herein is an improved method for mag...
3,US10358124,Driving force control method during engine clu...,Utility Patent Grant (with a published applica...,10,[Sang Joon Kim],7,2,[1. A driving force control method during engi...,A driving force control method is provided for...
4,US10361362,Magnetic random access memory with ultrathin r...,Utility Patent Grant (with a published applica...,18,"[Huadong Gan,Yiming Huai,Yuchen Zhou,Zihui Wang]",4,6,[1. A magnetic memory element including:a magn...,The present invention is directed to a magneti...
5,US10360345,Systems and methods of notifying a patient to ...,Utility Patent Grant (with a published applica...,25,"[Scott W. Ramsdell,Susan A. Waxenberg]",0,21,[1. A method comprising:at a computerized sche...,Systems and methods of scheduling and sending ...
6,US10360303,Learning document embeddings with convolutiona...,Utility Patent Grant (with a published applica...,18,"[Maksims Volkovs,Tomi Johan Poutanen]",46,12,[1. A method of performing one or more tasks w...,A document analysis system trains a document e...
7,US10359457,"Method of scanning, analyzing and identifying ...",Utility Patent Grant (with a published applica...,9,[Mathieu Audet],16,4,[1. A method of informing persons of potential...,A method of determining the energy level of an...
8,US10357240,Hernia surgery method and system,Utility Patent Grant (no published application...,14,[Robert M. Tomas],4,9,[1. A suturing method for repairing hernia wit...,The present invention discloses a suturing met...
9,US10362132,System and method for diverting established co...,Utility Patent Grant (with a published applica...,12,"[Don Bowman,David Dolson]",19,9,[1. A system for diverting an established comm...,The present invention is related to a system a...


### 5.5. CSV 
Finally, we export our `patent_df` as a csv file, which match all the required aspects for the assignement.

In [35]:
patent_df.to_csv('output/patents.csv', index=False)

## 6. JSON
### 6.1. Formating the lists
First, we will properly convert the special characters to unicode and take care of the escape characters in the base lists to them organize the information in a JSON-like string. It's according to [The JSON Data
Interchange Syntax](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf) documentation.

#### Dealing with escape characters:
We need to escape with "`\`" all the "`/`", "`\`" and "`"`". We should find them in the lists and replace by themselves preceded by the escape character. 

In [36]:
list_patent_title_json = list(map(lambda x: x.replace("\\",r"\\").replace('"','\"').replace("/","\/"), list_patent_title))
list_inventors_names_json = list(map(lambda x: x.replace("\\",r"\\").replace('"','\"').replace("/","\/"), list_inventors_names))
list_claims_text_json = list(map(lambda x: x.replace("\\",r"\\").replace('"','\"').replace("/","\/"), list_claims_text))
list_abstract_json = list(map(lambda x: x.replace("\\",r"\\").replace('"','\"').replace("/","\/"), list_abstract))

#### Treating the unicode characters:
We have already made the `dict_unicode` to take care of this issue of translating the HTML hex to the unicode format. 

The map function is used to apply the lambda function in all elements of the lists and return a transformed list. We first remove the tags, than replace the characters, than eliminate the "new lines". 

In [37]:
list_patent_title_json = list(map(lambda x: replace_char(remove_tags(x), unicode_dict)
                                  .replace('\n',''), list_patent_title_json))
list_inventors_names_json = list(map(lambda x: replace_char(remove_tags(x), unicode_dict)
                                     .replace('\n',''), list_inventors_names_json))
list_claims_text_json = list(map(lambda x: replace_char(remove_tags(x), unicode_dict)
                                 .replace('\n',''), list_claims_text_json))
list_abstract_json = list(map(lambda x: replace_char(remove_tags(x), unicode_dict)
                              .replace('\n',''), list_abstract_json))

### 6.2. Building the JSON structure

We follow the structure described in the  [document](http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf) cited above and the `sample_output.json`. To build the JSON without the use of any third part libraries, we build a string `str_json` based on our existing dataframe. We use the new lists instead of dataframe columns for the ones which potentially suffered alterations due to the special characters translation (`patent_title`, `inventors`, `claims_text` and `abstract`). <br>
The json structure is very similar to a python dictonary, usually, a nested structure of dictonaries. However, it is not possible to create and export a dictonary in order to create  a json file and the main reason behind it is the fact that python use single quotes while json double quotes. Therefore, we are creating a string that has the json format.<br>
Using a For loop, we are interating over the lists and the dataframe where all the data is stored, and step by steping adding to the string each of the patents. Every datatype used to built the final string has a lenght of 150, which is the number of patents we are working on. In every list, or the dataframe, the item refers to the same patent.<br>
If we make an analogy between the final json and a dictonary, the grant_id code would be the key to access another dictonary, where the remaining relevant data from the patent is store. We sill see something like: <br> `{"grant_id code":{"information1 code": "information1 content", "information2 code": "information2 content"}}` <br>
It was constantly necessary to add brackets and commas to the elements to get the final strucutre, and to make it more readable we broke it in many lines, which also makes it easier to find typos. 


In [38]:
json_output = "output/patents.json"
file = open(json_output, mode = "w", encoding = "utf-8")
str_json = "" # the string that will be exported to create a json file

for i in range(len(patent_df)):
    
    str_json = str_json + '"'+ str(patent_df.iloc[i,0]) + '":{'
    str_json = str_json +'"patent_title":'+'"'+ str(list_patent_title_json[i]) + '",'
    str_json = str_json +'"kind":'+'"'+ str(patent_df.iloc[i,2]) +'",'
    str_json = str_json +'"number_of_claims":'+ str(patent_df.iloc[i,3]) +','
    str_json = str_json +'"inventors":'+'"'+ str(list_inventors_names_json[i]) +'",'
    str_json = str_json +'"citations_applicant_count":'+ str(patent_df.iloc[i,5]) +','
    str_json = str_json +'"citations_examiner_count":'+ str(patent_df.iloc[i,6]) +','
    str_json = str_json +'"claims_text":'+'"'+ str(list_claims_text_json[i]) + '",'
    str_json = str_json +'"abstract":'+'"'+ str(list_abstract_json[i]) + '"},'

str_json = str_json[0:len(str_json)-1] # removing unwanted charcters
str_json = "{" + str_json + "}" # adding the oppening and ending brackets

file.write(str_json)
file.close()

### 6.3. Consistency Test
We need to have consistency between our JSON file and the CSV. Therefore, we'll check to be sure everything is fine.

First, we read the files with [io pandas](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) functions, checking if they open properly and storing it in `teste_json` and `teste_csv` dataframes. 

In [40]:
teste_json = pd.read_json("output/patents.json", orient='columns')
# we set the first column as row index to be in the same shape as the JSON, and keep the "NA" as "NA"
teste_csv = pd.read_csv("output/patents.csv", index_col = 0, na_filter = False) 

Now we must reshape the dataframes in order to be able to compare them. The JSON-based dataframe came with the `patent_title`s as columns and their attributes as rows. To make better sense and be similar to the CSV-based dataframe, we must transpose it. Also, the columns orders is different, so we reorder the collumns of `teste_csv` to put it in the same order as `teste_json`.

In [41]:
teste_json = pd.DataFrame.transpose(teste_json)
teste_csv = teste_csv.reindex(columns=teste_json.columns)

Now, the dataframes are prepared to the comparison test. We use an equality test and them check how many inconsistencies we might have per column.

In [42]:
teste = pd.DataFrame.eq(teste_json,teste_csv) # equality test for each observation of the dataframe
teste.apply(pd.Series.describe) # summary to see of how many True/False per column (apply describe function to all columns)

Unnamed: 0,abstract,citations_applicant_count,citations_examiner_count,claims_text,inventors,kind,number_of_claims,patent_title
count,150,150,150,150,150,150,150,150
unique,1,1,1,2,1,1,1,1
top,True,True,True,True,True,True,True,True
freq,150,150,150,148,150,150,150,150


We can see that the only column where a non-True value appears is `claims-text`, which are exactly the one with many diferent special character translation. Let's check if it's fine:

In [43]:
teste_csv.loc[teste.loc[:,'claims_text']==False,'claims_text']

grant_id
US10360320    [1. A computer implemented method comprising:r...
US10359750    [1. A computer-implemented frequency control m...
Name: claims_text, dtype: object

Further checking of the 2 patents above revealed that, for `claims-text`, the only 2 slightly diferent outputs occured due to the translations of the special characters, so, everything is ok.

## 7. Summary
The present assignment tested our text file processing abilities in the Python programming language. The main outcomes achieved while applying these techniques were:

- **Text Parsing and Data Extraction**. By using the `re` module, we were able to develop simple and complex regular expression patterns that were further applied to extract relevant data from the XML document. We used lots of regex resources, such as backreferences and DOTALL modifiers. Also, functions such as `search`, `sub` and `findall` were of great use to acess the data inside the document. 
- **Data Exploration**. To understand what were our goals in this assessment, there were relevant tasks regarding data exploration. One example, for instance, was to identify all special HTML characters that would have to be replaced in the document. Another example was our test to check the consistency across the `.json` and `.csv` files. For the last one, we used `pandas` dataframe methods as `DataFrame.eq`.
- **Data manipulation**. In this assessment, almost all major data types were used: Lists, strings, sets, dictonaries, Dataframes, etc. Firstly, all the data from the file was broken in to lists. Secondly, sets and dictonaries were used to help to clean the data in those lists. Finally, when all the data was ready, it was transformed into dataframes to finally be exported to CSV.
- **Exporting data to specific format**.  Native file operations like `open` and `write` were required where data had to be processed line by line or we couldn't use special libraries to write the JSON. Luckily, the use of Python's functions like `join()` and in-line iterators made such tasks more easy and readable. By using `pandas` function `DataFrame.to_csv()` it was possible to export data frames into a `.csv` file without additional formatting and transformations.

## 8. References

- ECMA International . (December 2017). *The JSON Data Interchange Syntax, 2nd Edition*. Retrieved from http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

- Florio, E. (2016, May 31). *The ABCs of Patent Kind Codes*. Retrieved from https://www.finnegan.com/en/insights/blogs/prosecution-first/the-abcs-of-patent-kind-codes.html

- Geeks for Geeks. (visited in 10/08/2019). *Python map() function*. Retrieved from https://www.geeksforgeeks.org/python-map-function/

- How to Create. (visited in 11/08/2019). *HTML entity reference*. Retrieved from http://www.howtocreate.co.uk/sidehtmlentity.html

- Jarderot M. (2017, May 17). *How can I remove a trailing newline?* [Response to]. Retrieved from (https://stackoverflow.com/questions/275018/how-can-i-remove-a-trailing-newline

- Javascript.Info (2019, August 11). *Greedy and lazy quantifiers?*. Retrieved from https://javascript.info/regexp-greedy-and-lazy

- Python Software Foundation. (visited in 18/08/2019). *Built-in Types*. Retrieved from https://docs.python.org/3/library/stdtypes.html

- Python Software Foundation. (visited in 19/08/2019). *Regular expression operations*. Retrieved from https://docs.python.org/2/library/re.html

- The `pandas` Project. (visited in 11/08/2019). *pandas 0.25.1 documentation: DataFrame*. Retrieved from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame

- The `pandas` Project. (visited in 12/08/2019). *pandas 0.25.1 documentation: input/output*. Retrieved from https://pandas.pydata.org/pandas-docs/stable/reference/io.html
