# Parsing Raw Rext File

Environment: Python 3.7.1 and Jupyter notebook

Libraries used:
* pandas
* re 

## 1. Introduction
The objective of this project is to parse a raw text file and export its content into two  file formats: CSV and JSON. 

The projects will start with exploring the text file. Followed by designing efficient regular expression patterns to extract the raw texts. Finally, the clean output will be extracted and exported into `csv` and `json` formats.

## 1.  Importing libraries 

In [4]:
#import required libraries
import re #regualr expression 
import pandas as pd

## 2. Examining and loading dataset

Before parsing the dataset, an analysis is carried out to determine the kind of data provided. Each tag is examined and located to match with the desired output variables.

In [8]:
#open the text file 
with open('Patents.txt', 'r') as rawfile:
    text = rawfile.read()

In [9]:
#matche the text between the XML declaration and the root closing tag

#create the pattern 
regex = r'<\?xml[\s\S]*?</us-patent-grant>' 

#search the pattern in the text file 
patents = re.findall(regex, text)

#select a random text block to see the outcome 
block1 = patents[57]

#print(block1)

It can be noted from the above examination, that the dataset is a combination of several XML files. Each beginning with `<?xml...?>` and ending with `<us-patent-grant>`. As a result, splitting the dataset into smaller blocks will make it easier to extract the desired variables.

In [10]:
#create an empty list to store all the patents
block=[]

#iterate through the text file and append each patent to the block list
for i in patents:
    block.append(i)

print('Total number of patents in the text files is:', len(block), 'patents')

Total number of patents in the text files is: 150 patents


## 3. Parsing data and extraction of variables

Now that a sample block has been examined and is successfully extracted, the desired variables will be located for extraction as below:

1. Grant ID
2. Patent Title
3. Number of claims
4. Names of inventors
5. Number of citations done by examiners
6. Number of citations done by applicants
7. Claims text
8. Abstract

### Grant ID 

Grant ID is made up of the two letters `US` and sometimes a third letter followed by a series of digits.

In [11]:
#code for grant_id

#define an empty list to hold the values of grant_id for each patent
grant_idlst=[] 

#iterate through each patent to extract the grant_id 
for i in block:
    
        #--------------grant_id-----------#
        #the search function uses the regex to extract the given pattern from each patent
    grant_id=re.findall('(?<=file=")([A-Z0-9]+)', i)[0]
         #each extracted item (grant_id)is appended to the list grant_idlst
    grant_idlst.append(grant_id)

    
#print a sample of grant_idlst 
print(grant_idlst[0:3])

#total number of grant_id
print(len(grant_idlst))

['US10362519', 'US10359672', 'US10358609']
150


### Patent title

Patent title consists of a string of texts. 

In [12]:
#code for patent title
#define an empty list to hold the values of patent_title for each block
patent_titlelst=[] 

#iterate through each block to extract the patent_title of each block
for i in block:

            #----------patent_title------------#
        #the search function uses the regex to extract the given pattern from each block
    patent_title=re.search('(?<=">)\w+.*?(?=</invention-title>)',i)[0]
    patent_title = re.sub('&#x;', '', str(patent_title))
    patent_title = re.sub('&#x', '', str(patent_title))
        #each extracted item (patenet_title)is appended to the list patent_titlelst
    patent_titlelst.append(patent_title)

#print a sample of patent_titlelst 
print(patent_titlelst[0:3])

#total number of patent_title
print(len(patent_titlelst))

['Handover apparatus and method', 'Display device and manufacturing method thereof', 'Process for removing metal naphthenate from crude hydrocarbon mixtures']
150


### Number of claims

The number of claims is the count of claims made on each patent and thereby they are represented in numbers. 

In [13]:
#code for number of claims
#define an empty list to hold the values of number_of_claims for each block
claimno_textlst=[]

    
#iterate through each block to extract the kind_id_block of each block
for i in block:
    
           #-------------number of claims------------#    
        #the search function uses the regex to extract the given pattern from each block
    number_of_claims=re.search('(?<=number-of-claims>)(\d+)',i)[0]
        #each extracted item (number_of_claims)is appended to the list claimno_textlst
    claimno_textlst.append(number_of_claims) 

#print a sample of claimno_textlst 
print(claimno_textlst[0:3])

#total number of claimno_textlst
print(len(claimno_textlst))

['12', '22', '22']
150


### Name of inventors

Each patent has more than one inventor and each inventor's name is split up into first and last name. 

In [14]:
#code for names of inventors
#define a list to hold the values of name for each block
inventor_name=[]


for i in block:
                #--------------inventors------------#
            # To find the inventors' names (first and last names)
            #the findall function uses the regex to extract the given pattern from each patent
    inventor = re.findall('<inventors>[\s\S]*?</inventors>', i)
    for name in inventor:
        inventor_last_regex = r'(?<=last-name>)\w+.*?(?=</last-name>)'
        inventor_first_regex = r'(?<=first-name>)\w+.*?(?=</first-name>)'
        inventor_last = re.findall(inventor_last_regex, name)
        inventor_first = re.findall(inventor_first_regex, name)
        name = [i + " " + j for i, j in zip(inventor_first, inventor_last)]
        name = re.sub('&#xe9;', 'é', str(name))
        name = re.sub('&#xf8;', 'ô', str(name))
        name = re.sub('&#xf4;', 'ô', str(name))
        name = re.sub('&#xf6;', 'ö', str(name))
        name = re.sub('&#xe4;', 'ä', str(name))
        name = re.sub('\"', "", str(name))
        name = re.sub('\'', "", str(name))
        #each extracted item (name)is appended to the list inventor_name
        inventor_name.append(name)
        

        

#print a sample of inventor_name
print(inventor_name[0:3])

#total number of inventor_name
print(len(inventor_name))

['[Mingzeng Dai, Jing Liu, Yi Guo, Qinghai Zeng]', '[Seonggyu Kwon, Sangil Kim, Wontae Kim, Haksun Kim, Namseok Roh, Jaecheol Park, Youyoung Jin]', '[Knut Grande, Hege Kummernes, Kim Reidar Hôvik, Jens Emil Vindstad, Heidi Mediaas, Jorunn Steinsland Rosvoll, Ingvild Johanne Haug]']
150


### Number of citations by examiner

Each patent was cited by a different number of examiners. 

In [15]:
#code for number of citations by examiner 
#define a list to hold the values of citations_examiner
#for each block
examiner_count=[]

#iterate through each block to extract the citations_examiner of each block
for i in block:
            #----------CitationsCount-----------#
         #the findall function uses the regex to extract the given pattern from each block
    citations_examiner=re.findall('(?<=category>)(\w.*examiner)(?=</category>)',i)
        #append the number of occrances of each block to the examiner_count list
    examiner_count.append(len(citations_examiner)) 

    
#The count of occurrences of the pattern within each patent in the text file
print(examiner_count[0:3]) 

#total number of examiner_count
print(len(examiner_count))  

[2, 8, 1]
150


### Number of citations by applicant

Each patent was cited by a different number of applicants. 

In [16]:
#code for number of citations by examiner and applicant
#define a list to hold the values of citations_applicant for each block
applicant_count=[]

#the for loop iterates through each block to extract the citations_applicant of each block
for i in block:
            #----------CitationsCount-----------#
         #the search function uses the regex to extract the given pattern from each block
    citations_applicant=re.findall('(?<=category>)(\w.*applicant)(?=</category>)',i)
         #append the number of occrances of each block to the applicant_count list
    applicant_count.append(len(citations_applicant)) 

#The count of occurrences of the pattern within each patent in the text file
print(applicant_count[0:3]) 

#total number of applicant_count
print(len(applicant_count))  

[46, 4, 47]
150


### Claims text


The claims_text consists of one or many numbered strings.

In [17]:
#code for claims text - needs refining
#define a list to hold the values of claims_text for each block
claim_textlst=[]


#the for loop iterates through each block to extract the citations_applicant of each block
for i in block:

            #------------ClaimText---------------#
            #extract the initial claims component from the dataset
    claims_text_regex = r'(?<=claim-text>)(\w.*)'
    claims_text = re.findall(claims_text_regex, i)
            #further parse the extracted 'claims text' component to remove unnecessary html tags
    remove_tags = re.compile('<[^>]+>')
    claims_text = re.sub(remove_tags, "", str(claims_text))
    claims_text = re.sub('\'', "", str(claims_text))
    claims_text = re.sub('\\\\', "'", str(claims_text))
    claims_text = re.sub('&#x', '', str(claims_text))
    claims_text = re.sub('&#xb', '', str(claims_text))
    claims_text = re.sub('&#x3', '', str(claims_text))
    claims_text = re.sub(':', '', str(claims_text))
    claims_text = re.sub(';', '', str(claims_text))
            #each extracted item (claims_text)is appended to the list claim_textlst
    claim_textlst.append(claims_text)


#print a sample of claim_textlst
#print(claim_textlst[0:3])

#print number of applicant_count
print(len(claim_textlst))

150


### Abstract

The abstract component is a string of texts briefly detailing each patent. 

In [18]:
#code for abstract
#define a list to hold the values of abstract for each block
abstractlst=[]


for i in block:
                #------------abstract---------------#
        #the findall function uses the regex to extract the given pattern from each block
    abstract = re.findall('(?<=<p id="p-0001" num="0000">)(\w.*)', i)
    abstract = re.sub('\[]', 'NA', str(abstract))
    rm_openbrac = re.compile('\[')
    rm_closebrac = re.compile(']')
    abstract = re.sub(rm_openbrac, "", str(abstract))
    abstract = re.sub(rm_closebrac, "", str(abstract))
    remove_tags = re.compile('<[^>]+>')
    abstract = re.sub(remove_tags, "", str(abstract))
    abstract = re.sub('\'', "", str(abstract))
    abstract = re.sub('\"', "", str(abstract))
    abstract = re.sub('&#xb', '', str(abstract))
    abstract = re.sub('&#xb;', '', str(abstract))
    abstract = re.sub('&#x3bc;', 'μ', str(abstract))
    abstract = re.sub('&#x2014;', '—', str(abstract))
    abstract = re.sub('&#x201c;', '“', str(abstract))
    abstract = re.sub('&#x201d;', '”', str(abstract))
        #each extracted item (abstract)is appended to the list abstractlst
    abstractlst.append(abstract)


#print a sample of abstractlst
#print(abstractlst[0:3])

#total number of abstractlst
print(len(abstractlst))

150


After each variable has been extracted and parsed, they are appended to their individual lists to create a foundation for converting them into pandas dataframe in the next step.

## 4. Consolidating all lists into dataframe

In [19]:
#code for list to dataframe
#using Pandas dataframe to convert the list of lists to a dataframe

final_output=pd.DataFrame({'grant_id':grant_idlst, 'patent_title':patent_titlelst,
                           'number_of_claims':claimno_textlst, 
                            'inventors':inventor_name, 
                            'citations_applicant_count':applicant_count, 
                            'citations_examiner_count':examiner_count,
                            'claims_text':claim_textlst,
                             'abstract':abstractlst})

#check that the 8 variables are extracted for the 150 patents 
print('The structure of the dataframe is:',len(final_output.columns),'columns and', 
     len(final_output),'rows')

The structure of the dataframe is: 8 columns and 150 rows


## 5. Exporting dataframe to CSV

Extract the final dataframe to Excel 

In [20]:
#code for writing to csv

#writing the file as csv 
final_csv=final_output.to_csv('final_output.csv',index=False)

## 6. Converting dataframe/code to JSON

In [21]:
json_dict={} #empty dictionary to store the json string
var_dict={}  #empty dictionary to store the lists (7 lists)              

#for loop that iterates 150 times   
for i in range(len(grant_idlst)):
    var_dict["patent_title"]=patent_titlelst[i]
    var_dict["number_of_claims"]=claimno_textlst[i]
    var_dict["inventors"]=inventor_name[i]
    var_dict["citations_applicant_count"]=applicant_count[i]
    var_dict["citations_examiner_count"]=examiner_count[i]
    var_dict["claims_text"]=claim_textlst[i]
    var_dict["abstract"]=abstractlst[i] 
    json_dict[grant_idlst[i]]=dict(var_dict) #assign each grant_id as the key for the corresponding key/value

#to check the format before saving the file
#print(json_dict) 

#replace function to replace the single quote with double quotes
with open('json_output.json','w') as conversion:
    conversion.write((str(json_dict).replace("'",'"').replace('/','\/')))