# Miniscule sample of production code (training pipeline)
---
This notebook is mainly used as a sandbox and to visualize results of code fragments before adding them to the production pipeline

<br>
<hr>
<br>

<a id='toc'></a>
### Table of Contents
[1. Create Dataframe from USPTO bulk data](#section-1)<br>
[2. Text Preprocessing](#section-2)<br>
<br>
<hr>

<a id='section-1'></a>
### - Create Dataframe from USPTO bulk data

In [1]:
# Load dependencies
import xml.etree.ElementTree as ET
import xmltodict
import pandas as pd
import glob

In [166]:
# Define the pandas dataframe
cols = ['id', 'invention_title', 'abstract', 'claims', 'description', 'drawings_description', 'drawings_file_paths']
patents_df = pd.DataFrame(columns=cols)

# File counter
i = 0

# Loop through all folders and grab xml files
for folder in glob.glob('../Dataset/*'):
    
    # Select only main xml file (folder[11:35]) and ignore supplementary ones
    # that have different name pattern
    for _file in glob.glob(folder + '/' + folder[11:35] + '.XML'):
        
        # Taking a subgroup of only 20 files for experimentation purposes
        if i <= 20:
            # Parse xml tree
            tree = ET.parse(_file)
            root = tree.getroot()

            # Placeholder for text content
            abstract_text = ''
            claims_text = ''
            description_text = ''
            drawings_description_text = ''
            drawings_file_paths = []

            # Traverse XML tree and extract data we need
            if (root[0].tag == 'us-bibliographic-data-application'):

                # Extract document number as id
                _id = root[0].find('publication-reference').find('document-id').find('doc-number').text
                
                # Extract invention title
                invention_title = root[0].find('invention-title').text
                
                # Extract abstract
                abstract = root.find('abstract')
                
                # Extract claims
                claims = root.find('claims')
                
                # Extract all description
                description = root.find('description')
                
                # Extract drawings description (if present)
                if root.find('drawings') != None:
                    drawings_description = root.find('description').find('description-of-drawings')
                    
                # Extract drawings paths (if present)
                if root.find('drawings') != None:
                    drawings = root.find('drawings')

                # Store all paragraphs in the abstract section
                for child in abstract:
                    if (child.text != None):
                        abstract_text += child.text + '\n'

                # Store all paragraphs in the claims section
                for child in claims:
                    claims_text += ''.join(child.itertext()).replace('\n', ' ')
                    
                # Store all paragraphs in the description section
                for child in description:
                    description_text += ''.join(child.itertext()) + ' '
                    
                # Store all paragraphs in the drawings description section
                if drawings_description:
                    for child in drawings_description:
                        drawings_description_text += ''.join(child.itertext()) + ' '
                        
                # Store all drawings file paths
                if drawings:
                    for child in drawings:
                        drawings_file_paths.append(child[0].get('file'))

                # Write extracted content to dataframe
                patents_df = patents_df.append(pd.Series([_id, invention_title, abstract_text, claims_text, \
                                                          description_text, drawings_description_text, \
                                                         drawings_file_paths], index=cols), ignore_index=True)
                
        # Process only 20 files and break out of the loop
        else:
            break
    
        # File counter increment
        i += 1
    
# Show dataframe    
patents_df

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths
0,20190151362,COMPOSITIONS AND METHODS OF CELLULAR IMMUNOTHE...,Disclosed herein are methods of treating a sub...,1. A method of treating a subject exhibiting ...,CROSS-REFERENCE This application is a continua...,BRIEF DESCRIPTION OF THE DRAWINGS The novel fe...,"[US20190151362A1-20190523-D00000.TIF, US201901..."
1,20190134132,NUTRITION BLEND FOR HEALTH BENEFITS IN ANIMALS,A method of minimizing fat accumulation in a g...,1. A method of minimizing fat accumulation in...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS The novel fe...,"[US20190151362A1-20190523-D00000.TIF, US201901..."
2,20190224135,USE OF PROCALCITONIN (PCT) IN RISK STRATIFICAT...,Subject of the present invention are assays an...,1. An in vitro method for prognosis for a pat...,Subject of the present invention is the in vit...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190224135A1-20190725-D00001.TIF, US201902..."
3,20190153419,THROMBIN-THROMBOMODULIN FUSION PROTEINS AS A P...,Compositions and methods for regulating the bl...,1. A thrombin-thrombomodulin fusion protein c...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS The patent o...,"[US20190153419A1-20190523-D00000.TIF, US201901..."
4,20190169293,HANP-FC-CONTAINING MOLECULAR CONJUGATE,The present invention provides a conjugate com...,1. A conjugate comprising a hANP peptide bond...,TECHNICAL FIELD The present invention relates ...,BRIEF DESCRIPTION OF DRAWINGS FIG. 1 schematic...,"[US20190169293A1-20190606-D00000.TIF, US201901..."
5,20190211060,CYCLIC PEPTIDE ANALOGS AND CONJUGATES THEREOF,"Provided are cyclic peptide analogs, conjugate...",1. A compound of Formula (I): or a salt t...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 is a 1...,"[US20190211060A1-20190711-D00001.TIF, US201902..."
6,20190201310,ORAL CARE COMPOSITION,An aqueous composition with a higher-than-neut...,1. An oral care composition useful for treati...,CROSS-REFERENCE TO RELATED APPLICATIONS This i...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 is a 1...,"[US20190211060A1-20190711-D00001.TIF, US201902..."
7,20190194292,Enveloped Virus Resistant to Complement Inacti...,A recombinant fusion protein is disclosed. The...,1. A fusion protein comprising: (a) a CD55 pe...,REFERENCE TO SEQUENCE LISTING The Sequence Lis...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1. Mamma...,"[US20190194292A1-20190627-D00000.TIF, US201901..."
8,20190194323,ANTI-SIGLEC-7 ANTIBODIES FOR THE TREATMENT OF ...,The invention provides methods and composition...,1. A method of inhibiting proliferation of tu...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 provi...,"[US20190194323A1-20190627-D00000.TIF, US201901..."
9,20190153042,ANTIMICROBIAL PEPTIDES DERIVED FROM HEPATITIS ...,A pharmaceutical composition comprising: (a) a...,1. A pharmaceutical composition comprising: (...,CROSS REFERENCE TO RELATED APPLICATIONS This i...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190153042A1-20190523-D00001.TIF, US201901..."


<hr>
<br>

<a id='section-2'></a>

### - Text Preprocessing