# Miniscule sample of production code (training pipeline)
---
This notebook is mainly used as a sandbox and to visualize results of code fragments before adding them to the production pipeline

<br>
<hr>
<br>

<a id='toc'></a>
### Table of Contents
[1. Create Dataframe from USPTO bulk data](#section-1)<br>
[2. Text Segmentation for Description Section](#section-2)<br>
[3. Append information about Chemical Structures in the patents](#section-3)<br>
[4. Extract chemical names from text](#section-4)<br>
[5. Text Preprocessing & Tokenization](#section-5)<br>
[6. Train word/document embeddings model](#section-6)<br>
[7. Train LSTM using document embeddings](#section-7)<br>
<br>
<hr>

<a id='section-1'></a>
### - Create Dataframe from USPTO bulk data

In [1]:
# Load dependencies
import xml.etree.ElementTree as ET
import xmltodict
import pandas as pd
import glob

In [2]:
# Define the pandas dataframe
cols = ['id', 'invention_title', 'abstract', 'claims', 'description', 'drawings_description', 'drawings_file_paths']
patents_df = pd.DataFrame(columns=cols)

# File counter
m = 0

# Loop through all folders and grab xml files
for folder in glob.glob('../Dataset/*'):
    
    # Select only main xml file (folder[11:35]) and ignore supplementary ones
    # that have different name pattern
    for _file in glob.glob(folder + '/' + folder[11:35] + '.XML'):
        
        # Taking a subgroup of only 20 files for experimentation purposes
        if m <= 20:
            # Parse xml tree
            tree = ET.parse(_file)
            root = tree.getroot()

            # Placeholder for text content
            abstract_text = ''
            claims_text = ''
            description_text = ''
            drawings_description_text = ''
            drawings_file_paths = []

            # Traverse XML tree and extract data we need
            if (root[0].tag == 'us-bibliographic-data-application'):

                # Extract document number as id
                _id = root[0].find('publication-reference').find('document-id').find('doc-number').text
                
                # Extract invention title
                invention_title = root[0].find('invention-title').text
                
                # Extract abstract
                abstract = root.find('abstract')
                
                # Extract claims
                claims = root.find('claims')
                
                # Extract all description
                description = root.find('description')
                
                # Extract drawings description (if present)
                if root.find('drawings') != None:
                    drawings_description = root.find('description').find('description-of-drawings')
                    
                # Extract drawings paths (if present)
                if root.find('drawings') != None:
                    drawings = root.find('drawings')

                # Store all paragraphs in the abstract section
                for child in abstract:
                    if (child.text != None):
                        abstract_text += child.text + '\n'

                # Store all paragraphs in the claims section
                for child in claims:
                    claims_text += ''.join(child.itertext()).replace('\n', ' ')
                    
                # Store all paragraphs in the description section
                for child in description:
                    description_text += ''.join(child.itertext()) + ' '
                    
                # Store all paragraphs in the drawings description section
                if drawings_description:
                    for child in drawings_description:
                        drawings_description_text += ''.join(child.itertext()) + ' '
                        
                # Store all drawings file paths
                if drawings:
                    for child in drawings:
                        drawings_file_paths.append(child[0].get('file'))

                # Write extracted content to dataframe
                patents_df = patents_df.append(pd.Series([_id, invention_title, abstract_text, claims_text, \
                                                          description_text, drawings_description_text, \
                                                         drawings_file_paths], index=cols), ignore_index=True)
                
        # Process only 20 files and break out of the loop
        else:
            break
    
        # File counter increment
        m += 1
    
# Show dataframe    
patents_df

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901..."
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901..."
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901..."
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901..."
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901..."
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902..."
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902..."
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902..."
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901..."
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901..."


<hr>
<br>

<a id='section-2'></a>

### - Text Segmentation for Description Section

In [3]:
# Load dependencies
import pandas as pd
import re

In [4]:
def find_inbetween_text(text, sub1, sub2):
    """
    This function accepts a piece of text and extracts what's
    between any two given strings after finding their positions
    """
    
    # Get positions for the two strings
    pos1 = sub1
    pos2 = sub2
    
    if pos1 > pos2 and pos2 > 0:
        return text[pos2:pos1]
    elif pos2 > pos1 and pos1 > 0:
        return text[pos1:pos2]

In [5]:
# Try an example on first document description
# extract the background section text
boundary1 = re.search('([A-Z])*BACKGROUND(([A-Z])*(\s))*', patents_df['description'][0], re.MULTILINE)
boundary2 = re.search('([A-Z])*SUMMARY(([A-Z])*(\s))*', patents_df['description'][0], re.MULTILINE)

print(find_inbetween_text(patents_df['description'][0], boundary1.end(), boundary2.start()))

Gene therapy can be used to genetically engineer a cell to have one or more inactivated genes and/or to cause that cell to express a product not previously being produced in that cell (e.g., via transgene insertion and/or via correction of an endogenous sequence). Examples of uses of transgene insertion include the insertion of one or more genes encoding one or more novel therapeutic proteins, insertion of a coding sequence encoding a protein that is lacking in the cell or in the individual, insertion of a wild type gene in a cell containing a mutated gene sequence, and/or insertion of a sequence that encodes a structural nucleic acid such as a microRNA or siRNA. Examples of useful applications of ‘correction’ of an endogenous gene sequence include alterations of disease-associated gene mutations, alterations in sequences encoding splice sites, alterations in regulatory sequences and/or targeted alterations of sequences encoding structural characteristics of a protein. Hepatic gene tra

In [6]:
background_boundary_start = 0
background_boundary_end = 0
invention_col = []

for index, row in patents_df.iterrows():
    background_boundary_start = re.search('([A-Z])*BACKGROUND(([A-Z])*(\s))*', \
                                          str(row['description']), re.MULTILINE)
    background_boundary_end = re.search('([A-Z])*SUMMARY(([A-Z])*(\s))*', \
                                        str(row['description']), re.MULTILINE)
    
    if background_boundary_start and background_boundary_end:
        invention_col.append(find_inbetween_text(str(row['description']), background_boundary_start.end(), \
                                             background_boundary_end.start()))
    else:
        invention_col.append(' ')
        
print(invention_col[2])

Formulations of bioactive compounds have been developed for human and animal consumption and therapeutic use since the beginning of recorded history. Since ancient times flowers and herbs were gathered and consumed therapeutically, often brewed into tea and cooked in food or compressed into a poultice and applied to the skin to treat wounds, pain and other maladies. Animals also self-medicate by eating plants. Transmucosal and transdermal methodologies often have an advantage over orally deliverable formulations because the human digestive tract can reduce the efficacy and bioavailability of any particular remedy. In particular, first pass metabolism may modify the therapeutic components of orally deliverable bioactive compounds. Additionally stomach acids can inhibit the effectiveness of oral delivery. Presently, whole-plant botanical mixtures, extracts, decoctions, distillations, essential oils and other various forms have been developed for treating various ailments. Some of these i

In [7]:
inventions_df = pd.DataFrame({'invention_background': invention_col})
inventions_df

Unnamed: 0,invention_background
0,Gene therapy can be used to genetically engine...
1,
2,Formulations of bioactive compounds have been ...
3,Cells communicate with each other either by di...
4,Human adipocyte lipid-binding protein (aP2) be...
5,Throughout this application various publicatio...
6,Antibiotics are the main tools to treat infect...
7,Mesenchymal stem cells (MSCs) hold much promis...
8,Breast cancer is a highly significant cause of...
9,Field of Invention The invention relates to me...


In [8]:
patents_df2 = pd.concat([patents_df, inventions_df], axis=1)
patents_df2

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...


In [9]:
# Repeat process for cross-reference section
ref_boundary_start = 0
ref_boundary_end = 0
ref_col = []

for index, row in patents_df2.iterrows():
    ref_boundary_start = re.search('CROSS-REFERENCE(([A-Z])*(\s))*', \
                                          str(row['description']), re.MULTILINE)
    ref_boundary_end = re.search('in their entirety', \
                                        str(row['description']), re.MULTILINE)
    
    if ref_boundary_start and ref_boundary_end:
        ref_col.append(find_inbetween_text(str(row['description']), ref_boundary_start.end(), \
                                             ref_boundary_end.start()))
    else:
        ref_col.append(' ')
        
print(ref_col[5])

This application claims the benefit of U.S. Provisional Patent Application No. 62/005,013, filed on May 30, 2014, the contents of which are herein incorporated by reference 


In [10]:
ref_df = pd.DataFrame({'cross_reference': ref_col})
ref_df

Unnamed: 0,cross_reference
0,The present application is a divisional of U.S...
1,
2,
3,The present application claims the benefit of ...
4,
5,This application claims the benefit of U.S. Pr...
6,
7,
8,The present application is a divisional of U.S...
9,


In [11]:
patents_df3 = pd.concat([patents_df2, ref_df], axis=1)
patents_df3

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,


In [12]:
# Extract summary section
summary_boundary_start = 0
summary_boundary_end = 0
summary_col = []

for index, row in patents_df3.iterrows():
    summary_boundary_start = re.search('([A-Z])*SUMMARY(([A-Z])*(\s))*', \
                                          str(row['description']), re.MULTILINE)
    summary_boundary_end = re.search('([A-Z])*DESCRIPTION(([A-Z])*(\s))*', \
                                        str(row['description']), re.MULTILINE)
    
    if summary_boundary_start and summary_boundary_end:
        summary_col.append(find_inbetween_text(str(row['description']), summary_boundary_start.end(), \
                                             summary_boundary_end.start()))
    else:
        summary_col.append(' ')
        
print(summary_col[2])

The present invention includes a cannabinoid composition including a cannabinoid and an excipient including palmitoleic acid in a therapeutic ratio. The cannabinoid composition includes at least one cannabinoid. In a preferred embodiment, the excipient includes palmitoleic acid, oleic acid and palmitic acid. The ratio of palmitoleic acid to the at least one cannabinoid is between 1:19 to 1000:1. In another embodiment of the invention, the ratio of palmitic acid to the at least one cannabinoid is between 1:100 to 1000:1. In yet another embodiment of the invention, the ratio of the oleic acid to the at least one cannabinoid is between 1:100 to 1000:1. In another embodiment of the invention, the ratio of the palmitic acid and oleic acid combined, to the at least one cannabinoid, is between 1:50 to 2000:1. Some research indicates that oleic acid included in the excipient for oral delivery can increase the plasma concentration of bioactive compounds in vivo, as compared to an aqueous excipi

In [13]:
summary_df = pd.DataFrame({'summary': summary_col})
summary_df

Unnamed: 0,summary
0,The present invention describes compositions a...
1,
2,The present invention includes a cannabinoid c...
3,The present invention describes compositions a...
4,Anti-aP2 monoclonal antibodies and antigen bin...
5,The present invention discloses assays for ide...
6,Density-dependent drug failure is a form of an...
7,"Despite the promise of these methods, there ex..."
8,The invention provides anti-HER2 antibodies an...
9,"In one embodiment, the present invention provi..."


In [14]:
patents_df4 = pd.concat([patents_df3, summary_df], axis=1)
patents_df4

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference,summary
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...,The present invention describes compositions a...
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,,
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,,The present invention includes a cannabinoid c...
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...,The present invention describes compositions a...
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,,Anti-aP2 monoclonal antibodies and antigen bin...
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...,The present invention discloses assays for ide...
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,,Density-dependent drug failure is a form of an...
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,,"Despite the promise of these methods, there ex..."
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...,The invention provides anti-HER2 antibodies an...
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,,"In one embodiment, the present invention provi..."


In [15]:
# Extract detailed description section
desc_boundary_start = 0
desc_boundary_end = 0
desc_col = []

for index, row in patents_df4.iterrows():
    desc_boundary_start = re.search('([A-Z])*DETAILED DESCRIPTION(([A-Z])*(\s))*', \
                                          str(row['description']), re.MULTILINE)
    desc_boundary_end = len(str(row['description']))
    
    if desc_boundary_start:
        desc_col.append(find_inbetween_text(str(row['description']), desc_boundary_start.end(), \
                                             desc_boundary_end))
    else:
        desc_col.append(' ')

print(desc_col[0])

Disclosed herein are expression cassettes for expression of a transgene, particularly in liver cells. The constructs can be used to deliver any transgene(s) to liver cells, in vivo or in vitro and can be used for the treatment and/or prevention of any disease or disorder which can be ameliorated by the provision of one or more of the transgenes. Unlike currently used hepatic-targeted constructs, the constructs described herein include modified enhancer and/or intronic sequences and, in addition, express the transgene at high levels even without the use of an MVM intron. These constructs are also small, allowing for successful use with transgenes delivered by small vector systems such as AAV. The constructs described herein can be used to express hFVIII BDD in the liver of non-human primates. Depending on the initial dose of the AAV hF8 cDNA expression cassette, circulating plasma levels of hFVIII were upwards of 800% of normal circulating hFVIII. Following these initial high doses, man

In [16]:
detailed_desc_df = pd.DataFrame({'detailed_description': desc_col})
detailed_desc_df

Unnamed: 0,detailed_description
0,Disclosed herein are expression cassettes for ...
1,
2,The present invention includes a composition o...
3,Disclosed herein are compositions and methods ...
4,Anti-aP2 monoclonal antibodies and antigen bin...
5,method is provided for identifying an agent as...
6,Pharmaceutical Compositions The present disclo...
7,"In some embodiments, the present invention pro..."
8,Reference will now be made in detail to certai...
9,The present invention sets forth a novel metho...


In [17]:
patents_df5 = pd.concat([patents_df4, detailed_desc_df], axis=1)
patents_df5

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference,summary,detailed_description
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...,The present invention describes compositions a...,Disclosed herein are expression cassettes for ...
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,,,
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,,The present invention includes a cannabinoid c...,The present invention includes a composition o...
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...,The present invention describes compositions a...,Disclosed herein are compositions and methods ...
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,,Anti-aP2 monoclonal antibodies and antigen bin...,Anti-aP2 monoclonal antibodies and antigen bin...
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...,The present invention discloses assays for ide...,method is provided for identifying an agent as...
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,,Density-dependent drug failure is a form of an...,Pharmaceutical Compositions The present disclo...
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,,"Despite the promise of these methods, there ex...","In some embodiments, the present invention pro..."
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...,The invention provides anti-HER2 antibodies an...,Reference will now be made in detail to certai...
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,,"In one embodiment, the present invention provi...",The present invention sets forth a novel metho...


<hr>
<br>

<a id='section-3'></a>

### - Append information about Chemical Structures in the patents

In [18]:
# Loading dependencies
import subprocess
import glob
import pandas as pd
import os
import numpy as np

In [None]:
# Extract SMILES notation

# Loop through all folders and grab .tif image files
for index, row in patents_df.iterrows():
    
    print('Recognizing structures in file US{}'.format(str(row['id'])))

    # Select only main tiff image file and ignore supplementary ones
    for _file in glob.glob('../Dataset/US{}*/*.TIF'.format(row['id'])):

        # Check if folder with the current document name exists
        if not os.path.exists('../temp/chemical-names-smiles/' + _file[11:35]):

            # If folder does not exist, create it
            os.makedirs('../temp/chemical-names-smiles/' + _file[11:35])

        subprocess.check_call(['osra', '-r', '300', _file, \
                               '-w', '../temp/chemical-names-smiles/{}/{}.txt'.format(str(_file[11:35]), \
                                                                                      str(_file[11:35]))])

In [19]:
patents_df5['chemical_compounds_smiles'] = np.nan
patents_df5

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference,summary,detailed_description,chemical_compounds_smiles
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...,The present invention describes compositions a...,Disclosed herein are expression cassettes for ...,
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,,,,
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,,The present invention includes a cannabinoid c...,The present invention includes a composition o...,
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...,The present invention describes compositions a...,Disclosed herein are compositions and methods ...,
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,,Anti-aP2 monoclonal antibodies and antigen bin...,Anti-aP2 monoclonal antibodies and antigen bin...,
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...,The present invention discloses assays for ide...,method is provided for identifying an agent as...,
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,,Density-dependent drug failure is a form of an...,Pharmaceutical Compositions The present disclo...,
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,,"Despite the promise of these methods, there ex...","In some embodiments, the present invention pro...",
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...,The invention provides anti-HER2 antibodies an...,Reference will now be made in detail to certai...,
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,,"In one embodiment, the present invention provi...",The present invention sets forth a novel metho...,


In [20]:
# Loop through all text files
for folder in glob.glob('../temp/chemical-names-smiles/*'): 
    
    print('Reading SMILES in file {}'.format(str(folder[30:-11])))
    smiles_for_one = ''
    
    # Checking integrity, writing to same file
    if ([folder[32:-11] == item[1]['id'] for item in patents_df.iterrows()]):
        
        for _file in glob.glob('{}/*.txt'.format(folder)):
            
            # Read each file and append each line that represents a compound
            # to an array
            one = ''

            with open(_file, 'r') as smiles_file:
                one = smiles_file.read()

            if one != '':
                smiles_for_one += str(one.split())
        
        # Append to dataframe row-by-row to ensure alignment
        patents_df5.loc[patents_df5['id'] == folder[32:-11], ['chemical_compounds_smiles']] = \
        smiles_for_one

    else:
        print('No structures recognized in this file..')
        patents_df5.loc[patents_df5['id'] == folder[32:-11], ['chemical_compounds_smiles']] = \
        ' '

Reading SMILES in file US20190151472
Reading SMILES in file US20190135863
Reading SMILES in file US20190133992
Reading SMILES in file US20190136261
Reading SMILES in file US20190161536
Reading SMILES in file US20190202865
Reading SMILES in file US20190201532
Reading SMILES in file US20190201444
Reading SMILES in file US20190177429
Reading SMILES in file US20190184190
Reading SMILES in file US20190159474
Reading SMILES in file US20190125691
Reading SMILES in file US20190169260
Reading SMILES in file US20190194282
Reading SMILES in file US20190127440
Reading SMILES in file US20190177383
Reading SMILES in file US20190194689
Reading SMILES in file US20190144517
Reading SMILES in file US20190177709
Reading SMILES in file US20190169583
Reading SMILES in file US20190202879
Reading SMILES in file US20190167778
Reading SMILES in file US20190125672
Reading SMILES in file US20190153497
Reading SMILES in file US20190161548
Reading SMILES in file US20190167567
Reading SMILES in file US20190191675
R

In [21]:
print(patents_df5['chemical_compounds_smiles'][17])

['C1*CCCC1', 'C1CCCCC1', 'C*CCCC1*CC*21CCCCC2']


In [22]:
patents_df5

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference,summary,detailed_description,chemical_compounds_smiles
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...,The present invention describes compositions a...,Disclosed herein are expression cassettes for ...,
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,,,,
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,,The present invention includes a cannabinoid c...,The present invention includes a composition o...,['CCCCCC/C=C\\CCCCCCCC(=*1CCCCC1)C1*C*(C1)*']
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...,The present invention describes compositions a...,Disclosed herein are compositions and methods ...,
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,,Anti-aP2 monoclonal antibodies and antigen bin...,Anti-aP2 monoclonal antibodies and antigen bin...,
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...,The present invention discloses assays for ide...,method is provided for identifying an agent as...,
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,,Density-dependent drug failure is a form of an...,Pharmaceutical Compositions The present disclo...,
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,,"Despite the promise of these methods, there ex...","In some embodiments, the present invention pro...",
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...,The invention provides anti-HER2 antibodies an...,Reference will now be made in detail to certai...,['CC[C@@H]([C@H](N(C(=O)[C@H](C(C)C)NC(=O)[C@@...
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,,"In one embodiment, the present invention provi...",The present invention sets forth a novel metho...,


<hr>
<br>

<a id='section-4'></a>

### - Extract chemical names from text

In [23]:
# Loading dependencies
import subprocess
import glob
import re

In [None]:
# Chemspot accepts only text files as input, so we write the description
# extracted previously to file
for index, row in patents_df.iterrows():
    with open('../temp/patent_text/US{}.txt'.format(row['id']), 'w') as description_file:
        description_file.write(row['description'])

In [None]:
# Extract from full patent description
print('Chemical NER extraction')

for _file in glob.glob('../temp/patent_text/US*.txt'):
    
    # Run ChemSpot, the Chemical named entity recognition library (in shell)
    subprocess.check_call(['java', '-Xmx4G', '-jar', '../chemspot-2.0/chemspot.jar', \
        '-t', _file, '-o', '../temp/chemical-names-inchi/{}.txt'.format(_file[20:-4])]) 

In [24]:
patents_df5['chemical_compounds_inchi'] = np.nan
patents_df5

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference,summary,detailed_description,chemical_compounds_smiles,chemical_compounds_inchi
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...,The present invention describes compositions a...,Disclosed herein are expression cassettes for ...,,
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,,,,,
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,,The present invention includes a cannabinoid c...,The present invention includes a composition o...,['CCCCCC/C=C\\CCCCCCCC(=*1CCCCC1)C1*C*(C1)*'],
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...,The present invention describes compositions a...,Disclosed herein are compositions and methods ...,,
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,,Anti-aP2 monoclonal antibodies and antigen bin...,Anti-aP2 monoclonal antibodies and antigen bin...,,
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...,The present invention discloses assays for ide...,method is provided for identifying an agent as...,,
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,,Density-dependent drug failure is a form of an...,Pharmaceutical Compositions The present disclo...,,
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,,"Despite the promise of these methods, there ex...","In some embodiments, the present invention pro...",,
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...,The invention provides anti-HER2 antibodies an...,Reference will now be made in detail to certai...,['CC[C@@H]([C@H](N(C(=O)[C@H](C(C)C)NC(=O)[C@@...,
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,,"In one embodiment, the present invention provi...",The present invention sets forth a novel metho...,,


In [25]:
# Loop through all files and grab text files
for _file in glob.glob('../temp/chemical-names-inchi/US*.txt'):
    
    print('Reading InCHhI structures for file {}'.format(str(_file[29:-4])))
    
    # Checking integrity, writing to same document
    if ([_file[31:-4] == item for item in patents_df['id']]):
        
        inchi_for_one = ''

        # Read each file 
        with open(_file, 'r') as f:
            for line in f:
                inchi_for_one += line.replace('\t\t\t\t\t\t\t\t\t\t\t', '')

        # Append to dataframe row-by-row to ensure alignment
        patents_df5.loc[patents_df5['id'] == _file[31:-4], ['chemical_compounds_inchi']] = \
        inchi_for_one

    else:
        print('No structures recognized in this file..')
        patents_df5.loc[patents_df5['id'] == folder[31:-4], ['chemical_compounds_inchi']] = \
        ' '

Reading InCHhI structures for file US20190177383
Reading InCHhI structures for file US20190161536
Reading InCHhI structures for file US20190192693
Reading InCHhI structures for file US20190201532
Reading InCHhI structures for file US20190169583
Reading InCHhI structures for file US20190184190
Reading InCHhI structures for file US20190201444
Reading InCHhI structures for file US20190133992
Reading InCHhI structures for file US20190144517
Reading InCHhI structures for file US20190177709
Reading InCHhI structures for file US20190194282
Reading InCHhI structures for file US20190127440
Reading InCHhI structures for file US20190125691
Reading InCHhI structures for file US20190169260
Reading InCHhI structures for file US20190194689
Reading InCHhI structures for file US20190202865
Reading InCHhI structures for file US20190135863
Reading InCHhI structures for file US20190151472
Reading InCHhI structures for file US20190136261
Reading InCHhI structures for file US20190159474
Reading InCHhI struc

In [26]:
patents_df5

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,drawings_file_paths,invention_background,cross_reference,summary,detailed_description,chemical_compounds_smiles,chemical_compounds_inchi
0,20190151472,"LIVER-SPECIFIC CONSTRUCTS, FACTOR VIII EXPRESS...",Described herein are constructs used for liver...,1-10. (canceled) 11. A method of providing a...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190151472A1-20190523-D00001.TIF, US201901...",Gene therapy can be used to genetically engine...,The present application is a divisional of U.S...,The present invention describes compositions a...,Disclosed herein are expression cassettes for ...,,\t19571\t19601\tglucose-6-phosphate transporte...
1,20190135863,RETRO-INVERSO PEPTIDE INHIBITORS OF CELL MIGRA...,The present invention discloses retro-inverso ...,1. A compound of formula (I): X1-(D)-Arg-X2...,The present invention concerns retro-inverso p...,DESCRIPTION OF THE FIGURES FIG. 1 High degree ...,"[US20190135863A1-20190509-D00001.TIF, US201901...",,,,,,\t2711\t2740\tglycosyl-phosphatidyl-inositol\t...
2,20190133992,CANNABINOID COMPOSITION HAVING AN OPTIMIZED FA...,A composition of matter for enhancing delivery...,1. A cannabinoid composition comprising: at l...,FIELD OF THE INVENTION The present invention r...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a ...,"[US20190133992A1-20190509-D00001.TIF, US201901...",Formulations of bioactive compounds have been ...,,The present invention includes a cannabinoid c...,The present invention includes a composition o...,['CCCCCC/C=C\\CCCCCCCC(=*1CCCCC1)C1*C*(C1)*'],"\t12506\t12524\t1,5-diarylpyrazoles\tSYSTEMATI..."
3,20190136261,GENETIC MODIFICATION OF CYTOKINE INDUCIBLE SH2...,The present disclosure is in the field of geno...,1. A genetically modified cell comprising a g...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows...,"[US20190136261A1-20190509-D00001.TIF, US201901...",Cells communicate with each other either by di...,The present application claims the benefit of ...,The present invention describes compositions a...,Disclosed herein are compositions and methods ...,,\t15305\t15313\tB2M-HLA-E\tABBREVIATION\thuman...
4,20190161536,ANTI-AP2 ANTIBODIES AND ANTIGEN BINDING AGENTS...,This invention is in the area of improved anti...,1. A method of attenuating an aP2-mediated di...,RELATED APPLICATIONS This application is a con...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1-8 ar...,"[US20190161536A1-20190530-D00001.TIF, US201901...",Human adipocyte lipid-binding protein (aP2) be...,,Anti-aP2 monoclonal antibodies and antigen bin...,Anti-aP2 monoclonal antibodies and antigen bin...,,\t31413\t31415\t10K\tFORMULA\t\t\n\t31423\t314...
5,20190202865,TARGETING DIMERIZATION OF BAX TO MODULATE BAX ...,Methods are provided for identifying an agent ...,1-19. (canceled) 20. A method of inhibiting ...,CROSS-REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-1D. ...,"[US20190202865A1-20190704-D00001.TIF, US201902...",Throughout this application various publicatio...,This application claims the benefit of U.S. Pr...,The present invention discloses assays for ide...,method is provided for identifying an agent as...,,\t15825\t15844\tco-immunoprecipitate\tSYSTEMAT...
6,20190201532,SENSITIZATION OF BACTERIAL CELLS TO QUINOLONE ...,Provided herein are pharmaceutical composition...,1. A pharmaceutical composition comprising a ...,RELATED APPLICATION The present application cl...,BRIEF DESCRIPTION OF THE DRAWINGS FIGS. 1A-1F....,"[US20190201532A1-20190704-D00001.TIF, US201902...",Antibiotics are the main tools to treat infect...,,Density-dependent drug failure is a form of an...,Pharmaceutical Compositions The present disclo...,,\t16983\t16998\tGlucose-fumarate\tSYSTEMATIC\t...
7,20190201444,GLYCOENGINEERING OF E-SELECTIN LIGANDS,The present invention provides methods of enfo...,1. A method of enforcing expression of an E-s...,CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A-FIG....,"[US20190201444A1-20190704-D00000.TIF, US201902...",Mesenchymal stem cells (MSCs) hold much promis...,,"Despite the promise of these methods, there ex...","In some embodiments, the present invention pro...",,"\t2697\t2714\talpha-(1,3)-fucose\tSYSTEMATIC\t..."
8,20190177429,Anti-HER2 Antibodies and Immunoconjugates,The invention provides anti-HER2 antibodies an...,"1. An isolated antibody that binds to HER2, w...",CROSS-REFERENCE TO RELATED APPLICATIONS The pr...,BRIEF DESCRIPTION OF THE FIGURES FIG. 1 shows ...,"[US20190177429A1-20190613-D00001.TIF, US201901...",Breast cancer is a highly significant cause of...,The present application is a divisional of U.S...,The invention provides anti-HER2 antibodies an...,Reference will now be made in detail to certai...,['CC[C@@H]([C@H](N(C(=O)[C@H](C(C)C)NC(=O)[C@@...,\t4914\t4916\t2C4\tFORMULA\t\t\n\t4926\t4928\t...
9,20190184190,PHOSPHOR-CONTAINING DRUG ACTIVATOR ACTIVATABLE...,A phosphor-containing drug activator activatab...,1. A phosphor-containing drug activator activ...,CROSS REFERENCE TO RELATED APPLICATIONS This a...,BRIEF DESCRIPTION OF THE FIGURES A more comple...,"[US20190184190A1-20190620-D00000.TIF, US201901...",Field of Invention The invention relates to me...,,"In one embodiment, the present invention provi...",The present invention sets forth a novel metho...,,\t10807\t10818\tZn2SiO4:Mn2+\tFORMULA\t\t\n\t1...


<hr>
<br>

<a id='section-5'></a>

### - Text Preprocessing and Tokenization

In [27]:
# Loading dependencies
import numpy as np
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.models.fasttext import FastText
import multiprocessing
from tqdm import tqdm
from gensim.utils import simple_preprocess
from sklearn.model_selection import train_test_split

In [28]:
# Tokenize text into sentences using nltk
patents_df_tokenized = pd.concat([patents_df5.iloc[:, 0], patents_df5.iloc[:, 1:6].applymap(lambda x: simple_preprocess(x)), \
patents_df5.iloc[:, 7:11].applymap(lambda x: simple_preprocess(x))], axis=1)
patents_df_tokenized

Unnamed: 0,id,invention_title,abstract,claims,description,drawings_description,invention_background,cross_reference,summary,detailed_description
0,20190151472,"[liver, specific, constructs, factor, viii, ex...","[described, herein, are, constructs, used, for...","[canceled, method, of, providing, protein, to,...","[cross, reference, to, related, applications, ...","[brief, description, of, the, drawings, fig, i...","[gene, therapy, can, be, used, to, genetically...","[the, present, application, is, divisional, of...","[the, present, invention, describes, compositi...","[disclosed, herein, are, expression, cassettes..."
1,20190135863,"[retro, inverso, peptide, inhibitors, of, cell...","[the, present, invention, discloses, retro, in...","[compound, of, formula, arg, arg, nh, wherein,...","[the, present, invention, concerns, retro, inv...","[description, of, the, figures, fig, high, deg...",[],[],[],[]
2,20190133992,"[cannabinoid, composition, having, an, optimiz...","[composition, of, matter, for, enhancing, deli...","[cannabinoid, composition, comprising, at, lea...","[field, of, the, invention, the, present, inve...","[brief, description, of, the, drawings, fig, i...","[formulations, of, bioactive, compounds, have,...",[],"[the, present, invention, includes, cannabinoi...","[the, present, invention, includes, compositio..."
3,20190136261,"[genetic, modification, of, cytokine, inducibl...","[the, present, disclosure, is, in, the, field,...","[genetically, modified, cell, comprising, geno...","[cross, reference, to, related, applications, ...","[brief, description, of, the, drawings, fig, s...","[cells, communicate, with, each, other, either...","[the, present, application, claims, the, benef...","[the, present, invention, describes, compositi...","[disclosed, herein, are, compositions, and, me..."
4,20190161536,"[anti, ap, antibodies, and, antigen, binding, ...","[this, invention, is, in, the, area, of, impro...","[method, of, attenuating, an, ap, mediated, di...","[related, applications, this, application, is,...","[brief, description, of, the, drawings, figs, ...","[human, adipocyte, lipid, binding, protein, ap...",[],"[anti, ap, monoclonal, antibodies, and, antige...","[anti, ap, monoclonal, antibodies, and, antige..."
5,20190202865,"[targeting, dimerization, of, bax, to, modulat...","[methods, are, provided, for, identifying, an,...","[canceled, method, of, inhibiting, bcl, associ...","[cross, reference, to, related, applications, ...","[brief, description, of, the, drawings, fig, b...","[throughout, this, application, various, publi...","[this, application, claims, the, benefit, of, ...","[the, present, invention, discloses, assays, f...","[method, is, provided, for, identifying, an, a..."
6,20190201532,"[sensitization, of, bacterial, cells, to, quin...","[provided, herein, are, pharmaceutical, compos...","[pharmaceutical, composition, comprising, carb...","[related, application, the, present, applicati...","[brief, description, of, the, drawings, figs, ...","[antibiotics, are, the, main, tools, to, treat...",[],"[density, dependent, drug, failure, is, form, ...","[pharmaceutical, compositions, the, present, d..."
7,20190201444,"[of, selectin, ligands]","[the, present, invention, provides, methods, o...","[method, of, enforcing, expression, of, an, se...","[cross, reference, to, related, applications, ...","[brief, description, of, the, drawings, fig, f...","[mesenchymal, stem, cells, mscs, hold, much, p...",[],"[despite, the, promise, of, these, methods, th...","[in, some, embodiments, the, present, inventio..."
8,20190177429,"[anti, her, antibodies, and]","[the, invention, provides, anti, her, antibodi...","[an, isolated, antibody, that, binds, to, her,...","[cross, reference, to, related, applications, ...","[brief, description, of, the, figures, fig, sh...","[breast, cancer, is, highly, significant, caus...","[the, present, application, is, divisional, of...","[the, invention, provides, anti, her, antibodi...","[reference, will, now, be, made, in, detail, t..."
9,20190184190,"[phosphor, containing, drug, activator, activa...","[phosphor, containing, drug, activator, activa...","[phosphor, containing, drug, activator, activa...","[cross, reference, to, related, applications, ...","[brief, description, of, the, figures, more, c...","[field, of, invention, the, invention, relates...",[],"[in, one, embodiment, the, present, invention,...","[the, present, invention, sets, forth, novel, ..."


In [29]:
class TaggedPatentDocument(object):
    """
    This class tags documents to convert them into a suitable
    format for Doc2Vec
    """

    def __init__(self, training_df, col):
        self.training_df = training_df
        self.col = col
        self.section = []

    def __iter__(self):
        for idx, row in self.training_df.iterrows():
            self.section += row[self.col]
            yield TaggedDocument(self.section, [self.training_df['id'][idx]])

<hr>
<br>

<a id='section-6'></a>

### - Train word/document embeddings model

In [30]:
# Apply tokenization by sentences to selected fields in the dataframe
# as some features will be used as is in other comparison methods
# and construct a new dataframe for lstm (without the full description column)
tagged_title = TaggedPatentDocument(patents_df_tokenized, 'invention_title')

In [31]:
for p in tagged_title:
    print(p)

TaggedDocument(['liver', 'specific', 'constructs', 'factor', 'viii', 'expression', 'cassettes', 'and', 'methods', 'of', 'use', 'thereof'], ['20190151472'])
TaggedDocument(['liver', 'specific', 'constructs', 'factor', 'viii', 'expression', 'cassettes', 'and', 'methods', 'of', 'use', 'thereof', 'retro', 'inverso', 'peptide', 'inhibitors', 'of', 'cell', 'migration', 'extracellular', 'matrix', 'and', 'endothelial', 'invasion', 'by', 'tumor', 'cells'], ['20190135863'])
TaggedDocument(['liver', 'specific', 'constructs', 'factor', 'viii', 'expression', 'cassettes', 'and', 'methods', 'of', 'use', 'thereof', 'retro', 'inverso', 'peptide', 'inhibitors', 'of', 'cell', 'migration', 'extracellular', 'matrix', 'and', 'endothelial', 'invasion', 'by', 'tumor', 'cells', 'cannabinoid', 'composition', 'having', 'an', 'optimized', 'fatty', 'acid', 'excipient', 'profile'], ['20190133992'])
TaggedDocument(['liver', 'specific', 'constructs', 'factor', 'viii', 'expression', 'cassettes', 'and', 'methods', 'of'

In [32]:
# Create model definition
epochs = 50
patent_title_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=4, epochs=epochs)

# Build vocabulary from documents
patent_title_model.build_vocab(tagged_title)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_title_model.train(tagged_title,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_title_model.epochs)
    patent_title_model.alpha = patent_title_model.alpha - 0.002
    patent_title_model.min_alpha = patent_title_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [33]:
# Simple test for the model to get the most similar words
# to a given term
print(patent_title_model.wv.most_similar("drug"))

[('aav', 0.15866905450820923), ('by', 0.10744364559650421), ('regulation', 0.10386599600315094), ('and', 0.08959125727415085), ('monte', 0.0835028812289238), ('gamma', 0.08255290985107422), ('exposure', 0.08163495361804962), ('cassettes', 0.08007222414016724), ('fruit', 0.07726302742958069), ('agents', 0.07491530478000641)]


In [34]:
# Save model to use later
patent_title_model.save('../models/patent_title_model.w2v')
print('Model saved..')

Model saved..


In [31]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_title_model = Doc2Vec.load('../models/patent_title_model.w2v')

In [32]:
# Repeat the process for all other columns
tagged_abstract = TaggedPatentDocument(patents_df_tokenized, 'abstract')

In [36]:
# Create model definition
epochs = 50
patent_abstract_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=4, epochs=epochs)

# Build vocabulary from documents
patent_abstract_model.build_vocab(tagged_abstract)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_abstract_model.train(tagged_abstract,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_abstract_model.epochs)
    patent_abstract_model.alpha = patent_abstract_model.alpha - 0.002
    patent_abstract_model.min_alpha = patent_abstract_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [37]:
# Save model to use later
patent_abstract_model.save('../models/patent_abstract_model.w2v')
print('Model saved..')

Model saved..


In [33]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_abstract_model = Doc2Vec.load('../models/patent_abstract_model.w2v')

In [34]:
# Repeat the process for all other columns
tagged_claims = TaggedPatentDocument(patents_df_tokenized, 'claims')

In [None]:
# Create model definition
epochs = 50
patent_claims_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=4, epochs=epochs)

# Build vocabulary from documents
patent_claims_model.build_vocab(tagged_claims)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_claims_model.train(tagged_claims,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_claims_model.epochs)
    patent_claims_model.alpha = patent_claims_model.alpha - 0.002
    patent_claims_model.min_alpha = patent_claims_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15


In [40]:
# Save model to use later
patent_claims_model.save('../models/patent_claims_model.w2v')
print('Model saved..')

Model saved..


In [35]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_claims_model = Doc2Vec.load('../models/patent_claims_model.w2v')

In [36]:
# Repeat the process for all other columns
tagged_description = TaggedPatentDocument(patents_df_tokenized, 'description')

In [37]:
# Create model definition
epochs = 50
patent_description_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=4, epochs=epochs)

# Build vocabulary from documents
patent_description_model.build_vocab(tagged_description)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_description_model.train(tagged_description,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_description_model.epochs)
    patent_description_model.alpha = patent_description_model.alpha - 0.002
    patent_description_model.min_alpha = patent_description_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [38]:
# Save model to use later
patent_description_model.save('../models/patent_description_model.w2v')
print('Model saved..')

Model saved..


In [73]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_description_model = Doc2Vec.load('../models/patent_description_model.w2v')

In [38]:
# Repeat the process for all other columns
tagged_drawings_description = TaggedPatentDocument(patents_df_tokenized, 'drawings_description')

In [40]:
# Create model definition
epochs = 50
patent_drawings_description_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=3, epochs=epochs)

# Build vocabulary from documents
patent_drawings_description_model.build_vocab(tagged_drawings_description)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_drawings_description_model.train(tagged_drawings_description,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_drawings_description_model.epochs)
    patent_drawings_description_model.alpha = patent_drawings_description_model.alpha - 0.002
    patent_drawings_description_model.min_alpha = patent_drawings_description_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [41]:
# Save model to use later
patent_drawings_description_model.save('../models/patent_drawings_description_model.w2v')
print('Model saved..')

Model saved..


In [39]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_drawings_description_model = Doc2Vec.load('../models/patent_drawings_description_model.w2v')

In [40]:
# Repeat the process for all other columns
tagged_invention_background = TaggedPatentDocument(patents_df_tokenized, 'invention_background')

In [43]:
# Create model definition
epochs = 50
patent_invention_background_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=3, epochs=epochs)

# Build vocabulary from documents
patent_invention_background_model.build_vocab(tagged_invention_background)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_invention_background_model.train(tagged_invention_background,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_invention_background_model.epochs)
    patent_invention_background_model.alpha = patent_invention_background_model.alpha - 0.002
    patent_invention_background_model.min_alpha = patent_invention_background_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [44]:
# Save model to use later
patent_invention_background_model.save('../models/patent_invention_background_model.w2v')
print('Model saved..')

Model saved..


In [41]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_invention_background_model = Doc2Vec.load('../models/patent_invention_background_model.w2v')

In [42]:
# Repeat the process for all other columns
tagged_cross_reference = TaggedPatentDocument(patents_df_tokenized, 'cross_reference')

In [46]:
# Create model definition
epochs = 50
patent_cross_reference_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=3, epochs=epochs)

# Build vocabulary from documents
patent_cross_reference_model.build_vocab(tagged_cross_reference)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_cross_reference_model.train(tagged_cross_reference,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_cross_reference_model.epochs)
    patent_cross_reference_model.alpha = patent_cross_reference_model.alpha - 0.002
    patent_cross_reference_model.min_alpha = patent_cross_reference_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [47]:
# Save model to use later
patent_cross_reference_model.save('../models/patent_cross_reference_model.w2v')
print('Model saved..')

Model saved..


In [43]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_cross_reference_model = Doc2Vec.load('../models/patent_cross_reference_model.w2v')

In [44]:
# Repeat the process for all other columns
tagged_summary = TaggedPatentDocument(patents_df_tokenized, 'summary')

In [49]:
# Create model definition
epochs = 50
patent_summary_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=3, epochs=epochs)

# Build vocabulary from documents
patent_summary_model.build_vocab(tagged_summary)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_summary_model.train(tagged_summary,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_summary_model.epochs)
    patent_summary_model.alpha = patent_summary_model.alpha - 0.002
    patent_summary_model.min_alpha = patent_summary_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [50]:
# Save model to use later
patent_summary_model.save('../models/patent_summary_model.w2v')
print('Model saved..')

Model saved..


In [45]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_summary_model = Doc2Vec.load('../models/patent_summary_model.w2v')

In [46]:
# Repeat the process for all other columns
tagged_detailed_description = TaggedPatentDocument(patents_df_tokenized, 'detailed_description')

In [47]:
# Create model definition
epochs = 50
patent_detailed_description_model = Doc2Vec(dm=0, hs=1, vector_size=500, min_count=3, \
                workers=2, epochs=epochs)

# Build vocabulary from documents
patent_detailed_description_model.build_vocab(tagged_detailed_description)

# Start training the model
for epoch in range(epochs):
    print("Training epoch {}".format(epoch + 1))
    patent_detailed_description_model.train(tagged_detailed_description,
                total_examples=len(patents_df_tokenized['id']), epochs=patent_detailed_description_model.epochs)
    patent_detailed_description_model.alpha = patent_detailed_description_model.alpha - 0.002
    patent_detailed_description_model.min_alpha = patent_detailed_description_model.alpha

Training epoch 1
Training epoch 2
Training epoch 3
Training epoch 4
Training epoch 5
Training epoch 6
Training epoch 7
Training epoch 8
Training epoch 9
Training epoch 10
Training epoch 11
Training epoch 12
Training epoch 13
Training epoch 14
Training epoch 15
Training epoch 16
Training epoch 17
Training epoch 18
Training epoch 19
Training epoch 20
Training epoch 21
Training epoch 22
Training epoch 23
Training epoch 24
Training epoch 25
Training epoch 26
Training epoch 27
Training epoch 28
Training epoch 29
Training epoch 30
Training epoch 31
Training epoch 32
Training epoch 33
Training epoch 34
Training epoch 35
Training epoch 36
Training epoch 37
Training epoch 38
Training epoch 39
Training epoch 40
Training epoch 41
Training epoch 42
Training epoch 43
Training epoch 44
Training epoch 45
Training epoch 46
Training epoch 47
Training epoch 48
Training epoch 49
Training epoch 50


In [48]:
# Save model to use later
patent_detailed_description_model.save('../models/patent_detailed_description_model.w2v')
print('Model saved..')

Model saved..


In [31]:
# Load already saved model - in case the notebook kernel dies and restarts
# to save model training time
patent_detailed_description_model = Doc2Vec.load('../models/patent_detailed_description_model.w2v')

In [74]:
# Create dataframe of document vectors so that can be the input for LSTM and CNN
patents_vectors = pd.DataFrame({'invention_title_vectors': list(patent_title_model.docvecs.vectors_docs), \
            'invention_abstract_vectors': list(patent_abstract_model.docvecs.vectors_docs), \
            'invention_claims_vectors': list(patent_claims_model.docvecs.vectors_docs), \
            'invention_description_vectors': list(patent_description_model.docvecs.vectors_docs), \
            'invention_drawings_description_vectors': list(patent_drawings_description_model.docvecs.vectors_docs), \
            'invention_invention_background_vectors': list(patent_invention_background_model.docvecs.vectors_docs), \
            'invention_cross_reference_vectors': list(patent_cross_reference_model.docvecs.vectors_docs), \
            'invention_summary_vectors': list(patent_summary_model.docvecs.vectors_docs), \
            'invention_detailed_description_vectors': list(patent_detailed_description_model.docvecs.vectors_docs)})

In [75]:
patents_vectors

Unnamed: 0,invention_title_vectors,invention_abstract_vectors,invention_claims_vectors,invention_description_vectors,invention_drawings_description_vectors,invention_invention_background_vectors,invention_cross_reference_vectors,invention_summary_vectors,invention_detailed_description_vectors
0,"[0.79648715, 0.49905267, 0.1990846, -0.6958045...","[-0.40922263, 0.7555281, -0.2939332, -0.824812...","[-0.32757396, -0.72097844, -0.6184255, 0.16919...","[-0.0877834, 0.14760958, 0.25188735, 0.2655999...","[0.18880707, 0.038463693, -0.090229504, -0.566...","[-1.0521476, -0.4201015, -0.49964833, 1.33142,...","[-0.8975088, 0.18549693, -0.9888742, 1.5618571...","[-0.25849417, -1.1690854, -0.036479406, -0.210...","[0.08795054, -0.6987753, -0.37846655, -1.37994..."
1,"[-0.7099177, -0.45519656, -0.16896322, 0.63520...","[-0.42359048, 0.77543813, -0.3168563, -0.81115...","[-0.33781555, -0.74758077, -0.65743595, 0.1657...","[-0.10550693, 0.15406513, 0.26139632, 0.290942...","[0.21356149, 0.012393088, -0.111655094, -0.707...","[-1.077294, -0.44164196, -0.5072208, 1.3556875...","[-0.83632326, 0.1730581, -0.923424, 1.4212142,...","[-0.18138191, -1.0368152, -0.064771295, -0.301...","[0.08152865, -0.7884193, -0.36260352, -1.41967..."
2,"[-0.7235869, -0.46847942, -0.1725856, 0.647002...","[-0.41187352, 0.73172605, -0.29745263, -0.7805...","[-0.33308947, -0.7066641, -0.62118995, 0.14689...","[-0.102144755, 0.14728326, 0.27713192, 0.29220...","[0.22136478, 0.01893254, -0.12228994, -0.76449...","[-1.0811398, -0.44532654, -0.5067718, 1.358797...","[-0.8759457, 0.17999512, -0.95992583, 1.477594...","[-0.17508577, -1.0230386, -0.06634464, -0.3075...","[0.07402404, -0.8209438, -0.3304778, -1.377606..."
3,"[-0.69865316, -0.4536933, -0.1615031, 0.633394...","[-0.30996147, 0.6678213, -0.22810836, -0.70015...","[-0.34275448, -0.7370787, -0.63825417, 0.16050...","[-0.09711998, 0.13907178, 0.2622618, 0.2733011...","[0.22860701, 0.038189214, -0.13485856, -0.8222...","[-1.0559562, -0.37349042, -0.49916127, 1.36386...","[-0.80676436, 0.16835235, -0.8846372, 1.334677...","[-0.16409132, -1.004466, -0.08327, -0.32637736...","[0.072413385, -0.82365495, -0.3156028, -1.3700..."
4,"[-0.69984543, -0.4555733, -0.16490342, 0.62932...","[-0.28795904, 0.6245526, -0.21497698, -0.64426...","[-0.33816746, -0.70621485, -0.603847, 0.160634...","[-0.10777167, 0.15237792, 0.33398283, 0.323142...","[0.23186213, 0.038659163, -0.13768524, -0.8303...","[-1.0407743, -0.36915192, -0.49463847, 1.34683...","[-0.8001403, 0.16775666, -0.8744451, 1.3131502...","[-0.14968641, -0.98594, -0.0915328, -0.3462096...","[0.071079336, -0.8088835, -0.30082428, -1.3383..."
5,"[-0.5821559, -0.37541214, -0.13808706, 0.52178...","[-0.30232346, 0.63609827, -0.2249654, -0.65665...","[-0.34404105, -0.6670663, -0.563262, 0.1617206...","[-0.11638489, 0.16753761, 0.29881465, 0.317200...","[0.23166063, 0.038750764, -0.1396112, -0.84613...","[-1.0331254, -0.35680932, -0.48794186, 1.34843...","[-0.740817, 0.1635279, -0.79812187, 1.2022171,...","[-0.09514956, -0.8742499, -0.087382704, -0.389...","[0.07032651, -0.7992063, -0.28869677, -1.32294..."
6,"[-0.59079534, -0.38521087, -0.1373858, 0.53435...","[-0.2382821, 0.5303434, -0.1864873, -0.5217362...","[-0.37029237, -0.60657454, -0.53663075, 0.1365...","[-0.11095362, 0.16591549, 0.29979724, 0.311120...","[0.2330211, 0.04080629, -0.13827029, -0.852073...","[-1.0231079, -0.3377742, -0.48051396, 1.367493...","[-0.733089, 0.15869096, -0.79562837, 1.1841578...","[-0.12985912, -0.934526, -0.0986678, -0.355308...","[0.06815048, -0.7940266, -0.28151748, -1.30173..."
7,"[-0.50633824, -0.32620156, -0.11922513, 0.4552...","[-0.2784767, 0.59579116, -0.20884693, -0.60822...","[-0.3634243, -0.620504, -0.55767995, 0.1276821...","[-0.11473989, 0.16910565, 0.3037188, 0.3184888...","[0.23343514, 0.031101035, -0.14220843, -0.8663...","[-1.0195814, -0.33844554, -0.48286924, 1.35123...","[-0.73059857, 0.15637653, -0.79444903, 1.17351...","[-0.10586243, -0.87620986, -0.100350425, -0.37...","[0.06589129, -0.7960281, -0.27838722, -1.30570..."
8,"[-0.5171465, -0.33310094, -0.12491555, 0.45968...","[-0.26925802, 0.5799887, -0.20516671, -0.58111...","[-0.33830756, -0.6813884, -0.614299, 0.1198018...","[-0.11506818, 0.17000505, 0.29748264, 0.313941...","[0.23067635, 0.04269409, -0.1396422, -0.855726...","[-1.0036359, -0.31219706, -0.4730591, 1.346242...","[-0.71547306, 0.16046862, -0.7717041, 1.147818...","[-0.08917042, -0.85957724, -0.097008586, -0.39...","[0.06310678, -0.78631765, -0.26396194, -1.2766..."
9,"[-0.5227302, -0.33839604, -0.12566571, 0.46570...","[-0.25106287, 0.5484887, -0.19419573, -0.54430...","[-0.35075808, -0.65480304, -0.5921591, 0.11914...","[-0.10932141, 0.16534041, 0.2918858, 0.3062248...","[0.22992158, 0.032765526, -0.13919176, -0.8622...","[-1.005509, -0.3160468, -0.47284564, 1.3554734...","[-0.7016552, 0.1551208, -0.7547043, 1.1171947,...","[-0.07706193, -0.8197851, -0.09828377, -0.3953...","[0.06280867, -0.79217076, -0.27129033, -1.2867..."


In [76]:
# Write file to csv for easy loading if notebook kernel dies
patents_vectors.to_csv('../Dataset/patents_vectors.csv')

<hr>
<br>

<a id='section-7'></a>

### - Train LSTM using document embeddings