<a href="https://colab.research.google.com/github/GabeAspir/Patent-Prior-Art-Finder/blob/main/Prior_Art_Finder_Front_End.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use this notebook to find pre-existing similar patents.
Please note that you will be required to
1.   Run this notebook locally or upload the metadata files to colab
2.   Sign into your (existing) Google Cloud Account to run 2 small queries from Google's patent database.



In [None]:
#@title Click the play button to begin, then continue to option 1 or 2
import pandas as pd
import gdown
import requests
import nltk
nltk.download('punkt')

# From https://changhsinlee.com/colab-import-python/
# If you are using GitHub, make sure you get the "Raw" version of the code
url = 'https://raw.githubusercontent.com/GabeAspir/Patent-Prior-Art-Finder/main/_DevFilesPatentPriorArtFinder.py'
r = requests.get(url)
 
# make sure your filename is the same as how you want to import 
with open('_DevFilesPatentPriorArtFinder.py', 'w') as f:
    f.write(r.text)
 
# now we can import
from _DevFilesPatentPriorArtFinder import _DevFilesPatentPriorArtFinder as paf

print("You may now proceed to option 1 or 2 below")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\mocka\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


You may now proceed to option 1 or 2 below


# Option 1: Compare an existing patent  

In [None]:
#@title Enter a patent number for comparison
patent_Number = 'US-2018356996-A1' #@param {type:"string"}
print(patent_Number)
#pat= {"patent_Number": patent_Number}
# Need to validate the patent number
#@markdown Please enter a Google Bigquery Project id to use this option
project_id = "semiotic-garden-315802" #@param {type:"string"}

pt = pd.io.gbq.read_gbq(f'''
  SELECT
    pub.publication_number as Publication_Number,
    ab.text as Abstract,
    STRING_AGG(citations.publication_number) as Citations
  FROM
    patents-public-data.patents.publications as pub,
    UNNEST (abstract_localized) AS ab,
    UNNEST (citation) as citations
  WHERE
    pub.publication_number IN ("{{}}")
  GROUP BY pub.publication_number, ab.text
  LIMIT
      1
'''.format(patent_Number), project_id=project_id)
pt= pt.iloc[0]




US-2018356996-A1
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=rkFkbJa8pYLfDCWl9EwtGGTbXme35B&prompt=consent&access_type=offline
Enter the authorization code: 4/1AX4XfWj5X_YX8klA2DeUC8mgBMuUCv3EIBF6lxBs80irHwv1K9E28yKSmMw


# Option 2: Compare a new text

You can enter the text of a potential patent to find similar pre-existing patents. The new_cit field is optional but can be used if you have citations in your potential patent, and can increase the accuracy of your search.

In [None]:
import pandas as pd
new_text = 'A manufacturing process to produce a controllable integral membrane (28) in sheet-like photosensitive laminates (20), said photosensitive laminate being adapted to adhere to the surface to be etched (36). By the use of the present invention the revealed image (33) washes out without detail roots leaving the substrate (22), transfers easily, even if it is very fine, and can be etched on said surface to be etched very nicely.' #@param {type:"string"}
print(new_text)
new_cit = ''#@param {type:"string"}
print(new_cit)
pt= pd.Series([new_text,new_cit],index=["Abstract","Citations"])
print(pt)

A manufacturing process to produce a controllable integral membrane (28) in sheet-like photosensitive laminates (20), said photosensitive laminate being adapted to adhere to the surface to be etched (36). By the use of the present invention the revealed image (33) washes out without detail roots leaving the substrate (22), transfers easily, even if it is very fine, and can be etched on said surface to be etched very nicely.

Abstract     A manufacturing process to produce a controlla...
Citations                                                     
dtype: object



# Both options:
This will use a prepared set of 4 million patents. If you wish to load and train a new set for your comparison, click [here](https://colab.research.google.com/drive/1DrKOIpQIOMqTJ-jcAd0amhtBOXY9wQv3?usp=sharing).

# For the next step 
you must save the folders contained [here](https://drive.google.com/drive/folders/1hSOHh7_8P1J_SWQgEMgkibtqjAsMr1D_?usp=sharing) in a folder on your computer. Copy the path of that folder into the box below. Alternativly, they can be downloaded directly to colab with the next cell. This will take some time. 
* If you trained your own model using the above link, use the directory you used there instead

You can also choose settings for the similarity check. Threshold will set the cut-off for how similar a patent must be to be included in the results. use_tfidf allows you to try searching with or without tfidf, which is a method of weighting words based on how frequently they appear across the documents.

# NOTE: This cell is pending the final gdrive file links

In [None]:
#@title Click to download the metadata files directly
import gdown
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Load w2v Files
embed_files= ["1IOv-XTS5Q-FL7Wju09cE8nU65X6MmhKU",
              "1AUwg7UeBlCX0DN9AZM1URXl2cQXUmxeb"]
# TODO: Add dict and actual files
!mkdir -p "ppaf_files/emb"
directory= "ppaf_files"
for index in range(0,2):
  url = 'https://drive.google.com/uc?id='+ embed_files[index]
  output = 'ppaf_files/emb/file'+str(index)+'.json.gz'
  gdown.download(url, output, quiet=False)

In [None]:
#@title Enter a directory that contains the metadata files
directory = 'C:\\Users\\eph\\PycharmProjects\\Patent-Prior-Art-Finder\\Patent Queries\\sampleZipSet' #@param {type:"string"}
print(directory)
#@markdown Enter settings
threshold = .95 #@param {type:"number"}
#@markdown (Optional)
use_tfidf = True #@param {type:"boolean"}
artFinder = paf(directory)
print(artFinder.compareNewPatent(pt,directory,.9,use_tfidf, use_citations=False))
#@markdown run to get matches

C:\Users\mocka\PycharmProjects\Patent-Prior-Art-Finder\Patent Queries\sampleZipSet
Initialization complete T=1392.4706384
Using citations: False
Using tfidf: True
reading <DirEntry 'bq-results-20210716-122855-k3ohqdlyn8nc (1).json.gz'>
reading <DirEntry 'results-20210716-123228.json.gz'>
61 Matches found
    similarity     Patent Number
0     1.000000      US-5260173-A
1     0.969610      US-5073431-A
2     0.962141  US-2013157046-A1
3     0.950503      US-5093164-A
4     0.945876  US-2010100065-A1
..         ...               ...
56    0.901909      US-6045728-A
57    0.901423  US-2012148815-A1
58    0.901336      US-5285964-A
59    0.900093      US-5904889-A
60    0.900011      US-4781118-A

[61 rows x 2 columns]


Use the check box to try running the comparison with and without tfidf (gives less frequent words more importance)

In [None]:
#@title Inspect a patent match
patent_Number = 'US-5073431-A' #@param {type:"string"}
print(patent_Number)
#@markdown Please enter a Google Bigquery Project id to use this option
project_id = "semiotic-garden-315802" #@param {type:"string"}

text = pd.io.gbq.read_gbq(f'''
  SELECT
    pub.publication_number as Publication_Number,
    ab.text as abstract_en,
    STRING_AGG(citations.publication_number) as Citations
  FROM
    patents-public-data.patents.publications as pub,
    UNNEST (abstract_localized) AS ab,
    UNNEST (citation) as citations
  WHERE
    pub.publication_number IN ("{{}}")
  GROUP BY pub.publication_number, ab.text
  LIMIT
      1
'''.format(patent_Number), project_id=project_id)
print(text["abstract_en"][0])

US-5073431-A
The method is one of producing multi-ply laminates incorporating an exposed veneer in quality wood or cork, and involved bonding the veneer (1) to a thin flexible thermoplastic film (2) by way of a layer of hot melt adhesive (3). The same laminate can be reinforced to enable its use in manufacturing sewn goods, such as bags and acessories, by ading a tough, close-woven backing fabric (20), bonded to the back of the film (2) in similar fashion via a further layer of adhesive (3), which provides the strength necessary to take a heavy stitch when sheets are sewn together.
