<a href="https://colab.research.google.com/github/GabeAspir/Patent-Prior-Art-Finder/blob/main/Prior_Art_Finder_Front_End.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use this notebook to find pre-existing similar patents.
Please note that you will be required to
1.   Run this notebook locally or upload the metadata files to colab
2.   Sign into your (existing) Google Cloud Account to run 2 small queries from Google's patent database.



In [2]:
#@title Click the play button to begin, then continue to option 1 or 2
import pandas as pd
import gdown
import requests
import nltk
nltk.download('punkt')

# From https://changhsinlee.com/colab-import-python/
# If you are using GitHub, make sure you get the "Raw" version of the code
url = 'https://raw.githubusercontent.com/GabeAspir/Patent-Prior-Art-Finder/main/_DevFilesPatentPriorArtFinder.py'
r = requests.get(url)
 
# make sure your filename is the same as how you want to import 
with open('_DevFilesPatentPriorArtFinder.py', 'w') as f:
    f.write(r.text)
 
# now we can import
from _DevFilesPatentPriorArtFinder import _DevFilesPatentPriorArtFinder as paf

print("You may now proceed to option 1 or 2 below")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
You may now proceed to option 1 or 2 below


# Option 1: Compare an existing patent  

In [36]:
#@title Enter a patent number for comparison
patent_Number = 'US-2018356996-A1' #@param {type:"string"}
print(patent_Number)
#pat= {"patent_Number": patent_Number}
# Need to validate the patent number
project_id = "semiotic-garden-315802"

pt = pd.io.gbq.read_gbq(f'''
  SELECT
    pub.publication_number as Publication_Number,
    ab.text as Abstract,
    STRING_AGG(citations.publication_number) as Citations
  FROM
    patents-public-data.patents.publications as pub,
    UNNEST (abstract_localized) AS ab,
    UNNEST (citation) as citations
  WHERE
    pub.publication_number IN ("{{}}")
  GROUP BY pub.publication_number, ab.text
  LIMIT
      1
'''.format(patent_Number), project_id=project_id)
pt= pt.iloc[0]




US-2018356996-A1
Please visit this URL to authorize this application: https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=725825577420-unm2gnkiprugilg743tkbig250f4sfsj.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fbigquery&state=rkFkbJa8pYLfDCWl9EwtGGTbXme35B&prompt=consent&access_type=offline
Enter the authorization code: 4/1AX4XfWj5X_YX8klA2DeUC8mgBMuUCv3EIBF6lxBs80irHwv1K9E28yKSmMw


# Option 2: Compare a new text

You can enter the text of a potential patent to find similar pre-existing patents. The new_cit field is optional but can be used if you have citations in your potential patent, and can increase the accuracy of your search.

In [24]:
import pandas as pd
new_text = 'A manufacturing process to produce a controllable integral membrane (28) in sheet-like photosensitive laminates (20), said photosensitive laminate being adapted to adhere to the surface to be etched (36). By the use of the present invention the revealed image (33) washes out without detail roots leaving the substrate (22), transfers easily, even if it is very fine, and can be etched on said surface to be etched very nicely.' #@param {type:"string"}
print(new_text)
new_cit = 'US-3916050-A,US-4371602-A,US-4430416-A,US-4511640-A,US-4587186-A,US-4716096-A,US-4764449-A,US-4801490-A'#@param {type:"string"}
print(new_cit)
pt= pd.Series([new_text,new_cit],index=["Abstract","Citations"])
print(pt)

A manufacturing process to produce a controllable integral membrane (28) in sheet-like photosensitive laminates (20), said photosensitive laminate being adapted to adhere to the surface to be etched (36). By the use of the present invention the revealed image (33) washes out without detail roots leaving the substrate (22), transfers easily, even if it is very fine, and can be etched on said surface to be etched very nicely.
US-3916050-A,US-4371602-A,US-4430416-A,US-4511640-A,US-4587186-A,US-4716096-A,US-4764449-A,US-4801490-A
Abstract     A manufacturing process to produce a controlla...
Citations    US-3916050-A,US-4371602-A,US-4430416-A,US-4511...
dtype: object



# Both options:
This will use a prepared set of 4 million patents. If you wish to load and train a new set for your comparison, click [here](https://colab.research.google.com/drive/1DrKOIpQIOMqTJ-jcAd0amhtBOXY9wQv3?usp=sharing).

# For the next step 
you must save the folders contained [here](https://drive.google.com/drive/folders/1TSjyNgCdvAIX92WF4BAMNeY5Ri4haANM?usp=sharing) in a folder on your computer. Copy the path of that folder into the box below. Alternativly, they can be downloaded directly to colab with the next cell. This will take some time.

You can calso choose settings for the similarity check. Threshold will set the cut-off for how similar a patent must be to be included in the results. use_tfidf allows you to try searching with or without tfidf, which is a method of weighting words based on how frequently they appear across the documents.

# NOTE: This cell is pending the final gdrive file links

In [None]:
#@title Download the metadata files directly
import gdown
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Load w2v Files
embed_files= ["1IOv-XTS5Q-FL7Wju09cE8nU65X6MmhKU",
              "1AUwg7UeBlCX0DN9AZM1URXl2cQXUmxeb"]
# TODO: Add dict and actual files
!mkdir -p "ppaf_files/emb"
directory= "ppaf_files"
for index in range(0,2):
  url = 'https://drive.google.com/uc?id='+ embed_files[index]
  output = 'ppaf_files/emb/file'+str(index)+'.json.gz'
  gdown.download(url, output, quiet=False)

In [1]:
from google.colab import drive
#@title Enter a directory that contains the metadata files
directory = '' #@param {type:"string"}
print(directory)
threshold = '' #@param {type:"string"}
use_tfidf = True #@param {type:"boolean"}
artFinder = paf(directory)
print(artFinder.compareNewPatent(pt,directory,.9))

Mounted at /content/drive
/content/drive/MyDrive/ppaf_files


Use the check box to try running the comparison with and without tfidf (gives less frequent words more importance)

In [None]:
#@title Inspect a patent match
patent_Number = 'US-2018356996-A1' #@param {type:"string"}
print(patent_Number)
project_id = "semiotic-garden-315802"
text = pd.io.gbq.read_gbq(f'''
  SELECT
    pub.publication_number as Publication_Number,
    ab.text as abstract_en,
    STRING_AGG(citations.publication_number) as Citations
  FROM
    patents-public-data.patents.publications as pub,
    UNNEST (abstract_localized) AS ab,
    UNNEST (citation) as citations
  WHERE
    pub.publication_number IN ("{{}}")
  GROUP BY pub.publication_number, ab.text
  LIMIT
      1
'''.format(patent_Number), project_id=project_id)
print(text["abstract_en"][0])