This project implements a search engine for USPTO design patents based on various criteria. Users can search for design patents by patent title, patent number, inventor(s) name, assignee (owner) name, application date, issue date, and design class (if available).
- Introduction
- Search Engine Architecture
- Features
- Getting Started
- Usage
- Database
- Search Functionality
- Performance Optimization
- Postman Documentation
- License
The United States Patent and Trademark Office (USPTO) provides a dataset of design patents, including information about various design patents granted by the USPTO. This project aims to create a search engine that enables users to search for design patents based on specific criteria.
- Search design patents by patent title, patent number, inventor(s) name, assignee (owner) name, application date, issue date, and design class.
- Efficiently parse and store USPTO design patent data.
- Optimize search engine performance for large datasets.
type Patent struct {
PatentNumber string `json:"PatentNumber" gorm:"primaryKey"`
PatentTitle string `json:"PatentTitle"`
Authors pq.StringArray `json:"Authors" gorm:"type:text[]"`
Assignee string `json:"Assignee"`
ApplicationDate string `json:"ApplicationDate"`
IssueDate string `json:"IssueDate"`
DesignClass string `json:"DesignClass"`
ReferencesCited pq.StringArray `json:"ReferencesCited" gorm:"type:text[]"`
Description pq.StringArray `json:"Description" gorm:"type:text[]"`
}
The fields have been highly optimiised to hold list of data, the extraction has been done refering to the dtd from the USPTO page for design patents.
For extaction, i'd written a script to first unzip all the data and extract the XML files to a folder called all_xml. The second step was to use use encoding/xml and encoding/json to derive all the extracted fields by specifing model structs.
type Inventor struct {
LastName string `xml:"addressbook>last-name"`
FirstName string `xml:"addressbook>first-name"`
}
type UsPatentGrant struct {
PatentTitle string `xml:"us-bibliographic-data-grant>invention-title"`
PatentNumber string `xml:"us-bibliographic-data-grant>publication-reference>document-id>doc-number"`
Authors []Inventor `xml:"us-bibliographic-data-grant>us-parties>inventors>inventor"`
Assignee string `xml:"us-bibliographic-data-grant>us-parties>us-applicants>us-applicant>addressbook>orgname"`
ApplicationDate CustomTime `xml:"us-bibliographic-data-grant>application-reference>document-id>date"`
IssueDate CustomTime `xml:"us-bibliographic-data-grant>publication-reference>document-id>date"`
DesignClass string `xml:"us-bibliographic-data-grant>classification-national>main-classification"`
ReferencesCited []Reference `xml:"us-bibliographic-data-grant>us-references-cited>us-citation,omitempty"`
Description Description `xml:"description"`
}
type Reference struct {
Name string `xml:"patcit>document-id>name"`
}
type Description struct {
DescriptionDrawings []string `xml:"description-of-drawings>p"`
}
type CustomTime struct {
Time string `xml:",chardata"`
}
Above you can see the Etree mappings, to extract the data from the xml and map it to the respective json attribute. The xml data was extracted using the NewEncoder method and appended to a combined json file.
Please refer json_generator.go and xml_file_extractor.go
Bulk insertion was done in two places from the combined_json generated from the file extraction with all the metadata. db_bulk_insertion - the file handling the bulk insert into postgres. The code is extremely modular and inserts data according to the specifed schema defined in models. es_bulk_insertion - This file handles chunking of the json_data and effeciently inserting the data into ES_INDEX = design_patents
This repository demonstrates a performance-optimized search functionality using Elasticsearch (ES) for Postgres data. The optimization involves a two-step process:
- Elasticsearch is utilized to index searchable fields, optimizing search performance.
- Only specific searchable fields are stored in Elasticsearch, enhancing efficiency.
-
Search Process:
- The search process involves querying Elasticsearch for relevant results based on the search query.
-
Data Retrieval:
- Once search results are obtained, a second query is made to the original data source (e.g., Postgres) using the retrieved unique identifiers (e.g., Patent Number).
This two-step approach minimizes the load on the original data source, enhancing response speed and efficiency. By implementing pagination within the Elasticsearch query and selectively indexing necessary fields, we achieve an efficient search mechanism. Additionally, leveraging Elasticsearch for primary search operations optimizes the overall system's performance.
The Search engine uses fuzzy logic coupled with ElasticSearch (indexed against a postgres DB) The search engine allows users to search for design patents based on various criteria, including patent title, patent number, inventor(s) name, assignee (owner) name, application date, issue date, and design class (if available).
To run this project, you need the following prerequisites:
- GoLang (v1.20)
- PostgreSQL (v12+)
- ElasticSearch (v17.17)
- Clone the repository:
git clone https://github.com/yourusername/patent_designs.git cd patent_designs
go mod download
go run main.go
- Postman Documentation Please refer to the API documentation over here.