This is a dataset of armenian patents, parsed from AIPO website, as well as code which generates it. This project is dedicated to @opendataam team, specifically this task. It's not in development right now, and there might be some bugs.
Virtual environment is planned, but not supported right now, so you will need an environment which supports these dependencies, with python version being at least 3.10 (3.12 is recommended):
import itertools
import json
import datetime
from pathlib import Path
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
Other than that, you should have these 2 files in the same directory:
- main.py
- aipoparser.py
aipoparser
can be used as a module on its own.
This file (ICID codes.json) is compiled manually and consists of every ICID code from this website, it greatly helps with parsing efficiency
- data/
- ICID codes.json
Finally, just run main.py
, user interface will guide you from there.
Currently, all code is organized in a aipoparser
module. This code does not contain comments right now, so here's a quick overview:
This AIPO webpage allows everyone to search for registred armenian patents according to some filter. There seems to be no way of getting every entry at once, so we need a way to extract it. Author of this task recommeds using ICID codes (International Classification of Industrial Designs).
aipo_request(...)
makes an HTTP request for patents exactly as AIPO webpage does.get_patent_by_id(...)
parses a request for individual certificate id to pythondict
object.get_group_by_icid_code(...)
parses a request for specific ICID code to python nesteddict
object.generate_icid_codes()
returns a generator for looping over ICID codes.get_ICID_json(...)
requests patents for every ICID code and stores them in corresponding ICID.json filefix_patents_list(...)
turns unordered list of patents with dublicates and missing entries to perfectly sorted list with unique entries without gaps. In theory should just download every patent one by one if empty array is passed, but I did not test.get_all_patents(...)
gets a list of patents from ICID.json, then "fixes" it, stores it in corresponding patents.json fileget_all_info_for_patent(...)
requests detailed information about specific patent from AIPO, returns a parsed dictionary with all infoget_all_info(...)
requests detailed information for every patent in patents.json, stores it in corresponding all_info.json
If somebody for some reason wants to contribute to this mess, open an issue and fork this repo. This project is under an MIT license, so use it as you wish