# WebScraping Patent Abstracts

This notebook is an example of applying webscraping and API tools to collect patents data from google patent search page.

First, start from Google Patent Search Page:
+ Set search parameter
+ Search
+ download csv file from top-right corner
+ The csv file should include the following information: patent id, assignee, inventors, date, and patent web page links

Using "renewable energy" as search keyword to define the topic of patents. Set other searching parameters as follow:

|search parameters|setting|description|
|----|----|----|
|sort by|Relevance||
|group by|Classification|set this to True will will group searching results with CPC code.|
|deduplicate|family|deduplicate same family patents, uses family as research unit. a patent family usually defines an invention, which could be break into many patents.|
|date before|'filing:20240630'|we will collect all patents that have filing date before 2024.06.30|
|date after|no selection||
|inventor|no selection||
|assignee|no selection||
|patent office|'US'|focus on patents in 'US'|
|language|'EN'|focus on patents in English language|
|status|'Grant'|another choice is 'application'|
|type|'patent'|another choice is 'design'|
|litigation|no selection||

## Load and Clean Patent List .csv File

In [1]:
import pandas as pd
df = pd.read_csv(r"C:\Users\user\Documents\RenewableEnergy_patents.csv", skiprows = 1)

In [2]:
df.head()

Unnamed: 0,id,title,assignee,inventor/author,priority date,filing/creation date,publication date,grant date,result link,representative figure link
0,US-10340693-B2,Systems and methods for generating energy usin...,"Lawrence D. Lansing, JR., Lawrence D. Lansing","Lawrence D. Lansing, JR., Lawrence D. Lansing",2013-03-13,2018-02-27,2019-07-02,2019-07-02,https://patents.google.com/patent/US10340693B2/en,https://patentimages.storage.googleapis.com/4c...
1,US-10526056-B1,Generation of electric power using wave motion...,"Physician Electronic Network, LLC","A-Hamid Hakki, Edin Dervishalidovic, Belmina H...",2019-04-29,2019-04-29,2020-01-07,2020-01-07,https://patents.google.com/patent/US10526056B1/en,https://patentimages.storage.googleapis.com/6f...
2,US-11228182-B2,Converter disabling photovoltaic electrical en...,"Ampt, Llc","Robert Porter, Anatoli Ledenev",2007-10-15,2019-06-12,2022-01-18,2022-01-18,https://patents.google.com/patent/US11228182B2/en,https://patentimages.storage.googleapis.com/f5...
3,US-11060742-B2,PVT heat pump system capable of achieving day-...,Dalian University Of Technology,"Jili ZHANG, Ruobing LIANG, Chao Zhou, Shixiang...",2017-08-03,2017-08-03,2021-07-13,2021-07-13,https://patents.google.com/patent/US11060742B2/en,https://patentimages.storage.googleapis.com/f8...
4,US-10808685-B2,Dispatchable combined cycle power plant,William M. Conlon,William M. Conlon,2014-06-04,2018-10-19,2020-10-20,2020-10-20,https://patents.google.com/patent/US10808685B2/en,https://patentimages.storage.googleapis.com/d5...


In [5]:
# change column name
df.columns = ['id','title','assignee','inventor','priority_date','filing_date','publication_date','grant_date','result_link','representative_figure_link']

## cleaning id column
# split id column to generate 'country' and 'kind_code' columns
df[['id2']] = df[['id']]# copy id column
df[['country','id_code','kind_code']] = df['id2'].str.split('-',expand=True)# split id column
df = df.drop(['id_code','id2'], axis = 1)# drop id2 and id_code

# remove '-' signs from id
df['id'] = df['id'].str.replace('-', '')

# remove '-' signs from date columns
for i in ['priority_date','filing_date','publication_date','grant_date']:
    df[i] = df[i].str.replace('-', '')

# change inventor column from string to name list
inventor_list = df['inventor'].str.split(',').to_frame()
df[['inventor']] = inventor_list

In [7]:
# save cleaned patents dateframe
df.to_csv(r"C:\Users\user\Documents\RenewableEnergy_patents_clean.csv")

## WebScraping: Patent Abstracts

In [1]:
# reload cleaned patent list
import pandas as pd
df = pd.read_csv(r"C:\Users\user\Documents\RenewableEnergy_patents_clean.csv", index_col = 0)

In [13]:
df.head()

Unnamed: 0,id,title,assignee,inventor,priority_date,filing_date,publication_date,grant_date,result_link,representative_figure_link,country,kind_code
0,US10340693B2,Systems and methods for generating energy usin...,"Lawrence D. Lansing, JR., Lawrence D. Lansing","['Lawrence D. Lansing', ' JR.', ' Lawrence D. ...",20130313.0,20180227,20190702,20190702,https://patents.google.com/patent/US10340693B2/en,https://patentimages.storage.googleapis.com/4c...,US,B2
1,US10526056B1,Generation of electric power using wave motion...,"Physician Electronic Network, LLC","['A-Hamid Hakki', ' Edin Dervishalidovic', ' B...",20190429.0,20190429,20200107,20200107,https://patents.google.com/patent/US10526056B1/en,https://patentimages.storage.googleapis.com/6f...,US,B1
2,US11228182B2,Converter disabling photovoltaic electrical en...,"Ampt, Llc","['Robert Porter', ' Anatoli Ledenev']",20071015.0,20190612,20220118,20220118,https://patents.google.com/patent/US11228182B2/en,https://patentimages.storage.googleapis.com/f5...,US,B2
3,US11060742B2,PVT heat pump system capable of achieving day-...,Dalian University Of Technology,"['Jili ZHANG', ' Ruobing LIANG', ' Chao Zhou',...",20170803.0,20170803,20210713,20210713,https://patents.google.com/patent/US11060742B2/en,https://patentimages.storage.googleapis.com/f8...,US,B2
4,US10808685B2,Dispatchable combined cycle power plant,William M. Conlon,['William M. Conlon'],20140604.0,20181019,20201020,20201020,https://patents.google.com/patent/US10808685B2/en,https://patentimages.storage.googleapis.com/d5...,US,B2


In [2]:
import requests
from bs4 import BeautifulSoup

In [15]:
# Take the first patent as an example:
url = df.loc[0,"result_link"]
response = requests.get(url) 
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser') 

In [117]:
patent_abstract = soup.find('h2', string = 'Abstract').find_next_sibling().find('abstract').find('div').contents[0]

In [168]:
patent_abstract

'Systems and methods for continuously generating electric power using a renewable energy power source to continuously generate electrical energy are disclosed. An illustrative embodiment includes transmitting electrical power from the renewable energy power source to an electrolyzer to produce hydrogen gas, storing the hydrogen gas in a storage facility until production of power from the renewable energy power source drops below a predetermined threshold, and activating a secondary power generation system that converts the stored hydrogen to electrical energy. The stored hydrogen may be converted to electrical energy using a gas turbine generator or a fuel cell. The system further includes a reverse osmosis subsystem for purifying water for use in the electrolyzer and optional systems for providing the purified water to a community and for using the produced electricity to treat waste water to generate treated water that may be purified and supplied to the electrolyzer.'

In [6]:
# view first 5 patents abstract scraping results
for i in range(5):
    url = df.loc[i,"result_link"]
    response = requests.get(url) 
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    print(f'{i+1} patent abstract:')
    print(soup.find('h2', string = 'Abstract').find_next_sibling().find('abstract').find('div').contents[0])
    print('')

1 patent abstract:
Systems and methods for continuously generating electric power using a renewable energy power source to continuously generate electrical energy are disclosed. An illustrative embodiment includes transmitting electrical power from the renewable energy power source to an electrolyzer to produce hydrogen gas, storing the hydrogen gas in a storage facility until production of power from the renewable energy power source drops below a predetermined threshold, and activating a secondary power generation system that converts the stored hydrogen to electrical energy. The stored hydrogen may be converted to electrical energy using a gas turbine generator or a fuel cell. The system further includes a reverse osmosis subsystem for purifying water for use in the electrolyzer and optional systems for providing the purified water to a community and for using the produced electricity to treat waste water to generate treated water that may be purified and supplied to the electrolyze

### Problem occured in the third example, let's see what is going wrong

In [125]:
# inspect third patent
url = df.loc[2,"result_link"]
response = requests.get(url) 
html_content = response.text
soup = BeautifulSoup(html_content, 'html.parser') 

In [132]:
abstract_contents_list = soup.find('h2', string = 'Abstract').find_next_sibling().find('abstract').find('div').contents

In [165]:
abstract_contents_list

['Renewable electrical energy is provided with aspects and circuitry that can harvest maximum power from an alternative electrical energy source (',
 <b>1</b>,
 ') such as a string of solar panels (',
 <b>11</b>,
 ') for a power grid (',
 <b>10</b>,
 '). Aspects include: i) controlling electrical power creation from photovoltaic DC-AC inverter (',
 <b>5</b>,
 '), ii) operating photovoltaic DC-AC inverter (',
 <b>5</b>,
 ') at maximal efficiency even when MPP would not be, iii) protecting DC-AC inverter (',
 <b>5</b>,
 ') so input can vary over a range of insolation and temperature, and iv) providing dynamically reactive capability to react and assure operation, to permit differing components, to achieve code compliant dynamically reactive photovoltaic power control circuitry (',
 <b>41</b>,
 '). With previously explained converters, inverter control circuitry (',
 <b>38</b>,
 ') or photovoltaic power converter functionality control circuitry (',
 <b>8</b>,
 ') configured as inverter sw

The contents may be a list of strings, so using ".contents[0]" is not a good practice. We should extract contents as a list and join all the string elements in the list.

There also comes another problem in this third example, there are Tags in the contents list, which we would like to exclude from our abstract.

So, we need to change our workflow. We use the contents list, only keep strings, and join all string elements to form one string, that is the whole abstract of this patent.

In [166]:
abstract_contents = [element for element in abstract_contents_list if isinstance(element, str)]

In [167]:
abstract_contents

['Renewable electrical energy is provided with aspects and circuitry that can harvest maximum power from an alternative electrical energy source (',
 ') such as a string of solar panels (',
 ') for a power grid (',
 '). Aspects include: i) controlling electrical power creation from photovoltaic DC-AC inverter (',
 '), ii) operating photovoltaic DC-AC inverter (',
 ') at maximal efficiency even when MPP would not be, iii) protecting DC-AC inverter (',
 ') so input can vary over a range of insolation and temperature, and iv) providing dynamically reactive capability to react and assure operation, to permit differing components, to achieve code compliant dynamically reactive photovoltaic power control circuitry (',
 '). With previously explained converters, inverter control circuitry (',
 ') or photovoltaic power converter functionality control circuitry (',
 ') configured as inverter sweet spot converter control circuitry (',
 ') can achieve extraordinary efficiencies with substantially 

In [164]:
abstract = " ".join(abstract_contents)
abstract

'Renewable electrical energy is provided with aspects and circuitry that can harvest maximum power from an alternative electrical energy source ( ) such as a string of solar panels ( ) for a power grid ( ). Aspects include: i) controlling electrical power creation from photovoltaic DC-AC inverter ( ), ii) operating photovoltaic DC-AC inverter ( ) at maximal efficiency even when MPP would not be, iii) protecting DC-AC inverter ( ) so input can vary over a range of insolation and temperature, and iv) providing dynamically reactive capability to react and assure operation, to permit differing components, to achieve code compliant dynamically reactive photovoltaic power control circuitry ( ). With previously explained converters, inverter control circuitry ( ) or photovoltaic power converter functionality control circuitry ( ) configured as inverter sweet spot converter control circuitry ( ) can achieve extraordinary efficiencies with substantially power isomorphic photovoltaic capability 

### Establish patent abstract scarping workflow

In [169]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import random

# reload cleaned patent list
df = pd.read_csv(r"C:\Users\user\Documents\RenewableEnergy_patents_clean.csv", index_col = 0)

In [18]:
# randomly sample 5 patents abstract scraping results
# run several times to make sure there is no problem with current scraping method
for i in random.sample(range(len(df)), 5):
    url = df.loc[i,"result_link"]
    response = requests.get(url) 
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    abstract_contents_list = soup.find('h2', string = 'Abstract').find_next_sibling().find('abstract').find('div').contents
    abstract_contents = [element for element in abstract_contents_list if isinstance(element, str)]
    abstract = " ".join(abstract_contents)
    print(f'{i+1} patent abstract:')
    print(abstract)
    print('')

15815 patent abstract:
A power generation transport includes a gas turbine, an inlet plenum coupled to an intake of the gas turbine, a generator driven by the gas turbine, and an air intake and exhaust module including an air inlet filter housing, an intake air duct coupled to the housing at a first end and to the inlet plenum at a second end, and an exhaust collector coupled to an exhaust of the gas turbine. The transport further includes at least one base frame. The frame mounts and aligns the gas turbine, the inlet plenum, the generator, and the air intake and exhaust module. The intake air duct is mounted on the base frame so as to be disposed underneath the gas turbine, and extend along the base frame from an exhaust end side of the gas turbine to an intake end side, in a longitudinal direction of the power generation transport.

13451 patent abstract:
An example device includes a housing including a first surface facing a first direction and a second surface facing a second direc

## Scrape Abstracts for All Patents Filed in 2023

In [26]:
# extract patents filed in 2023 
df_2023 = df.loc[df['filing_date'] > 20230000,]
len(df_2023)

360

In [38]:
df_2023 = df_2023.reset_index(drop = True)
df_2023.head()

Unnamed: 0,id,title,assignee,inventor,priority_date,filing_date,publication_date,grant_date,result_link,representative_figure_link,country,kind_code,abstract
0,US11967653B2,Phased solar power supply system,"Ampt, Llc",['Anatoli Ledenev'],20130315.0,20230905,20240423,20240423,https://patents.google.com/patent/US11967653B2/en,https://patentimages.storage.googleapis.com/c0...,US,B2,
1,US11867096B2,Calcination system with thermal energy storage...,"Rondo Energy, Inc.","[""John Setel O'Donnell"", ' Peter Emery von Beh...",20201130.0,20230220,20240109,20240109,https://patents.google.com/patent/US11867096B2/en,https://patentimages.storage.googleapis.com/35...,US,B2,
2,US11757404B2,Coordinated control of renewable electric gene...,"8Me Nova, Llc","['Lukas Hansen', ' Philippe Garneau-Halliday',...",20190208.0,20230310,20230912,20230912,https://patents.google.com/patent/US11757404B2/en,https://patentimages.storage.googleapis.com/f6...,US,B2,
3,US11973345B2,Building energy system with predictive control...,Johnson Controls Tyco IP Holdings LLP,"['Robert D. Turney', ' Nishith R. Patel']",20170427.0,20230602,20240430,20240430,https://patents.google.com/patent/US11973345B2/en,https://patentimages.storage.googleapis.com/0f...,US,B2,
4,US11780305B2,Tonneau system for use with a pickup truck,Worksport Ltd.,"['Steven Rossi', ' Jonathan Loudon']",20151030.0,20230110,20231010,20231010,https://patents.google.com/patent/US11780305B2/en,https://patentimages.storage.googleapis.com/36...,US,B2,


In [28]:
from tqdm import tqdm
import time

In [39]:
df_2023['abstract'] = ''

for i in tqdm(range(len(df_2023))):
    url = df_2023.loc[i,"result_link"]
    response = requests.get(url) 
    html_content = response.text
    soup = BeautifulSoup(html_content, 'html.parser')
    abstract_contents_list = soup.find('h2', string = 'Abstract').find_next_sibling().find('abstract').find('div').contents
    abstract_contents = [element for element in abstract_contents_list if isinstance(element, str)]
    abstract = " ".join(abstract_contents)
    df_2023.loc[i, 'abstract'] = abstract
    time.sleep(0.1)

100%|████████████████████████████████████████████████████████████████████████████████| 360/360 [14:41<00:00,  2.45s/it]


It takes about 15 mins to scape 360 patents' abstracts.

In [46]:
# save df_2023
df_2023.to_csv(r"C:\Users\user\Documents\RenewableEnergy_patents_2023_abstract.csv")

In [50]:
vect = CountVectorizer()
vect.fit(ab)
bof = vect.transform(ab)

In [60]:
print(len(vect.vocabulary_))
print(vect.vocabulary_.keys())

3981


In [61]:
print(repr(bof))

<360x3981 sparse matrix of type '<class 'numpy.int64'>'
	with 18113 stored elements in Compressed Sparse Row format>


As we can see, there are many uninformative words in the dictionary, like  'of', 'can', 'be' etc.

We will use multiple methods to filter out uninformative words.

In [64]:
vect = CountVectorizer(min_df = 5, stop_words = "english")
vect.fit(ab)
bof = vect.transform(ab)
print(len(vect.vocabulary_))
print(vect.vocabulary_.keys())

574
dict_keys(['high', 'efficiency', 'solar', 'power', 'sources', 'base', 'phase', 'converter', 'low', 'energy', 'storage', 'circuitry', 'controlled', 'control', 'operates', 'provide', 'conversion', 'output', 'provided', 'individual', 'source', 'operational', 'modes', 'presenting', 'panel', 'variable', 'continuous', 'heat', 'electrical', 'solid', 'medium', 'continuously', 'array', 'internal', 'radiation', 'directly', 'thermal', 'facilitate', 'heating', 'delivery', 'temperature', 'used', 'processes', 'including', 'generation', 'based', 'supply', 'using', 'current', 'advance', 'information', 'voltage', 'distribution', 'transfer', 'method', 'includes', 'generating', 'time', 'signal', 'device', 'comprises', 'identifying', 'simultaneously', 'operating', 'mode', 'additional', 'comprising', 'plurality', 'values', 'according', 'determining', 'predictive', 'controller', 'processing', 'circuits', 'configured', 'obtain', 'defines', 'electric', 'load', 'period', 'multiple', 'specific', 'components

In [65]:
print(repr(bof))

<360x574 sparse matrix of type '<class 'numpy.int64'>'
	with 7417 stored elements in Compressed Sparse Row format>


We could see some words should be together, like 'high efficiency', 'solar power' and 'low energy' etc.
Let's use n-grams to allow multi-word tokens.

In [89]:
vect = CountVectorizer(min_df = 5, stop_words = "english", ngram_range = (1,3))
vect.fit(ab)
bof = vect.transform(ab)
print(len(vect.vocabulary_))
print(vect.vocabulary_.keys())

653
dict_keys(['high', 'efficiency', 'solar', 'power', 'sources', 'base', 'phase', 'converter', 'low', 'energy', 'storage', 'circuitry', 'controlled', 'control', 'operates', 'provide', 'conversion', 'output', 'provided', 'individual', 'source', 'operational', 'modes', 'presenting', 'panel', 'energy storage', 'variable', 'continuous', 'heat', 'electrical', 'solid', 'medium', 'continuously', 'array', 'internal', 'radiation', 'directly', 'thermal', 'facilitate', 'heating', 'delivery', 'temperature', 'used', 'processes', 'including', 'generation', 'based', 'supply', 'using', 'current', 'advance', 'information', 'voltage', 'distribution', 'transfer', 'method', 'includes', 'generating', 'time', 'signal', 'device', 'comprises', 'identifying', 'simultaneously', 'operating', 'mode', 'additional', 'comprising', 'plurality', 'values', 'according', 'determining', 'method includes', 'storage device', 'predictive', 'controller', 'processing', 'circuits', 'configured', 'obtain', 'defines', 'electric'

In [90]:
print(bof)

  (0, 44)	1
  (0, 69)	1
  (0, 123)	1
  (0, 124)	1
  (0, 127)	2
  (0, 128)	3
  (0, 193)	1
  (0, 220)	2
  (0, 221)	2
  (0, 268)	1
  (0, 303)	2
  (0, 339)	2
  (0, 376)	1
  (0, 405)	1
  (0, 408)	1
  (0, 413)	1
  (0, 417)	1
  (0, 433)	3
  (0, 453)	2
  (0, 466)	1
  (0, 482)	1
  (0, 483)	1
  (0, 559)	1
  (0, 561)	1
  (0, 562)	1
  :	:
  (358, 551)	2
  (358, 591)	3
  (358, 616)	3
  (358, 632)	3
  (358, 643)	1
  (359, 113)	3
  (359, 155)	1
  (359, 160)	1
  (359, 178)	5
  (359, 179)	1
  (359, 183)	3
  (359, 247)	4
  (359, 291)	1
  (359, 297)	2
  (359, 298)	1
  (359, 299)	1
  (359, 327)	4
  (359, 382)	3
  (359, 417)	4
  (359, 442)	5
  (359, 443)	2
  (359, 447)	2
  (359, 487)	1
  (359, 562)	3
  (359, 585)	4


In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [91]:
tfidf_vect = TfidfVectorizer(min_df = 5, stop_words = "english", ngram_range = (1,3))
tfidf_vect.fit(ab)
tfidf_vect_bof = tfidf_vect.transform(ab)
print(len(tfidf_vect.vocabulary_))
print(tfidf_vect.vocabulary_.keys())

653
dict_keys(['high', 'efficiency', 'solar', 'power', 'sources', 'base', 'phase', 'converter', 'low', 'energy', 'storage', 'circuitry', 'controlled', 'control', 'operates', 'provide', 'conversion', 'output', 'provided', 'individual', 'source', 'operational', 'modes', 'presenting', 'panel', 'energy storage', 'variable', 'continuous', 'heat', 'electrical', 'solid', 'medium', 'continuously', 'array', 'internal', 'radiation', 'directly', 'thermal', 'facilitate', 'heating', 'delivery', 'temperature', 'used', 'processes', 'including', 'generation', 'based', 'supply', 'using', 'current', 'advance', 'information', 'voltage', 'distribution', 'transfer', 'method', 'includes', 'generating', 'time', 'signal', 'device', 'comprises', 'identifying', 'simultaneously', 'operating', 'mode', 'additional', 'comprising', 'plurality', 'values', 'according', 'determining', 'method includes', 'storage device', 'predictive', 'controller', 'processing', 'circuits', 'configured', 'obtain', 'defines', 'electric'

In [100]:
words_sortedby_tfidf = tfidf_vect_bof.toarray().max(axis = 0).argsort()
feature_names = tfidf_vect.vocabulary_.keys()

In [120]:
print('feautures with lowest tfidf:')
for i in list(words_sortedby_tfidf[:20]):
    print(f'{i}: ',list(feature_names)[i])

feautures with lowest tfidf:
611:  external
322:  area
108:  methods
298:  accordance
595:  configuration
414:  includes light
12:  controlled
258:  target
259:  tissue
204:  cells
286:  model
15:  provide
370:  volume
40:  delivery
159:  length
273:  connected
615:  audio
405:  range
292:  safety
642:  size


In [122]:
print('feautures with highest tfidf:')
for i in list(words_sortedby_tfidf[-20:]):
    print(f'{i}: ',list(feature_names)[i])

feautures with highest tfidf:
120:  apparatus
547:  image data
551:  results
385:  resources
477:  wavelength
499:  difference
68:  plurality
638:  trained
516:  directed
572:  software
117:  monitor
262:  greater
36:  directly
646:  servers
589:  implemented
411:  distal
74:  predictive
363:  actuator
618:  identification
381:  emitting
