# Webscraping and API: Collecting Patent Citation Data

This notebook is an example of applying webscraping and API tools to collect patent and citation data from google patent search page.

First, start from Google Patent Search Page:
+ Set search parameter
+ Search
+ download csv file from top-right corner
+ The csv file should include the following information: patent id, assignee, inventors, date, and patent web page links

Using "Gate-All-Around" as search keyword to define the topic of patents. Set other searching parameters as follow:

|search parameters|setting|description|
|----|----|----|
|sort by|no selection||
|group by|'True' (Classification)|set this to True will return CPC code in the response, we are interested with the CPC code of those patents, so we set it to true.|
|deduplicate|no selection|default = family, another choice is publication, we will leave it with default choice.|
|date before|'publication:20231231'|we will collect all patents that have publication date before 2023.12.31|
|date after|no selection||
|inventor|no selection|we are not searching for specific inventor, so leave it blank|
|assignee|no selection|same as above|
|patent office|'US'|we first focus on patents in 'US'|
|language|'EN'|we first focus on patents in English language|
|status|'Grant'|another choice is 'application', 'grant' patents are those with true impact, we choose 'grant'|
|type|'patent'|another choice is 'design', we will focus on patents|
|litigation|no selection||

## Patent list

In [24]:
import pandas as pd
df = pd.read_csv(r"C:\Users\user\Documents\GAA_patents.csv", skiprows = 1)

In [55]:
df.head()

Unnamed: 0,id,title,assignee,inventor,priority_date,filing_date,publication_date,grant_date,result_link,representative_figure_link,country,kind_code
0,US11756960B2,Multi-threshold voltage gate-all-around transi...,International Business Machines Corporation,"Jingyun Zhang, Takashi Ando, ChoongHyun Lee",20190523,20210924,20230912,20230912,https://patents.google.com/patent/US11756960B2/en,https://patentimages.storage.googleapis.com/4d...,US,B2
1,US10700064B1,Multi-threshold voltage gate-all-around field-...,International Business Machines Corporation,"Jingyun Zhang, Takashi Ando, ChoongHyun Lee",20190215,20190215,20200630,20200630,https://patents.google.com/patent/US10700064B1/en,https://patentimages.storage.googleapis.com/9b...,US,B1
2,US11177258B2,Stacked nanosheet CFET with gate all around st...,International Business Machines Corporation,"Ruilong Xie, Alexander Reznicek, Heng Wu, Lan Yu",20200222,20200222,20211116,20211116,https://patents.google.com/patent/US11177258B2/en,https://patentimages.storage.googleapis.com/28...,US,B2
3,US10483085B2,Use of ion beam etching to generate gate-all-a...,Lam Research Corporation,"Ivan L. Berry, III, Thorsten Lill",20141021,20161115,20191119,20191119,https://patents.google.com/patent/US10483085B2/en,https://patentimages.storage.googleapis.com/9c...,US,B2
4,US10332803B1,Hybrid gate-all-around (GAA) field effect tran...,Globalfoundaries Inc.,"Ruilong Xie, Edward J. Nowak, Bipul C. Paul, S...",20180508,20180508,20190625,20190625,https://patents.google.com/patent/US10332803B1/en,https://patentimages.storage.googleapis.com/5f...,US,B1


In [26]:
# change column name
df.columns = ['id','title','assignee','inventor','priority_date','filing_date','publication_date','grant_date','result_link','representative_figure_link']

In [54]:
# check data type
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23426 entries, 0 to 23425
Data columns (total 12 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   id                          23426 non-null  object
 1   title                       23426 non-null  object
 2   assignee                    23426 non-null  object
 3   inventor                    23426 non-null  object
 4   priority_date               23426 non-null  object
 5   filing_date                 23426 non-null  object
 6   publication_date            23426 non-null  object
 7   grant_date                  23426 non-null  object
 8   result_link                 23426 non-null  object
 9   representative_figure_link  23426 non-null  object
 10  country                     23426 non-null  object
 11  kind_code                   23426 non-null  object
dtypes: object(12)
memory usage: 2.1+ MB


In [57]:
# check missing value
df.isnull().sum()

id                            0
title                         0
assignee                      0
inventor                      0
priority_date                 0
filing_date                   0
publication_date              0
grant_date                    0
result_link                   0
representative_figure_link    0
country                       0
kind_code                     0
dtype: int64

In [None]:
## cleaning id column
# split id column to generate 'country' and 'kind_code' columns
df[['id2']] = df[['id']]# copy id column
df[['country','id_code','kind_code']] = df['id2'].str.split('-',expand=True)# split id column
df = df.drop(['id_code','id2'], axis = 1)# drop id2 and id_code

# remove '-' signs from id
df['id'] = df['id'].str.replace('-', '')

In [None]:
# remove '-' signs from date columns
for i in ['priority_date','filing_date','publication_date','grant_date']:
    df[i] = df[i].str.replace('-', '')

In [70]:
# change inventor column from string to name list
inventor_list = df['inventor'].str.split(',').to_frame()
df[['inventor']] = inventor_list
df.head()

Unnamed: 0,id,title,assignee,inventor,priority_date,filing_date,publication_date,grant_date,result_link,representative_figure_link,country,kind_code
0,US11756960B2,Multi-threshold voltage gate-all-around transi...,International Business Machines Corporation,"[Jingyun Zhang, Takashi Ando, ChoongHyun Lee]",20190523,20210924,20230912,20230912,https://patents.google.com/patent/US11756960B2/en,https://patentimages.storage.googleapis.com/4d...,US,B2
1,US10700064B1,Multi-threshold voltage gate-all-around field-...,International Business Machines Corporation,"[Jingyun Zhang, Takashi Ando, ChoongHyun Lee]",20190215,20190215,20200630,20200630,https://patents.google.com/patent/US10700064B1/en,https://patentimages.storage.googleapis.com/9b...,US,B1
2,US11177258B2,Stacked nanosheet CFET with gate all around st...,International Business Machines Corporation,"[Ruilong Xie, Alexander Reznicek, Heng Wu, ...",20200222,20200222,20211116,20211116,https://patents.google.com/patent/US11177258B2/en,https://patentimages.storage.googleapis.com/28...,US,B2
3,US10483085B2,Use of ion beam etching to generate gate-all-a...,Lam Research Corporation,"[Ivan L. Berry, III, Thorsten Lill]",20141021,20161115,20191119,20191119,https://patents.google.com/patent/US10483085B2/en,https://patentimages.storage.googleapis.com/9c...,US,B2
4,US10332803B1,Hybrid gate-all-around (GAA) field effect tran...,Globalfoundaries Inc.,"[Ruilong Xie, Edward J. Nowak, Bipul C. Paul...",20180508,20180508,20190625,20190625,https://patents.google.com/patent/US10332803B1/en,https://patentimages.storage.googleapis.com/5f...,US,B1


In [None]:
## inspect 'inventor' column for more details
# create 'inventor_num' column to see how many inventors are there for each patent
inventor_num = []
for i in range(len(df)):
    num = len(df['inventor'][i])
    inventor_num.append(num)

df[['inventor_num']] = pd.DataFrame(inventor_num)

In [98]:
# summary statistics
df['inventor_num'].describe()
# there is an outlier that have 68 inventors for one patent

count    23426.000000
mean         3.695424
std          2.620646
min          1.000000
25%          2.000000
50%          3.000000
75%          5.000000
max         68.000000
Name: inventor_num, dtype: float64

In [101]:
# we found this patent's assignee is intel corporation
# it's about wireless communication technology
df.loc[df['inventor_num'] == 68,]

Unnamed: 0,id,title,assignee,inventor,priority_date,filing_date,publication_date,grant_date,result_link,representative_figure_link,country,kind_code,inventor_num
7865,US11955732B2,"Wireless communication technology, apparatuses...",Intel Corporation,"[Erkan Alpman, Arnaud Lucres Amadjikpe, Omer...",20161221,20221227,20240409,20240409,https://patents.google.com/patent/US11955732B2/en,https://patentimages.storage.googleapis.com/91...,US,B2,68


In [104]:
# save cleaned patents dateframe
df.to_csv(r"C:\Users\user\Documents\GAA_patents_clean.csv")

## Citation list

#### Using Pandas to scrape tables

In [1]:
# load patent list
import pandas as pd
df = pd.read_csv(r"C:\Users\user\Documents\GAA_patents_clean.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23426 entries, 0 to 23425
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Unnamed: 0                  23426 non-null  int64 
 1   id                          23426 non-null  object
 2   title                       23426 non-null  object
 3   assignee                    23426 non-null  object
 4   inventor                    23426 non-null  object
 5   priority_date               23426 non-null  int64 
 6   filing_date                 23426 non-null  int64 
 7   publication_date            23426 non-null  int64 
 8   grant_date                  23426 non-null  int64 
 9   result_link                 23426 non-null  object
 10  representative_figure_link  23387 non-null  object
 11  country                     23426 non-null  object
 12  kind_code                   23426 non-null  object
 13  inventor_num                23426 non-null  in

In [2]:
df = df.drop('Unnamed: 0', axis = 1)
df.head()

Unnamed: 0,id,title,assignee,inventor,priority_date,filing_date,publication_date,grant_date,result_link,representative_figure_link,country,kind_code,inventor_num
0,US11756960B2,Multi-threshold voltage gate-all-around transi...,International Business Machines Corporation,"['Jingyun Zhang', ' Takashi Ando', ' ChoongHyu...",20190523,20210924,20230912,20230912,https://patents.google.com/patent/US11756960B2/en,https://patentimages.storage.googleapis.com/4d...,US,B2,3
1,US10700064B1,Multi-threshold voltage gate-all-around field-...,International Business Machines Corporation,"['Jingyun Zhang', ' Takashi Ando', ' ChoongHyu...",20190215,20190215,20200630,20200630,https://patents.google.com/patent/US10700064B1/en,https://patentimages.storage.googleapis.com/9b...,US,B1,3
2,US11177258B2,Stacked nanosheet CFET with gate all around st...,International Business Machines Corporation,"['Ruilong Xie', ' Alexander Reznicek', ' Heng ...",20200222,20200222,20211116,20211116,https://patents.google.com/patent/US11177258B2/en,https://patentimages.storage.googleapis.com/28...,US,B2,4
3,US10483085B2,Use of ion beam etching to generate gate-all-a...,Lam Research Corporation,"['Ivan L. Berry', ' III', ' Thorsten Lill']",20141021,20161115,20191119,20191119,https://patents.google.com/patent/US10483085B2/en,https://patentimages.storage.googleapis.com/9c...,US,B2,3
4,US10332803B1,Hybrid gate-all-around (GAA) field effect tran...,Globalfoundaries Inc.,"['Ruilong Xie', ' Edward J. Nowak', ' Bipul C....",20180508,20180508,20190625,20190625,https://patents.google.com/patent/US10332803B1/en,https://patentimages.storage.googleapis.com/5f...,US,B1,7


In [20]:
# Take the first patent as example:
# extract patent webpage link
firstlink = df.loc[0,"result_link"]
firstlink

'https://patents.google.com/patent/US11756960B2/en'

In [43]:
## use pandas to parse table content
# parse tables out of html contents
url = firstlink
df_list = pd.read_html(url)

In [52]:
# show all scraped tables
for i in range(len(df_list)):
    print(i)
    display(df_list[i])

0


Unnamed: 0,Application Number,Priority Date,Filing Date,Title
0,"US17/483,981 US11756960B2 (en)",2019-05-23,2021-09-24,Multi-threshold voltage gate-all-around transi...


1


Unnamed: 0,Application Number,Priority Date,Filing Date,Title
0,"US16/420,753 US11133309B2 (en)",2019-05-23,2019-05-23,Multi-threshold voltage gate-all-around transi...
1,"US17/483,981 US11756960B2 (en)",2019-05-23,2021-09-24,Multi-threshold voltage gate-all-around transi...


2


Unnamed: 0,Application Number,Title,Priority Date,Filing Date
0,"US16/420,753 Division US11133309B2 (en)",2019-05-23,2019-05-23,Multi-threshold voltage gate-all-around transi...


3


Unnamed: 0,Publication Number,Publication Date
0,US20220085014A1 US20220085014A1 (en),2022-03-17
1,US11756960B2 true US11756960B2 (en),2023-09-12


4


Unnamed: 0,Application Number,Title,Priority Date,Filing Date
0,"US16/420,753 Active 2039-11-01 US11133309B2...",2019-05-23,2019-05-23,Multi-threshold voltage gate-all-around transi...
1,"US17/483,981 Active 2039-10-23 US11756960B2...",2019-05-23,2021-09-24,Multi-threshold voltage gate-all-around transi...


5


Unnamed: 0,Application Number,Title,Priority Date,Filing Date
0,"US16/420,753 Active 2039-11-01 US11133309B2...",2019-05-23,2019-05-23,Multi-threshold voltage gate-all-around transi...


6


Unnamed: 0,Country,Link
0,US (2),US11133309B2 (en)


7


Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
0,US11133309B2 (en) *,2019-05-23,2021-09-28,International Business Machines Corporation,Multi-threshold voltage gate-all-around transi...
1,TWI812751B (en) *,2019-07-08,2023-08-21,聯華電子股份有限公司,Semiconductor device and manufacturing method ...
2,US11152488B2 (en) *,2019-08-21,2021-10-19,"Taiwan Semiconductor Manufacturing Co., Ltd.",Gate-all-around structure with dummy pattern t...
3,US11049934B2 (en) *,2019-09-18,2021-06-29,Globalfoundries U.S. Inc.,Transistor comprising a matrix of nanowires an...
4,US20210126018A1 (en) *,2019-10-24,2021-04-29,International Business Machines Corporation,Gate stack quality for gate-all-around field-e...
5,US11502168B2 (en) *,2019-10-30,2022-11-15,"Taiwan Semiconductor Manufacturing Company, Ltd.",Tuning threshold voltage in nanosheet transito...
6,US11264503B2 (en) *,2019-12-18,2022-03-01,"Taiwan Semiconductor Manufacturing Co., Ltd.",Metal gate structures of semiconductor devices
7,US11664420B2 (en) *,2019-12-26,2023-05-30,"Taiwan Semiconductor Manufacturing Company, Ltd.",Semiconductor device and method
8,US11410889B2 (en) *,2019-12-31,2022-08-09,"Taiwan Semiconductor Manufacturing Co., Ltd.",Semiconductor device and manufacturing method ...
9,US11152264B2 (en) *,2020-01-08,2021-10-19,International Business Machines Corporation,Multi-Vt scheme with same dipole thickness for...


8


Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
0,US9613870B2 (en),2015-06-30,2017-04-04,International Business Machines Corporation,Gate stack formed with interrupted deposition ...
1,US9812449B2 (en),2015-11-20,2017-11-07,"Samsung Electronics Co., Ltd.",Multi-VT gate stack for III-V nanosheet device...
2,US20180090326A1 (en),2016-09-26,2018-03-29,International Business Machines Corporation,Controlling threshold voltage in nanosheet tra...
3,US9997519B1 (en),2017-05-03,2018-06-12,International Business Machines Corporation,Dual channel structures with multiple threshol...
4,US10026652B2 (en),2016-08-17,2018-07-17,"Samsung Electronics Co., Ltd.",Horizontal nanosheet FETs and method of manufa...
5,US10056454B2 (en),2016-03-02,2018-08-21,"Samsung Electronics Co., Ltd.",Semiconductor device and method of manufacturi...
6,US11133309B2 (en) *,2019-05-23,2021-09-28,International Business Machines Corporation,Multi-threshold voltage gate-all-around transi...


9


Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
0,US10825736B1 (en) *,2019-07-22,2020-11-03,International Business Machines Corporation,Nanosheet with selective dipole diffusion into...


10


Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
0,US9613870B2 (en),2015-06-30,2017-04-04,International Business Machines Corporation,Gate stack formed with interrupted deposition ...
1,US9812449B2 (en),2015-11-20,2017-11-07,"Samsung Electronics Co., Ltd.",Multi-VT gate stack for III-V nanosheet device...
2,US10056454B2 (en),2016-03-02,2018-08-21,"Samsung Electronics Co., Ltd.",Semiconductor device and method of manufacturi...
3,US10026652B2 (en),2016-08-17,2018-07-17,"Samsung Electronics Co., Ltd.",Horizontal nanosheet FETs and method of manufa...
4,US20180090326A1 (en),2016-09-26,2018-03-29,International Business Machines Corporation,Controlling threshold voltage in nanosheet tra...
5,US9997519B1 (en),2017-05-03,2018-06-12,International Business Machines Corporation,Dual channel structures with multiple threshol...
6,US11133309B2 (en) *,2019-05-23,2021-09-28,International Business Machines Corporation,Multi-threshold voltage gate-all-around transi...


11


Unnamed: 0,Title
0,"Barry P. Linder et al., Process optimizations ..."
1,"Disclosed Anonymously, ""Method and Structure t..."
2,"Disclosed Anonymously, ""Multiple VT for Gate-A..."
3,"J.W. Park et al., ""Reflective High-Energy Elec..."
4,"Jingyun Zhang et al.;""High-k metal gate fundam..."


12


Unnamed: 0,Publication number,Publication date
0,US11133309B2 (en),2021-09-28
1,US20220085014A1 (en),2022-03-17
2,US20200373300A1 (en),2020-11-26


13


Unnamed: 0,Publication,Publication Date,Title
0,US11756960B2 (en),2023-09-12,Multi-threshold voltage gate-all-around transi...
1,US10964601B2 (en),2021-03-30,Fabrication of a pair of vertical fin field ef...
2,US11158544B2 (en),2021-10-26,Vertical stacked nanosheet CMOS transistors wi...
3,US10361301B2 (en),2019-07-23,Fabrication of vertical fin transistor with mu...
4,US11756957B2 (en),2023-09-12,Reducing gate resistance in stacked vertical t...
5,US10903369B2 (en),2021-01-26,Transistor channel having vertically stacked n...
6,US11282961B2 (en),2022-03-22,Enhanced bottom dielectric isolation in gate-a...
7,US10332883B2 (en),2019-06-25,Integrated metal gate CMOS devices
8,US10957799B2 (en),2021-03-23,Transistor channel having vertically stacked n...
9,US11251267B2 (en),2022-02-15,Vertical transistors with multiple gate lengths


14


Unnamed: 0,Date,Code,Title,Description
0,2021-09-24,AS,Assignment,Owner name: INTERNATIONAL BUSINESS MACHINES C...
1,2021-09-24,FEPP,Fee payment procedure,Free format text: ENTITY STATUS SET TO UNDISC...
2,2021-12-07,STPP,Information on status: patent application and ...,Free format text: DOCKETED NEW CASE - READY F...
3,2023-04-24,STPP,Information on status: patent application and ...,Free format text: NOTICE OF ALLOWANCE MAILED ...
4,2023-07-25,STPP,Information on status: patent application and ...,Free format text: PUBLICATIONS -- ISSUE FEE P...
5,2023-08-23,STCF,Information on status: patent grant,Free format text: PATENTED CASE


In [54]:
# find out tables that contain citation and cited-by patents
citation = df_list[8]
cited_by = df_list[7]
display(citation)
display(cited_by)

Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
0,US9613870B2 (en),2015-06-30,2017-04-04,International Business Machines Corporation,Gate stack formed with interrupted deposition ...
1,US9812449B2 (en),2015-11-20,2017-11-07,"Samsung Electronics Co., Ltd.",Multi-VT gate stack for III-V nanosheet device...
2,US20180090326A1 (en),2016-09-26,2018-03-29,International Business Machines Corporation,Controlling threshold voltage in nanosheet tra...
3,US9997519B1 (en),2017-05-03,2018-06-12,International Business Machines Corporation,Dual channel structures with multiple threshol...
4,US10026652B2 (en),2016-08-17,2018-07-17,"Samsung Electronics Co., Ltd.",Horizontal nanosheet FETs and method of manufa...
5,US10056454B2 (en),2016-03-02,2018-08-21,"Samsung Electronics Co., Ltd.",Semiconductor device and method of manufacturi...
6,US11133309B2 (en) *,2019-05-23,2021-09-28,International Business Machines Corporation,Multi-threshold voltage gate-all-around transi...


Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
0,US11133309B2 (en) *,2019-05-23,2021-09-28,International Business Machines Corporation,Multi-threshold voltage gate-all-around transi...
1,TWI812751B (en) *,2019-07-08,2023-08-21,聯華電子股份有限公司,Semiconductor device and manufacturing method ...
2,US11152488B2 (en) *,2019-08-21,2021-10-19,"Taiwan Semiconductor Manufacturing Co., Ltd.",Gate-all-around structure with dummy pattern t...
3,US11049934B2 (en) *,2019-09-18,2021-06-29,Globalfoundries U.S. Inc.,Transistor comprising a matrix of nanowires an...
4,US20210126018A1 (en) *,2019-10-24,2021-04-29,International Business Machines Corporation,Gate stack quality for gate-all-around field-e...
5,US11502168B2 (en) *,2019-10-30,2022-11-15,"Taiwan Semiconductor Manufacturing Company, Ltd.",Tuning threshold voltage in nanosheet transito...
6,US11264503B2 (en) *,2019-12-18,2022-03-01,"Taiwan Semiconductor Manufacturing Co., Ltd.",Metal gate structures of semiconductor devices
7,US11664420B2 (en) *,2019-12-26,2023-05-30,"Taiwan Semiconductor Manufacturing Company, Ltd.",Semiconductor device and method
8,US11410889B2 (en) *,2019-12-31,2022-08-09,"Taiwan Semiconductor Manufacturing Co., Ltd.",Semiconductor device and manufacturing method ...
9,US11152264B2 (en) *,2020-01-08,2021-10-19,International Business Machines Corporation,Multi-Vt scheme with same dipole thickness for...


#### Using BeautifulSoup

In [21]:
import requests
from bs4 import BeautifulSoup
url = firstlink
response = requests.get(url) 
html_content = response.text

In [22]:
# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser') 

In [23]:
# through inspecting the whole html content, we found that <h2> tags are very useful to help us find the citation tables
h2_list = soup.find_all('h2')
print(len(h2_list))
h2_list

24


[<h2>Info</h2>,
 <h2>Links</h2>,
 <h2>Images</h2>,
 <h2>Classifications</h2>,
 <h2>Definitions</h2>,
 <h2>Abstract</h2>,
 <h2>Description</h2>,
 <h2>Claims (<span itemprop="count">20</span>)</h2>,
 <h2>Priority Applications (1)</h2>,
 <h2>Applications Claiming Priority (2)</h2>,
 <h2>Related Parent Applications (1)</h2>,
 <h2>Publications (2)</h2>,
 <h2>ID=73456197</h2>,
 <h2>Family Applications (2)</h2>,
 <h2>Family Applications Before (1)</h2>,
 <h2>Country Status (1)</h2>,
 <h2>Families Citing this family (24)</h2>,
 <h2>Citations (7)</h2>,
 <h2>Family Cites Families (1)</h2>,
 <h2>Patent Citations (7)</h2>,
 <h2>Non-Patent Citations (5)</h2>,
 <h2>Also Published As</h2>,
 <h2>Similar Documents</h2>,
 <h2>Legal Events</h2>]

In [35]:
import re

In [39]:
citation_h2 = soup.find('h2', string = re.compile('Family Cites Families'))

In [40]:
citation_h2

<h2>Family Cites Families (1)</h2>

In [41]:
# find table that affiliates to <h2>Family Cites Families</h2>
table = citation_h2.find_next_siblings('table')
table

[<table>
 <caption>* Cited by examiner, â  Cited by third party</caption>
 <thead>
 <tr>
 <th>Publication number</th>
 <th>Priority date</th>
 <th>Publication date</th>
 <th>Assignee</th>
 <th>Title</th>
 </tr>
 </thead>
 <tbody>
 <tr itemprop="backwardReferencesFamily" itemscope="" repeat="">
 <td>
 <a href="/patent/US10825736B1/en">
 <span itemprop="publicationNumber">US10825736B1</span>
               (<span itemprop="primaryLanguage">en</span>)
             </a>
 <span itemprop="examinerCited">*</span>
 </td>
 <td itemprop="priorityDate">2019-07-22</td>
 <td itemprop="publicationDate">2020-11-03</td>
 <td><span itemprop="assigneeOriginal">International Business Machines Corporation</span></td>
 <td itemprop="title">Nanosheet with selective dipole diffusion into high-k 
        </td>
 </tr>
 </tbody>
 </table>]

In [43]:
# Find all <tr> tags in the table
rows = soup.find('h2', string = re.compile('Family Cites Families')).find_next_sibling('table').find_all('tr')

In [44]:
# parsing the header
headers_tag = soup.find('h2', string = re.compile('Family Cites Families')).find_next_sibling('table').find_all('th')
headers = [ele.text.strip() for ele in headers_tag]
headers

['Publication number',
 'Priority date',
 'Publication date',
 'Assignee',
 'Title']

In [45]:
# parsing the table
citation_table = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    citation_table.append([ele for ele in cols if ele])

In [46]:
# combine header and table
citation_table[0] = headers
citation_table

[['Publication number',
  'Priority date',
  'Publication date',
  'Assignee',
  'Title'],
 ['US10825736B1\n              (en)\n            \n*',
  '2019-07-22',
  '2020-11-03',
  'International Business Machines Corporation',
  'Nanosheet with selective dipole diffusion into high-k']]

In [47]:
# turn it into pandas dataframe
citation_df = pd.DataFrame(citation_table)
citation_df.columns = citation_df.iloc[0]
citation_df = citation_df.iloc[1:]
citation_df

Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
1,US10825736B1\n (en)\n \n*,2019-07-22,2020-11-03,International Business Machines Corporation,Nanosheet with selective dipole diffusion into...


In [48]:
## do the same with <h2>Families Citing this family (24)</h2>
# prepare rows
rows = soup.find('h2', string = re.compile('Families Citing this family')).find_next_sibling('table').find_all('tr')
cited_by_table = []
for row in rows:
    cols = row.find_all('td')
    cols = [ele.text.strip() for ele in cols]
    cited_by_table.append([ele for ele in cols if ele])

# prepare header
headers_tag = soup.find('h2', string = re.compile('Families Citing this family')).find_next_sibling('table').find_all('th')
headers = [ele.text.strip() for ele in headers_tag]

# combine rows and headers
cited_by_table[0] = headers

# turn parsed table into pandas dataframe
cited_by_df = pd.DataFrame(cited_by_table)
cited_by_df.columns = cited_by_df.iloc[0]
cited_by_df = cited_by_df.iloc[1:]
cited_by_df

Unnamed: 0,Publication number,Priority date,Publication date,Assignee,Title
1,US11133309B2\n (en)\n \n*,2019-05-23,2021-09-28,International Business Machines Corporation,Multi-threshold voltage gate-all-around transi...
2,TWI812751B\n (en)\n \n*,2019-07-08,2023-08-21,è¯è¯é»å­è¡ä»½æéå¬å¸,Semiconductor device and manufacturing method ...
3,US11152488B2\n (en)\n \n*,2019-08-21,2021-10-19,"Taiwan Semiconductor Manufacturing Co., Ltd.",Gate-all-around structure with dummy pattern t...
4,US11049934B2\n (en)\n \n*,2019-09-18,2021-06-29,Globalfoundries U.S. Inc.,Transistor comprising a matrix of nanowires an...
5,US20210126018A1\n (en)\n ...,2019-10-24,2021-04-29,International Business Machines Corporation,Gate stack quality for gate-all-around field-e...
6,US11502168B2\n (en)\n \n*,2019-10-30,2022-11-15,"Taiwan Semiconductor Manufacturing Company, Ltd.",Tuning threshold voltage in nanosheet transito...
7,US11264503B2\n (en)\n \n*,2019-12-18,2022-03-01,"Taiwan Semiconductor Manufacturing Co., Ltd.",Metal gate structures of semiconductor devices
8,US11664420B2\n (en)\n \n*,2019-12-26,2023-05-30,"Taiwan Semiconductor Manufacturing Company, Ltd.",Semiconductor device and method
9,US11410889B2\n (en)\n \n*,2019-12-31,2022-08-09,"Taiwan Semiconductor Manufacturing Co., Ltd.",Semiconductor device and manufacturing method ...
10,US11152264B2\n (en)\n \n*,2020-01-08,2021-10-19,International Business Machines Corporation,Multi-Vt scheme with same dipole thickness for...


#### Using google-patent-scraper API

In [3]:
from google_patent_scraper import scraper_class
import json

In [10]:
## take first 2 patents from patent list as example
# Initialize scraper class
scraper=scraper_class()

# first 2 patents from patent list
plist = df.loc[0:1,'id'].to_list()

# Add patents to list
scraper.add_patents(plist[0])
scraper.add_patents(plist[1])

# Scrape all patents
scraper.scrape_all_patents()

# Get results of scrape
patent_1_parsed = scraper.parsed_patents[plist[0]]
patent_2_parsed = scraper.parsed_patents[plist[1]]

# Print inventors of first patent
for inventor in json.loads(patent_1_parsed['inventor_name']):
  print('Patent inventor : {0}'.format(inventor['inventor_name']))

https://patents.google.com/patent/US11756960B2
https://patents.google.com/patent/US10700064B1
Patent inventor : Jingyun Zhang
Patent inventor : Takashi Ando
Patent inventor : ChoongHyun Lee


In [18]:
patent_1_parsed.keys()

dict_keys(['inventor_name', 'assignee_name_orig', 'assignee_name_current', 'pub_date', 'priority_date', 'grant_date', 'filing_date', 'forward_cite_no_family', 'forward_cite_yes_family', 'backward_cite_no_family', 'backward_cite_yes_family', 'abstract_text', 'url', 'patent'])

In [53]:
patent_1_cited_by_df = pd.DataFrame(json.loads(patent_1_parsed['forward_cite_yes_family'])).iloc[:,0].to_frame()
patent_1_citation_df = pd.DataFrame(json.loads(patent_1_parsed['backward_cite_yes_family'])).iloc[:,0].to_frame()
display(patent_1_cited_by_df)
display(patent_1_citation_df)

Unnamed: 0,patent_number
0,US11133309B2
1,TWI812751B
2,US11152488B2
3,US11049934B2
4,US20210126018A1
5,US11502168B2
6,US11264503B2
7,US11664420B2
8,US11410889B2
9,US11152264B2


Unnamed: 0,patent_number
0,US10825736B1


In [57]:
patent_1_citation_df.columns = ['citation']
patent_1_citation_df['patent'] = plist[0]
patent_1_citation_df['direction'] = 'cite'
patent_1_citation_df

Unnamed: 0,citation,patent,direction
0,US10825736B1,US11756960B2,cite


In [56]:
patent_1_cited_by_df.columns = ['citation']
patent_1_cited_by_df['patent'] = plist[0]
patent_1_cited_by_df['direction'] = 'cited_by'
patent_1_cited_by_df

Unnamed: 0,citation,patent,direction
0,US11133309B2,US11756960B2,cited_by
1,TWI812751B,US11756960B2,cited_by
2,US11152488B2,US11756960B2,cited_by
3,US11049934B2,US11756960B2,cited_by
4,US20210126018A1,US11756960B2,cited_by
5,US11502168B2,US11756960B2,cited_by
6,US11264503B2,US11756960B2,cited_by
7,US11664420B2,US11756960B2,cited_by
8,US11410889B2,US11756960B2,cited_by
9,US11152264B2,US11756960B2,cited_by


In [58]:
citations_1_df = pd.concat([patent_1_citation_df, patent_1_cited_by_df], axis = 0)
citations_1_df

Unnamed: 0,citation,patent,direction
0,US10825736B1,US11756960B2,cite
0,US11133309B2,US11756960B2,cited_by
1,TWI812751B,US11756960B2,cited_by
2,US11152488B2,US11756960B2,cited_by
3,US11049934B2,US11756960B2,cited_by
4,US20210126018A1,US11756960B2,cited_by
5,US11502168B2,US11756960B2,cited_by
6,US11264503B2,US11756960B2,cited_by
7,US11664420B2,US11756960B2,cited_by
8,US11410889B2,US11756960B2,cited_by
