# Gathering Citation class information

In [1]:
import pandas as pd
import numpy as np

In [3]:
citations = pd.read_csv('citations_patents_level.csv')
citations.head()

Unnamed: 0,cited_patent_number,patent_number
0,,D257752
1,4162014.0,D257752
2,,D257924
3,4162014.0,D257924
4,,D258382


In [4]:
citations.shape

(6777757, 2)

In [7]:
citations.groupby('patent_number').ngroups

525512

For some reason or another, the version that filters non-design patents did not save. Let's fix that real quick

In [8]:
def remove_non_design(df):
    return df[df.patent_number.str.contains('D')]
    

In [9]:
citations = remove_non_design(citations)

In [10]:
citations.groupby('patent_number').ngroups

525490

In [11]:
citations.to_csv('citations_patents_level.csv', index=False)

In [12]:
citations.isnull().sum()

cited_patent_number    289567
patent_number               0
dtype: int64

In [15]:
citations.dropna(inplace=True)
citations.shape

(6488089, 2)

Let's remove any duplicate patents that have been referenced by multiple patents

In [16]:
citations.drop_duplicates(subset=['cited_patent_number'],inplace=True)
citations.shape

(941034, 2)

In [18]:
941034/6777757

0.13884150759609706

In [24]:
needed_patents= citations['cited_patent_number'].sort_values()

In [25]:
needed_patents.head()

87412      3930271
5357083    3930272
23827      3930273
1137858    3930280
3133688    3930286
Name: cited_patent_number, dtype: object

In [41]:
needed_patents[needed_patents.str.contains('R')]

215569     RE28673
6407442    RE28687
2767       RE28720
1355927    RE28746
1160406    RE28752
132230     RE28793
50839      RE28797
5073187    RE28834
2745033    RE28855
449612     RE28874
5997353    RE28876
478363     RE28879
2895675    RE28880
1683791    RE28889
2153334    RE28898
4224475    RE28910
139959     RE28915
836440     RE28916
253686     RE28936
125277     RE28948
249352     RE28969
588161     RE28987
1043019    RE28994
2712421    RE29002
2820247    RE29034
19104      RE29036
27373      RE29041
55107      RE29047
68150      RE29050
2179126    RE29052
            ...   
5901145    RE45560
6412769    RE45585
5508558    RE45611
5901165    RE45622
6309234    RE45623
6347756    RE45624
6193298    RE45657
6200711    RE45674
6131766    RE45679
6352299    RE45712
6221335    RE45715
6018866    RE45741
6107284    RE45787
6217445    RE45836
6647303    RE45837
6045515    RE45843
6059342    RE45863
6645132    RE45864
6404465    RE45914
6610508    RE45915
6610509    RE45961
6138495    R

generating all the URLs for the patents needed

In [26]:
urls = []
base_url = "http://www.patentsview.org/api/patents/query?"
field_list = "&f=[\"patent_number\",\"uspc_subclass_id\",\"uspc_mainclass_id\"]"
for i, value in needed_patents.iteritems():
    query = "q={{\"patent_number\":\"{}\"}}".format(value)
    urls.append(base_url + query + field_list)


In [37]:
pd.DataFrame(urls).to_csv('citation_urls.csv', index=False, header=True)