This notebook explores the robustness of the eight measures at the patent and industrial level.

The eight measures are listed below

# At the patent value

Binary main class D-D

Binary sub class D-D

Continuous main class D-D

Continuous sub class D-D

Binary main class D-U

Binary sub class D-U

Continuous main class D-U

Continuous sub class D-U

# At the industry level
Binary main class (%) D-D

Binary sub class (%) D-D

Continuous main class (mean of max scores) D-D

Continuous sub class (mean of max scores) D-D

Binary main class (%) D-U

Binary sub class (%) D-U

Continuous main class (mean of max scores) D-U

Continuous sub class (mean of max scores) D-U

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import design_class_dictionary

%matplotlib inline

In [2]:
#reading in data
master = pd.read_csv('data/master_503107.csv')
master.head()

Unnamed: 0,patent_number,grant_year,app_year,num_inventors,us_inventor,cite_foreign_patent,is_missing,num_design_cited,num_utility_cited,non-pat_refs,num_figures,num_assignees,missing_data,family_size,country_transformed,priority_year
0,D257752,1981,1980,1,1,0,1,1,2,1,6,1,0,1,US,1980.0
1,D257924,1981,1980,1,1,0,1,1,2,1,6,1,0,1,US,1980.0
2,D258382,1981,1980,2,1,0,1,2,1,0,5,1,0,1,US,1980.0
3,D258383,1981,1980,2,1,0,1,2,1,0,5,1,0,1,US,1980.0
4,D258678,1981,1980,2,1,0,1,2,1,0,5,1,0,1,US,1980.0


In [3]:
binary_main_d = pd.read_csv('data/d2d_main_binary.csv')
binary_sub_d = pd.read_csv('data/d2d_sub_binary.csv')
cont_main_d = pd.read_csv('data/d2d_main_cont.csv')
cont_sub_d = pd.read_csv('data/d2d_sub_cont.csv')

binary_main_u = pd.read_csv('data/d2u_main_binary.csv')
binary_sub_u = pd.read_csv('data/d2u_sub_binary.csv')
cont_main_u = pd.read_csv('data/d2u_main_cont.csv')
cont_sub_u = pd.read_csv('data/d2u_sub_cont.csv')

In [4]:
binary_main_d.head()

Unnamed: 0,patent_number,is novel,novelty count
0,D258382,0,0
1,D258383,0,0
2,D258678,0,0
3,D258755,0,0
4,D258990,1,2


In [5]:
cont_main_d.head()

Unnamed: 0,patent_number,max_novelty
0,D258382,-0.797657
1,D258383,-0.797657
2,D258678,-0.797657
3,D258755,-0.797657
4,D258990,0.809274


# Explore at the patent level first

In [6]:
def add_column(df, column, column_name):
    df = pd.merge(df, column, on='patent_number', how='left')
    return df
    

In [7]:
# Merge the binary measures into master dataframe

master = pd.merge(master, binary_main_d, on='patent_number', how='left')
master = master.rename(index=str, columns={'is novel':'is novel binary main: D-D',\
                                  'novelty count':'novelty count binary main: D-D'})

master = pd.merge(master, binary_sub_d, on='patent_number', how='left' )
master = master.rename(index=str, columns={'is novel':'is novel binary sub: D-D',\
                                          'novelty count':'novelty count binary sub: D-D'})

master = pd.merge(master, binary_sub_u, on='patent_number', how='left')
master = master.rename(index=str, columns={'is novel':'is novel binary main: D-U',\
                                          'novelty count':'novelty count binary main: D-U'})

master = pd.merge(master, binary_sub_u, on='patent_number', how='left')
master = master.rename(index=str, columns={'is novel':'is novel binary sub: D-U',\
                                          'novelty count':'novelty count binary sub: D-U'})


In [8]:
# merge the continuous into master dataframe

master = pd.merge(master, cont_main_d, on='patent_number', how='left')
master = master.rename(index=str, columns={'max_novelty':'max novelty main: D-D'})

master = pd.merge(master, cont_sub_d, on='patent_number', how='left' )
master = master.rename(index=str, columns={'max_novelty':'max novelty sub: D-D'})

master = pd.merge(master, cont_sub_u, on='patent_number', how='left')
master = master.rename(index=str, columns={'max_novelty':'max novelty main: D-U'})

master = pd.merge(master, cont_sub_u, on='patent_number', how='left')
master = master.rename(index=str, columns={'max_novelty':'max novelty sub: D-U'})


In [9]:
#select the columns with just the measures
columns = master[['is novel binary main: D-D','is novel binary sub: D-D',\
        'is novel binary main: D-U','is novel binary sub: D-U','max novelty main: D-D',\
        'max novelty sub: D-D','max novelty main: D-U','max novelty sub: D-U']]
columns.head()

Unnamed: 0,is novel binary main: D-D,is novel binary sub: D-D,is novel binary main: D-U,is novel binary sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U
0,,,1.0,1.0,,,-4.951017,-4.951017
1,,,1.0,1.0,,,-5.724207,-5.724207
2,0.0,1.0,,,-0.797657,-5.392617,,
3,0.0,1.0,,,-0.797657,-5.392617,,
4,0.0,1.0,,,-0.797657,-5.392617,,


Examing the correlation of design citing design measures for binary and continous measures, at mainclass and subclass level

In [10]:
matrix = columns[['is novel binary main: D-D','is novel binary sub: D-D','max novelty main: D-D',\
                  'max novelty sub: D-D']].dropna()

# count the number of observations
matrix.shape

(475267, 4)

There are 475,267 non-null observations for these measures

In [11]:
#correlation matrix
matrix.corr().round(3)

Unnamed: 0,is novel binary main: D-D,is novel binary sub: D-D,max novelty main: D-D,max novelty sub: D-D
is novel binary main: D-D,1.0,0.066,0.038,-0.012
is novel binary sub: D-D,0.066,1.0,0.478,0.255
max novelty main: D-D,0.038,0.478,1.0,0.475
max novelty sub: D-D,-0.012,0.255,0.475,1.0


Examing the correlation of design citing utility measures for binary and continous measures, at mainclass and subclass level

In [12]:
matrix = columns[['is novel binary main: D-U','is novel binary sub: D-U','max novelty main: D-U',\
                  'max novelty sub: D-U']].dropna()

#count the number of observations
matrix.shape

(308725, 4)

There are 308,725 non-null observations for these measures

In [13]:
matrix.corr().round(3)

Unnamed: 0,is novel binary main: D-U,is novel binary sub: D-U,max novelty main: D-U,max novelty sub: D-U
is novel binary main: D-U,1.0,1.0,0.168,0.168
is novel binary sub: D-U,1.0,1.0,0.168,0.168
max novelty main: D-U,0.168,0.168,1.0,1.0
max novelty sub: D-U,0.168,0.168,1.0,1.0


Comparing all 8 measures

In [14]:
matrix = columns.dropna()

#count the number of observations
matrix.shape

(280885, 8)

There are 280,885 observations

In [15]:
matrix.corr().round(3)

Unnamed: 0,is novel binary main: D-D,is novel binary sub: D-D,is novel binary main: D-U,is novel binary sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U
is novel binary main: D-D,1.0,0.049,0.024,0.024,0.028,-0.009,-0.01,-0.01
is novel binary sub: D-D,0.049,1.0,0.328,0.328,0.483,0.282,0.105,0.105
is novel binary main: D-U,0.024,0.328,1.0,1.0,0.181,0.088,0.192,0.192
is novel binary sub: D-U,0.024,0.328,1.0,1.0,0.181,0.088,0.192,0.192
max novelty main: D-D,0.028,0.483,0.181,0.181,1.0,0.487,0.235,0.235
max novelty sub: D-D,-0.009,0.282,0.088,0.088,0.487,1.0,0.591,0.591
max novelty main: D-U,-0.01,0.105,0.192,0.192,0.235,0.591,1.0,1.0
max novelty sub: D-U,-0.01,0.105,0.192,0.192,0.235,0.591,1.0,1.0


In [16]:
#save the table
matrix.to_csv('data/design_cite_design_design_cite_utility_corr.csv')

# Examing at the industrial level

We need to attribute each patent to an industry. I have taken the first listed class, known as primary class, of a patent as the industry

In [17]:
# read in the data file from PatStat that lists all patent classes and their sequences
pat_class = pd.read_csv('data/uspc_current.tsv', delimiter='\t', usecols=['patent_id','mainclass_id','sequence'],\
                        dtype={'patent_id':str})
pat_class.head()

Unnamed: 0,patent_id,mainclass_id,sequence
0,3930271,2,0
1,3930271,2,1
2,3930271,2,2
3,3930271,473,3
4,3930272,5,0


In [18]:
#select first listed class
pat_class_first = pat_class.loc[pat_class.sequence == 0]
master = pd.merge(master, pat_class_first, left_on='patent_number',right_on='patent_id', how='left')

In [19]:
master.head()

Unnamed: 0,patent_number,grant_year,app_year,num_inventors,us_inventor,cite_foreign_patent,is_missing,num_design_cited,num_utility_cited,non-pat_refs,...,novelty count binary main: D-U,is novel binary sub: D-U,novelty count binary sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U,patent_id,mainclass_id,sequence
0,D257752,1981,1980,1,1,0,1,1,2,1,...,3.0,1.0,3.0,,,-4.951017,-4.951017,D257752,D19,0.0
1,D257924,1981,1980,1,1,0,1,1,2,1,...,3.0,1.0,3.0,,,-5.724207,-5.724207,D257924,D06,0.0
2,D258382,1981,1980,2,1,0,1,2,1,0,...,,,,-0.797657,-5.392617,,,D258382,D23,0.0
3,D258383,1981,1980,2,1,0,1,2,1,0,...,,,,-0.797657,-5.392617,,,D258383,D23,0.0
4,D258678,1981,1980,2,1,0,1,2,1,0,...,,,,-0.797657,-5.392617,,,D258678,D23,0.0


In [20]:
#Checking the primary classes of these patents for uniformity or unexpected values
master.mainclass_id.unique()

array(['D19', 'D06', 'D23', 'D25', 'D09', 'D24', 'D11', 'D12', 'D21',
       'D08', 'D10', 'D07', 'D22', 'D02', 'D13', 'D34', 'D17', 'D18',
       'D30', 'D14', 'D32', 'D28', 'D03', 'D27', 'D05', 'D15', 'D26',
       'D16', 'D20', 'D99', 'D04', 'D01', 'D29', 'D8', 'D9', 'D6', 'D7',
       '1', '70', '24', nan], dtype=object)

In [21]:
# harmonize the class names. ie D8 -> D08
master['mainclass_id'] = master['mainclass_id'].replace(design_class_dictionary.class_dictionary)

In [22]:
# Dropping the null value, it's useless
master = master.dropna(subset=['mainclass_id'])

#Selecting only valid primary design classes
master = master.loc[master.mainclass_id.str.contains('D')]

In [23]:
#selecting the columns of the measures and the primary classes
columns = master[['is novel binary main: D-D','is novel binary sub: D-D',\
        'is novel binary main: D-U','is novel binary sub: D-U','max novelty main: D-D',\
        'max novelty sub: D-D','max novelty main: D-U','max novelty sub: D-U', 'mainclass_id']]
columns.head()

Unnamed: 0,is novel binary main: D-D,is novel binary sub: D-D,is novel binary main: D-U,is novel binary sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U,mainclass_id
0,,,1.0,1.0,,,-4.951017,-4.951017,D19
1,,,1.0,1.0,,,-5.724207,-5.724207,D06
2,0.0,1.0,,,-0.797657,-5.392617,,,D23
3,0.0,1.0,,,-0.797657,-5.392617,,,D23
4,0.0,1.0,,,-0.797657,-5.392617,,,D23


In [24]:
'''
Creating the table of measures consolidated at the industry level
The mean is taken for the max novelty and binary measures
Since the binary measure only takes on the value of 1 or 0, the mean is equivalent
calculating the percentage of positive occurences
'''
table = columns.groupby('mainclass_id').mean().round(3).reset_index()
table['class name'] = table['mainclass_id'].map(design_class_dictionary.class_labels)
table

Unnamed: 0,mainclass_id,is novel binary main: D-D,is novel binary sub: D-D,is novel binary main: D-U,is novel binary sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U,class name
0,D01,0.007,0.64,0.806,0.806,0.506,-2.838,-4.785,-4.785,Edible Products
1,D02,0.002,0.288,0.575,0.575,-0.895,-2.476,-4.426,-4.426,Apparel and Haberdashery
2,D03,0.003,0.456,0.774,0.774,0.871,-2.42,-3.88,-3.88,"Travel Goods, Personal Belongings, and Storage..."
3,D04,0.007,0.316,0.663,0.663,-0.728,-3.26,-4.247,-4.247,Brushware
4,D05,0.009,0.516,0.818,0.818,0.018,-2.961,-3.922,-3.922,Textile or Paper Yard Goods; Sheet Material
5,D06,0.002,0.387,0.741,0.741,0.62,-2.961,-4.738,-4.738,Furnishings
6,D07,0.002,0.414,0.706,0.706,0.655,-2.539,-4.115,-4.115,Equipment for Preparing or Serving Food or Dri...
7,D08,0.001,0.312,0.71,0.71,0.653,-3.101,-4.505,-4.505,Tools and Hardware
8,D09,0.002,0.499,0.763,0.763,0.754,-0.651,-2.995,-2.995,Packages and Containers for Goods
9,D10,0.002,0.29,0.74,0.74,0.241,-3.221,-5.214,-5.214,"Measuring, Testing or Signaling Instruments"


In [25]:
#save this table
table.to_csv('data/table.csv')

In [26]:
#correlation of measures at the industry level
table.corr().round(3)

Unnamed: 0,is novel binary main: D-D,is novel binary sub: D-D,is novel binary main: D-U,is novel binary sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U
is novel binary main: D-D,1.0,0.476,0.134,0.134,-0.129,-0.148,-0.333,-0.333
is novel binary sub: D-D,0.476,1.0,0.664,0.664,0.629,0.473,0.231,0.231
is novel binary main: D-U,0.134,0.664,1.0,1.0,0.62,0.339,0.338,0.338
is novel binary sub: D-U,0.134,0.664,1.0,1.0,0.62,0.339,0.338,0.338
max novelty main: D-D,-0.129,0.629,0.62,0.62,1.0,0.62,0.47,0.47
max novelty sub: D-D,-0.148,0.473,0.339,0.339,0.62,1.0,0.816,0.816
max novelty main: D-U,-0.333,0.231,0.338,0.338,0.47,0.816,1.0,1.0
max novelty sub: D-U,-0.333,0.231,0.338,0.338,0.47,0.816,1.0,1.0


In [27]:
#Save the correlation table
table.corr().round(3).to_csv('data/corr_table.csv')

Examin if there is a correlation between the above measures and if a patent experiences multiple binary pairs

In [28]:
# If the binary count is larger than 1, then the patent is marked as having multiple binary novelty
master['multiple novelty main: D-D'] = np.where(master['novelty count binary main: D-D'] > 1,1,0)
master['multiple novelty main: D-U'] = np.where(master['novelty count binary main: D-U'] > 1,1,0)
master['multiple novelty sub: D-D'] = np.where(master['novelty count binary sub: D-D'] > 1,1,0)
master['multiple novelty sub: D-U'] = np.where(master['novelty count binary sub: D-U'] > 1,1,0)

In [29]:
#Creating the measures table at the industry level that includes multiple binary
columns = master[['is novel binary main: D-D',\
                 'multiple novelty main: D-D',\
                 'is novel binary sub: D-D',\
                 'multiple novelty sub: D-D',\
                 'is novel binary main: D-U',\
                 'multiple novelty main: D-U',\
                 'is novel binary sub: D-U',\
                 'multiple novelty sub: D-U',\
                 'max novelty main: D-D',\
                 'max novelty sub: D-D',\
                 'max novelty main: D-U',\
                 'max novelty sub: D-U', \
                 'mainclass_id']]
table = columns.groupby('mainclass_id').mean().round(3).reset_index()
table['class name'] = table['mainclass_id'].map(design_class_dictionary.class_labels)
table

Unnamed: 0,mainclass_id,is novel binary main: D-D,multiple novelty main: D-D,is novel binary sub: D-D,multiple novelty sub: D-D,is novel binary main: D-U,multiple novelty main: D-U,is novel binary sub: D-U,multiple novelty sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U,class name
0,D01,0.007,0.001,0.64,0.464,0.806,0.494,0.806,0.494,0.506,-2.838,-4.785,-4.785,Edible Products
1,D02,0.002,0.0,0.288,0.177,0.575,0.289,0.575,0.289,-0.895,-2.476,-4.426,-4.426,Apparel and Haberdashery
2,D03,0.003,0.001,0.456,0.309,0.774,0.459,0.774,0.459,0.871,-2.42,-3.88,-3.88,"Travel Goods, Personal Belongings, and Storage..."
3,D04,0.007,0.001,0.316,0.18,0.663,0.4,0.663,0.4,-0.728,-3.26,-4.247,-4.247,Brushware
4,D05,0.009,0.001,0.516,0.361,0.818,0.543,0.818,0.543,0.018,-2.961,-3.922,-3.922,Textile or Paper Yard Goods; Sheet Material
5,D06,0.002,0.0,0.387,0.251,0.741,0.327,0.741,0.327,0.62,-2.961,-4.738,-4.738,Furnishings
6,D07,0.002,0.001,0.414,0.286,0.706,0.38,0.706,0.38,0.655,-2.539,-4.115,-4.115,Equipment for Preparing or Serving Food or Dri...
7,D08,0.001,0.0,0.312,0.172,0.71,0.473,0.71,0.473,0.653,-3.101,-4.505,-4.505,Tools and Hardware
8,D09,0.002,0.001,0.499,0.357,0.763,0.427,0.763,0.427,0.754,-0.651,-2.995,-2.995,Packages and Containers for Goods
9,D10,0.002,0.0,0.29,0.167,0.74,0.367,0.74,0.367,0.241,-3.221,-5.214,-5.214,"Measuring, Testing or Signaling Instruments"


In [30]:
#save this table
table.to_csv('data/table.csv')

In [31]:
#correlation
table.corr().round(3)

Unnamed: 0,is novel binary main: D-D,multiple novelty main: D-D,is novel binary sub: D-D,multiple novelty sub: D-D,is novel binary main: D-U,multiple novelty main: D-U,is novel binary sub: D-U,multiple novelty sub: D-U,max novelty main: D-D,max novelty sub: D-D,max novelty main: D-U,max novelty sub: D-U
is novel binary main: D-D,1.0,0.707,0.476,0.463,0.134,0.218,0.134,0.218,-0.129,-0.148,-0.333,-0.333
multiple novelty main: D-D,0.707,1.0,0.568,0.545,0.241,0.406,0.241,0.406,0.326,0.102,-0.071,-0.071
is novel binary sub: D-D,0.476,0.568,1.0,0.991,0.664,0.545,0.664,0.545,0.629,0.473,0.231,0.231
multiple novelty sub: D-D,0.463,0.545,0.991,1.0,0.639,0.481,0.639,0.481,0.613,0.523,0.256,0.256
is novel binary main: D-U,0.134,0.241,0.664,0.639,1.0,0.578,1.0,0.578,0.62,0.339,0.338,0.338
multiple novelty main: D-U,0.218,0.406,0.545,0.481,0.578,1.0,0.578,1.0,0.444,0.181,0.31,0.31
is novel binary sub: D-U,0.134,0.241,0.664,0.639,1.0,0.578,1.0,0.578,0.62,0.339,0.338,0.338
multiple novelty sub: D-U,0.218,0.406,0.545,0.481,0.578,1.0,0.578,1.0,0.444,0.181,0.31,0.31
max novelty main: D-D,-0.129,0.326,0.629,0.613,0.62,0.444,0.62,0.444,1.0,0.62,0.47,0.47
max novelty sub: D-D,-0.148,0.102,0.473,0.523,0.339,0.181,0.339,0.181,0.62,1.0,0.816,0.816


In [32]:
#Save the correlation table
table.corr().round(3).to_csv('data/corr_table.csv')

Below are a bit of descriptive stats for the above measures

In [33]:
master['is novel binary main: D-D'].value_counts()

0.0    474238
1.0      1026
Name: is novel binary main: D-D, dtype: int64

In [34]:
master['is novel binary sub: D-D'].value_counts()

0.0    317256
1.0    158008
Name: is novel binary sub: D-D, dtype: int64

In [35]:
master['is novel binary main: D-U'].value_counts()

1.0    212074
0.0     96649
Name: is novel binary main: D-U, dtype: int64

In [36]:
master['is novel binary sub: D-U'].value_counts()

1.0    212074
0.0     96649
Name: is novel binary sub: D-U, dtype: int64

In [37]:
master['max novelty main: D-D'].describe()

count    475264.000000
mean          0.222976
std           1.688454
min          -5.216022
25%          -1.211405
50%          -0.368211
75%           1.747975
max           5.728431
Name: max novelty main: D-D, dtype: float64

In [38]:
master['max novelty sub: D-D'].describe()

count    475264.000000
mean         -3.075206
std           2.098393
min         -11.636991
25%          -4.541145
50%          -3.012485
75%          -1.555243
max           4.032047
Name: max novelty sub: D-D, dtype: float64

In [39]:
master['max novelty main: D-U'].describe()

count    308723.000000
mean         -4.541657
std           1.839085
min         -13.114638
25%          -5.715160
50%          -4.442751
75%          -3.245814
max           2.409061
Name: max novelty main: D-U, dtype: float64

In [40]:
master['max novelty sub: D-U'].describe()

count    308723.000000
mean         -4.541657
std           1.839085
min         -13.114638
25%          -5.715160
50%          -4.442751
75%          -3.245814
max           2.409061
Name: max novelty sub: D-U, dtype: float64