# Merging PCT patents with location data

This workbook attempts to merge the PCT patent data with the location data.

The patent data comes from `PCT_cited&citing_noself.csv` which contains PCT publication numbers and application id's for both cited and citing patents. Cited patents refer to patents located in the UK that have subsequently been cited in the EPO dataset. Citing patents refer to those that cite these patents. The app_nbr for both citing and cited patents refers to the PCT publication number which can be merged with the first_and_subsequent files.

For preparation for the merge with the location data, the patent data is merged with the `first and subsequent` dataset along the publication number for both the citing and the cited dataset. Both of these merges result in a loss of results form 78,258 to 28,315 before any merge is done with the geoc_inv dataset, so no further analysis is carried out.

In [1]:
#import the necessary libraries
import pandas as pd
import numpy as np

In [2]:
#read in the PCT data
#This contains publictaion numbers for both the cited and citing applications
PCT_treatment = pd.read_csv("Patents data/PCT_cited&citing_noself.csv")
#drop any uncessary rows
PCT_treatment.drop(["Citing_IPC", "IPC_subclass", "Cited_firm_id",
                    "citing_firm_id", "Cited_ind_id", "Citing_ind_id"], axis =1, inplace = True)

In [3]:
#check the resulting dataframe to make sure it is read in correctly
PCT_treatment
#Unnamed: 0  acts as the index and can be used to check whether there is a loss of citing-cited pairs

Unnamed: 0.1,Unnamed: 0,Citing_app_nbr,Citing_appln_id,Cited_App_nbr,Cited_Appln_id,prio_year
0,0,WO1979000092,15648948,WO1979000002,11473341.0,1977
1,1,WO1983000236,15649723,WO1979000002,11473341.0,1981
2,2,WO1993005833,47214466,WO1979000002,46912661.0,1991
3,3,WO1994011576,47250451,WO1979000002,46912661.0,1992
4,4,WO1997005434,47345574,WO1979000002,22670158.0,1995
5,5,WO1998007448,47388255,WO1979000002,22670158.0,1996
6,6,WO1980000202,43452175,WO1980000019,43451779.0,1978
7,7,WO1988001113,15653284,WO1980000019,43451779.0,1986
8,8,WO1990003927,47132716,WO1980000019,6509723.0,1988
9,9,WO1992000151,47179753,WO1980000019,22700929.0,1990


In [4]:
#the length of this is checked
#initially there is 78,258 results
len(list(PCT_treatment["Unnamed: 0"].unique()))

78258

In [5]:
FS_WO = pd.read_csv("Patents data/first_and_subsequent_WO.csv")

  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
FS_WO.count()

Unnamed: 0           5457928
appln_id             5457928
is_first             5457928
publn_auth           5457928
publn_nr             5457928
publn_nr_original    1729503
publn_kind           5457928
dtype: int64

In [7]:
#following the advice given in the paper, publn_auth and publn_nr are merged to form the pub_nbr column
FS_WO["pub_nbr"] = FS_WO["publn_auth"] + FS_WO["publn_nr"].astype(str) 
#FS_WO["og_pub_nbr"] = FS_WO["publn_auth"] + FS_WO["publn_nr_original"].astype(str)

In [8]:
#any unecessary columns are removed
FS_WO.drop(["Unnamed: 0", "publn_auth", "publn_nr", "publn_nr_original", "publn_kind"],
          axis =1, inplace = True)

In [9]:
#the PCT treatment data is merged on the basis of the citing publication number with the first and subsequent data
PCT_treatment_merged = PCT_treatment.merge(FS_WO, left_on ="Citing_app_nbr", right_on = "pub_nbr")

In [10]:
#columns are renamed
PCT_treatment_merged.rename(columns = {"appln_id":"Citing_appln_id"}, inplace = True)

In [11]:
#this suggests that in terms of citing applications, 28,000 publication numbers are not contained in the first and subsequent dataset 
len(list(PCT_treatment_merged["Unnamed: 0"].unique()))

59154

In [12]:
#to check which results are not contained in the merged file
PCT_treatment_merged_list = list(PCT_treatment_merged["Unnamed: 0"].unique())
PCT_treatment_not_merged = PCT_treatment[~PCT_treatment["Unnamed: 0"].isin(PCT_treatment_merged_list)]
PCT_treatment_not_merged
#it appears that it is not just results that are out of the range of the years of the first_and_subsequent files
#tthat are not merged
#not sure why

Unnamed: 0.1,Unnamed: 0,Citing_app_nbr,Citing_appln_id,Cited_App_nbr,Cited_Appln_id,prio_year
0,0,WO1979000092,15648948,WO1979000002,11473341.0,1977
1,1,WO1983000236,15649723,WO1979000002,11473341.0,1981
2,2,WO1993005833,47214466,WO1979000002,46912661.0,1991
3,3,WO1994011576,47250451,WO1979000002,46912661.0,1992
4,4,WO1997005434,47345574,WO1979000002,22670158.0,1995
5,5,WO1998007448,47388255,WO1979000002,22670158.0,1996
6,6,WO1980000202,43452175,WO1980000019,43451779.0,1978
7,7,WO1988001113,15653284,WO1980000019,43451779.0,1986
8,8,WO1990003927,47132716,WO1980000019,6509723.0,1988
9,9,WO1992000151,47179753,WO1980000019,22700929.0,1990


In [13]:
#this is then reapeated with the cited dataset
PCT_treatment_merged_cited = PCT_treatment_merged.merge(FS_WO, left_on ="Cited_App_nbr", right_on = "pub_nbr")

In [14]:
PCT_treatment_merged_cited.rename(columns = {"appln_id":"Cited_appln_id"}, inplace = True)

In [15]:
#if this is then subsequently checked for total loss
#it appears there is a loss of 50,000 datapoints
len(list(PCT_treatment_merged_cited["Unnamed: 0"].unique()))

28315