# Patent Impact Calculation according to Kelly et al. (2021)

- **Novel** patents are distinct from their predecessors

- **Impactful** patents influence future scientific advances, manifested as high similarity with subsequent innovations/patents

- An **important** patent is both novel **and** impactful

-> An **important** patent is **dissimilar** to previous patents and **similar** to future patents

### Novelty: 

Novel patents should be distinct from earlier patents, i.e., a patent is novel if its similarity to prior patents is low. The authors suggest to measure novelty by calculating the Backward Similarity (BS), which is the cumulative similarity between a focal patent $p$ and all patents filed $5$ years prior to $p$. 
A novel patent should have a low BS score, as it is dissimilar to previous patents. 

Kelly et al. calculate the BS the following:
$$
BS_{j}^{\tau} = \sum_{i \in \mathcal{B}_{j, \tau}} p_{j, i}
$$
where $\mathcal{B}{j, \tau}$ denotes a set of prior patents filed $\tau = 5$ years before patent $j$.

### Impact

Impactful patents should be similar to future patents, i.e., a patent is impactful if its similarity to future patents is high. This is calculated as the Forward Simiarlity (FS), which is the cumulative similarity between a focal patent $p$ and all patents filed $10$ years after $p$.

FS is calculated the following:

$$
FS_{j}^{\tau} = \sum_{i \in \mathcal{F}_{j, \tau}} p_{j, i}
$$

where $\mathcal{F}{j, \tau}$ denotes the set of patents filed $\tau = 10$ years after patent $j$. 

### Importance
An important patent should be novel and impactful, i.e., the BS score should be low and the FS score should be high. This can be expressed by taking the ratio between BS and FS:

$$
q_{j}^{\tau} = \frac{FS_{j}^{\tau}}{BS_{j}^{\tau}}
$$

The value will be high if BS is low and FS is high.

**And here is the issue**: 

The number of patents in the sets $\mathcal{B}_{j, \tau}$ and $\mathcal{F}_{j, \tau}$ varies for each patent. For a patent $p$ the number of patents in $\mathcal{B}{j, \tau}$ or $\mathcal{F}_{j, \tau}$ may be high, which leads to a high BS or FS (or vice versa), even though each individual similarity between $p$ and all other patents in these sets is low. Therefore, the number of patents in each set heavily influences the BS and FS scores, which in return determines the importance score. On the other hand, the actual individual similarity scores used to calculate FS and BS do not influence the importance scores as they are supposed to. 

See the below [figure](#num_patents_per_year): A patent from 1900 will have a lower BS than a patent from 1915, just because there are less patents available in the $5$ years before 1900 compared to the $5$ years before 1915. 
A patent from 1900 will have a high importance score as the number of previous patents is low (results in low BS) relatively to the number of patents after 1900 (results in larger FS).

<a name="num_patents_per_year"></a>
![num_patents_per_year](plots/num_patents_per_year.png)

### Solution:

An easy way to prevent this from happening, is to normalize the FS and BS scores by the length of the sets $\mathcal{B}_{j, \tau}$ and $\mathcal{F}_{j, \tau}$, i.e., the number of patents $\tau = 5$ years before and $\tau = 10$ years after the focal patent. This corresponds to taking the average. The resulting formulas are the following:


$$
BS_{j}^{\tau} = \frac{1}{| \mathcal{B}_{j, \tau} |} \cdot \sum_{i \in \mathcal{B}_{j, \tau}} p_{j, i}
$$

and

$$
FS_{j}^{\tau} = \frac{1}{| \mathcal{F}_{j, \tau} |} \cdot \sum_{i \in \mathcal{F}_{j, \tau}} p_{j, i}
$$

That way, the importance scores between patents are comparable and the actual similarity scores between patents influence the importance value of a patent. 

### Patent Similarity
The patent similarity is calculated based on the TFIBDF score. 
for a patent $p$ and a term $w$, the TFBIDF score is calculated the following:
$$
    TFBIDF_{w, p} = TF_{w, p} \cdot BIDF_{w}
$$
where 
$$
    TF_{w, p} = \frac{c_{p,w}}{\sum_{k}{}c_{p,k}}
$$
and
$$
    BIDF_{w, p} = \log \left( \frac{\# \text{ patents prior to } p}{1 + \# \text{ documents prior to } p \text{ that include term } w} \right)
$$

To calculate the similarity $p_{i, j}$ between patent $i$ and $j$, create two vectors $V_i$ and $V_j$ with the size of the union of terms in patents $i$ and $j$. In this vectors, store the TFBIDF scores for each term $w$ in $i$ and $j$ respectively. Finally, normalize the vectors and calculate the cosine similarity between the two vectors. 

$$
    p_{i, j} = V_{i} \cdot V_{j}
$$

# Code demonstration

In [1]:
from tfbidf_optimized import *

generating patent mapping
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
generating term count dictionary
generating tf mapping
generating sorted patent list


### Pre-Computations

**generate_json** creates a mapping between a patent id and metainformation like its grant date, filepath, but also stores the preprocessed text of the patent

In [6]:
textfolder_path = "../data_preprocessed/"
jsonfolder_path = "../json/"
PATENT_MAPPING = generate_json(textfolder_path, jsonfolder_path, start_year=1885, end_year=1900)

Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)
Expecting value: line 1 column 1 (char 0)


In [9]:
PATENT_MAPPING["1000"]["date"]

datetime.datetime(1887, 8, 19, 0, 0)

In [10]:
PATENT_MAPPING["1000"]["text"]

['patent',
 'beskrifning',
 'offentliggjord',
 'ap',
 'kongl',
 'patentbyrån',
 'caspersson',
 'forsbacka',
 'margretehill',
 'sätt',
 'tillverkning',
 'af',
 'jer',
 'stål',
 'bessemer',
 'metod',
 'gjuta',
 'metall',
 'direkt',
 'ugn',
 'uti',
 'formarne',
 'jemte',
 'härför',
 'afsedd',
 'gjutskänk',
 'patent',
 'Sverige',
 'maj',
 'fråga',
 'varande',
 'gjutningsmetod',
 'af',
 'sedd',
 'apparat',
 'benämna',
 'konverterskänk',
 'kunna',
 'form',
 'variera',
 'vidfogad',
 'ritning',
 'visad',
 'form',
 'af',
 'cylind',
 'fig',
 'af',
 'jernplåt',
 'invändig',
 'infodra',
 'eldfast',
 'material',
 'fäsa',
 'sätt',
 'fig',
 'utvisa',
 'bessemerugn',
 'skorst',
 'blåsning',
 'fullbordad',
 'fig',
 'visa',
 'läge',
 'bessemerusn',
 'slutad',
 'blasning',
 'böra',
 'hafva',
 'apparat',
 'aptera',
 'me',
 'dels',
 'åtdragning',
 'af',
 'tvenne',
 'kila',
 'stadig',
 'fästa',
 'densamwma',
 'uttappning',
 'formarnea',
 'hvilken',
 'ske',
 'vanlig',
 'sätt',
 'medelst',
 'häfstång',
 'spe'

**generate_tf_mapping** computes tf scores for all terms in patents in a given time range. This is highly memory expensive, but speeds up later calculations.

In [11]:
TF_MAPPING = generate_tf_mapping(start_date=1885, end_date=1900)

In [16]:
TF_MAPPING["10000"]["stockholm"]

0.0019157088122605363

**calculate_term_frequencies_per_patent** is calculated to speed up the computationally expensive bidf calculation.

$$
    BIDF_{w, p} = \log \left( \frac{\# \text{ patents prior to } p}{1 + \# \text{ documents prior to } p \text{ that include term } w} \right)
$$

For every patent $p$, it keeps track of the $\# \text{ of documents prior to } p$ and the $\# \text{ documents prior to } p \text{ that include term } w$. 

While keeping track of the number of documents prior to $p$ does not require much memory, keeping track of the number of occurances of each term until a certain patent does! Here is how it is done: From a given *total_start_year* (1885 in our case) until *start_year*, a dictionary *global_term_count* stores the number of patents each term occurs in. This can be seen as a pre-computation.
From *start_year* to *end_year* (and this is the memory intensive part), for each patent id and each term that has occured so far, the number of documents that contain each term is stored. This is a trade-off between memory costs and computational costs, which in return heavily reduces the runtime of the algorithm.

In [None]:
TERM_COUNT_PER_PATENT = calculate_term_frequencies_per_patent(1885, 1885, 1900)

In [22]:
TERM_COUNT_PER_PATENT["10000"]["num_prior_patents"]

8569

In [40]:
list(TERM_COUNT_PER_PATENT["10000"]["counts"].items())[:20]

[('väfs', 2),
 ('oljor', 78),
 ('hvarefta', 376),
 ('behandling', 488),
 ('patentanspråk', 8435),
 ('skodon', 62),
 ('fernissa', 51),
 ('ottentliggjord', 16),
 ('torka', 528),
 ('läd', 212),
 ('bereda', 243),
 ('olja', 333),
 ('retziusekvall', 2),
 ('sätt', 6425),
 ('använda', 4498),
 ('andsifna', 2),
 ('hvarje', 2572),
 ('patentbyrån', 5496),
 ('stockholm', 7703),
 ('kongl', 5600)]

### Term Frequency

In [24]:
patent_id = "10000"
term = "patent"
calculate_tf(patent_id, term)

0.007662835249042145

In [25]:
patent_id = "10000"
term = "thisistrash"
calculate_tf(patent_id, term)

0

### Backward Inverse Document Frequency (BIDF)

In [26]:
earlier_patent_id = "10000"
term = "patent"
calculate_bidf(term, earlier_patent_id)

0.00011670654153411112

### Patent Similarity

calculation according to the formula above

In [30]:
patent_id_i = "10000"
patent_id_j = "10000"
calculate_patent_similarity(patent_id_i, patent_id_j)

1.0000000000000002

In [32]:
patent_id_i = "10000"
patent_id_j = "11000"
calculate_patent_similarity(patent_id_i, patent_id_j)

0.011435583355308634

### Backward Similarity

In [33]:
patent_id = "10000"
calculate_backward_similarity(patent_id, backward_years=5)

backward similarity for patent 10000: 0.022385680427972097


0.022385680427972097

### Forward Similarity

In [34]:
calculate_forward_similarity(patent_id, forward_years=10)

forward similarity for patent 10000: 0.02317778213777927


0.02317778213777927

### Patent Importance

In [36]:
novelty, impact, importance = calculate_patent_importance(patent_id, backward_years=5, forward_years=10)
print(f"novelty: {novelty}, impact: {impact}, importance: {importance}")

backward similarity for patent 10000: 0.022385680427972097
forward similarity for patent 10000: 0.02317778213777927
novelty: 0.022385680427972097, impact: 0.02317778213777927, importance: 1.0353843034772086


In [37]:
# also get an array of the *10* terms in this patent with highest tfibdf score
top_tfbidf_terms_ls = get_top_n_tfbidf_scores(patent_id, n=10)
top_tfbidf_terms_ls

array(['smergelskifvan', 'slipningskon', 'kon', 'slipskifvan',
       'slipningsskifva', 'smergel', 'slipningsrumm', 'slipningsorga',
       'gjutstålskula', 'slipningsrum'], dtype='<U18')

These calculations are repeated for a given range of patents. Results are stored in a pandas dataframe.