# U.S. Patent Phrase-to-Phrase Matching - A Simple EDA 📜

In this competition, we are given pairs of phrases and asked to determine how similar the two phrases are. The phrases are taken from patent archives, and the idea is that the determination of the similarity of phrases found in the patents can help connect and find relevant prior art necessary for the reviewing of the patents.


I'll provide a quick and simple EDA to help you get started with this interesting competition!

## Imports

Let's start out by setting up our environment by importing the required modules:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
from collections import Counter
from operator import itemgetter
import os

## A look at the provided data

Let's check what data is available to us:

In [None]:
data_path = Path('../input/us-patent-phrase-to-phrase-matching')
os.listdir(data_path)

We can see we have three CSVs, the `train.csv`, `test.csv`, and the `sample_submission.csv`. Let's look at the `train.csv` more closely.

In [None]:
train_df = pd.read_csv(data_path/'train.csv')
train_df.head(10)

There are five columns:

`id` - a unique identifier for a pair of phrases

`anchor` - the first phrase

`target` - the second phrase

`context` - the CPC classification, which indicates the subject within which the similarity is to be scored

`score` - the similarity. This is sourced from a combination of **one or more manual expert ratings**.

Let's get some more information:

In [None]:
print(f'There are {len(train_df)} entries')

In [None]:
print(f'There are {len(train_df["anchor"].unique())} unique anchor words')

In [None]:
anchor_count = Counter(train_df["anchor"])

In [None]:
plt.hist([anchor_count[i] for i in anchor_count.keys()])
plt.title('Histogram of the count of the anchor words')

We can find out which anchor words have the least and most number of entries:

In [None]:
min(anchor_count.items(), key=itemgetter(1))

In [None]:
max(anchor_count.items(), key=itemgetter(1))

What is the distribution of scores?

In [None]:
train_df['score'].hist()
plt.title('Histogram of scores')

Note that there are only 5 potential scores:

1.0 - Very close match. This is typically an exact match except possibly for differences in conjugation, quantity (e.g. singular vs. plural), and addition or removal of stopwords (e.g. “the”, “and”, “or”).

0.75 - Close synonym, e.g. “mobile phone” vs. “cellphone”. This also includes abbreviations, e.g. "TCP" -> "transmission control protocol".

0.5 - Synonyms which don’t have the same meaning (same function, same properties). This includes broad-narrow (hyponym) and narrow-broad 
(hypernym) matches.

0.25 - Somewhat related, e.g. the two phrases are in the same high level domain but are not synonyms. This also includes antonyms.

0.0 - Unrelated.

In [None]:
train_df['score'].mean()

The context refers to the [Cooperative Patent Classification (CPC)](https://en.wikipedia.org/wiki/Cooperative_Patent_Classification) of the patent where the phrases are taken from. 

The Wikipedia page provides useful information on this system:

> Patent publications are each assigned at least one classification term indicating the subject to which the invention relates and may also be assigned further classification and indexing terms to give further details of the contents. 
>
> ...
>
> The first letter is the "section symbol" consisting of a letter from "A" ("Human Necessities") to "H" ("Electricity") or "Y" for emerging cross-sectional technologies. This is followed by a two-digit number to give a "class symbol" ("A01" represents "Agriculture; forestry; animal husbandry; trapping; fishing"). 

Let's see how many unique tags we have:

In [None]:
context_count = Counter(train_df['context'])

In [None]:
print(f'There are {len(context_count)} unique contexts in the dataset')

In [None]:
plt.hist([context_count[i] for i in context_count.keys()])
plt.title('Histogram of count of CPC tags')

Here are the tags with the lowest and highest count in the dataset:

In [None]:
min(context_count.items(), key=itemgetter(1))

In [None]:
max(context_count.items(), key=itemgetter(1))

Interestingly, there is a BigQuery [dataset](https://www.kaggle.com/datasets/bigquery/cpc) on Kaggle where we can get more information:

In [None]:
# Start by importing the bq_helper module and calling on the specific active_project and dataset_name for the BigQuery dataset.
import bq_helper
from bq_helper import BigQueryHelper
# https://www.kaggle.com/sohier/introduction-to-the-bq-helper-package

cpc = bq_helper.BigQueryHelper(active_project="patents-public-data",
                                   dataset_name="cpc")

In [None]:
def get_cpc_title(cpc_code):
    query = f"""
    SELECT *
    FROM
    `patents-public-data.cpc.definition`
    WHERE
    symbol="{cpc_code}";
    """
    response = cpc.query_to_pandas_safe(query)
    return response.titleFull.values[0]

Hopefully this small helper function will be helpful for you. We can now get more information about the actual CPC tags. This may be useful for augmenting the dataset.

In [None]:
get_cpc_title('F26')

In [None]:
get_cpc_title('H01')

## Evaluation metric

The evaluation metric for this competition is the Pearson correlation coefficient between the predicted and actual scores. It is the ratio  between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.

$$ \rho = \frac{ \text{cov}(pred, target)}{\sigma_{pred}\sigma_{target}} $$

You can use `scipy.stats.pearsonr` ([docs](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)) to calculate it.

## Sample submission

I will just use the mean of the train CSV scores for a sample submission:

In [None]:
sample_df = pd.read_csv('../input/us-patent-phrase-to-phrase-matching/sample_submission.csv')
sample_df['score'] = train_df['score'].mean()
sample_df.to_csv('submission.csv', index=False)

Now, **WE ARE DONE!**

If you enjoyed this notebook, **please give it an upvote**. If you have any questions or suggestions, please leave a comment!

Good luck, fellow participants!