<a href="https://colab.research.google.com/github/Seb-klay/google-patents-using-topic-modeling/blob/main/google_patents_using_document_clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

## Goal of this project

This project aims to answer one question :
**In the wake of third-party cookie deprecation, does Google implement alternative user tracking or targeting practices that may raise privacy disclosed to the public ?**

To answer this question, 2 hypothesis have yet to be answered :
- H1: *Google's privacy-related patents filed after the announcement of third-party cookie phase-out reflect ongoing development of alternative targeting systems.*
- H2: *Google has deployed new forms of user tracking that replicate or exceed the privacy intrusiveness of third-party cookies.*

To answer this questions, the Google patents, from year 1995 to 2024, so for the past 29 years, are going to be explored. We are going to dig into them using topic modeling techniques. Creating clusters of the different patents and identify those that are related to tracking topics.

We are going to answer the question by validating hypothesis such as H1 and H2.

# Installation

In [1]:
# general purpose
! pip install bertopic==0.17.0
! pip install -U sentence-transformers
! pip install datamapplot==0.5.1
! pip install pandas==2.2.2
! pip install numpy==2.0.0
#! pip install octis==1.12.1
#! pip install contextualized-topic-models
#! pip install gensim==3.5.0
#! pip install scipy==1.1.0

# data
from google.colab import drive, files
import pandas as pd
pd.set_option('display.max_colwidth', 2000)
import seaborn as sns
from scipy.sparse import csr_matrix
import numpy as np

# clustering
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("AI-Growth-Lab/PatentSBERTa") #all-MiniLM-L6-v2
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
# Fine-tune your topic representations
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance

# graphs & plots
import plotly.express as px
import plotly.io as pio
import datamapplot
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.colors import TwoSlopeNorm

# Distance computing
from sklearn.metrics.pairwise import cosine_similarity

# Classification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# Tests
from itertools import combinations

Collecting bertopic==0.17.0
  Downloading bertopic-0.17.0-py3-none-any.whl.metadata (23 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic==0.17.0)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic==0.17.0)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic==0.17.0)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers>=0.4.1->bertopic==0.17.0)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->sentence-

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.88k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/671 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/440 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Data

For this project, we're taking 107'603 patents from Google. The purpose is to use a ML model to classify these patents and analyse the work Google made the past few years regarding internet cookies.

For this work, the data were collected on Google Patents using BigQuery retrieving Google past patents before year 2025.

## Data source

--> How did we got these data ? How was the query ? etc.

## Import data

In [2]:
# Import data with Drive or install Drive with collab interface
#drive.mount('/content/drive')

In [3]:
# Create panda dataset
patents = pd.read_excel('/content/drive/MyDrive/bq-results-20250211-170449-1739293514946.xlsx')

## Quick analysis

In [4]:
# visualization
patents.head(5)

Unnamed: 0,publication_number,title,filing_date,publication_date,grant_date,assignee_name,abstract
0,DE-112022005472-T5,,20221117,20240905,0,GOOGLE LLC,
1,EP-3762908-B1,Baby monitoring with intelligent audio cueing based on an analyzed video stream,20190304,20240904,20240904,GOOGLE LLC,
2,EP-3799705-B1,Cooling electronic devices in a data center,20191025,20240904,20240904,GOOGLE LLC,
3,EP-3619654-B1,Continuous parametrizations of neural network layer weights,20190723,20240904,20240904,GOOGLE LLC,
4,EP-4423673-A1,Self-improving llms through consistency-based self-generated demonstrations,20231031,20240904,0,GOOGLE LLC,"Aspects of the disclosure are directed to automatically selecting examples in a prompt for an ELM to demonstrate how to perform tasks. Aspects of the disclosure can select and build a set of examples from ELM zero-shot outputs via predetermined criteria that can combine consistency, diversity, and repetition. In the zero-shot setting for three different LLMs, using only ELM predictions, aspects of the disclosure can improve performance up to 15% compared to zero-shot baselines and can match or exceed few-shot base-lines for a range of reasoning tasks."


In [5]:
# Checking size
patents.shape

(107603, 7)

In [6]:
patents.dtypes

Unnamed: 0,0
publication_number,object
title,object
filing_date,int64
publication_date,int64
grant_date,int64
assignee_name,object
abstract,object


In [7]:
# Number of NaN and Null
patents.isnull().sum()

Unnamed: 0,0
publication_number,0
title,4802
filing_date,0
publication_date,0
grant_date,0
assignee_name,0
abstract,15635


We can see on this table that 4'802 titles and 15'635 abstracts are missing.

If it is only one out of the two, we can deal with that. But if both are missing, it is useless to keep these lines as long as the purpose is to categorize text on titles, abstracts or both.

We're going to check if both of the lines are missing. If that's the case, we can just delete the line because having no information at all is not usefull for the text classification if we have no text at all.

In [8]:
# Count rows where both are null
both_null = patents[(patents["title"].isnull()) & (patents["abstract"].isnull())]
len(both_null)

4668

After checking both columns, we're not going to delete all of the 15'635 lines because we only have 4'668 lines that are completely null.

Now, let's make sure that all of the patents are from Google, from Alphabet Inc.

In [9]:
patents['assignee_name'].unique()

array(['GOOGLE LLC', 'Google Technology Holdings LLC', 'Google PLLC',
       'Google LLC', 'c/o Google LLC', 'GOOGLE INC', 'GOOGLE LLE',
       'GOOGLE ELLC', 'Google LLC 1600', 'GOOGLES LLC',
       'GOOGLE TECH HOLDINGS', 'GOOGLE INT LLC',
       'C/O GOOGLE TECH HOLDINGS', 'GOOGLE LTD RESPONSIBILITY COMPANY',
       'GOOGLE', 'GOOGLE LIFE SCIENCES LLC', 'GOOGLE COMPANY',
       'PEARL HAI GOOGLE ELECTRONIC CO LTD', 'GOOGLE INC GOOGLE LLC',
       'GOOGLE INCORPORATED', 'X DEV LLC GOOGLE INC', 'Google LLLC',
       'GOOGLE TECH HOLDING LLC', 'SHENZHEN GOOGLE WEIXUN TECH CO LTD',
       'Google LCC', 'GOOGLE TECH HODLINGS LLC', 'GOOGLE LLC GOOGLE INC',
       'GOOGLE LLC (N D GES D STAATES DELAWARE)', 'GoogleLLC',
       'Google LLP', 'GOOGLE INC [', 'GOOGLE TECH HOLDING CO LTD',
       'GOOGLE TECH CONTROL CO LTD', 'GOOGLE TECH HOLDINGS INC',
       'Google Technoogy Holdings LLC', 'GOOGLE TECHNOLOBY HOLDINGS LLC',
       'GOOGLE TECH HOLDINGS CO LTD', 'Google Technologies Holdings L

In [10]:
companies_not_related = [
    'GOOGLE LIFE SCIENCES LLC', # Now called Verily and divest Alphabet Inc. industry but not related to our topics
    'PEARL HAI GOOGLE ELECTRONIC CO LTD', # Chinese company with Google in its name
    'SHENZHEN GOOGLE WEIXUN TECH CO LTD', # Other chinese company having Google in its name and not being related
    'JURONG GOOGLE MANOR MODERN AGRICULTURAL TECHNOLOGY DEV CO LTD', # Other chinese company having Google in its name
    'GOOGLE SWEDEN TECH AB', # A swedish company
    'GOOGLE SWEDEN TECHNIQUE AB', # The same swedish company
    'REAL ESTATE GOOGLE CO LTD' # Google Real Estate is part of Alphabet Inc. but not related to our topics
]

We are now going to check for duplicates. Since we have more than 100'000 patents, we want to make sure that all of them are unique,

In [11]:
patents.duplicated(subset='abstract').sum()

np.int64(68835)

This shows that 68835 line has a duplicate. We are going to count all duplicates and delete them later in the pipeline.

In [12]:
print(len(patents['abstract'].unique()))

38768


## Data treatment

In [13]:
# Delete lines that are empty in both title and abstract columns
patents_nn = patents.drop(both_null.index)

# Replace null lines with " "
patents_nn["title"] = patents_nn["title"].fillna("")
patents_nn["abstract"] = patents_nn["abstract"].fillna("")

len(patents_nn) # 102'935

102935

In [14]:
# Format date columns
patents_dformated = patents_nn.copy(deep=True)

patents_dformated['filing_date'] = pd.to_datetime(patents_nn["filing_date"].astype(str), format='%Y%m%d')
patents_dformated['publication_date'] = pd.to_datetime(patents_nn["publication_date"].astype(str), format='%Y%m%d')
patents_dformated['grant_date'] = pd.to_datetime(patents_nn["grant_date"].astype(str), format='%Y%m%d', errors='coerce')

In [15]:
# Adding a mix column used for pre-processing later
patents_dformated['processing'] = patents_dformated["title"] + " " + patents_dformated["abstract"]

In [20]:
# Delete some unwanted characters
# unwanted_chars = ['&#39;', '&#34;', '-']

# for char in unwanted_chars:
#     patents_dformated['processing'] = patents_dformated['processing'].str.replace(char, '', regex=False)

# patents_dformated.duplicated(subset='processing').sum()

np.int64(100)

In [16]:
# Delete the unwanted companies
patents_dformated = patents_dformated[~patents_dformated['assignee_name'].isin(companies_not_related)] # 107574

In [17]:
# Delete duplicates
patents_dformated = patents_dformated.drop_duplicates(subset='abstract').reset_index(drop=True)

# Clustering

Here come the clustering techniques. For this project, the main tool used is BERTopics. This toold allows us to ... .

However, BERTopics is not working alone but with several dependencies. There is also UMAP and HDBSCAN.

UMAP is ...

HDBSCAN is ...

### Training of the model

In [None]:
# Create a copy
processing = patents_dformated[['processing']].copy(deep=True)

# Put it in a list
processing_list = processing.iloc[:, 0].to_list()

In [None]:
# # Prepare embeddings
# embeddings = sentence_model.encode(processing_list, show_progress_bar=True, batch_size=64)

# # Save embeddings to a .npy file
# np.save('/content/drive/My Drive/embeddings_patents_google.npy', embeddings)

# Load embeddings from .npy file
embeddings = np.load('/content/drive/My Drive/embeddings_patents_google.npy')

In [None]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
umap_model = UMAP(
    n_neighbors=200,
    n_components=3,
    min_dist=0.5,
    metric='euclidean',
    random_state=42)

UMAP uses the following parameters :

- n_neighbors
- n_components
- min_dist
- metric
- random_state

--> why using these numbers ?

In [None]:
# Added because advised to control number of topics through the cluster model (hdbscan by default)
hdbscan_model = HDBSCAN(
    min_cluster_size=100,
    max_cluster_size=5000,
    metric='euclidean',
    cluster_selection_method='eom')

HDBSCAN uses the following parameters :

- min_cluster_size
- max_cluster_size
- metric
- cluster_selection_method

--> why using these numbers ?

In [None]:
# Removing the stop-words because BERTopic does not do it by default
vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 3), # Extract unigrams (1-word), bigrams (2-word), and trigrams (3-word phrases) from the text
    min_df=10 # Ignore terms that appear in less than 10 documents
    )

In [None]:
# Use KeyBERTInspired and MaximalMarginalRelevance for our representation model to (1) keep useful words and (2) produce cleaner topic words
representation_model=[KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]

In [None]:
# Train BERTopic
# topic_model = BERTopic(
#     embedding_model=sentence_model,
#     umap_model=umap_model,
#     hdbscan_model=hdbscan_model,
#     vectorizer_model=vectorizer_model,
#     representation_model=representation_model
#     )

# topics, probs = topic_model.fit_transform(
#     processing_list,
#     embeddings
#     )
# topic_model.save("/content/drive/My Drive/google_patents_model")

# Or load BERTopic in BERTopic v0.9.2 or higher:
topic_model = BERTopic.load("/content/drive/My Drive/google_patents_model")
topics = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
probs = topic_model.hdbscan_model.probabilities_

In [None]:
# Let's see the information given, the amount of topics per cluster, the type of groups we have, etc.
topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

We have, however, too many outliers. We want to reduce them before keeping on going

In [None]:
# Reduce outliers using the `embeddings` strategy
reduced_topics = topic_model.reduce_outliers(
    processing_list,
    topics,
    strategy="embeddings",
    embeddings=embeddings,
    threshold=0.5 # The threshold for assigning topics to outlier documents
    )

# Update topics
topic_model.update_topics(
    processing_list,
    topics=reduced_topics,
    vectorizer_model=vectorizer_model
    )

In [None]:
# Let's check a second time
topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

## Assessing the model

Before we keep on going, it would be interesting to assess our model to see if it was trained properly and have a good confidence with the following results. For that, we will...

List of good topics evaluations : https://github.com/jonaschn/awesome-topic-models?tab=readme-ov-file#models

### Topic model

"There are two main aspects to evaluate topic models:
- coherence
- relevance.

Coherence measures how well the words in a topic are related to each other, based on their semantic similarity or frequency.
Relevance measures how well the topics capture the main themes or aspects of the documents, based on their importance r specificity.
There are various metrics and tools to calculate coherence and relevance, such as C_V, U_Mass, topic coherence pipeline, etc. You can also use human judgment or feedback to assess the interpretability and usefulness of your topics."

This LinkedIn post, "How do you evaluate the quality and relevance of your topic models and clusters ?", tells us a bit about methods to accomplish this task. We'll first focus on the topic model and then on the clusters.

In [None]:
# Get list of words that are used for the topic modeling assessment
topics_to_evaluate = topic_model.get_topic_info()['Representation']

#### Coherence

In [None]:
# # Add assessment method
# from octis.evaluation.metrics.coherence import Coherence
# coherence = Coherence(texts=tokenized_docs, topk=10, measure='c_v')
# score = coherence.score(model_output['topics'])

In [None]:
# from contextualized_topic_models.models.ctm import CombinedTM
# from contextualized_topic_models.datasets.dataset import CTMDataset

# # Preprocess
# bow_matrix = vectorizer_model.fit_transform(texts)  # texts is your list of documents
# vocab = vectorizer_model.get_feature_names_out()

# # Create CTM dataset
# ctm_dataset = CTMDataset(bow=bow_matrix.toarray(), contextual_embeddings=embeddings)

# # Train the model
# ctm = CombinedTM(bow_size=len(vocab),
#                  contextual_size=embeddings.shape[1],
#                  n_components=10,  # number of topics
#                  num_epochs=20)

# ctm.fit(ctm_dataset)

# # Loop over topics
# topics_ctm = ctm.get_topic_lists(10)  # top 10 words per topic
# for idx, topic in enumerate(topics_ctm):
#     print(f"Topic {idx}: {topic}")

In [None]:
# from gensim.models.coherencemodel import CoherenceModel
# from gensim.corpora.dictionary import Dictionary
# from gensim import corpora

# def calculate_coherence_score(topic_model, docs):
#     # Preprocess documents
#     cleaned_docs = topic_model._preprocess_text(docs)

#     # Extract vectorizer and tokenizer from BERTopic
#     vectorizer = topic_model.vectorizer_model
#     tokenizer = vectorizer.build_tokenizer()

#     # Extract features for Topic Coherence evaluation
#     words = vectorizer.get_feature_names_out()
#     # depending on the version and if you get an error use commented out code below:
#     # words = vectorizer.get_feature_names()
#     tokens = [tokenizer(doc) for doc in cleaned_docs]
#     dictionary = corpora.Dictionary(tokens)
#     corpus = [dictionary.doc2bow(token) for token in tokens]
#     # Create topic words
#     topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id]
#     for _ in range(topic_model.nr_topics)]

#     # this creates a list of the token ids (in the format of integers) of the words in words that are also present in the
#     # dictionary created from the preprocessed text. The topic_words list contains list of token ids for each
#     # topic.

#     coherence_model = CoherenceModel(topics=topic_words,
#                                     texts=tokens,
#                                     corpus=corpus,
#                                     dictionary=dictionary,
#                                     coherence='c_v')
#     coherence = coherence_model.get_coherence()

#     return coherence

#### Relevance

In [None]:
# How many of the top-N words in each topic are unique across topics (non-overlapping).
def proportion_unique_words(topics, topk=10):
    """
    compute the proportion of unique words

    Parameters
    ----------
    topics: a list of lists of words
    topk: top k words on which the topic diversity will be computed
    """
    if topk > len(topics[0]):
        raise Exception('Words in topics are less than '+str(topk))
    else:
        unique_words = set()
        for topic in topics:
            unique_words = unique_words.union(set(topic[:topk]))
        puw = len(unique_words) / (topk * len(topics))
        return puw

#Result 1.0 would mean that all words are unique across topics.
proportion_unique_words(topics_to_evaluate, topk=10)

A proportion_unique_words score of ~0.75 means that around 25% of the top topic words are reused across topics, suggesting moderate redundancy. This means that reducing topics would be an option in this case to reduce redundancy.

### Clusters

"There are two main aspects to evaluate clusters :
- validity
- stability

Validity measures how well the clusters reflect the true structure or similarity of the data, based on their compactness, separation, or silhouette.
Stability measures how consistent the clusters are across different runs or samples of the data, based on their bobustness, sensitivity, or agreement. There are various metrics and tools to calculate validity and stability, such as Davies-Bouldin index, Rand index, cluster validation toolbox, etc.

You can also use domain knowledge or business goals to assess the relevance and value of your clusters."

#### Validity

In [None]:
# Jaccard similarity across top-N words for each pair of topics.
def pairwise_jaccard_diversity(topics, topk=10):
    '''
    compute the average pairwise jaccard distance between the topics

    Parameters
    ----------
    topics: a list of lists of words
    topk: top k words on which the topic diversity
          will be computed

    Returns
    -------
    pjd: average pairwise jaccard distance
    '''
    dist = 0
    count = 0
    for list1, list2 in combinations(topics, 2):
        js = 1 - len(set(list1).intersection(set(list2)))/len(set(list1).union(set(list2)))
        dist = dist + js
        count = count + 1
    return dist/count

# Result 1.0 would mean that there are no shared words between any topic pairs
pairwise_jaccard_diversity(topics_to_evaluate, topk=10)

A validity score of 0.9437 indicates that the topics are quite distinctive and well-separated. There would be no need to reduce them according to this measure.

#### Stability

The following code has been hard coded because of some troubles during installation of the file. It directly comes from https://github.com/silviatti/topic-model-diversity/blob/master/rbo.py

In [None]:
"""Rank-biased overlap, a ragged sorted list similarity measure.
See http://doi.acm.org/10.1145/1852102.1852106 for details. All functions
directly corresponding to concepts from the paper are named so that they can be
clearly cross-identified.
The definition of overlap has been modified to account for ties. Without this,
results for lists with tied items were being inflated. The modification itself
is not mentioned in the paper but seems to be reasonable, see function
``overlap()``. Places in the code which diverge from the spec in the paper
because of this are highlighted with comments.
The two main functions for performing an RBO analysis are ``rbo()`` and
``rbo_dict()``; see their respective docstrings for how to use them.
The following doctest just checks that equivalent specifications of a
problem yield the same result using both functions:
    >>> lst1 = [{"c", "a"}, "b", "d"]
    >>> lst2 = ["a", {"c", "b"}, "d"]
    >>> ans_rbo = _round(rbo(lst1, lst2, p=.9))
    >>> dct1 = dict(a=1, b=2, c=1, d=3)
    >>> dct2 = dict(a=1, b=2, c=2, d=3)
    >>> ans_rbo_dict = _round(rbo_dict(dct1, dct2, p=.9, sort_ascending=True))
    >>> ans_rbo == ans_rbo_dict
    True
"""

from __future__ import division

import math
from bisect import bisect_left
from collections import namedtuple


RBO = namedtuple("RBO", "min res ext")
RBO.__doc__ += ": Result of full RBO analysis"
RBO.min.__doc__ = "Lower bound estimate"
RBO.res.__doc__ = "Residual corresponding to min; min + res is an upper bound estimate"
RBO.ext.__doc__ = "Extrapolated point estimate"


def _round(obj):
    if isinstance(obj, RBO):
        return RBO(_round(obj.min), _round(obj.res), _round(obj.ext))
    else:
        return round(obj, 3)


def set_at_depth(lst, depth):
    ans = set()
    for v in lst[:depth]:
        if isinstance(v, set):
            ans.update(v)
        else:
            ans.add(v)
    return ans


def raw_overlap(list1, list2, depth):
    """Overlap as defined in the article.
    """
    set1, set2 = set_at_depth(list1, depth), set_at_depth(list2, depth)
    return len(set1.intersection(set2)), len(set1), len(set2)


def overlap(list1, list2, depth):
    """Overlap which accounts for possible ties.
    This isn't mentioned in the paper but should be used in the ``rbo*()``
    functions below, otherwise overlap at a given depth might be > depth which
    inflates the result.
    There are no guidelines in the paper as to what's a good way to calculate
    this, but a good guess is agreement scaled by the minimum between the
    requested depth and the lengths of the considered lists (overlap shouldn't
    be larger than the number of ranks in the shorter list, otherwise results
    are conspicuously wrong when the lists are of unequal lengths -- rbo_ext is
    not between rbo_min and rbo_min + rbo_res.
    >>> overlap("abcd", "abcd", 3)
    3.0
    >>> overlap("abcd", "abcd", 5)
    4.0
    >>> overlap(["a", {"b", "c"}, "d"], ["a", {"b", "c"}, "d"], 2)
    2.0
    >>> overlap(["a", {"b", "c"}, "d"], ["a", {"b", "c"}, "d"], 3)
    3.0
    """
    ov = agreement(list1, list2, depth) * min(depth, len(list1), len(list2))
    return ov
    # NOTE: comment the preceding and uncomment the following line if you want
    # to stick to the algorithm as defined by the paper
    # return raw_overlap(list1, list2, depth)[0]


def agreement(list1, list2, depth):
    """Proportion of shared values between two sorted lists at given depth.
    >>> _round(agreement("abcde", "abdcf", 1))
    1.0
    >>> _round(agreement("abcde", "abdcf", 3))
    0.667
    >>> _round(agreement("abcde", "abdcf", 4))
    1.0
    >>> _round(agreement("abcde", "abdcf", 5))
    0.8
    >>> _round(agreement([{1, 2}, 3], [1, {2, 3}], 1))
    0.667
    >>> _round(agreement([{1, 2}, 3], [1, {2, 3}], 2))
    1.0
    """
    len_intersection, len_set1, len_set2 = raw_overlap(list1, list2, depth)
    return 2 * len_intersection / (len_set1 + len_set2)


def cumulative_agreement(list1, list2, depth):
    return (agreement(list1, list2, d) for d in range(1, depth + 1))


def average_overlap(list1, list2, depth=None):
    """Calculate average overlap between ``list1`` and ``list2``.
    >>> _round(average_overlap("abcdefg", "zcavwxy", 1))
    0.0
    >>> _round(average_overlap("abcdefg", "zcavwxy", 2))
    0.0
    >>> _round(average_overlap("abcdefg", "zcavwxy", 3))
    0.222
    >>> _round(average_overlap("abcdefg", "zcavwxy", 4))
    0.292
    >>> _round(average_overlap("abcdefg", "zcavwxy", 5))
    0.313
    >>> _round(average_overlap("abcdefg", "zcavwxy", 6))
    0.317
    >>> _round(average_overlap("abcdefg", "zcavwxy", 7))
    0.312
    """
    depth = min(len(list1), len(list2)) if depth is None else depth
    return sum(cumulative_agreement(list1, list2, depth)) / depth


def rbo_at_k(list1, list2, p, depth=None):
    # ``p**d`` here instead of ``p**(d - 1)`` because enumerate starts at
    # 0
    depth = min(len(list1), len(list2)) if depth is None else depth
    d_a = enumerate(cumulative_agreement(list1, list2, depth))
    return (1 - p) * sum(p ** d * a for (d, a) in d_a)


def rbo_min(list1, list2, p, depth=None):
    """Tight lower bound on RBO.
    See equation (11) in paper.
    >>> _round(rbo_min("abcdefg", "abcdefg", .9))
    0.767
    >>> _round(rbo_min("abcdefgh", "abcdefg", .9))
    0.767
    """
    depth = min(len(list1), len(list2)) if depth is None else depth
    x_k = overlap(list1, list2, depth)
    log_term = x_k * math.log(1 - p)
    sum_term = sum(
        p ** d / d * (overlap(list1, list2, d) - x_k) for d in range(1, depth + 1)
    )
    return (1 - p) / p * (sum_term - log_term)


def rbo_res(list1, list2, p):
    """Upper bound on residual overlap beyond evaluated depth.
    See equation (30) in paper.
    NOTE: The doctests weren't verified against manual computations but seem
    plausible. In particular, for identical lists, ``rbo_min()`` and
    ``rbo_res()`` should add up to 1, which is the case.
    >>> _round(rbo_res("abcdefg", "abcdefg", .9))
    0.233
    >>> _round(rbo_res("abcdefg", "abcdefghijklmnopqrstuvwxyz", .9))
    0.239
    """
    S, L = sorted((list1, list2), key=len)
    s, l = len(S), len(L)
    x_l = overlap(list1, list2, l)
    # since overlap(...) can be fractional in the general case of ties and f
    # must be an integer --> math.ceil()
    f = int(math.ceil(l + s - x_l))
    # upper bound of range() is non-inclusive, therefore + 1 is needed
    term1 = s * sum(p ** d / d for d in range(s + 1, f + 1))
    term2 = l * sum(p ** d / d for d in range(l + 1, f + 1))
    term3 = x_l * (math.log(1 / (1 - p)) - sum(p ** d / d for d in range(1, f + 1)))
    return p ** s + p ** l - p ** f - (1 - p) / p * (term1 + term2 + term3)


def rbo_ext(list1, list2, p):
    """RBO point estimate based on extrapolating observed overlap.
    See equation (32) in paper.
    NOTE: The doctests weren't verified against manual computations but seem
    plausible.
    >>> _round(rbo_ext("abcdefg", "abcdefg", .9))
    1.0
    >>> _round(rbo_ext("abcdefg", "bacdefg", .9))
    0.9
    """
    S, L = sorted((list1, list2), key=len)
    s, l = len(S), len(L)
    x_l = overlap(list1, list2, l)
    x_s = overlap(list1, list2, s)
    # the paper says overlap(..., d) / d, but it should be replaced by
    # agreement(..., d) defined as per equation (28) so that ties are handled
    # properly (otherwise values > 1 will be returned)
    # sum1 = sum(p**d * overlap(list1, list2, d)[0] / d for d in range(1, l + 1))
    sum1 = sum(p ** d * agreement(list1, list2, d) for d in range(1, l + 1))
    sum2 = sum(p ** d * x_s * (d - s) / s / d for d in range(s + 1, l + 1))
    term1 = (1 - p) / p * (sum1 + sum2)
    term2 = p ** l * ((x_l - x_s) / l + x_s / s)
    return term1 + term2


def rbo(list1, list2, p):
    """Complete RBO analysis (lower bound, residual, point estimate).
    ``list`` arguments should be already correctly sorted iterables and each
    item should either be an atomic value or a set of values tied for that
    rank. ``p`` is the probability of looking for overlap at rank k + 1 after
    having examined rank k.
    >>> lst1 = [{"c", "a"}, "b", "d"]
    >>> lst2 = ["a", {"c", "b"}, "d"]
    >>> _round(rbo(lst1, lst2, p=.9))
    RBO(min=0.489, res=0.477, ext=0.967)
    """
    if not 0 <= p <= 1:
        raise ValueError("The ``p`` parameter must be between 0 and 1.")
    args = (list1, list2, p)
    return RBO(rbo_min(*args), rbo_res(*args), rbo_ext(*args))


def sort_dict(dct, *, ascending=False):
    """Sort keys in ``dct`` according to their corresponding values.
    Sorts in descending order by default, because the values are
    typically scores, i.e. the higher the better. Specify
    ``ascending=True`` if the values are ranks, or some sort of score
    where lower values are better.
    Ties are handled by creating sets of tied keys at the given position
    in the sorted list.
    >>> dct = dict(a=1, b=2, c=1, d=3)
    >>> list(sort_dict(dct)) == ['d', 'b', {'a', 'c'}]
    True
    >>> list(sort_dict(dct, ascending=True)) == [{'a', 'c'}, 'b', 'd']
    True
    """
    scores = []
    items = []
    # items should be unique, scores don't have to
    for item, score in dct.items():
        if not ascending:
            score *= -1
        i = bisect_left(scores, score)
        if i == len(scores):
            scores.append(score)
            items.append(item)
        elif scores[i] == score:
            existing_item = items[i]
            if isinstance(existing_item, set):
                existing_item.add(item)
            else:
                items[i] = {existing_item, item}
        else:
            scores.insert(i, score)
            items.insert(i, item)
    return items


def rbo_dict(dict1, dict2, p, *, sort_ascending=False):
    """Wrapper around ``rbo()`` for dict input.
    Each dict maps items to be sorted to the score according to which
    they should be sorted. The RBO analysis is then performed on the
    resulting sorted lists.
    The sort is descending by default, because scores are typically the
    higher the better, but this can be overridden by specifying
    ``sort_ascending=True``.
    >>> dct1 = dict(a=1, b=2, c=1, d=3)
    >>> dct2 = dict(a=1, b=2, c=2, d=3)
    >>> _round(rbo_dict(dct1, dct2, p=.9, sort_ascending=True))
    RBO(min=0.489, res=0.477, ext=0.967)
    """
    list1, list2 = (
        sort_dict(dict1, ascending=sort_ascending),
        sort_dict(dict2, ascending=sort_ascending),
    )
    return rbo(list1, list2, p)


if __name__ in ("__main__", "__console__"):
    import doctest

    doctest.testmod()

In [None]:
# A ranking-aware similarity metric between two ranked lists (usually top-N words in topics).
def irbo(topics, weight=0.9, topk=10):
    """
    compute the inverted rank-biased overlap

    Parameters
    ----------
    topics: a list of lists of words
    weight: p (float), default 1.0: Weight of each
        agreement at depth d:p**(d-1). When set
        to 1.0, there is no weight, the rbo returns
        to average overlap.
    topk: top k words on which the topic diversity
          will be computed

    Returns
    -------
    irbo : score of the rank biased overlap over the topics
    """
    if topk > len(topics[0]):
        raise Exception('Words in topics are less than topk')
    else:
        collect = []
        for list1, list2 in combinations(topics, 2):
            word2index = get_word2index(list1, list2)
            indexed_list1 = [word2index[word] for word in list1]
            indexed_list2 = [word2index[word] for word in list2]
            rbo_val = rbo(indexed_list1[:topk], indexed_list2[:topk], p=weight)[2]
            collect.append(rbo_val)
        return 1 - np.mean(collect)

def get_word2index(list1, list2):
    words = set(list1)
    words = words.union(set(list2))
    word2index = {w: i for i, w in enumerate(words)}
    return word2index

# Result 1.0 would mean that there is no overlap even considering rank — perfect diversity.
print("irbo p=0.5:",irbo(topics_to_evaluate, weight=0.5, topk=10))

An IRBO score of 0.98 (Inverted Rank-Biased Overlap) means the topics are highly diverse — their top words barely overlap in ranked order. This is another measure saying that the topics are well separated.

However, Bertopics allows other ways to measure how similar clusters are. We can use them to give a final opinion on our topic modeling and keep on going.

#### Final thoughts

In [None]:
topic_model.visualize_heatmap()

On the heatmap, we confirm that we have to reduce a few clusters similar between each other. We can proceed to reduce them before keeping on going.

Before that, let's also have a look at the distance of each topics. This allows us to see the biggest topics but also their distance to each other and see if it is well spreaded over or all aggregated.

In [None]:
# show how topics overlaps each other and the need to merge them
topic_model.visualize_topics()

This graph shows that ...

In [None]:
# Further reduce topics
topic_model.reduce_topics(processing_list, nr_topics=12)

Let's visualize the heatmap one more time and we are good to go

In [None]:
topic_model.visualize_heatmap()

# Clustering with BERTopics

## Most important topics

We can now see the 20 biggest topics and see what they are talking about. Our goal is to see the main focus of Google R&D over the years, especially with tracking and profiling techniques. This will allow us to see if advertisement topics are part of the main ones.

In [None]:
topic_model.visualize_barchart(top_n_topics=15, n_words=7)

Now, we can see the documents all together and see how they are clustered. This gives us a representation of the documents and the topics they are linked to in the blick of an eye.

By doing this, we can see the amount of importance the advertisement topics (or related topics) are in the whole dataset and dig into that later.

## Clustering

Now, let's take a look at the documents all together. By visualizing the documents and aggregate them in topics, we can see the overall structure of all of the topics.

...

In [None]:
umap_settings = [
    {"n_neighbors": 15, "min_dist": 0.1}, # test with a new one
    #{"n_neighbors": 50, "min_dist": 0.3},
    #{"n_neighbors": 100, "min_dist": 0.6},
    {"n_neighbors": 200, "min_dist": 0.99}
]

# Reduce dimensionnality and train every parameters
visualization_figures = []
for param in umap_settings:
    # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
    reduced_embeddings = UMAP(
        n_neighbors=param["n_neighbors"],
        min_dist=param["min_dist"],
        n_components=2,
        metric='euclidean',
        random_state=42
        ).fit_transform(embeddings)

    # Train every parameters
    fig_param = topic_model.visualize_documents(
        processing_list,
        reduced_embeddings=reduced_embeddings,
        title="Patent clusters from Google"
        )
    visualization_figures.append(fig_param)

In [None]:
visualization_figures[0]

In [None]:
visualization_figures[1]

We can also generate another type of maps in order to zoom in and out more freely ...

In [None]:
# Process the topic model to visualize the documents
fig_with_datamapplot = topic_model.visualize_document_datamap(
    processing_list,
    reduced_embeddings=reduced_embeddings,
    interactive=True)

# Save the second visualization with datamapplot
fig_with_datamapplot.save("/content/drive/My Drive/patents_from_google_with_datamapplot.html")

# Show plot
fig_with_datamapplot

What this maps shows is that...

Let's dig deeper in our topics and start analyzing the advertisement related topics.

## Overall evolution of the patents

We are now going to see the evolution of the overall patents. This will gives us an idea of the proportions of this topic in Google's focus.

Google is primarly a company that sells ads. However, it is also known to ...

In [None]:
# get all topics (int) and add them to every documents
patents_dformated["topics"] = topic_model.topics_

### Overall evolution

In [None]:
# Add publication_year
patents_dformated["publication_year"] = patents_dformated["publication_date"].dt.year

# Group by year and topic
count_df = patents_dformated.groupby(
    ["publication_year", "topics"]
).size().reset_index(name="count")

# Plot with color = topics
plot_topic_by_year = px.bar(
    count_df,
    x="publication_year",
    y="count",
    color="topics",
    title="Number of patents per year by Topic",
    barmode="stack"
)

# Save as HTML
plot_topic_by_year.write_html("/content/drive/My Drive/patents_by_topic_chart.html")

# Show the chart
plot_topic_by_year.show()

In [None]:
# Get the topic list to get an idea of the bar chart on top
topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

### Topics over time

This graph helps us better see which topic grew the most over time...

In [None]:
# Process ad-topics related over time
topics_over_time = topic_model.topics_over_time(
    patents_dformated.iloc[:, 0].tolist(),
    patents_dformated['publication_date'].tolist(),
    nr_bins=20
    )

# Save the plot
topics_over_time.to_html("/content/drive/My Drive/topics_over_time.html")

# Show ad-topics related over time
topic_model.visualize_topics_over_time(
    topics_over_time
    )

## Digging inside the clusters

We first want to get our privacy and tracking topics. Using this as an anchor, we can calculate the difference between these topics and every other topics. Then, we can check which are the most related to privacy and surveillance.

Let's find where, if there are any, are located the topics for surveillance and privacy. We can then use them to compute

In [None]:
similar_topics_surveillance, similarity_surveillance = topic_model.find_topics("tracking", top_n=1)
similar_topics_privacy, similarity_privacy = topic_model.find_topics("data privacy", top_n=1)

print(f'surveillance topic : {similar_topics_surveillance} %:{similarity_surveillance[0]}')
print(f'privacy topic : {similar_topics_privacy} %:{similarity_privacy[0]}')

For the following part, the idea was to calculate the cosine similarity of opposite documents or topics like "privacy" and "tracking".

The result is that models based on the distributional hypothesis are not very good at this because antonyms often appear in similar contexts. So if we calculate the cosine similarity of opposite words like the ones we suggested, we would arrive at more or less the same result when using, let's say, a correlation bar chart.

For that reason, we are going to compute the polarity score between a privacy topic (positive) and a tracking topic (negative). Let's first get these topics.

### Visualization of the subtopics

Our goal is to answer one question. We'd like to check if Google implement alternative user tracking or targeting practices. For this purpose, we have our patents list that we modeled by topics. We want now to dig in the topics that are related the most to these practices which are :
- ...

However, all of the documents in these topics are genuinely related to tracking techniques. Thus, we'll find out the clusters that are related to tracking and its opposite, privacy, extract the documents from these topics and calculate the distance with them.

This should give us a good idea of the documents that are related to tracking techniques and those that are not. By doing this, we want to confirm that, the past few years, Google was mostly focused on tracking techniques or, the opposite, mostly focused on privacy techniques preserving users' digital life.

Let's take a look at each topics individually

#### Model parametering

In [None]:
# Removing the stop-words because BERTopic does not do it by default
vectorizer_submodel_indiv = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 3) # Extract unigrams (1-word), bigrams (2-word), and trigrams (3-word phrases) from the text
    )

In [None]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
umap_submodel = UMAP(
    n_neighbors=30,
    n_components=3,
    min_dist=0.1,
    metric='cosine',
    random_state=42)

In [None]:
# Added because advised to control number of topics through the cluster model (hdbscan by default)
hdbscan_submodel = HDBSCAN(
    min_cluster_size=20,
    max_cluster_size=500,
    metric='euclidean',
    cluster_selection_method='leaf')

In [None]:
# Fet all embeddings (float) and add them to the dataframe
patents_dformated["embeddings"] = [vec.tolist() for vec in embeddings]

#### Privacy and Tracking topics

In [None]:
sub_topics = []
sub_topics.extend(topic_model.find_topics('privacy', top_n=1)[0])
sub_topics.extend(topic_model.find_topics('tracking', top_n=1)[0])
# Delete duplicates
sub_topics = list(set(sub_topics))

# Filter the emdeddings just on the topic mentionned above
# (reprocessing the embeddings again would take several hours)
filtered_embeddings_sub_topics_df = patents_dformated.loc[patents_dformated["topics"].isin(sub_topics)]
filtered_sub_topics_df = filtered_embeddings_sub_topics_df['processing']
# convert the embeddings
filtered_sub_embeddings = filtered_embeddings_sub_topics_df["embeddings"].tolist()
filtered_sub_embeddings_mat = csr_matrix(filtered_sub_embeddings)
filtered_sub_embeddings = filtered_sub_embeddings_mat.toarray()

In [None]:
# Train the new subtopics model
# subtopic_model = BERTopic(
#     embedding_model=sentence_model,
#     umap_model=umap_submodel,
#     hdbscan_model=hdbscan_submodel,
#     vectorizer_model=vectorizer_submodel_indiv,
#     representation_model=representation_model
#     )

# sub_model_topics, sub_model_probs = subtopic_model.fit_transform(filtered_sub_topics_df, filtered_sub_embeddings)
# subtopic_model.save("/content/drive/My Drive/google_patents_model_subtopics")

# Or load BERTopic in BERTopic v0.9.2 or higher:
subtopic_model = BERTopic.load("/content/drive/My Drive/google_patents_model_subtopics")
sub_model_topics = subtopic_model._map_predictions(subtopic_model.hdbscan_model.labels_)
sub_model_probs = subtopic_model.hdbscan_model.probabilities_

In [None]:
# Reduce outliers using the `embeddings` strategy
reduced_subtopics = subtopic_model.reduce_outliers(
    documents=filtered_sub_topics_df.to_list(),
    topics=sub_model_topics,
    strategy="embeddings",
    embeddings=filtered_sub_embeddings,
    threshold=0.80
    )

# Update topics
subtopic_model.update_topics(
    filtered_sub_topics_df.to_list(),
    topics=reduced_subtopics,
    vectorizer_model=vectorizer_submodel_indiv
    )

In [None]:
reduced_filtered_submodel_embeddings = UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='cosine',
    random_state=42
    ).fit_transform(filtered_sub_embeddings)

In [None]:
# Put it in the new document visualisation
# Process the subtopic model to visualize the documents
fig_with_datamapplot_subtopics = subtopic_model.visualize_document_datamap(
    docs=filtered_sub_topics_df.tolist(),
    reduced_embeddings=reduced_filtered_submodel_embeddings,
    interactive=True
    )

# Save the second visualization with datamapplot
fig_with_datamapplot_subtopics.save("/content/drive/My Drive/patents_from_google_with_datamapplot_subtopics.html")

# Show plot
fig_with_datamapplot_subtopics

Let's make an assessment of our model before going further.

In [None]:
# Get list of words that are used for the topic modeling assessment
subtopics_to_evaluate = subtopic_model.get_topic_info()['Representation']

In [None]:
# Result 1.0 would mean that all words are unique across topics.
print("proportion of unique words, topk=10:",proportion_unique_words(subtopics_to_evaluate, topk=10))

# Result 1.0 would mean that there are no shared words between any topic pairs
print("pairwise jaccard diversity, topk=10:",pairwise_jaccard_diversity(subtopics_to_evaluate, topk=10))

# Result 1.0 would mean that there is no overlap even considering rank — perfect diversity.
print("irbo, p=0.5:",irbo(subtopics_to_evaluate, weight=0.5, topk=10))

Now that we have a more granular look at our big topic that contains both privacy and tracking topics, we can dig deeper and get them to have our anchor.

In [None]:
similar_subtopics_tracking, similarity_sub_tracking = subtopic_model.find_topics("tracking", top_n=1)
similar_subtopics_privacy, similarity_sub_privacy = subtopic_model.find_topics("data privacy", top_n=1)

print(f'tracking topic : {similar_subtopics_tracking} %:{similarity_sub_tracking[0]}')
print(f'privacy topic : {similar_subtopics_privacy} %:{similarity_sub_privacy[0]}')

This shows what we were expecting. As we said before, models based on the distributional hypothesis are not very good at this because antonyms often appear in similar contexts. So the submodel tells me that the same topic is at the same time the most similar to opposite words.

To counter that, it has been decided to find manually a topic that matches the most our "tracking" topic. We now have our privacy topic (173 docs) and our tracking topic (750 docs). The latter got chosen because of the main focus of the documents which are network analysis that corresponds the most, according to me, to the definition of tracking techniques.

In [None]:
# get all topics (int) and add them to every documents
filtered_embeddings_sub_topics_df["topics"] = subtopic_model.topics_

In [None]:
subtopic_model.get_topic_info(8)

In [None]:
topic_model.get_topic_info(10)

### Merging models

In [None]:
# Combine all models into one
merged_model = BERTopic.merge_models([topic_model, subtopic_model], min_similarity=0.6)

In [None]:
merged_model.get_topic_info()

### Get topic embeddings from merged model

In [None]:
# Get privacy topic embeddings
privacy_centroid = subtopic_model.topic_embeddings_[9]

# Get tracking topic embeddings
tracking_centroid = topic_model.topic_embeddings_[11]

### Compute polarity score between topics

How does a polarity score works ? It is calculated using (positive- negative) / (positive + negative). This result in a score ranging from -1 to +1. We can then have a better understanding of the topics close to those in privacy topics and those in tracking topics.

In [None]:
# Get topic embeddings
topic_vectors = topic_model.topic_embeddings_  # Shape: (n_topics, embedding_dim)
topic_ids = topic_model.get_topic_info().Topic.tolist()  # Topic numbers

# Get subtopic embeddings
subtopic_vectors = subtopic_model.topic_embeddings_  # Shape: (n_topics, embedding_dim)
subtopic_ids = subtopic_model.get_topic_info().Topic.tolist()  # Topic numbers

In [None]:
# Calculate the polarity score (using (positive - negative) / (positive + negative))
def polarity_score_normalized(topic_vector, priv_centroid, track_centroid):
  sim_privacy = cosine_similarity([topic_vector], [priv_centroid])[0][0]
  sim_tracking = cosine_similarity([topic_vector], [track_centroid])[0][0]

  return (sim_privacy - sim_tracking) / (sim_privacy + sim_tracking)

In [None]:
scores = {}
for topic_id, topic_vector in zip(topic_ids, topic_vectors):
  score = polarity_score_normalized(
      topic_vector,
      privacy_centroid,
      tracking_centroid
      )
  scores[topic_id] = score

In [None]:
# Convert to Series and sort
corrplot = pd.Series(scores).sort_values()

# Plot
fig, ax = plt.subplots(figsize=(17, 12))

# TwoSlopeNorm: midpoint = 0 (neutral), red = negative, green = positive
vmin = min(scores.values())
vmax = max(scores.values())
norm = TwoSlopeNorm(vmin=vmin, vcenter=0, vmax=vmax)
colors = [plt.cm.RdYlGn(norm(c)) for c in corrplot.values]

# Horizontal bar plot
corrplot.plot.barh(color=colors, ax=ax)

# Labeling
ax.set_xlabel("Polarity Score (Privacy vs Tracking)")
ax.set_ylabel("Topic ID")
ax.set_title("Topic Alignment with Privacy (Red = Tracking-Aligned, Green = Privacy-Aligned)")

plt.grid(True)
plt.show()

This plot shows the problem that we have with embeddings. For 2 opposites words, we end up having a really tight polarity score indicating ...

### Try with standard techniques

Since we have some troubles with the embeddings, we're going to use a technique that do not requires embeddings. We are going with traditionnal machine learning model like logistic regression, SVM and others.

By doing this, we are going to classify our patents from 0 to 2 depending if they are tracking related (0), neutrals (1) or privacy related (2).

It is interesting to note that other tools have been tried like Vader, a sentiment analyser classifier based on a lexicon, and an NLI (Natural Language Inference) based on embeddings which lead to the same problem.

#### Define privacy topic from 2nd model

In [None]:
# Isolate privacy topics from second topic
privacy_topic = filtered_embeddings_sub_topics_df[filtered_embeddings_sub_topics_df["topics"] == 8]

# Update with new values so it doesn't match with the ones from the main topic
privacy_topic['topics'] = 11

# Update main dataframe with values from second dataframe
patents_dformated.update(privacy_topic)

#### Create train dataframe

Here is the method for spliting the dataset and obtain a few samples in order to manually classify them. A focus has been made on the privacy-related topic and tracking-related-topic (the latter defined arbitrarely).

One thing interesting to note is the way the classification has been made, or why some patents were classify (manually) as privacy-related or tracking-related.

- If the purpose of the patent is about privacy or tracking, they were classify according to the purpose.
- If they were about data processing but not related to any of them, they were classify as neutral.
- Some patents may include tracking but the purpose was the development of a privacy tool, then they have been categorized as privacy ; which is also the reason why they are so close together.
- Some patents may have words like "location" which may think of tracking but are actually related on the location of fingers on a screen.

In [None]:
# Some documents of every topic of the dataset to label them as privacy friendly or not
# Select privacy and tracking documents
privacy_df = patents_dformated[patents_dformated['topics'] == 11].sample(n=150, random_state=42)
tracking_df = patents_dformated[patents_dformated['topics'] == 10].sample(n=200, random_state=42)

# Drop selected privacy/tracking docs from full dataset to avoid duplicates
remaining_df = patents_dformated.drop(privacy_df.index).drop(tracking_df.index)

# Sample 1000 general documents (neutral or unlabeled)
neutral_df = remaining_df.sample(n=750, random_state=42)

# Combine all into one training dataset
train_df = pd.concat([privacy_df, tracking_df, neutral_df]).sample(frac=1, random_state=42).reset_index(drop=True)

# Get the patents_dformated without the rows used for training
predicting_df = patents_dformated.drop(train_df.index)

# Save the files in excel format
train_df.to_excel('/content/drive/My Drive/train_df.xlsx')
predicting_df.to_excel('/content/drive/My Drive/predicting_df.xlsx')

# Or read the file
# train_df = pd.read_excel('/content/drive/My Drive/train_df.xlsx')
# predicting_df = pd.read_excel('/content/drive/My Drive/predicting_df.xlsx')

#### Define labels

To delete later, just have to replace train_df with my actual real training dataset

In [None]:
# Define conditions
conditions = [
    train_df['topics'] == 11,  # privacy
    train_df['topics'] == 10   # tracking
]

train_df['label'] = 1

# Create labels: 0 for tracking (topic 10), 2 for privacy (topic 11), 1 for neutral as hugging face classifier expect non negative integers
label_conditions = [
    train_df['topics'] == 11,  # tracking
    train_df['topics'] == 10   # privacy
]

label_values = [2, 0]

train_df['label'] = np.select(label_conditions, label_values, default=1)

#### Create tfidf on processing column

In [None]:
# Vectorize documents with tfidf
documents = train_df['processing']

vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = vectorizer.fit_transform(documents)  # sparse matrix (n_samples, n_features)

# Convert sparse matrix to dense, then to list of vectors
train_df['processing_tfidf'] = list(X_tfidf.toarray())

#### Split the train dataset

In [None]:
# Split the data into training and testing sets
X = vectorizer.fit_transform(train_df['processing'])
y = train_df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [None]:
print(X)
print(y)
print(X_train.shape)
print(X_test.shape)

In [None]:
# find privacy and tracking doc to make a quick test
privacy_doc_test = 'System and method for detecting sensitive information leakage while preserving privacy Systems and methods for privacy preserving data loss detection include: performing a scan of online information for candidate data leaks to generate an online data set; performing an analysis of the online data set to determine that online information is a candidate data leak; the host encrypts the data communication and provides the host encrypted data communication to a software agent at the enterprise; in response to receiving the host encrypted data communication, the software agent encrypts the enterprise information database and re-encrypts and provides the host encrypted data communication to the host; the host decrypting the host encrypted aspect of the re-encrypted data communication to generate a software agent encrypted data communication; determining whether a match exists between the encrypted information database and the software agent encrypted data communication; and depending on whether a match exists, the software agent takes a first action or the host takes a second action., Systems and methods for detecting sensitive information leakage while preserving privacy Privacy-preserving data loss detection, including a sweep of online information for a candidate data leakage to generate an online data set; analyzing the online data set to determine that the online information is candidate data leakage; the host encrypting the data communication and providing the host-encrypted data communication to a software agent at the enterprise; in response to receiving the host-encrypted data communication, the software agent encrypting a database of enterprise information and re-encrypting the host-encrypted data communication, and providing the same to the host; the host decrypting a host-encrypted aspect of the re-encrypted data communication to generate a software agent-encrypted data communication'
tracking_doc_test = 'Social circle in social networks The application is related to the social circle in social networks.For the figured contact data that description transmission is used to show contact person for being shown to user, contact person is contact person of the user in computer-implemented social networking service；The first social circle of user is generated, the first social circle includes first contact subset of the user in social networking service and defines the first distribution for digital content；The second social circle of user is generated, the second social circle includes second contact subset of the user in social networking service and defines the second distribution for digital content；And inputted in response to user, there is provided for being selected by user to define the distribution of digital content, distribution includes at least one distribution in the first distribution and the second distribution for the first social circle and the second social circle., Social circles in social networks Described is transmitting contact data for displaying graphical representations of contacts for display to a user, the contacts being contacts of the user within a computer-implemented social networking service, generating a first social circle of the user, the first social circle comprising a first subset of contacts of the user within the social networking service and defining a first distribution for digital content, generating a second social circle of the user, the second social circle comprising a second subset of contacts of the user within the social networking service and defining a second distribution for digital content, and, in response to user input, providing the first social circle and the second social circle for selection by the user to define a distribution of digital content, the distribution comprising at least one of the first distribution and the second distribution.'

privacy_doc_test_vectorized = vectorizer.transform([privacy_doc_test])
tracking_doc_test_vectorized = vectorizer.transform([tracking_doc_test])

In [None]:
def fit_best_model(model, params) :

    scoring = {
        'f1': make_scorer(f1_score, average='macro'),
        'precision': make_scorer(precision_score, average='macro'),
        'recall': make_scorer(recall_score, average='macro')
    }

    # Fit using gridSearchCV for cross-validation
    model_cv = GridSearchCV(
      model,
      params,
      cv=5,
      n_jobs=-1, # The number of CPU cores used during cv loop. -1 means all
      return_train_score=True,
      verbose = 0, # Enable verbose output, 0 = silent
      # P quantifies how many of the positive predictions made by the model were actually correct
      # R measures how well the model identifies all the actual positive instances
      # f1 balance the 2 providing a single value that represents the overall performance
      scoring = scoring,
      refit='f1'
    )

    model_cv.fit(X, y)

    # Return the best parameters giving the best results
    # the best estimator that was chosen by the search,
    # the scorer function used on the held out data to choose the best parameters for the model.
    return model_cv.best_params_, model_cv.best_estimator_, model_cv.cv_results_

In [None]:
# Logistic regression
params_lr = {
    'fit_intercept' : [True],
    'dual' : [False], # Dual or primal formulation
    'penalty' : ['l2'], # l1 and elasticnet not always supported with multiclass
    'solver' : ['lbfgs', 'newton-cg', 'newton-cholesky', 'sag', 'saga'], # only classes that supports multiclass
    'tol' : [0.0001], # Tolerance for stopping criteria
    'max_iter' : [100], # Max iteration during training
    'class_weight' : [None] # Give more weight to some classes, if None all classes are 1
}

lr = LogisticRegression()

lr_best_params_, lr_best_estimator_, lr_results_ = fit_best_model(lr, params_lr)

In [None]:
# Support Vector Machine (SVM)
params_lsvc = {
    'penalty' : ['l2'], # Norm used in penalization
    'loss' : ['hinge'], # Specifies loss function
    'dual' : [True], # Select algo to either solve dual or primal optimization problem
    'tol' : [1e-3], # Tolerance for stopping criteria
    'C' : [0.01, 0.1, 1], # Regularization parameter
    'multi_class' : ["ovr"], # Determines multi-class strategy. Crammer-singer rarely leads to better accuracy so ovr is chosen
    'fit_intercept' : [True], # Wether or not to fit an intercept
    'intercept_scaling' : [1],
    'class_weight' : [None], # Give more weight to some classes, if None all classes are 1
    'random_state' : [42], # Control the randomness of the output, 42 is the least possible
    'max_iter' : [100], # Max iteration during training
}

lsvc = LinearSVC()

lsvc_best_params, lsvc_best_estimator, lsvc_results_ = fit_best_model(lsvc, params_lsvc)

In [None]:
# Random forest
params_rfc = {
    'n_estimators' : [100], # Number of trees in the forest
    'criterion' : ['gini', 'entropy', 'log_loss'], # Measure to split a node
    'max_depth' : [None], # Max depth of the tree
    'min_samples_split' : [2], # Min number of samples required to split an internal node
    'min_samples_leaf' : [1], # Min number of samples required to be at a leaf node
    'min_weight_fraction_leaf' : [0.0], # Min weighted fraction of the sum total of weights required to be at a leaf node
    'max_features' : [100], # Number of features to consider when looking for the best split
    'max_leaf_nodes' : [None], # Grow trees with this param is best-first fashion
    'min_impurity_decrease' : [0.0], # A node will be split if this split induces a decrease of the impurity > or = to this value
    'bootstrap' : [False], # Technique used to create multiple, diverse datasets from the original training data
    'oob_score' : [False], # Whether to use out-of-bag samples to estimate the generalization score
    'random_state' : [42], # Control the randomness of the output, 42 is the least possible
    'warm_start' : [False], # Reuse solution of previous call to fit and add more estimators to the ensemble
    'class_weight' : [None],  # Give more weight to some classes, if None all classes are 1
    'ccp_alpha' : [0.0], # Complexity parameter used for Minimal Cost-Complexity
    'max_samples' : [None], # Based on bootstrap=True, so this parameter is set to none
    'monotonic_cst' : [None] # As the value of one input feature increases, the output of the model either always increases or always decreases, keeping other variables unchanged
}

rfc = RandomForestClassifier()

rfc_best_params, rfc_best_estimator, rfc_results_ = fit_best_model(rfc, params_rfc)

In [None]:
# XGBoost
params_xgb = {
    'objective': ['multi:softmax', 'multi:softprob'], # Defining multi-classification with softmax and softprob
    'num_class': [3], # Number of classes
    'max_depth': [4], # Max depth of a tree 4-6
    'subsample': [0.5], # Subsample ratio of the training instances and helps prevent overfitting. Between 0-1
    'learning_rate': [0.1],
    'sampling_method': ['uniform'], # Method to use to sample the training instance, uniform means each training instance has an equal prob of being selected
    'gamma': [0], # Min loss reduction required to make a further partition on a leaf node of the tree. Larger gamma, the more conservative the algo is
    'n_estimators': [100],
    'scale_pos_weight': [1], # Control balance of pos and neg weights, useful for unbalanced classes
    'eval_metric': ['rmse'] # Metric used for model assessment
}

xgbc = xgb.XGBClassifier()

xgbc_best_params, xgbc_best_estimator, xgbc_results_ = fit_best_model(xgbc, params_xgb)

### Assess the models

Now that all of our models are tested, we can find the one that gives the "best result". This also means that we take one that ... (talk about false positive and true negative and why we chose the one that makes more true negative or false positive)

In [None]:
# Logistic regression assessment
print("Mean precision:", lr_results_['mean_test_precision'])
print("Mean recall:", lr_results_['mean_test_recall'])
print("Mean f1:", lr_results_['mean_test_f1'])

In [None]:
# Support Vector Machine (SVM) assessment
print("Mean precision:", lsvc_results_['mean_test_precision'])
print("Mean recall:", lsvc_results_['mean_test_recall'])
print("Mean f1:", lsvc_results_['mean_test_f1'])

In [None]:
# Random forest assessment
print("Mean precision:", rfc_results_['mean_test_precision'])
print("Mean recall:", rfc_results_['mean_test_recall'])
print("Mean f1:", rfc_results_['mean_test_f1'])

In [None]:
# XGBoost assessment
print("Mean precision:", xgbc_results_['mean_test_precision'])
print("Mean recall:", xgbc_results_['mean_test_recall'])
print("Mean f1:", xgbc_results_['mean_test_f1'])

### Predict the whole dataset

In [None]:
# Make the prediction over the dataset


## Get privacy/tracking related documents

In [None]:
# Closest to privacy


In [None]:
# Furthest from privacy


### Display evolution of privacy, neutral and tracking related topics

In [None]:
# Plot the evolution


### Select our documents

We can now select our documents...

In [None]:
# Take those of the last 10 years


# See the evolution through out the years, the kind of patents it is and doing the same for privacy focused content


# Manually dig into the documents and read what it's about OR using a sentiment based algorithm to help identifying
# patents that can be concerning regarding privacy and those that are not


# Make a comparison


# Conclusion

What is it about ?