# Introduction

## Goal of this project

This project aims to answer one question : In the wake of third-party cookie deprecation, does Google implement alternative user tracking or targeting practices that may raise privacy disclosed to the public ? To answer this question, an hypothesis have yet to be answered :

-	H₀ (Null Hypothesis): The number of Google’s privacy-related patents filed after the announcement of the third-party cookie phase-out does not reflect ongoing development of alternative targeting systems.
-	H₁ (Alternative Hypothesis): The number of Google’s privacy-related patents filed after the announcement of the third-party cookie phase-out reflects ongoing development of alternative targeting systems.

To answer this questions, the Google patents, from year 1995 to 2024, so for the past 29 years, are going to be explored. We are going to dig into them using topic modeling techniques. Creating clusters of the different patents and identify those that are related to tracking topics.

We are going to answer the question by validating hypothesis such as H₀ or H₁. The significance level choosen is α = 0.05, which indicates a 5% risk of rejecting the null hypothesis when it is true. A statistical test (chi squared test of independence) will be used to compare the number of patents filed before and after the announcement. Finally, we will compare the p-value with α in order to validate or reject our hypothesis.

# Installation

In [None]:
# general purpose
! pip install bertopic==0.17.0
! pip install transformers==4.52.4
! pip install sentence-transformers==4.1.0
! pip install datamapplot==0.5.1
! pip install pandas==2.2.2

# data
from google.colab import drive, files
import pandas as pd
pd.set_option('display.max_colwidth', 2000)
import seaborn as sns
from scipy.sparse import csr_matrix
import numpy as np

# clustering

from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer("AI-Growth-Lab/PatentSBERTa") #all-MiniLM-L6-v2
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
# Fine-tune your topic representations
from bertopic.representation import KeyBERTInspired
from bertopic.representation import MaximalMarginalRelevance

# graphs & plots
import plotly.express as px
import datamapplot
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
from matplotlib.colors import TwoSlopeNorm

# Distance computing
from sklearn.metrics.pairwise import cosine_similarity

# Classification
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import make_scorer, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

# Hypothesis validation
from scipy.stats import chi2_contingency

# Tests
from itertools import combinations
# AJOUTER test_rbo.py

# Data

For this project, we're taking 38'752 patents from Google. The purpose is to use a ML model to classify these patents and analyse the work Google made the past few years regarding privacy and tracking. More than 100'000 were gathered but a lot of duplicates were present in the dataset.

## Data source

For this work, the data were collected on Google Patents using BigQuery retrieving Google past patents until December 2024.

The following query was used to retrieve the patents :

```sql
SELECT
    p.publication_number,
    (SELECT text FROM UNNEST(p.title_localized) WHERE language = 'en' LIMIT 1) AS title,
    p.filing_date,
    p.publication_date,
    p.grant_date,
    a.name AS assignee_name,
    (SELECT text FROM UNNEST(p.abstract_localized) WHERE language = 'en' LIMIT 1) AS abstract
FROM
    `patents-public-data.patents.publications` AS p
LEFT JOIN
    UNNEST(p.assignee_harmonized) AS a
WHERE
    LOWER(a.name) LIKE '%google%'
ORDER BY
    p.publication_date DESC
```

## Import data

In [None]:
# Import data with Drive or install Drive with collab interface
#drive.mount('/content/drive')

In [None]:
# Create panda dataset
patents = pd.read_excel('/content/drive/MyDrive/bq-results-20250211-170449-1739293514946.xlsx')

## Quick analysis

In [None]:
# visualization
patents.head(5)

In [None]:
# Checking size
patents.shape

In [None]:
patents.dtypes

In [None]:
# Number of NaN and Null
patents.isnull().sum()

We can see on this table that 4'802 titles and 15'635 abstracts are missing.

If it is only one out of the two, we can deal with that. But if both are missing, it is useless to keep these lines as long as the purpose is to categorize text on titles, abstracts or both.

We're going to check if both of the lines are missing. If that's the case, we can just delete the line because having no information at all is not usefull for the text classification if we have no text at all.

In [None]:
# Count rows where both are null
both_null = patents[(patents["title"].isnull()) & (patents["abstract"].isnull())]
len(both_null)

After checking both columns, we're not going to delete all of the 15'635 lines because we only have 4'668 lines that are completely null.

Now, let's make sure that all of the patents are from Google, part of Alphabet Inc.

In [None]:
patents['assignee_name'].unique()

In [None]:
companies_not_related = [
    'GOOGLE LIFE SCIENCES LLC', # Now called Verily and divest Alphabet Inc. industry but not related to our topics
    'PEARL HAI GOOGLE ELECTRONIC CO LTD', # Chinese company with Google in its name
    'SHENZHEN GOOGLE WEIXUN TECH CO LTD', # Other chinese company having Google in its name and not being related
    'JURONG GOOGLE MANOR MODERN AGRICULTURAL TECHNOLOGY DEV CO LTD', # Other chinese company having Google in its name
    'GOOGLE SWEDEN TECH AB', # A swedish company
    'GOOGLE SWEDEN TECHNIQUE AB', # The same swedish company
    'REAL ESTATE GOOGLE CO LTD' # Google Real Estate is part of Alphabet Inc. but not related to our topics
]

We are now going to check for duplicates. Since we have more than 100'000 patents, we want to make sure that all of them are unique,

In [None]:
patents.duplicated(subset='abstract').sum()

This shows that 68835 line has a duplicate. We are going to count all duplicates and delete them later in the pipeline.

In [None]:
print(len(patents['abstract'].unique()))

## Data treatment

In [None]:
# Delete lines that are empty in both title and abstract columns
patents_nn = patents.drop(both_null.index)

# Replace null lines with " "
patents_nn["title"] = patents_nn["title"].fillna("")
patents_nn["abstract"] = patents_nn["abstract"].fillna("")

len(patents_nn) # 102'935

In [None]:
# Format date columns
patents_dformated = patents_nn.copy(deep=True)

patents_dformated['filing_date'] = pd.to_datetime(patents_nn["filing_date"].astype(str), format='%Y%m%d')
patents_dformated['publication_date'] = pd.to_datetime(patents_nn["publication_date"].astype(str), format='%Y%m%d')
patents_dformated['grant_date'] = pd.to_datetime(patents_nn["grant_date"].astype(str), format='%Y%m%d', errors='coerce')

In [None]:
# Delete duplicates
patents_dformated = patents_dformated.drop_duplicates(subset='abstract').reset_index(drop=True)

len(patents_dformated) # 38768

In [None]:
# Adding a mix column used for pre-processing later
patents_dformated['processing'] = patents_dformated["title"] + " " + patents_dformated["abstract"]

In [None]:
# Delete some unwanted characters
unwanted_chars = ["&#39;", "&#34;", "&amp;", "-", "/", "\\", "“", "”", "~"]

for char in unwanted_chars:
    patents_dformated['processing'] = patents_dformated['processing'].str.replace(char, '', regex=False)

In [None]:
# Delete the unwanted companies
patents_dformated = patents_dformated[~patents_dformated['assignee_name'].isin(companies_not_related)] # 38752

# Clustering

Here come the clustering techniques. For this project, the main tool used is BERTopic.This tool allows us to extract coherent topics from textual data by combining embeddings with clustering algorithms and generating interpretable topics. However, BERTopic is not working alone but with several dependencies.

UMAP is a dimensionality reduction technique that preserves the structure of high-dimensional embeddings while making them easier to visualize and cluster.

HDBSCAN is a density-based clustering algorithm that identifies groups of similar documents and filters out noise or unrelated data points.

### Training of the model

In [None]:
# Create a copy
processing = patents_dformated[['processing']].copy(deep=True)

# Put it in a list
processing_list = processing.iloc[:, 0].to_list()

In [None]:
# # Prepare embeddings
# embeddings = sentence_model.encode(processing_list, show_progress_bar=True, batch_size=64)

# # Save embeddings to a .npy file
# np.save('/content/drive/My Drive/embeddings_patents_google.npy', embeddings)

# Load embeddings from .npy file
embeddings = np.load('/content/drive/My Drive/embeddings_patents_google.npy')

In [None]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
umap_model = UMAP(
    n_neighbors=100,
    n_components=3,
    min_dist=0.5,
    metric='cosine',
    random_state=42)

UMAP is used to reduce the dimensions of the embeddings created by BERT. For the need of seeing the global clusters of the dataset, we focused on bigger parameters.

100 n_neighbors has been choosen for that purpose. n_components of 3 is the final dimension reduction to make it more suitable for clustering and visualization. min_dist of 0.5 has been choosen because it is in the middle of 0 and 1 for this parameter, and choosing a bigger one makes the clusters too big and less understandable.

In [None]:
# Added because advised to control number of topics through the cluster model (hdbscan by default)
hdbscan_model = HDBSCAN(
    min_cluster_size=100,
    max_cluster_size=5000,
    metric='euclidean',
    cluster_selection_method='eom')

HDBSCAN is used for clustering. Because we need to see the global clusters, we had to put a minimum size on the cluster of 100 and a maximum size of 5000 (even though this is overlaps when doing the dimensional reduction with BERTopic).

In [None]:
# Removing the stop-words because BERTopic does not do it by default
vectorizer_model = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 3), # Extract unigrams (1-word), bigrams (2-word), and trigrams (3-word phrases) from the text
    min_df=10 # Ignore terms that appear in less than 10 documents
    )

In [None]:
# Use KeyBERTInspired and MaximalMarginalRelevance for our representation model to (1) keep useful words and (2) produce cleaner topic words
representation_model=[KeyBERTInspired(top_n_words=30), MaximalMarginalRelevance(diversity=.5)]

In [None]:
# # Train BERTopic
# topic_model = BERTopic(
#     embedding_model=sentence_model,
#     umap_model=umap_model,
#     hdbscan_model=hdbscan_model,
#     vectorizer_model=vectorizer_model,
#     representation_model=representation_model
#     )

# topics, probs = topic_model.fit_transform(
#     processing_list,
#     embeddings
#     )
# topic_model.save("/content/drive/My Drive/google_patents_model_cosine_bis")

# Or load BERTopic in BERTopic v0.9.2 or higher:
topic_model = BERTopic.load("/content/drive/My Drive/google_patents_model_cosine_bis")
topics = topic_model._map_predictions(topic_model.hdbscan_model.labels_)
probs = topic_model.hdbscan_model.probabilities_

In [None]:
# Let's see the information given, the amount of topics per cluster, the type of groups we have, etc.
topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

We have, however, too many outliers. We want to reduce them before keeping on going

In [None]:
# Reduce outliers using the `embeddings` strategy
reduced_topics = topic_model.reduce_outliers(
    processing_list,
    topics,
    strategy="embeddings",
    embeddings=embeddings,
    threshold=0.5 # The threshold for assigning topics to outlier documents
    )

# Update topics
topic_model.update_topics(
    processing_list,
    topics=reduced_topics,
    vectorizer_model=vectorizer_model
    )

In [None]:
# Let's check a second time
topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

## Assessing the model

Before we keep on going, it would be interesting to assess our model to see if it was trained properly and have a good confidence with the following results. For that, we will...

List of good topics evaluations : https://github.com/jonaschn/awesome-topic-models?tab=readme-ov-file#models

### Topic model

"There are two main aspects to evaluate topic models:
- coherence
- relevance.

Coherence measures how well the words in a topic are related to each other, based on their semantic similarity or frequency.
Relevance measures how well the topics capture the main themes or aspects of the documents, based on their importance r specificity.
There are various metrics and tools to calculate coherence and relevance, such as C_V, U_Mass, topic coherence pipeline, etc. You can also use human judgment or feedback to assess the interpretability and usefulness of your topics."

This LinkedIn post, "How do you evaluate the quality and relevance of your topic models and clusters ?", tells us a bit about methods to accomplish this task. We'll first focus on the topic model and then on the clusters.

In [None]:
# Get list of words that are used for the topic modeling assessment
topics_to_evaluate = topic_model.get_topic_info()['Representation']

#### Coherence

The coherence assessment should be executed with a tool like OCTIS. However, some versioning problems were encountered. Thus, it has been decided to keep only the 3 other metrics.

#### Relevance

In [None]:
# How many of the top-N words in each topic are unique across topics (non-overlapping).
def proportion_unique_words(topics, topk=10):
    """
    compute the proportion of unique words

    Parameters
    ----------
    topics: a list of lists of words
    topk: top k words on which the topic diversity will be computed
    """
    if topk > len(topics[0]):
        raise Exception('Words in topics are less than '+str(topk))
    else:
        unique_words = set()
        for topic in topics:
            unique_words = unique_words.union(set(topic[:topk]))
        puw = len(unique_words) / (topk * len(topics))
        return puw

#Result 1.0 would mean that all words are unique across topics.
proportion_unique_words(topics_to_evaluate, topk=10)

A proportion_unique_words score of ~0.73 means that around 27% of the top topic words are reused across topics, suggesting moderate redundancy. This means that reducing topics would be an option in this case to reduce redundancy.

### Clusters

"There are two main aspects to evaluate clusters :
- validity
- stability

Validity measures how well the clusters reflect the true structure or similarity of the data, based on their compactness, separation, or silhouette.
Stability measures how consistent the clusters are across different runs or samples of the data, based on their bobustness, sensitivity, or agreement. There are various metrics and tools to calculate validity and stability, such as Davies-Bouldin index, Rand index, cluster validation toolbox, etc.

You can also use domain knowledge or business goals to assess the relevance and value of your clusters."

#### Validity

In [None]:
# Jaccard similarity across top-N words for each pair of topics.
def pairwise_jaccard_diversity(topics, topk=10):
    '''
    compute the average pairwise jaccard distance between the topics

    Parameters
    ----------
    topics: a list of lists of words
    topk: top k words on which the topic diversity
          will be computed

    Returns
    -------
    pjd: average pairwise jaccard distance
    '''
    dist = 0
    count = 0
    for list1, list2 in combinations(topics, 2):
        js = 1 - len(set(list1).intersection(set(list2)))/len(set(list1).union(set(list2)))
        dist = dist + js
        count = count + 1
    return dist/count

# Result 1.0 would mean that there are no shared words between any topic pairs
pairwise_jaccard_diversity(topics_to_evaluate, topk=10)

A validity score of 0.962 indicates that the topics are quite distinctive and well-separated. There would be no need to reduce them according to this measure.

#### Stability

The following code has been hard coded because of some troubles during installation of the file. It directly comes from https://github.com/silviatti/topic-model-diversity/blob/master/rbo.py

In [None]:
"""Rank-biased overlap, a ragged sorted list similarity measure.
See http://doi.acm.org/10.1145/1852102.1852106 for details. All functions
directly corresponding to concepts from the paper are named so that they can be
clearly cross-identified.
The definition of overlap has been modified to account for ties. Without this,
results for lists with tied items were being inflated. The modification itself
is not mentioned in the paper but seems to be reasonable, see function
``overlap()``. Places in the code which diverge from the spec in the paper
because of this are highlighted with comments.
The two main functions for performing an RBO analysis are ``rbo()`` and
``rbo_dict()``; see their respective docstrings for how to use them.
The following doctest just checks that equivalent specifications of a
problem yield the same result using both functions:
    >>> lst1 = [{"c", "a"}, "b", "d"]
    >>> lst2 = ["a", {"c", "b"}, "d"]
    >>> ans_rbo = _round(rbo(lst1, lst2, p=.9))
    >>> dct1 = dict(a=1, b=2, c=1, d=3)
    >>> dct2 = dict(a=1, b=2, c=2, d=3)
    >>> ans_rbo_dict = _round(rbo_dict(dct1, dct2, p=.9, sort_ascending=True))
    >>> ans_rbo == ans_rbo_dict
    True
"""

from __future__ import division

import math
from bisect import bisect_left
from collections import namedtuple


RBO = namedtuple("RBO", "min res ext")
RBO.__doc__ += ": Result of full RBO analysis"
RBO.min.__doc__ = "Lower bound estimate"
RBO.res.__doc__ = "Residual corresponding to min; min + res is an upper bound estimate"
RBO.ext.__doc__ = "Extrapolated point estimate"


def _round(obj):
    if isinstance(obj, RBO):
        return RBO(_round(obj.min), _round(obj.res), _round(obj.ext))
    else:
        return round(obj, 3)


def set_at_depth(lst, depth):
    ans = set()
    for v in lst[:depth]:
        if isinstance(v, set):
            ans.update(v)
        else:
            ans.add(v)
    return ans


def raw_overlap(list1, list2, depth):
    """Overlap as defined in the article.
    """
    set1, set2 = set_at_depth(list1, depth), set_at_depth(list2, depth)
    return len(set1.intersection(set2)), len(set1), len(set2)


def overlap(list1, list2, depth):
    """Overlap which accounts for possible ties.
    This isn't mentioned in the paper but should be used in the ``rbo*()``
    functions below, otherwise overlap at a given depth might be > depth which
    inflates the result.
    There are no guidelines in the paper as to what's a good way to calculate
    this, but a good guess is agreement scaled by the minimum between the
    requested depth and the lengths of the considered lists (overlap shouldn't
    be larger than the number of ranks in the shorter list, otherwise results
    are conspicuously wrong when the lists are of unequal lengths -- rbo_ext is
    not between rbo_min and rbo_min + rbo_res.
    >>> overlap("abcd", "abcd", 3)
    3.0
    >>> overlap("abcd", "abcd", 5)
    4.0
    >>> overlap(["a", {"b", "c"}, "d"], ["a", {"b", "c"}, "d"], 2)
    2.0
    >>> overlap(["a", {"b", "c"}, "d"], ["a", {"b", "c"}, "d"], 3)
    3.0
    """
    ov = agreement(list1, list2, depth) * min(depth, len(list1), len(list2))
    return ov
    # NOTE: comment the preceding and uncomment the following line if you want
    # to stick to the algorithm as defined by the paper
    # return raw_overlap(list1, list2, depth)[0]


def agreement(list1, list2, depth):
    """Proportion of shared values between two sorted lists at given depth.
    >>> _round(agreement("abcde", "abdcf", 1))
    1.0
    >>> _round(agreement("abcde", "abdcf", 3))
    0.667
    >>> _round(agreement("abcde", "abdcf", 4))
    1.0
    >>> _round(agreement("abcde", "abdcf", 5))
    0.8
    >>> _round(agreement([{1, 2}, 3], [1, {2, 3}], 1))
    0.667
    >>> _round(agreement([{1, 2}, 3], [1, {2, 3}], 2))
    1.0
    """
    len_intersection, len_set1, len_set2 = raw_overlap(list1, list2, depth)
    return 2 * len_intersection / (len_set1 + len_set2)


def cumulative_agreement(list1, list2, depth):
    return (agreement(list1, list2, d) for d in range(1, depth + 1))


def average_overlap(list1, list2, depth=None):
    """Calculate average overlap between ``list1`` and ``list2``.
    >>> _round(average_overlap("abcdefg", "zcavwxy", 1))
    0.0
    >>> _round(average_overlap("abcdefg", "zcavwxy", 2))
    0.0
    >>> _round(average_overlap("abcdefg", "zcavwxy", 3))
    0.222
    >>> _round(average_overlap("abcdefg", "zcavwxy", 4))
    0.292
    >>> _round(average_overlap("abcdefg", "zcavwxy", 5))
    0.313
    >>> _round(average_overlap("abcdefg", "zcavwxy", 6))
    0.317
    >>> _round(average_overlap("abcdefg", "zcavwxy", 7))
    0.312
    """
    depth = min(len(list1), len(list2)) if depth is None else depth
    return sum(cumulative_agreement(list1, list2, depth)) / depth


def rbo_at_k(list1, list2, p, depth=None):
    # ``p**d`` here instead of ``p**(d - 1)`` because enumerate starts at
    # 0
    depth = min(len(list1), len(list2)) if depth is None else depth
    d_a = enumerate(cumulative_agreement(list1, list2, depth))
    return (1 - p) * sum(p ** d * a for (d, a) in d_a)


def rbo_min(list1, list2, p, depth=None):
    """Tight lower bound on RBO.
    See equation (11) in paper.
    >>> _round(rbo_min("abcdefg", "abcdefg", .9))
    0.767
    >>> _round(rbo_min("abcdefgh", "abcdefg", .9))
    0.767
    """
    depth = min(len(list1), len(list2)) if depth is None else depth
    x_k = overlap(list1, list2, depth)
    log_term = x_k * math.log(1 - p)
    sum_term = sum(
        p ** d / d * (overlap(list1, list2, d) - x_k) for d in range(1, depth + 1)
    )
    return (1 - p) / p * (sum_term - log_term)


def rbo_res(list1, list2, p):
    """Upper bound on residual overlap beyond evaluated depth.
    See equation (30) in paper.
    NOTE: The doctests weren't verified against manual computations but seem
    plausible. In particular, for identical lists, ``rbo_min()`` and
    ``rbo_res()`` should add up to 1, which is the case.
    >>> _round(rbo_res("abcdefg", "abcdefg", .9))
    0.233
    >>> _round(rbo_res("abcdefg", "abcdefghijklmnopqrstuvwxyz", .9))
    0.239
    """
    S, L = sorted((list1, list2), key=len)
    s, l = len(S), len(L)
    x_l = overlap(list1, list2, l)
    # since overlap(...) can be fractional in the general case of ties and f
    # must be an integer --> math.ceil()
    f = int(math.ceil(l + s - x_l))
    # upper bound of range() is non-inclusive, therefore + 1 is needed
    term1 = s * sum(p ** d / d for d in range(s + 1, f + 1))
    term2 = l * sum(p ** d / d for d in range(l + 1, f + 1))
    term3 = x_l * (math.log(1 / (1 - p)) - sum(p ** d / d for d in range(1, f + 1)))
    return p ** s + p ** l - p ** f - (1 - p) / p * (term1 + term2 + term3)


def rbo_ext(list1, list2, p):
    """RBO point estimate based on extrapolating observed overlap.
    See equation (32) in paper.
    NOTE: The doctests weren't verified against manual computations but seem
    plausible.
    >>> _round(rbo_ext("abcdefg", "abcdefg", .9))
    1.0
    >>> _round(rbo_ext("abcdefg", "bacdefg", .9))
    0.9
    """
    S, L = sorted((list1, list2), key=len)
    s, l = len(S), len(L)
    x_l = overlap(list1, list2, l)
    x_s = overlap(list1, list2, s)
    # the paper says overlap(..., d) / d, but it should be replaced by
    # agreement(..., d) defined as per equation (28) so that ties are handled
    # properly (otherwise values > 1 will be returned)
    # sum1 = sum(p**d * overlap(list1, list2, d)[0] / d for d in range(1, l + 1))
    sum1 = sum(p ** d * agreement(list1, list2, d) for d in range(1, l + 1))
    sum2 = sum(p ** d * x_s * (d - s) / s / d for d in range(s + 1, l + 1))
    term1 = (1 - p) / p * (sum1 + sum2)
    term2 = p ** l * ((x_l - x_s) / l + x_s / s)
    return term1 + term2


def rbo(list1, list2, p):
    """Complete RBO analysis (lower bound, residual, point estimate).
    ``list`` arguments should be already correctly sorted iterables and each
    item should either be an atomic value or a set of values tied for that
    rank. ``p`` is the probability of looking for overlap at rank k + 1 after
    having examined rank k.
    >>> lst1 = [{"c", "a"}, "b", "d"]
    >>> lst2 = ["a", {"c", "b"}, "d"]
    >>> _round(rbo(lst1, lst2, p=.9))
    RBO(min=0.489, res=0.477, ext=0.967)
    """
    if not 0 <= p <= 1:
        raise ValueError("The ``p`` parameter must be between 0 and 1.")
    args = (list1, list2, p)
    return RBO(rbo_min(*args), rbo_res(*args), rbo_ext(*args))


def sort_dict(dct, *, ascending=False):
    """Sort keys in ``dct`` according to their corresponding values.
    Sorts in descending order by default, because the values are
    typically scores, i.e. the higher the better. Specify
    ``ascending=True`` if the values are ranks, or some sort of score
    where lower values are better.
    Ties are handled by creating sets of tied keys at the given position
    in the sorted list.
    >>> dct = dict(a=1, b=2, c=1, d=3)
    >>> list(sort_dict(dct)) == ['d', 'b', {'a', 'c'}]
    True
    >>> list(sort_dict(dct, ascending=True)) == [{'a', 'c'}, 'b', 'd']
    True
    """
    scores = []
    items = []
    # items should be unique, scores don't have to
    for item, score in dct.items():
        if not ascending:
            score *= -1
        i = bisect_left(scores, score)
        if i == len(scores):
            scores.append(score)
            items.append(item)
        elif scores[i] == score:
            existing_item = items[i]
            if isinstance(existing_item, set):
                existing_item.add(item)
            else:
                items[i] = {existing_item, item}
        else:
            scores.insert(i, score)
            items.insert(i, item)
    return items


def rbo_dict(dict1, dict2, p, *, sort_ascending=False):
    """Wrapper around ``rbo()`` for dict input.
    Each dict maps items to be sorted to the score according to which
    they should be sorted. The RBO analysis is then performed on the
    resulting sorted lists.
    The sort is descending by default, because scores are typically the
    higher the better, but this can be overridden by specifying
    ``sort_ascending=True``.
    >>> dct1 = dict(a=1, b=2, c=1, d=3)
    >>> dct2 = dict(a=1, b=2, c=2, d=3)
    >>> _round(rbo_dict(dct1, dct2, p=.9, sort_ascending=True))
    RBO(min=0.489, res=0.477, ext=0.967)
    """
    list1, list2 = (
        sort_dict(dict1, ascending=sort_ascending),
        sort_dict(dict2, ascending=sort_ascending),
    )
    return rbo(list1, list2, p)


if __name__ in ("__main__", "__console__"):
    import doctest

    doctest.testmod()

In [None]:
# A ranking-aware similarity metric between two ranked lists (usually top-N words in topics).
def irbo(topics, weight=0.9, topk=10):
    """
    compute the inverted rank-biased overlap

    Parameters
    ----------
    topics: a list of lists of words
    weight: p (float), default 1.0: Weight of each
        agreement at depth d:p**(d-1). When set
        to 1.0, there is no weight, the rbo returns
        to average overlap.
    topk: top k words on which the topic diversity
          will be computed

    Returns
    -------
    irbo : score of the rank biased overlap over the topics
    """
    if topk > len(topics[0]):
        raise Exception('Words in topics are less than topk')
    else:
        collect = []
        for list1, list2 in combinations(topics, 2):
            word2index = get_word2index(list1, list2)
            indexed_list1 = [word2index[word] for word in list1]
            indexed_list2 = [word2index[word] for word in list2]
            rbo_val = rbo(indexed_list1[:topk], indexed_list2[:topk], p=weight)[2]
            collect.append(rbo_val)
        return 1 - np.mean(collect)

def get_word2index(list1, list2):
    words = set(list1)
    words = words.union(set(list2))
    word2index = {w: i for i, w in enumerate(words)}
    return word2index

# Result 1.0 would mean that there is no overlap even considering rank — perfect diversity.
print("irbo p=0.5:",irbo(topics_to_evaluate, weight=0.5, topk=10))

An IRBO score of 0.99 (Inverted Rank-Biased Overlap) means the topics are highly diverse — their top words barely overlap in ranked order. This is another measure saying that the topics are well separated.

However, Bertopics allows other ways to measure how similar clusters are. We can use them to give a final opinion on our topic modeling and keep on going.

#### Final thoughts

In [None]:
topic_model.visualize_heatmap()

On the heatmap, we confirm that we have to reduce a few clusters similar between each other. We can proceed to reduce them before keeping on going.

Before that, let's also have a look at the distance of each topics. This allows us to see the biggest topics but also their distance to each other and see if it is well spreaded over or all aggregated.

In [None]:
# show how topics overlaps each other and the need to merge them
topic_model.visualize_topics()

This graph shows that a few clusters have a close distance to each other.  It often means they're discussing very similar themes, with minor variations

In [None]:
# Further reduce topics
topic_model.reduce_topics(processing_list, nr_topics=15)

Let's visualize the heatmap one more time and we are good to go.

In [None]:
topic_model.visualize_heatmap()

# Clustering with BERTopics

## Most important topics

We can now see the biggest topics and what they are talking about. Our goal is to see the main focus of Google R&D over the years, especially with tracking and profiling techniques. This will allow us to see if advertisement topics are part of the main ones.

In [None]:
topic_model.visualize_barchart(top_n_topics=15, n_words=7)

Now, we can see the documents all together and how they are clustered. This gives us a representation of the documents and the topics they are linked to in the blick of an eye.

By doing this, we can see the amount of importance the advertisement topics (or related topics) are in the whole dataset and dig into that later.

## Clustering

Now, let's take a look at the documents all together. By visualizing the documents and aggregate them in topics, we can see the overall structure of all of the topics.

...

In [None]:
umap_settings = [
    {"n_neighbors": 15, "min_dist": 0.99},
    {"n_neighbors": 200, "min_dist": 0.1},
    {"n_neighbors": 15, "min_dist": 0.1},
    {"n_neighbors": 200, "min_dist": 0.99}
]

# Reduce dimensionnality and train every parameters
visualization_figures = []
for param in umap_settings:
    # Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
    reduced_embeddings = UMAP(
        n_neighbors=param["n_neighbors"],
        min_dist=param["min_dist"],
        n_components=2,
        metric='cosine',
        random_state=42
        ).fit_transform(embeddings)

    # Train every parameters
    fig_param = topic_model.visualize_documents(
        processing_list,
        reduced_embeddings=reduced_embeddings,
        title="Patent clusters from Google"
        )
    visualization_figures.append(fig_param)

In [None]:
visualization_figures[0]

In [None]:
visualization_figures[1]

In [None]:
visualization_figures[2]

In [None]:
visualization_figures[3]

We can also generate another type of maps in order to zoom in and out more freely ...

In [None]:
# Process the topic model to visualize the documents
fig_with_datamapplot = topic_model.visualize_document_datamap(
    processing_list,
    reduced_embeddings=reduced_embeddings,
    interactive=True)

# Save the second visualization with datamapplot
fig_with_datamapplot.save("/content/drive/My Drive/patents_from_google_with_datamapplot.html")

# Show plot
fig_with_datamapplot

What this maps shows is the global clusters of the Google patents since the beginning to end of 2024. We can see a few main focus over advertisement and search query (for ads and search engine), audio speach voice (which should be related to assistant like Google home), neural network (for the need of AI products since a the beginning of the decade), etc.

## Overall evolution of the patents

We are now going to see the evolution of the overall patents. This will give us an idea of the proportions of the topic in Google's focus.

Google is primarly a company that sells ads. However, it is also known to have other businesses running like AI assistant with Google Home, Self-driving cars, Streaming plateforms, etc.

In [None]:
# get all topics (int) and add them to every documents
patents_dformated["topics"] = topic_model.topics_

### Overall evolution

In [None]:
# Add publication_year
patents_dformated["publication_year"] = patents_dformated["publication_date"].dt.year

# Group by year and topic
count_df = patents_dformated.groupby(
    ["publication_year", "topics"]
).size().reset_index(name="count")

# Plot with color = topics
plot_topic_by_year = px.bar(
    count_df,
    x="publication_year",
    y="count",
    color="topics",
    title="Number of patents per year by Topic",
    barmode="stack"
)

# Save as HTML
plot_topic_by_year.write_html("/content/drive/My Drive/patents_by_topic_chart.html")

# Show the chart
plot_topic_by_year.show()

In [None]:
# Get the topic list to get an idea of the bar chart on top
topic_model.get_topic_info()[['Topic', 'Count', 'Name', 'Representation']]

### Topics over time

In [None]:
# Process ad-topics related over time
topics_over_time = topic_model.topics_over_time(
    patents_dformated.iloc[:, 0].tolist(),
    patents_dformated['publication_date'].tolist(),
    nr_bins=20
    )

# Save the plot
topics_over_time.to_html("/content/drive/My Drive/topics_over_time.html")

# Show ad-topics related over time
topic_model.visualize_topics_over_time(
    topics_over_time,
    topics=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13] # do not show topic 0 because we dig into it later
    )

This graph helps us better see which topic grew the most over time. Interesting to note some pics in 2014-2016 with search query, ads, geographic location, social graphs, etc. Some others like image display, audio speech voice data, sation base wireless and neural networks are growing for a few years.

## Digging inside the clusters

We first want to get our privacy and tracking topics. Using this as an anchor, we can calculate the difference between these topics and every other topics. Then, we can check which are the most related to privacy and surveillance.

Let's find where, if there are any, are located the topics for surveillance and privacy. We can then use them to compute our cosine distance between documents.

In [None]:
similar_topics_surveillance, similarity_surveillance = topic_model.find_topics("tracking", top_n=1)
similar_topics_privacy, similarity_privacy = topic_model.find_topics("data privacy", top_n=1)

print(f'surveillance topic : {similar_topics_surveillance} %:{similarity_surveillance[0]}')
print(f'privacy topic : {similar_topics_privacy} %:{similarity_privacy[0]}')

For the following part, the idea was to calculate the cosine similarity of opposite documents or topics like "privacy" and "tracking".

The result is that models based on the distributional hypothesis are not very good at this because antonyms often appear in similar contexts. So if we calculate the cosine similarity of opposite words like the ones we suggested, we would arrive at more or less the same result when using, let's say, a correlation bar chart.

For that reason, we are going to compute the polarity score between a privacy topic (positive) and a tracking topic (negative). Let's first get these topics.

### Visualization of the subtopics

Our goal is to answer one question. We'd like to check if Google implement alternative user tracking or targeting practices. For this purpose, we have our patents list that we modeled by topics. We want now to dig in the topics that are related the most to these practices.

According to BERTopic find_topics() method, they are located in one big aggregated cluster (0). Let's dig into this topic and see if they fit our definition.

#### Model parametering

In [None]:
# Removing the stop-words because BERTopic does not do it by default
vectorizer_submodel_indiv = CountVectorizer(
    stop_words="english",
    ngram_range=(1, 3) # Extract unigrams (1-word), bigrams (2-word), and trigrams (3-word phrases) from the text
    )

In [None]:
# Reduce dimensionality of embeddings, this step is optional but much faster to perform iteratively:
umap_submodel = UMAP(
    n_neighbors=30,
    n_components=3,
    min_dist=0.1,
    metric='cosine',
    random_state=42)

In [None]:
# Added because advised to control number of topics through the cluster model (hdbscan by default)
hdbscan_submodel = HDBSCAN(
    min_cluster_size=20,
    max_cluster_size=500,
    metric='euclidean',
    cluster_selection_method='leaf')

In [None]:
# Fet all embeddings (float) and add them to the dataframe
patents_dformated["embeddings"] = [vec.tolist() for vec in embeddings]

#### Privacy and Tracking topics

In [None]:
sub_topics = []
sub_topics.extend(topic_model.find_topics('privacy', top_n=1)[0])
sub_topics.extend(topic_model.find_topics('tracking', top_n=1)[0])
# Delete duplicates
sub_topics = list(set(sub_topics))

# Filter the emdeddings just on the topic mentionned above
# (reprocessing the embeddings again would take several hours)
filtered_embeddings_sub_topics_df = patents_dformated.loc[patents_dformated["topics"].isin(sub_topics)]
filtered_sub_topics_df = filtered_embeddings_sub_topics_df['processing']
# convert the embeddings
filtered_sub_embeddings = filtered_embeddings_sub_topics_df["embeddings"].tolist()
filtered_sub_embeddings_mat = csr_matrix(filtered_sub_embeddings)
filtered_sub_embeddings = filtered_sub_embeddings_mat.toarray()

In [None]:
# Train the new subtopics model
# subtopic_model = BERTopic(
#     embedding_model=sentence_model,
#     umap_model=umap_submodel,
#     hdbscan_model=hdbscan_submodel,
#     vectorizer_model=vectorizer_submodel_indiv,
#     representation_model=representation_model
#     )

# sub_model_topics, sub_model_probs = subtopic_model.fit_transform(filtered_sub_topics_df, filtered_sub_embeddings)
# subtopic_model.save("/content/drive/My Drive/google_patents_model_subtopics_bis")

# Or load BERTopic in BERTopic v0.9.2 or higher:
subtopic_model = BERTopic.load("/content/drive/My Drive/google_patents_model_subtopics_bis")
sub_model_topics = subtopic_model._map_predictions(subtopic_model.hdbscan_model.labels_)
sub_model_probs = subtopic_model.hdbscan_model.probabilities_

In [None]:
# Reduce outliers using the `embeddings` strategy
reduced_subtopics = subtopic_model.reduce_outliers(
    documents=filtered_sub_topics_df.to_list(),
    topics=sub_model_topics,
    strategy="embeddings",
    embeddings=filtered_sub_embeddings,
    threshold=0.80
    )

# Update topics
subtopic_model.update_topics(
    filtered_sub_topics_df.to_list(),
    topics=reduced_subtopics,
    vectorizer_model=vectorizer_submodel_indiv
    )

In [None]:
reduced_filtered_submodel_embeddings = UMAP(
    n_neighbors=15,
    min_dist=0.1,
    n_components=2,
    metric='cosine',
    random_state=42
    ).fit_transform(filtered_sub_embeddings)

In [None]:
# Put it in the new document visualisation
# Process the subtopic model to visualize the documents
fig_with_datamapplot_subtopics = subtopic_model.visualize_document_datamap(
    docs=filtered_sub_topics_df.tolist(),
    reduced_embeddings=reduced_filtered_submodel_embeddings,
    interactive=True
    )

# Save the second visualization with datamapplot
fig_with_datamapplot_subtopics.save("/content/drive/My Drive/patents_from_google_with_datamapplot_subtopics_bis.html")

# Show plot
fig_with_datamapplot_subtopics

Let's make an assessment of our model before going further.

In [None]:
# Get list of words that are used for the topic modeling assessment
subtopics_to_evaluate = subtopic_model.get_topic_info()['Representation']

In [None]:
# Result 1.0 would mean that all words are unique across topics.
print("proportion of unique words, topk=10:",proportion_unique_words(subtopics_to_evaluate, topk=10))

# Result 1.0 would mean that there are no shared words between any topic pairs
print("pairwise jaccard diversity, topk=10:",pairwise_jaccard_diversity(subtopics_to_evaluate, topk=10))

# Result 1.0 would mean that there is no overlap even considering rank — perfect diversity.
print("irbo, p=0.5:",irbo(subtopics_to_evaluate, weight=0.5, topk=10))

Before moving on, let's take a look at the evolution of the topics over time

In [None]:
# Process ad-topics related over time
subtopics_over_time = subtopic_model.topics_over_time(
    filtered_sub_topics_df.tolist(),
    filtered_embeddings_sub_topics_df["publication_date"].tolist(),
    nr_bins=20
    )

# Save the plot
subtopics_over_time.to_html("/content/drive/My Drive/subtopics_over_time.html")

# Show ad-topics related over time
subtopic_model.visualize_topics_over_time(
    subtopics_over_time
    )

Now that we have a more granular look at our big topic that contains both privacy and tracking topics, we can dig deeper and get them to have our anchor.

In [None]:
similar_subtopics_tracking, similarity_sub_tracking = subtopic_model.find_topics("tracking", top_n=1)
similar_subtopics_privacy, similarity_sub_privacy = subtopic_model.find_topics("data privacy", top_n=1)

print(f'tracking topic : {similar_subtopics_tracking} %:{similarity_sub_tracking[0]}')
print(f'privacy topic : {similar_subtopics_privacy} %:{similarity_sub_privacy[0]}')

This shows what we were expecting. As we said before, models based on the distributional hypothesis are not very good at this because antonyms often appear in similar contexts. So the submodel tells me that the same topic is at the same time the most similar to opposite words.

To counter that, it has been decided to find manually a topic that matches the most our "tracking" topic. We now have our privacy topic (173 docs) and our tracking topic (750 docs). The latter got chosen because of the main focus of the documents which are network analysis that corresponds the most, according to me, to the definition of tracking techniques.

In [None]:
# get all topics (int) and add them to every documents
filtered_embeddings_sub_topics_df["topics"] = subtopic_model.topics_

In [None]:
subtopic_model.get_topic_info(12)

In [None]:
topic_model.get_topic_info(11)

### Compute polarity score between topics

How does a polarity score works ? It is calculated using (positive- negative) / (positive + negative). This result in a score ranging from -1 to +1. We can then have a better understanding of the topics close to those in privacy topics and those in tracking topics.

In [None]:
# Get privacy topic embeddings
privacy_centroid = subtopic_model.topic_embeddings_[12]

# Get tracking topic embeddings
tracking_centroid = topic_model.topic_embeddings_[11]

In [None]:
# Get topic embeddings
topic_vectors = topic_model.topic_embeddings_  # Shape: (n_topics, embedding_dim)
topic_ids = topic_model.get_topic_info().Topic.tolist()  # Topic numbers

In [None]:
# Calculate the polarity score (using (positive - negative) / (positive + negative))
def polarity_score_normalized(topic_vector, priv_centroid, track_centroid):
  sim_privacy = cosine_similarity([topic_vector], [priv_centroid])[0][0]
  sim_tracking = cosine_similarity([topic_vector], [track_centroid])[0][0]

  return (sim_privacy - sim_tracking) / (sim_privacy + sim_tracking)

In [None]:
scores = {}
for topic_id, topic_vector in zip(topic_ids, topic_vectors):
  score = polarity_score_normalized(
      topic_vector,
      privacy_centroid,
      tracking_centroid
      )
  scores[topic_id] = score

In [None]:
# Convert to Series and sort
corrplot = pd.Series(scores).sort_values()

# Plot
fig, ax = plt.subplots(figsize=(17, 12))

# TwoSlopeNorm: midpoint = 0 (neutral), red = negative, green = positive
vmin = min(scores.values())
vmax = max(scores.values())
norm = TwoSlopeNorm(vmin=vmin, vcenter=0, vmax=vmax)
colors = [plt.cm.RdYlGn(norm(c)) for c in corrplot.values]

# Horizontal bar plot
corrplot.plot.barh(color=colors, ax=ax)

# Labeling
ax.set_xlabel("Polarity Score (Privacy vs Tracking)")
ax.set_ylabel("Topic ID")
ax.set_title("Topic Alignment with Privacy (Red = Tracking-Aligned, Green = Privacy-Aligned)")

plt.grid(True)
plt.show()

This plot shows the problem that we have with embeddings. For 2 opposites words, we end up having a really tight polarity score indicating small difference between the 2 topics.

### Try with standard techniques

Since we have some troubles with the embeddings, we're going to use a technique that do not requires embeddings. We are going with traditionnal machine learning model like logistic regression, SVM and others.

By doing this, we are going to classify our patents from 0 to 2 depending if they are tracking related (0), neutrals (1) or privacy related (2).

It is interesting to note that other tools have been tried like Vader, a sentiment analyser classifier based on a lexicon, and an NLI (Natural Language Inference) based on embeddings which lead to the same problem.

#### Create train dataframe

Here is the method for spliting the dataset and obtain a few samples in order to manually classify them. A focus has been made on the privacy-related topic and tracking-related-topic (the latter defined arbitrarely).

One thing interesting to note is the way the classification has been made, or why some patents were classify (manually) as privacy-related or tracking-related.

- If the purpose of the patent is about privacy or tracking, they were classify according to the purpose.
- If they were about data processing but not related to any of them, they were classify as neutral.
- Some patents may include tracking but the purpose was the development of a privacy tool, then they have been categorized as privacy ; which is also the reason why they are so close together.
- Some patents may have words like "location" which may think of tracking but are actually related on the location of fingers on a screen.
- If patents shows techniques to visualize behaviors such as trend data, this is also categorize as tracking.

This training dataset has been manually labelled and consists of :

| Label| Name       | Amount|
| :--- | :------:   | ----: |
| 0    |   Tracking | 210   |
| 1    |   Neutral  | 655   |
| 2    |  Privacy   | 166   |

In [None]:
# Import training set
train_df = pd.read_excel('/content/drive/My Drive/train_df.xlsx')

#### Split the train dataset

In [None]:
# Instanciate vectorizer
vectorizer = TfidfVectorizer(max_features=5000)

# Split the data into training and testing sets
X = vectorizer.fit_transform(train_df['processing'])
y = train_df['labels']

#### Train the models

In [None]:
def fit_best_model(model, params) :

    scoring = {
        'f1': make_scorer(f1_score, average='macro'),
        'precision': make_scorer(precision_score, average='macro'),
        'recall': make_scorer(recall_score, average='macro')
    }

    # Fit using gridSearchCV for cross-validation
    model_cv = GridSearchCV(
      model,
      params,
      cv=5,
      n_jobs=-1, # The number of CPU cores used during cv loop. -1 means all
      return_train_score=True,
      verbose = 0, # Enable verbose output, 0 = silent
      # P quantifies how many of the positive predictions made by the model were actually correct
      # R measures how well the model identifies all the actual positive instances
      # f1 balance the 2 providing a single value that represents the overall performance
      scoring = scoring,
      refit='f1'
    )

    model_cv.fit(X, y)

    # Return the best parameters giving the best results
    # the best estimator that was chosen by the search,
    # the scorer function used on the held out data to choose the best parameters for the model.
    return model_cv.best_params_, model_cv.best_estimator_, model_cv.cv_results_

In [None]:
# Logistic regression
params_lr = {
    'fit_intercept' : [True, False],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'dual' : [False], # Dual or primal formulation
    'penalty' : ['l2'], # l1 and elasticnet not always supported with multiclass
    'solver' : ['lbfgs', 'newton-cg', 'newton-cholesky', 'sag', 'saga'], # only classes that supports multiclass
    'tol' : [1e-4, 1e-3, 1e-2], # Tolerance for stopping criteria
    'max_iter' : [100, 500, 1000] , # Max iteration during training
    'class_weight' : ['balanced'] # Give more weight to some classes, if None all classes are 1
}

lr = LogisticRegression()

# Train the model
lr_best_params_, lr_best_estimator_, lr_results_ = fit_best_model(lr, params_lr)
# Save the model
joblib.dump(lr_best_estimator_, '/content/drive/My Drive/lr_best_estimator_.pkl')

In [None]:
# Support Vector Machine (SVM)
params_lsvc = {
    'penalty' : ['l2'], # Norm used in penalization
    'loss' : ['hinge', 'squared_hinge'], # Specifies loss function
    'dual' : [True, False], # Select algo to either solve dual or primal optimization problem
    'tol' : [1e-4, 1e-3, 1e-2], # Tolerance for stopping criteria
    'C' : [0.001, 0.01, 0.1, 1, 10, 100], # Regularization parameter
    'multi_class' : ['ovr', 'crammer_singer'], # Determines multi-class strategy. Crammer-singer rarely leads to better accuracy so ovr is chosen
    'fit_intercept' : [True, False], # Wether or not to fit an intercept
    'intercept_scaling' : [0.1, 1, 10],
    'class_weight' : ['balanced'], # Give more weight to some classes, if None all classes are 1
    'random_state' : [42], # Control the randomness of the output, 42 is the least possible
    'max_iter' : [100, 500, 1000] , # Max iteration during training
}

lsvc = LinearSVC()

# Train the model
lsvc_best_params, lsvc_best_estimator, lsvc_results_ = fit_best_model(lsvc, params_lsvc)
# Save the model
joblib.dump(lsvc_best_estimator, '/content/drive/My Drive/lsvc_best_estimator_.pkl')

In [None]:
# Random forest
params_rfc = {
    'n_estimators' : [100, 300, 500], # Number of trees in the forest
    'criterion' : ['gini'], # Measure to split a node
    'max_depth' : [None, 30, 70], # Max depth of the tree
    'min_samples_split' : [2, 5, 10], # Min number of samples required to split an internal node
    'min_samples_leaf' : [1, 2, 4], # Min number of samples required to be at a leaf node
    'max_features' : ['sqrt', 'log2', None], # Number of features to consider when looking for the best split
    'random_state' : [42], # Control the randomness of the output, 42 is the least possible
    'class_weight' : ['balanced'],  # Give more weight to some classes, if None all classes are 1
}

rfc = RandomForestClassifier()

# Train the model
rfc_best_params, rfc_best_estimator, rfc_results_ = fit_best_model(rfc, params_rfc)
# Save the model
joblib.dump(rfc_best_estimator, '/content/drive/My Drive/rfc_best_estimator_.pkl')

In [None]:
# XGBoost
# Compute scale_pos_weight as ratio of negative to positive instances in train set
# according to https://xgboosting.com/xgboost-scale_pos_weight-vs-sample_weight-for-imbalanced-classification/
scale_pos_weight = 197 / 166  # ≈ 1.187

params_xgb = {
    'objective': ['multi:softmax', 'multi:softprob'], # Defining multi-classification with softmax and softprob
    'num_class': [3], # Number of classes
    'max_depth': [3, 5, 10], # Max depth of a tree 4-6
    'gamma': [0, 0.5, 1], # Min loss reduction required to make a further partition on a leaf node of the tree. Larger gamma, the more conservative the algo is
    'n_estimators': [100, 300, 500],
    'scale_pos_weight': [scale_pos_weight], # Control balance of pos and neg weights, useful for unbalanced classes
    'eval_metric': ['mlogloss'] # Metric used for model assessment
}

xgbc = xgb.XGBClassifier()

# Train the model
xgbc_best_params, xgbc_best_estimator, xgbc_results_ = fit_best_model(xgbc, params_xgb)
# Save the model
joblib.dump(xgbc_best_estimator, '/content/drive/My Drive/xgbc_best_estimator_.pkl')

In [None]:
# Load the model (if already trained)
model_lr = joblib.load('/content/drive/My Drive/lr_best_estimator_.pkl')
model_lsvc = joblib.load('/content/drive/My Drive/lsvc_best_estimator_.pkl')
model_rfc = joblib.load('/content/drive/My Drive/rfc_best_estimator_.pkl')
model_xgbc = joblib.load('/content/drive/My Drive/xgbc_best_estimator_.pkl')

### Assess the models

Now that all of our models are tested, we can find the one that gives the "best result". This also means that we take one that ... (talk about false positive and true negative and why we chose the one that makes more true negative or false positive)

In [None]:
# Logistic regression assessment
print("Mean precision:", lr_results_['mean_test_precision'])
print("Mean recall:", lr_results_['mean_test_recall'])
print("Mean f1:", lr_results_['mean_test_f1'])

In [None]:
# Support Vector Machine (SVM) assessment
print("Mean precision:", lsvc_results_['mean_test_precision'])
print("Mean recall:", lsvc_results_['mean_test_recall'])
print("Mean f1:", lsvc_results_['mean_test_f1'])

In [None]:
# Random forest assessment
print("Mean precision:", rfc_results_['mean_test_precision'])
print("Mean recall:", rfc_results_['mean_test_recall'])
print("Mean f1:", rfc_results_['mean_test_f1'])

In [None]:
# XGBoost assessment
print("Mean precision:", xgbc_results_['mean_test_precision'])
print("Mean recall:", xgbc_results_['mean_test_recall'])
print("Mean f1:", xgbc_results_['mean_test_f1'])

After considering these 4 algorithms, training them and looking at the results, it has been decided to keep going with the Logistic Regression algorithm. This choice was made because of the high and stable scores of the predictions. There is a high mean prediction (around 0.75) while a good mean recall (around 0.75) which gives a good balance with the mean F1 score (around 0.75). It doesn't show overfitting like the decision tree was giving with, sometimes, a mean precision of more than 0.80 but a mean recall of 0.60.

### Predict the whole dataset

In [None]:
# Get the dataset without training
predict_df = pd.read_excel('/content/drive/My Drive/predict_df.xlsx')

In [None]:
# Get lines to predict
documents = predict_df['processing']

# Vectorize documents with tfidf
X_tfidf = vectorizer.fit_transform(documents)  # sparse matrix (n_samples, n_features)

In [None]:
# Make the prediction over the dataset
predicted_results = model_lr.predict(X_tfidf)

# Assign predicted results made by the model to new column
predict_df['predicted_labels'] = predicted_results

In [None]:
# Assign same column to train_df
train_df["predicted_labels"] = train_df["labels"]

# Merge 2 datasets
patents_df = pd.concat([train_df, predict_df], ignore_index=True)

### Display evolution of privacy, neutral and tracking related topics

Let's start by looking at the overall evolution of the privacy and non privacy related patents.

In [None]:
# Add publication_year
patents_df["publication_year"] = patents_df["publication_date"].dt.year

# Group by year and topic
count_df = patents_df.groupby(
    ["publication_year", "predicted_labels"]
).size().reset_index(name="count")

# Plot with color = topics
plot_topic_by_year = px.bar(
    count_df,
    x="publication_year",
    y="count",
    color="predicted_labels",
    title="Number of patents per year by Topic",
    barmode="stack"
)

# Save as HTML
plot_topic_by_year.write_html("/content/drive/My Drive/patents_by_privacy_chart.html")

# Show the chart
plot_topic_by_year.show()

#### Plot the evolution

In [None]:
# Count total patents per year
total_per_year = patents_df.groupby('publication_year').size().rename('total')
# Count tracking and privacy patents per year
tracking_counts = patents_df[patents_df['predicted_labels'] == 0].groupby('publication_year').size().rename('tracking')
privacy_counts = patents_df[patents_df['predicted_labels'] == 2].groupby('publication_year').size().rename('privacy')

# Merge into one DataFrame
yearly_stats = pd.concat([total_per_year, tracking_counts, privacy_counts], axis=1).fillna(0)
yearly_stats['tracking_prop'] = yearly_stats['tracking'] / yearly_stats['total']
yearly_stats['privacy_prop'] = yearly_stats['privacy'] / yearly_stats['total']

In [None]:
plt.figure(figsize=(10, 6))

plt.plot(yearly_stats.index, yearly_stats['tracking_prop'], marker='o', label='Tracking')
plt.plot(yearly_stats.index, yearly_stats['privacy_prop'], marker='s', label='Privacy')

plt.axvline(x=2020, color='red', linestyle='--', label='Cookie Phase-Out (2020)')
plt.title('Proportion of Tracking and Privacy-Related Patents Over Time')
plt.xlabel('Year')
plt.ylabel('Proportion')
plt.legend()
plt.grid(True)
plt.tight_layout()

# Save the figure
plt.savefig('/content/drive/MyDrive/patent_proportions_plot.png')

# Display plot
plt.show()

First, we see that privacy and tracking related patents represent a small amount of patents. Like said before, other patents are related to technological improvements in self driving cars, machine learning models and so on.

However, between 2014 and 2024, there's a significant drop in the number of tracking-related patents. On the other hand, in the same period, privacy-related patents grew.

Finally, the inflection point, around 2019, can show a strategic shift in Google's strategy regarding the rise of privacy scrutinity, regulatory changes (GDPR, etc.), and so on.

## Statistical test

In [None]:
# Statistical test (chi squared test of independence)
# Create a column for period
patents_df['period'] = patents_df['publication_year'].apply(lambda x: 'pre_2020' if x < 2020 else 'post_2020')

# Create contingency table
contingency = pd.crosstab(patents_df['period'], patents_df['predicted_labels'])

In [None]:
chi2, p, dof, expected = chi2_contingency(contingency)

print(f"Chi-squared statistic: {chi2:.3f}")
print(f"Degrees of freedom: {dof}")
print(f"P-value: {p:.4f}")

if p < 0.05:
    print("Significant change in distribution after 2020")
else:
    print("No statistically significant change in distribution")

A 60.1 result is a significant change and confirms there's a non-random change in patent distribution after 2020, aligning well with known external pressures on data practices.

# Conclusion

This study analyzed Google’s patent activity from 2010 to 2025 to investigate shifts in technological focus in response to the phasing out of third-party cookies. By applying topic modeling (BERTopic) combined with document embeddings and statistical analysis, we identified a significant transition in the nature of Google’s patents related to user data handling.

The results show a steady decline in tracking-related patents, dropping from 3% to 0.5% over 15 years, with a marked inflection point around 2018–2019, where privacy-related patents began to outnumber tracking ones. This shift coincides with increasing regulatory pressure (e.g., GDPR, CCPA) and the public announcement of Google's plans to eliminate third-party cookies.

A chi-squared test confirmed that this trend is not random, revealing a statistically significant change in patent distribution after 2020 (χ² = 60.993, p < 0.001). This suggests a deliberate strategic pivot.

These findings support the rejection of the null hypothesis (H₀). There is strong evidence that Google’s increased activity in privacy-related patent filings reflects the development of alternative targeting systems, aligning with the company’s stated privacy initiatives — while possibly preserving, or even enhancing, its capabilities in user profiling through new technical paradigms.

# Limitations and considerations

While this study offers compelling evidence of a thematic shift in Google’s patent filings toward privacy-related technologies, several constraints limit the scope and interpretation of the findings:

- Patents ≠ Products: Not all patents lead to actual deployment. Many are defensive, strategic, or speculative in nature. Therefore, the presence of privacy-related patents does not guarantee implementation in Google’s real-world advertising or tracking systems.

- Interpretation of Patent Language: Patent abstracts and claims often use generalized, technical, or vague language. This introduces ambiguity when classifying them under themes like "privacy" or "tracking," even with semantic clustering models.

- Algorithmic Limitations: Techniques such as BERTopic rely on vector embeddings that can misrepresent nuanced legal or contextual meanings, especially in fields with evolving terminologies like privacy and surveillance.