<center> <h1> Topological Data Analysis n°2 </h1> </center>
<center> <h2> Exploratory Analysis using Artificially made data</h2> <center>

In this notebook, I explore novel strategies for representing session data. Sessions aren't necessarily linear (users might open multiple windows to explore various topics), and in the previous pipeline, we didn't leverage contextual information about the documents. The new pipeline aims to address these issues by treating a session as a collection of words (including themes and words from the titles). We then apply Singular Value Decomposition (SVD) to obtain a condensed representation of the sequence as a vector. Subsequently, we employ the mapper algorithm on the vector list representing the sessions.

# 1. Data

Initially, we begin by generating data using the identical code from the preceding notebook.

In [1]:
from tda_utils import *
import json

#Load the small dataset
with open("toy_dataset.json", 'r') as f:

    data = json.load(f)

print("The contained themes:", *data.keys())

The contained themes: Droit Musique Histoire Science Technologie Art Cuisine Sport Mode Environnement Éducation Santé Voyages Philosophie Politique


[nltk_data] Downloading package stopwords to /home/maabid/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/maabid/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
SMALL_VARIABILITY = 0.2 

MEDIUM_VARIABILITY = 1

HIGH_VARIABILITY = 10

SMALL_ALPHA = 0.1

MEDIUM_ALPHA = 1

HIGH_ALPHA = 100


BEHAVIOURS_SPECIFY = [
    (SMALL_VARIABILITY, HIGH_ALPHA, np.random.randint(1_000, 2_001), "small"),
    (SMALL_VARIABILITY, MEDIUM_ALPHA, np.random.randint(1_000, 2_001), "small"),
    (MEDIUM_VARIABILITY, SMALL_ALPHA, np.random.randint(1_000, 2_001), "medium"),
    (HIGH_VARIABILITY, MEDIUM_ALPHA, np.random.randint(1_000, 2_001), "long"),
    (HIGH_VARIABILITY, HIGH_VARIABILITY, np.random.randint(1_000, 2_001), "long"),
    (MEDIUM_VARIABILITY, HIGH_ALPHA, np.random.randint(1_000, 2_001), "medium"),
    (MEDIUM_VARIABILITY, MEDIUM_ALPHA, np.random.randint(1_000, 2_001), "medium"),
]

In [3]:
sessions = generate_dataset(BEHAVIOURS_SPECIFY, data)

  0%|          | 0/7 [00:00<?, ?it/s]

Additionally, we preserve the labels of each generated session to maintain tracking capability.

In [12]:
ground_truth_labels = [np.ones(b[2])*i for i, b in enumerate(BEHAVIOURS_SPECIFY)]

ground_truth_labels = np.concatenate(ground_truth_labels)

# 2. New Pipeline

The titles of documents harbor valuable information beyond their themes. By incorporating titles, we have the potential to enrich the context of our analysis. Additionally, I plan to utilize a compact pretrained Word2Vec algorithm for this purpose.

## 2.1 Get sessions representations

The first step is to create a set of vectors that represent a session using a pretrained word2vec model. I chose to use the model by Fauconnier, available at [website link](https://fauconnier.github.io/). Once I have a set of vectors for each session, I consider the matrix of the session (the vectors in the set are the columns of the matrix). I compute the Singular Value Decomposition (SVD) on this matrix, and I extract the first left singular vector to represent the session.

In [4]:
from gensim.models import KeyedVectors

#Load the pretrained model
path_to_word2vec_model = 'frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin'
word2vec_model = KeyedVectors.load_word2vec_format(path_to_word2vec_model, binary=True)

In [5]:
sessions_vec = get_sessions_representation(sessions, word2vec_model)

Vectorize sessions...


  0%|          | 0/10815 [00:00<?, ?it/s]

Done!
Compute SVD sessions


  0%|          | 0/10815 [00:00<?, ?it/s]

Done!


In [6]:
sessions_vec = np.array(sessions_vec)

Once the SVD is computed and the representative vector for each session is obtained, the Mapper algorithm can be employed to obtain a comprehensive view of the global structure of the sessions, elucidating their distribution in the space.

The Mapper algorithm, a cornerstone of Topological Data Analysis (TDA), facilitates the exploration of complex data structures by reducing their dimensionality while preserving topological characteristics. Here's a breakdown of its functionality:

1. **Data Partitioning**: Initially, the data is partitioned into intervals or bins within the feature space based on a chosen measure of density or distance.

2. **Atlas Construction**: For each interval, a representative point or vector is selected to summarize the data contained within that interval.

3. **Interval Connection**: Subsequently, these interval representatives are connected based on their overlaps, forming a network known as the Mapper atlas.

4. **Projection and Visualization**: Finally, the Mapper atlas can be projected into a lower-dimensional visualization space to enable visual interpretation of the data structure. Techniques like Multidimensional Scaling (MDS) or Principal Component Analysis (PCA) are often employed for this purpose.

By applying the Mapper algorithm to the representative vectors obtained from the SVD, one gains insight into the overall organization and distribution of the session data in the feature space. This facilitates a deeper understanding of the relationships and patterns present within the dataset.

In [7]:
import kmapper as km
from kmapper.jupyter import display
from sklearn.manifold import Isomap
from umap import UMAP
from sklearn.cluster import DBSCAN

## 2.2 Mapper Algorithme

In [31]:
mapper  = km.KeplerMapper(verbose=0)

projected_data = mapper.fit_transform(sessions_vec, projection=[Isomap(n_components=175, n_jobs=1), UMAP(n_components=2)])

G = mapper.map(projected_data, sessions_vec, clusterer=DBSCAN(metric="cosine"))

In [None]:
#Creat a graph to visualize how data is distributed.
mapper.visualize(G, 
                 path_html="data_structure.html", 
                 title='Data Structure',
                 color_values=ground_truth_labels,
                 color_function_name=["Ground Truth labels"],
                 node_color_function= np.array(['average', 'std', 'sum', 'min', 'max']))

# 3. Discussion
(comming soon)