S+t-SNE

S+t-SNE - Bringing dimensionality reduction to data streams

We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle infinite data streams. The core idea behind S+t-SNE is to update the t-SNE embedding incrementally as new data arrives, ensuring scalability and adaptability to handle streaming scenarios. By selecting the most important points at each step, the algorithm ensures scalability while keeping informative visualisations. Employing a blind method for drift management adjusts the embedding space, facilitating continuous visualisation of evolving data dynamics. Our experimental evaluations demonstrate the effectiveness and efficiency of S+t-SNE. The results highlight its ability to capture patterns in a streaming scenario. We hope our approach offers researchers and practitioners a real-time tool for understanding and interpreting high-dimensional data.

Contributors: Pedro C. Vieira¹ pedrocvieira@fc.up.pt, João P. Montrezol joao.antunes@fc.up.pt, João T. Vieira up201905419@fc.up.pt, João Gama jgama@fep.up.pt

Organization

images/paper_plus_extra/: images used in the paper
images/paper_plus_extra/extra/: images not used in the paper
Drift_Simulator/: scripts to create the dataset with drifted used in the paper
Drift_Simulator/Images/: 3d images of dataset with drift used in the paper
Demo/: scripts used for demonstration of results from S+t-SNE, the scripts include classes for testing time and memory complexity as well as KL divergence. The Run.py file is a compilation of parameters and methodologies of testing for measuring the algorithm and saving CSVs results.
ecs/: scripts used to run Exponential Cobweb Slicing (ECS)
STSNE.py: Main class of the S+t-SNE addapted for the River API structure.

Running S+t-SNE

Using S+t-SNE

The provided S+t-SNE implementation is based on River API, to run the python class one need to initialize it with a sample of the data on its constructor alongside the parameters for the t-SNE operations. Afterwards one can use one of two options to embed examples:

learn_one(x), to passively learn an example, adds an example to the current batch, waiting till the points threshold is met and the batch is flushed;
predict_one(x), to actively learn an example, adds an example to the current batch and forces the batch to be flushed;
flush_batch(), to force a batch to be flushed;

As for ECS, the hyperparameters used correspond to how the exponential decay behaves. Said parameters can be readily changed in subregion.py. The number of concentric regions can also be changed by modifying the layers_distance_ratio in convex_polygon.py. Each value in the list will be translated into a distance from the centroid and a concetric region will be constructed based on said distance. Any other change like how points are counted involve, for now, changing multiple functions.

Tests

Firstly, go to this drive folder and download the folder Demo/. Copy all folders and files to the Demo/ folder obtained from the repository. In the case of ProcessedAnalysisCSV/ directory move the files inside it to the corresponding directory in the repository. The same applies to the files inside Demo/ in the drive. As for the directory Results/ in the drive, its enough to just copy it to the Demo/ in the repository.

After completing that, one can simply explore the functions on the avilable jupyter notebooks to generate the plots with the available CSVs on the Results/ folder. To generate the CSVs by yourself, check if the file driftingPoints_shuffle.csv is available on the Demo/ folder and SKLearn library is available, afterwards one can personalize the test parameters on the file Demo_Params.py, and test routines on the file Run.py. Finally run the Run.py scripts to get new versions of the CSVs at the Results/ folder.

Notes

In the version of the paper freely available in ArXiv, the fourth paragraph of section 3.3 - Handling Drift is poorly worded. A better description would be:

Our solution involves using the convex hulls obtained from clustering and dividing them into parts. Each partition will employ blind drift detection by exponential decay based on the number of iterations in S+t-SNE. This allows parts without new points during a period (given by exponential decay) to disappear, ensuring consistency. We parameterise the exponential decay with three parameters, $\alpha = 0.88$, $\beta = 1.6$ and $\eta = 0.01$, yielding $N(t) = \alpha e^{−t\eta+\beta}$ where $t$ is the number of iterations. This expression encapsulates our deﬁnition of drift. In this conﬁguration, a polygon in iteration 200 will have section x cut if said section does not receive points for more than $N(200)$ iterations.

The version published in the conference should have everything corrected. We will probably update the file in ArXiv soon.

Better documentation for ECS will be added soon

Contributions

Feel free to improve the code!
If you have any questions do not hesitate to contact us (preferably the corresponding author) either by email or using GitHub functionalities! :)

Corresponding Author ↩

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Demo		Demo
Drift_Simulator		Drift_Simulator
ecs		ecs
images/paper_plus_extra		images/paper_plus_extra
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
STSNE.py		STSNE.py
requirementsDemo.txt		requirementsDemo.txt
requirementsSTSNEFinal.txt		requirementsSTSNEFinal.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

S+t-SNE

S+t-SNE - Bringing dimensionality reduction to data streams

Organization

Running S+t-SNE

Using S+t-SNE

Tests

Notes

Contributions

About

Releases

Packages

Languages

License

MrXester/WIP-StSNE

Folders and files

Latest commit

History

Repository files navigation

S+t-SNE

S+t-SNE - Bringing dimensionality reduction to data streams

Organization

Running S+t-SNE

Using S+t-SNE

Tests

Notes

Contributions

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages