Skip to content

AntheSevenants/DistributionalSemanticsR

Repository files navigation

DistributionalSemanticsR

Tools for using distributional semantics in R

Tutorial on how to use this library

What is DistributionalSemanticsR?

DistributionalSemanticsR is a library with which you can easily use distributional semantics datasets. The main idea is that you have a dataset which contains distributional vectors for words, which you can then use to draw your words of choice on a plot. This is done by running dimension reduction on the words' vector values.

DistributionalSemanticsR handles the following aspects of the dataset conversion process:

  • retrieving the correct vectors from a distributional semantics dataset
  • computing the distance matrix for the words of your choice
  • applying dimension reduction on the distance matrix using the following techniques:
    • MDS
    • TSNE
    • UMAP
  • computing data point clusters using the following techniques:
    • k means
    • dbscan

The output of these regression fits can be attached to the output of ElasticToolsR.

Installing DistributionalSemanticsR

DistributionalSemanticsR is not available on the CRAN package manager (yet). To use it, simply copy the scripts from this repository to your R project's directory (preferably using git clone). From there, you can simply include the scripts you need. More information on what scripts to import is given below.

Using DistributionalSemanticsR

Importing the required scripts

source("VectorSpace.R")
source("DimensionReductionFactory.R")
source("Clustering.R")

⚠ "VectorSpace.R", "DimensionReductionFactory.R" and "Clustering.R" should be present in your R script's directory.

Vector space

Defining a vector space

If you want to retrieve distributional vectors from a distributional dataset, you first have to define a VectorSpace instance. The constructor for the VectorSpace class takes the following arguments:

parameter type description example
space array a matrix of size $rows \times vector dimensions$ containing all distributional semantics vectors; rows must be named! /

E.g., to load a snaut dataset:

vector_space_ <- vector_space(space=as.matrix(read.table(gzfile('dutch.gz'),
                                                         sep=' ',
                                                         row.names=1,
                                                         skip = 3)))

Building your own distributional matrix

Now, we want to create our own matrix with dimensions $number of words \times vector dimensions$. Use the get_distributional_values method. This will return an R matrix. There is one argument:

parameter type description example
words list(character) the list of words you want to get distributional values for list("aap", "huis")
distributional_coords <- vector_space_$get_distributional_values(list("aap", "huis"))

Dimension reduction

Defining a dimension reduction factory

With your distributional coordinates collected, you can now define a DimensionReductionFactory instance. This factory will produce all your dimension reduced coordinates! The constructor for the DimensionReductionFactory class takes one argument:

parameter type description example
distributional_coords matrix the matrix containing the vectors for your words /
dim_reduce <- dimension_reduction_factory(distributional_coords)

Running MDS

To run MDS, simply run $do_mds() on your dimension reduction factory. The method will return a dataframe with two columns: mds_x and mds_y.

coords.mds <- dim_reduce$do_mds()

Running TSNE

To run TSNE, simply run $do_tsne() on your dimension reduction factory. The method will return a dataframe with two columns: tsne_x and tsne_y.

coords.tsne <- dim_reduce$do_tsne()

Running UMAP

To run UMAP, simply run $do_umap() on your dimension reduction factory. The method will return a dataframe with two columns: umap_x and umap_y.

coords.umap <- dim_reduce$do_umap()

Clustering

Defining a Clustering instance

Once your distributional coordinates have undergone dimension reduction, you can apply a clustering algorithm to sort the data points into clusters. You can do this by defining a Clustering instance. The constructor for the Clustering class takes two arguments:

parameter type description example
dimension_reduction_technique character the name of the dimension reduction technique you are using (will be used for column names) "mds"
coords data.frame a dataframe containing your coordinates; columns naming scheme: "mds.x", "mds.y" /
clustering_ <- clustering("mds", coords.mds)

Running k means

To run k means on your coordinates, run $do_k_means on the clustering instance. It will return a list of indicies corresponding to the respective clusters of your data points. The method has one argument:

parameter type description example
k number the number of desired clusters 5
clusters.mds <- clustering_$do_k_means(5)
Running k means in batch

You can also try a number of different k values in one go. To do this, run $do_k_means_batch with a vector of k values you want to try:

parameter type description example
k_range vector a vector of k values you want to try 1:10
clusters.batch.mds <- clustering_$do_k_means_batch(1:10)

The function will return a named list with the following values:

parameter type description example
k_range vector a vector of k values tried 1:10
k_names vector a vector of column names for the k values tried c("cluster.mds.kmeans_k_2", ...)
k_results vector a vector of vectors containing the clustering information output from $do_k_means /

Running dbscan

To run dbscan on your coordinates, run $do_dbscan on the clustering instance. It will return a list of indicies corresponding to the respective clusters of your data points. The method has one argument:

parameter type description example
min_neighbourhood number the minimum neighbourhood for the dbscan procedure 5
clusters.mds <- clustering_$do_dbscan(5)
Running dbscan in batch

You can also try a number of different neighbourhood values in one go. To do this, run $do_dbscan_batch with a vector of neighbourhood values you want to try:

parameter type description example
neighbourhood_range vector a vector of neighbourhood values you want to try 2:10
clusters.batch.mds <- clustering_$do_dbscan_batch(2:10)

The function will return a named list with the following values:

parameter type description example
k_range vector a vector of neighbourhood values tried 2:10
k_names vector a vector of column names for the neighbourhood values tried c("cluster.mds.dbscan_k_2", ...)
k_results vector a vector of vectors containing the clustering information output from $do_dbscan /

⚠ The names of the named list are taken from the k means method in order to guarantee compatibility across clustering mechanisms.

About

Tools for using distributional semantics in R

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages