GitHub - Stepka/telegram_clustering_contest: CLI client for Telegram Data Clustering contest. Made using Metric, Boost, Lapack and C++17.

Overview

CLI client for telegram clustering contest.

Made using Metric, Boost, Lapack and C++17.

The general idea that lies in the core of my way of solving task is:

Take pre-trained Word2Vec vocabularies for english and russian
Cluster it using Metric framework to a big number of classes
For each text calculate embeddings:
- create zeros vector for text embeddings with size equals to number of clusters in the Word2Vec vocab
- for each word in text take it cluster's index
- increment position at found index in the vector for text embeddings
- in the end we have single vector for each text
When we have embeddings vectors for text we can cluster it, compare with categories (converted to embeddings as text too), etc

I chose technologies with the following criteria (order is important):

speed
accuracy
recall

Futher improvements:

Tune hyperparams using supervised learning
Use multithreading and batch sampling
Parse html to omit tags etc.

Tasks

languages

I detect languages by counting relative number of words that can be found in the most frequency words vocabulary for each language.

Here I take random num_language_samples samples and looking among top 100 words for each language. And if relative number of found words greater than language_score_min_level it means that we detected language.

See assets/vocabs/top_english_words.voc and assets/vocabs/top_russian_words.voc

Futher improvements:

Tune params num_language_samples and language_score_min_level

news

News is "What? Where? When?" text. So this task I can resolve with Name Entities recognition and dates exctracting. In this section I extract dates, calculate average and if average date is fresh, that means text is news.

Futher improvements:

Use NER for found answer for "What?" question.
Calculate entropy for dates using Metric framework.
Tune param freshness_days

threads

Here same as above we calculate embeddings for the text and then run clustering over calculated embeddings. Title is extracting from html tag. And relevance calculated as closest distance from text embeddings to the cluster's centroid.

Futher improvements:

Tune params eps and minpts, means distance in the cluster and min points in the cluster.
Try to use Affinity Propagation
Try to use Cosine distance for clustering

top

Here I take result from clustering of texts and sort it by importantcy. I suppose that thread is important if it fresh and thread has a lot of publications, i. e. thread has a lot of articles incide. Combination of that two params is sort criteria.

Tools

Client use predefined and pretrained vocabularies.

First od all it is:

pretrained Word2Vec on english Google news (can be found here: https://code.google.com/archive/p/word2vec/, License: Apache License 2.0)
pretrained Word2Vec on russian news (can be found here: https://rusvectores.org/ru/models/, License: CC-BY)
crowdmarked vocabulary for russian morphology (can be found here: http://opencorpora.org/?page=downloads, License: Unknown)

Then vocabs can be cut and processed to use with current client with some tools:

cut_word2vec - convert origonal Word2Vec vocab to short and ready for use in the current client. Takes two argunets: path to the original Word2Vec vocab and the number of words that will be leave in the result vocab.
cluster_word2vec - cluster cutted Word2Vec and save. Takes two argunets: path to the cutted Word2Vec vocab and the number of clusters.
convert_tags_corpora - convert morphology tags from vocab format to Universal POS. Takes two argunets: path to the morphology vocab and the number the number of words that will be leave in the result vocab.

Compile using CMake

You need Metric, Boost, Lapack and C++17 support to compile.

Windows

mkdir build
cd build
cmake .. -T llvm -A x64  -DBoost_NAMESPACE="libboost" -DBoost_COMPILER="-vc141"  -DSTATIC_LINKING=false

Then open solution in the Microsoft Visual Studio

Linux

Just run cmake

mkdir build
cd build
cmake ..  -DSTATIC_LINKING=true
make

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
3rdparty		3rdparty
assets		assets
metric		metric
modules		modules
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
deb-packages.txt		deb-packages.txt
tgnews.cpp		tgnews.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Tasks

languages

news

categories

threads

top

Tools

Compile using CMake

About

Releases

Packages

Languages

Stepka/telegram_clustering_contest

Folders and files

Latest commit

History

Repository files navigation

Overview

Tasks

languages

news

categories

threads

top

Tools

Compile using CMake

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages