Skip to content

Kirili4ik/NL2ML-corpus

Repository files navigation

NL2ML corpus

Natural Language to Machine Learning corpus. A coursework of mine.

Workflow

alt text

Presentation with explanation:

https://github.com/Kirili4ik/pres-n-articles/blob/master/corpus_NL2ML_Presentation.pdf

Expertly collected and marked data:

https://docs.google.com/spreadsheets/d/1gDhVdq2GktuWXh7hDyt_js335Xbvsw57iSNh_wEaUxE/

Data parsed from Kaggle:

https://yadi.sk/d/kvnqRG6ngt8emw

markup_complete - Python 3 code snippets with 6 binary columns that stand for KG nodes(> 100 000 snippets)

chunks_30_final - Python 3 code, divided by every 30 rows. (2 574k rows)

code_blocks_final - Python 3 code, divided to blocks (from .ipynb), where authors left comments. (2 211k rows)

Work files

first_attempt_baseline - Naive Bayes classifier solution.

pre-preprocessing - basic preprocessing for finding most popular comments and trying stemming/lemmatization.

regular_expressions+LogReg_code2vec - making tags by KG and regular expressions + looking at code2vec and logistic regression f1-scores

Code2vec implementation:

https://github.com/Kirili4ik/code2vec

Related works:

https://github.com/whatevernevermindbro/source_code_classification

https://github.com/ramazyant/nl2ml

About

Natural Language to Machine Learning corpus

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published