Natural Language to Machine Learning corpus. A coursework of mine.
https://github.com/Kirili4ik/pres-n-articles/blob/master/corpus_NL2ML_Presentation.pdf
https://docs.google.com/spreadsheets/d/1gDhVdq2GktuWXh7hDyt_js335Xbvsw57iSNh_wEaUxE/
https://yadi.sk/d/kvnqRG6ngt8emw
markup_complete - Python 3 code snippets with 6 binary columns that stand for KG nodes(> 100 000 snippets)
chunks_30_final - Python 3 code, divided by every 30 rows. (2 574k rows)
code_blocks_final - Python 3 code, divided to blocks (from .ipynb), where authors left comments. (2 211k rows)
first_attempt_baseline - Naive Bayes classifier solution.
pre-preprocessing - basic preprocessing for finding most popular comments and trying stemming/lemmatization.
regular_expressions+LogReg_code2vec - making tags by KG and regular expressions + looking at code2vec and logistic regression f1-scores
https://github.com/Kirili4ik/code2vec
https://github.com/whatevernevermindbro/source_code_classification