NL2ML corpus

Natural Language to Machine Learning corpus. A coursework of mine.

Workflow

markup_complete - Python 3 code snippets with 6 binary columns that stand for KG nodes(> 100 000 snippets)

chunks_30_final - Python 3 code, divided by every 30 rows. (2 574k rows)

code_blocks_final - Python 3 code, divided to blocks (from .ipynb), where authors left comments. (2 211k rows)

Work files

first_attempt_baseline - Naive Bayes classifier solution.

pre-preprocessing - basic preprocessing for finding most popular comments and trying stemming/lemmatization.

regular_expressions+LogReg_code2vec - making tags by KG and regular expressions + looking at code2vec and logistic regression f1-scores

Code2vec implementation:

https://github.com/Kirili4ik/code2vec

Related works:

https://github.com/whatevernevermindbro/source_code_classification

https://github.com/ramazyant/nl2ml

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
know_graphs		know_graphs
popular_tags		popular_tags
README.md		README.md
c2v-arc.jpg		c2v-arc.jpg
first_attempt_baseline.ipynb		first_attempt_baseline.ipynb
links.txt		links.txt
pre-preprocessing.ipynb		pre-preprocessing.ipynb
regular_expressions+LogReg_code2vec.ipynb		regular_expressions+LogReg_code2vec.ipynb
whole_workflow (1)-Page-1.jpg		whole_workflow (1)-Page-1.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NL2ML corpus

Workflow

Presentation with explanation:

Expertly collected and marked data:

Data parsed from Kaggle:

Work files

Code2vec implementation:

Related works:

About

Releases

Packages

Languages

Kirili4ik/NL2ML-corpus

Folders and files

Latest commit

History

Repository files navigation

NL2ML corpus

Workflow

Presentation with explanation:

Expertly collected and marked data:

Data parsed from Kaggle:

Work files

Code2vec implementation:

Related works:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages