This project is a helper project for InterviewInsightMine which taggs question of scrapped data
In the InterviewInsightMine there has been collected 9000+ data, however analyzing them is a difficult task since they are not tagged
Train a model from data StackOverflow and StackExchange website. They publish Stack Exchage Data Dumps
In this project, we are interested in the Posts file which contains the question and the tags.
The first iteration of this project is done on stats.meta.stackexchange.com.7z.
Because of the limited GPU power and also we don't need all tags I extracted only the top 50 tags.
The preprocessing is basic
- Removing StopWords
- Making all strings lower
- stemming the words
- Removing the slashes and other symbols
The data is then fitted with tfidf vectorizer and fed into the convolution model
The prediction is a vector of 50 elements with each from 0 to 1 as a probability for a tag to be associated with the question
MSE loss between the actual vs predicted vector. Current Testing loss: 0.0268