The base code in 'bt.py' other than the 'naive_bayes_classify()' and 'markov_model_classify()' function was provided by the professor of the course.
The model is trained on data from www.kaggle.com/c/sentiment-analysis-on-movie-reviews. It predicts the most likely sentiment of a movie rating using a unigram (Naive Bayes) and bigram model (Markov model).
1 1 happy happy joy joy 4
2 2 happy meh joy meh 3
3 3 meh meh meh meh 2
4 4 blah meh blah meh 1
5 5 blah ugh blah ugh 0
---
happy happy happy
happy meh
Lines before '---' are the training sentences with 'PhraseId', 'SentenceId', 'Phrase' and 'Sentiment'. Lines after it are to be classified.
4
-3.6888794541139363
4
-3.6888794541139363
3
-3.6888794541139363
3
-2.995732273553991
For each sentence to be classified from input file, there is one line each for the sentiment class and the logarithm probability obtained from bayesian model and the same obtained from markovian model respectively. Hence, there are 4 lines for each sentence.
- by.py - Python code for the bayesian and markovian models
- Text files 'bayesTest.txt', 'smallTest.txt', 'train.tsv' - Training data with sentences to be tested
- Output files 'bayesTest.out', 'smallTest.out', 'train.out' - Output of the sentiment classification.