Emoetion

INTRODUCTION

Anime is a popular form of media in Japan that is gaining popularity in the United States. Since most of the shows have Japanese audio, most commentary on anime hinges on bilingual elements from both English and Japanese using the Roman alphabet. This poses several challenges in analyzing text critiquing anime. We are hoping to build a system that given an anime can predict the common sentiment and its rating on myanimelist.com by using Twitter comments mentioning the specified show.

STRATEGY FOR SOLVING THE PROBLEM

Building the classifier We plan on making two classifiers to get a clear comparison of which works better with our answering scheme. The first will be a Boolean Naive Baye’s Classifier that uses Laplace Smoothing (AddOne Smoothing) to improve statistical analysis. Including Laplace Smoothing adds another layer of inclusion for outofvocabulary words that appear; normally, the Naive Bayes calculation for a particular sentence with an outofvocabulary word would have a probability of 0 which contradicts to the very existence of the sentence and its according nonzero probability. The other classifier is a Maximum Entropy classifier that has no assumptions and aims for a uniform distribution despite that; it will handle the outofvocabulary supposedly better than the Boolean Naive Baye’s classifier. For both classifiers, we plan on using the Sentiment140 corpus of tweets as our training data set to establish our vocabulary..
Scraping and Tokenizing Tweets We will leveraging the Reddit API to gather posts. We will search query them by post and topic. We will query by discussion with topics of animerelated and consider each thread of the discuss as an opinion that we will analyze. Then we have to normalize them to accommodate emoticons, emojis, misspellings, abbreviations, usernames, colloquialism (“loooooool” is essentially “lol”) to name a few.
Evaluation We plan on using the rankings and ratings on imdb.com as a comparison to check our system. We will use Mean Average Precision that represents the average precision up to some cutoff (in terms of some ranking number, some recall number, etc). With each classifiers results, we can analyze and infer which system is better and why.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
scale_data		scale_data
scale_whole_review		scale_whole_review
Bayes.py		Bayes.py
Powerpoint.pptx		Powerpoint.pptx
Powerpoint1.pptx		Powerpoint1.pptx
README.md		README.md
Reddit.py		Reddit.py
classes_count.txt		classes_count.txt
evaluation.txt		evaluation.txt
nbc.py		nbc.py
scrape.py		scrape.py
training.py		training.py
word_count 1.txt		word_count 1.txt
word_count.txt		word_count.txt

Havaldar/emoetion

Folders and files

Latest commit

History

Repository files navigation

Emoetion

INTRODUCTION

STRATEGY FOR SOLVING THE PROBLEM

About

Resources

Stars

Watchers

Forks

Languages