# Predicting Sarcasm Using The Onion

## Project Motivation

As companies' reliance on tools such as sentiment analysis and chatbots increases, the ability to correctly discern the genuine attitude of the public also increases. Imagine the following interaction:

Customer: 'You've been a big help! Thanks for nothing'

Bot: 'Your welcome!'

Or, this review:

'Ah, Bulls. What fond, fond memories I have of my one and only visit to your "establishment."' 


Misunderstanding the true meaning in either could lead to decreased customer satisfaction or a misinterpretation of the general sentiment towards the company or product by the public. 

While overt feelings of anger, frustration, or satisfaction are becoming easier for ML models to identify, sarcasm adds an extra level of complexity.  The goal of this project is to explore the viability of building models that can correctly classify sarcasm in text. 


## The Data

Finding Data for this project posed an interesting problem. We needed accurately labeled and similarly-formatted sets of text.  While there are data sets of user, self-reported, sarcastic posts from Reddit and Twitter, they have significant formating issues and would need a large amount of preprocessing.  

While the long term goal of sarcasm detection is for such forums; for our initial trials, we were looking for data with higher label accuracy, and which was predominantly free of spelling and grammar errors.     

We found the perfect one in the "News Headlines Dataset for Sarcasm Detection', on Kaggle (https://rishabhmisra.github.io/publications/). We are forever grateful to Prahal Arora and Rishabh Misra for their work. 

This set is composed of headlines taken from mainstream news sources and The Onion, a satirical news website. We can be reasonably confident that the headlines from The Onion are sarcastic, while still following established headline formatting conventions. 

## Natural Language Processing And Feature Extraction

While we experimented with many different protocols for text preprocessing, the final process was the following:

 - data cleaning
  
 - feature extraction and tokenization of the text: we employed sequenced tokenization, as
   maintaining the context of the words was vital to judging the writer's true meaning 
  
 - we then padded the sequences to have a uniformed length
  
 - with our final models, we also used a pre-trained embedding mask, GloVe, which is an 'unsupervised
   learning algorithm for obtaining vector representations for words'  
   
 - for our Machine Learning Models, we created a W2Vecorizer class with GloVe

## Machine Learning  

Using a Pipeline, we tested several Machine Learning Methods. Below are their final accuracy cross-validation scores: 

- Random Forest: 0.734
- Support Vector Machine: 0.72
- Logistic Regression: 0.70

We also ran a stand-alone XGboost Classifier:

 - XGB: 0.734



## Deep Learning

Not overly impressed with these results, we ran a several Deep Learning Algorithms. They are below, with there accuracy:

 - LSTM: .85
 - GRU: .84
 - CNN: .80
 

In addition to those,  also attempted many self-created sequential models. By far the best results we achieved, was with a Bidirectional/LSTM:

 - BiD/LSTM: .87
 
While we tried multiple hyperparameter settings, but we were not able to improve on this result. 
    

## Toward Production 

With the Bidirectional/LSTM model as our engine, we created a locally hosted Flask app that allows users to enter headlines 'from the wild' and get a percentage indicating the likelihood it is from The Onion or a mainstream news source. 

If you would like to try it out, in your terminal, go to the directory for this project and enter the command 'python app.py'.

Then go to your browser and go to http://localhost:5000/, there will be a text box where you can enter a headline. Enjoy.

## Conclusions
While amusing in nature - the overarching goal of this project was to test the feasibility of building production models that could reasonably detect sarcasm within text. 

While taking into consideration the type of data used, and an underwhelming final accuracy rate of 89%, I think we have shown that it is possible with current Deep Learning methods. 

With that said, much more work would be needed.  

## Future Work

More training. There are two versions of the headlines data set.  The next logical step is to train the current model with on the second set, to see if we can improve accuracy. 

Continuous training. It might be interesting to scrap headlines daily from both The Onion and news sources and update the training on a weekly or monthly basis. 

Take it to the real world. In the end, our goal is to gauge the sentiment of real people, not headlines. Once we have a reasonably accurate model, we should apply it to real-world situations. 

With that in mind, we will (as always) need new data. Paramount to the above is gathering accurate uses of sarcasm by regular people. The more realistic the data, the better the result.  