# Progress Report: [Blog-Sentiment Analysis](https://github.com/Data-Science-for-Linguists-2019/Blog-Sentiment-Analysis)

To learn more about the data and see my previous analysis, refer to [Progress Report #2](https://github.com/Data-Science-for-Linguists-2019/Blog-Sentiment-Analysis/blob/master/progress_report_part2.ipynb).

# Loading the data

In [1]:
# Import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
%pprint

Pretty printing has been turned OFF


In [2]:
dir = '/users/eva/Documents/Data_Science/Blog-Sentiment-Analysis/'

In [3]:
blogdata = pd.read_csv(dir + 'data/blogtext.csv')

In [4]:
newcolumns = ['id', 'gender', 'age', 'industry', 'sign', 'date', 'text']
blogdata.columns = newcolumns
blogdata.head(3)

Unnamed: 0,id,gender,age,industry,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...


# Data analysis
### Goals of my analysis
There are three things I would like to investigate in this data. 
1. Word frequencies
2. Blog topics
3. Blog sentiment

I already looked at [word frequencies](https://github.com/Data-Science-for-Linguists-2019/Blog-Sentiment-Analysis/blob/master/progress_report_part2.ipynb) in Progress Report #2, though I'm not super happy with my results. I experimented with [topic clustering using LDA (Latent Dirichlet Allocation) and scikit-learn](https://github.com/Data-Science-for-Linguists-2019/Blog-Sentiment-Analysis/blob/master/progress_report_part2.ipynb). 

For Progress Report #3, I am going to begin by exploring blog sentiment. If I have time, I will revisit the previous two goals of my analysis.

## Sentiment analysis

Ok so, it appears the vast majority of sentiment analysis groups texts into "positive" and "negative". I would really like to go deeper than that if possible, but I'll try positive and negative classification first.

Since I am really not interested in labeling even a portion of the 681,284 blogs in this dataset for sentiment, I think I am going to try using some already trained models or wordlists with sentiment mapping. Many sentiment models were trained on traditional corpus data, using text from news and books. These models don't incorporate things like emoticons :-) and other internet-specific language which would be found in the blog dataset. In order to get the best sentiment judgments, I specifically looked for models trained on more modern corpora.

### First try: VADER
VADER (Valence Aware Dictionary and sEntiment Reasoner) is "a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media". It is also part of NLTK, which is convenient. It looks like it can be used on a whole text without tokenization.
Credit:
> Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

In [20]:
blogdata.text[10000]

'        transported through the air into the mall across town through the throngs of girls in pants too low and boys with gold necklaces  into the shop beside the bank near the bathrooms  the colour of love is present  not being over-powered by the sound of elevator music and rap mixed together          '

In [23]:
# after doing some poking around i am going to try this blog
# it seems clearly negative focusing on the blogger's health issues however
# there are a lot of mispellings and sarcasm
blogdata.text[30000]
text = blogdata.text[30000]

In [27]:
def demo_vader_instance(text):
    """
    Output polarity scores for a text using Vader approach.

    :param text: a text whose polarity has to be evaluated.
    """
    from nltk.sentiment import SentimentIntensityAnalyzer

    vader_analyzer = SentimentIntensityAnalyzer()
    print(vader_analyzer.polarity_scores(text))

In [28]:
demo_vader_instance(text)

LookupError: 
**********************************************************************
  Resource [93mvader_lexicon[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('vader_lexicon')
  [0m
  Attempted to load [93msentiment/vader_lexicon.zip/vader_lexicon/vader_lexicon.txt[0m

  Searched in:
    - '/Users/eva/nltk_data'
    - '/anaconda3/nltk_data'
    - '/anaconda3/share/nltk_data'
    - '/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [24]:
nltk.sentiment.util.demo_vader_instance(text)

AttributeError: module 'nltk' has no attribute 'sentiment'

### First try: AFINN
[AFINN](https://github.com/fnielsen/afinn) is described as a "Wordlist-based approach for sentiment analysis". The wordlist was created using data from Twitter. It contains basically every emoticon and things like "wowow" (4/5 on the positive scale) and "wtfff" (-4/-5 on the negative scale). I read the paper accompanying the wordlist and thought it sounded like a potential fit for the blog data.

> Finn Årup Nielsen, "A new ANEW: evaluation of a word list for sentiment analysis in microblogs", Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages. Volume 718 in CEUR Workshop Proceedings: 93-98. 2011 May. Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, Mariann Hardey (editors)