# Basics of NLP

In this notebook, I will review some of the most fundamental ideas in Natural Language Processing (NLP). We will start with a very naive kind of sentiment analysis and this will force us to also consider some of the most important techniques for preprocessing natural language data.

## Sentiment analysis

One of the most popular kinds of NLP analyses used extensively in computational social science is sentiment analysis. The notion of sentiment refers to the emotional valence of a given text or utterance. Valence is typically defined in this context as a one-dimensional and bipolar construct and refers to the extent to which something is infused with negative or positive emotions/attitudes. However, there are also more elaborate kinds of sentiment analyses that define sentiment in terms of multiple dimensions usually referring to some model of basic (primitive) emotions (i.e. anger, disgust, fear, happiness, sadness, and surprise).

Sentiment analysis is often used in analyses of social media, speeches of public persons, press and many as well as in many other contexts.

In technical terms sentiment analysis may be conducted in many different ways (most sophisticated approaches are based on complex deep learning models). Nonetheless, all the methods are always based on some prior linguistic datasets (usually called lexicons or corpora) that assign some scores, representing emotional valences, to given words or phrases.

Here I will not bore you with the most basic and naive (and frankly not very useful) sentiment analysis based on the so-called [AFINN](http://corpustext.com/reference/sentiment_afinn.html) lexicon. The lexicon assigns sentiment scores ranging from -5 (very negative) to +5 (very positive) to individual English words (and it includes only about 2500 of them). It computes sentiment in a very simple way by just matching individual words from a larger text with their scores in the AFINN lexicon and computing different kinds of average scores based on that.

Instead, I will use a more nuanced type of sentiment tailored for web data (such as blog posts, etc.) called [VADER](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/viewPaper/8109), which is implemented in a powerful Python package for NLP called [NLTK](https://www.nltk.org/).

## Data preprocessing

One of the biggest difficulties of NLP analysis stems from the fact that natural language is very contextual and messy. Many words may mean different things depending on context or might be spelled or written differently, for instance, depending on their position in a sentence while still being semantically equivalent. Other words may not have any intrinsic meaning as they play an only grammatical role. A good example of that are articles in English (i.e. a, an, the).

### Stop words

In our simple sentiment analysis, we will be concerned mostly with average sentiment scores over all words in a given text. Therefore, one of our concerns will be to first get rid of words with no clear semantics such as articles. Usually, such words in the context of NLP are called **stop words**. So we will want to get rid of all the stop words from our texts because they would bias our sentiment scores downwards.

### Tokenization

However, first, we have to notice that our approach will be based on the analysis of individual words. And initially, our texts will be single strings. Thus, first, we will have to decompose texts into single words. Such a process is usually called **tokenization** and it refers to a decomposition of a text into lower-order elements such as words or sentences. The naive way to do that would be to split a text by any kind of whitespace, but in practice, this is too simplistic. Luckily, people already studied this problem quite extensively and figured out better solutions, so we will not have to reinvent the wheel. Instead, we will use one of the methods offered by the NLTK package.

### Lemmatization

After tokenization, we would be able to remove stop words and lookup sentiment scores for the rest of the words. However, there is still some problem that we should address before that. Many words with the same meaning may be written differently in different contexts, for instance, depending on whether they occur in singular or plural form or depending on tense, etc.

One way to deal with that is to convert words to their lemmas in the process called **lemmatization**. A lemma of a word is its core form. Below we provide sum examples:

* houses $\rightarrow$ house
* are $\rightarrow$ is
* mice $\rightarrow$ mouse
* becoming $\rightarrow$ become

Very accurate lemmatization is possible, but in general, it is rather hard and requires some additional work to be done that is beyond the scope of this course. Here we will do only a very simple kind of lemmatization that will allow us to simplify all plural forms into singular forms.

### Pipeline

Summing up, the data processing pipeline that we will use here will be the following:

$$\text{text} \rightarrow \text{tokenization} \rightarrow \text{lemmatization} \rightarrow \text{stop words removal} \rightarrow \text{sentiment analysis}$$

## Natural Language Toolkit (NLTK)

NLTK is one of the most important and popular Python packages for Natural Language Processing. It is a very powerful but also complex package and we will not discuss any details of how it works. Instead, we will only use a few tools it provides. However, what we will show is enough to conduct simple sentiment analysis. Thus, the techniques presented here will constitute the last element that together with the things we learned previously will allow you to conduct a simple computational study starting with data extraction and ending with simple natural language analysis.

As we already mentioned, one of the characteristic features of NLP is that it is usually based (one way or another) on some preexisting datasets called lexicons and/or corpora compile by linguists and other people who study natural languages. As a result, quite often working with NLTK starts with downloading some additional datasets that will be needed to perform particular analyses. Luckily, this is very easy with NLTK as it provides a very simple API for downloading missing datasets.

## Read the data

Before we move to computing the sentiment we need some text to do so. Therefore, we will use the data you have already collected. We will use submissions because they have more text to process. 

**But how are we going to do that?**

We will use exactly the same script as I put in the last part of the last notebook under the Hint. However, I will unpack in more details what happens here to give you a better understanding of it.

In [8]:
## Load required module
import json

## Open the file in read mode
with open('../scripts/comments.jl', 'r') as file:
    ## Read line by line and convert every line into a dict
    ## Store everything in a list
    data = [json.loads(line) for line in file.readlines()]

For the ilustration how VADER works we will need some text. First we will foucus on one comment only and later I will show you how to do it for all comments. So let's extract the longest comment from our data.

In [9]:
longest_comment = ""
for line in data:
    if len(line['body'])>len(longest_comment):
        longest_comment = line['body']

In [10]:
longest_comment

'He draws a false equivalence between science and religion. Science is based on the scientific method. It works, and if it doesn\'t it changes until it does.\n\nReligion is faith based, with the "truths" passed down from the past, and are not tested or changed if they are shown to be wrong.\n\nSo now that we know that we can use science and that we should use science lets look at his actual arguments:\n\n&gt; We sense a growing number of skeptics who accept that the Earth may be warming, but question whether nature is the driving force rather than a vain attempt by man to accept blame as a form of industrialization penance\n\nWe may sense this, but we would be sensing uninformed opinions.\n\nThe scientific community is pretty much 97-3 on the side of "most of the warming since the middle of last century is likely to be anthropogenic".\n\nLook perhaps at the wiki page on [the scientific opinion on climate change](http://en.wikipedia.org/wiki/Scientific_opinion_on_climate_change). The co

## Sentiment with VADER

First, I thought about showing you also very naive way of computing sentiment - dictionary based but afterward I realised that there was no point in that. It was enough I told you about it cause it is more or less intuittive what happens.

VADER is a slightly more complex approach that takes into account issues such as exclamation marks, negations and adjectival modifiers (i.e. words such as "very"). Its implementation is much more complex than what we did previously and we will not discuss it here. The good thing is that it is implemented in NLTK and extremely easy to use.

In [3]:
# Import VADER, download its lexicon and initialize an instance of a sentiment analyzer

from nltk.sentiment.vader import SentimentIntensityAnalyzer
# Import sentence tokenizer
from nltk.tokenize import sent_tokenize

nltk.download('vader_lexicon')
vader = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/mikolaj/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [11]:
sentiment1 = vader.polarity_scores(longest_comment)


In [12]:
sentiment1

{'neg': 0.073, 'neu': 0.84, 'pos': 0.087, 'compound': 0.8432}

In [13]:
# Sentiment of sentences (more fine-grained picture)
sent_sentiment1 = [ vader.polarity_scores(sent) for sent in sent_tokenize(longest_comment) ]
sent_sentiment1[:5]

[{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0},
 {'neg': 0.107, 'neu': 0.796, 'pos': 0.097, 'compound': -0.0772},
 {'neg': 0.123, 'neu': 0.764, 'pos': 0.113, 'compound': -0.431}]

In [None]:
# For instance, we can look at average compound scores over sentences
sent_sentiment1_compound = sum(s['compound'] for s in sent_sentiment1) / len(sent_sentiment1)
print("Average compound score over sentences (text I):", sent_sentiment1_compound)

In [None]:
for line in data:
    line['neg'] = vader.polarity_scores(line['body'])