# Predicting *sentimental impact*

For this project, we decided to try and predict the *sentimental impact* certain conflicts have in other countries. Bear in mind what you are about to read is but a sequel to the [project's root README](https://github.com/nunomota/ada2017-hw/blob/master/project/README.md) - so that we can keep information duplication to a bare minimum.

Although it may seem quite complicated, this end goal can be broken down into smaller steps:

1. [Language Recognition](https://github.com/nunomota/ada2017-hw/blob/master/project/scripts/language_recognition.py): To identify and keep only English Tweets

2. [Named Entity Recognition](https://github.com/nunomota/ada2017-hw/blob/master/project/scripts/name_entity_recognition.py): To find which country a Tweet might be referring to

3. [Sentiment Analysis](https://github.com/nunomota/ada2017-hw/blob/master/project/scripts/sentiment_analysis.py): To measure the sentiment associated with a Tweet

4. Location and textual data: Taken from UCDP and Twitter datasets

**Note:** To keep the code clear and concise, we moved the code related to these steps into separate scripts. For the most part, this notebook covers a higher level of the implementation - if you would like to know more about the specifics, everything is properly commented within each script.

In [1]:
# Add scripts folder to path
import sys
sys.path.insert(0, 'scripts')

# Get Tweets
import pandas as pd
TWEET_SAMPLE_FILE_PATH = 'data/parsed/markov_text_files/tweet_sentiment_sample.csv'
tweets_df = pd.read_csv(TWEET_SAMPLE_FILE_PATH, index_col=False, compression='gzip', encoding='utf-8')

Tweet sample:

In [2]:
tweets_df.head()

Unnamed: 0,Text
0,This Kutuzov who before the battle of Borodino...
1,"Djibouti The economy is based on agriculture, ..."
2,The general looked the captain up and down as ...
3,Crude oil and natural gas and holds more than ...
4,The little man stood glancing from one to the ...


## Language Recognition (LR)

The [language recognition script](https://github.com/nunomota/ada2017-hw/blob/master/project/scripts/language_recognition.py) makes use of the [langdetect library](https://pypi.python.org/pypi/langdetect) to output whether or not a certain Tweet is written in English. Although this library's reliability increases with the text's length - and hence a bit limited for Tweets - it tends to give more *false negatives* than *false positives*. This is not a problem since we have a large amount of data and we would rather discard some English Tweets than keeping others in foreign languages.

In [3]:
import language_recognition as lr

Samples:

In [4]:
lr.is_tweet_english('Portuguese people always arrive late')

True

### Applying to Tweets:
We can now use this module to filter out all the Tweets that are not English

In [5]:
tweets_df = tweets_df[tweets_df['Text'].apply(lambda x: lr.is_tweet_english(x))]

Sample:

In [6]:
tweets_df.head()

Unnamed: 0,Text
0,This Kutuzov who before the battle of Borodino...
1,"Djibouti The economy is based on agriculture, ..."
2,The general looked the captain up and down as ...
3,Crude oil and natural gas and holds more than ...
4,The little man stood glancing from one to the ...


## Named Entity Recognition (NER)

This is the trickiest part of the project. The main goal is to find whether or not a Tweet is talking about a country. For this part, we opted to use a natural language processing library called [Spacy](https://spacy.io/), for a couple of reasons:

1. Simple text comparison tends to over-identify entities in a text (loose words might match but, contextually, they might not be referring to a place)
2. Other alternatives, like [NLTK](http://www.nltk.org/) tended to be slower
3. Spacy comes with built-in models trained with [OntoNotes 5](https://catalog.ldc.upenn.edu/ldc2013t19), giving us a wide range of entity labels to work with

Although it seems to work quite well, we noticed many cases where it may (or not) consider certain entities based on their capitalization alone. To minimize these errors, we created a [Tweet Generator](https://github.com/nunomota/ada2017-hw/blob/master/project/analyzers/Auxiliary/tweets_generator.ipynb). This notebook creates Markov Chains models by analysing several corpus we provide (stories, geo-political or religious texts, news, etc...) and, through the [markovify library](https://github.com/jsvine/markovify), it generates several texts with specified length (in this case 140, which is a Tweet's max length). The aim is to, as a next step, go through its interactive generator and manually specify what we find in the generated text. From the overall output of that process we expect to find which text-formatting techniques provide a higher accuracy for our NER model (e.g. lower-case text for cities, upper-case for countries or title-case for religions).

We also enriched our data with many other datasets (also specified in the [project's root README](https://github.com/nunomota/ada2017-hw/blob/master/project/README.md)), creating bi-lateral associations between:

* Countries
* Cities
* Nationalities
* Religions
* Religious affiliations
* Currencies

Also, we don't just keep Tweets that directly refer to a country. In the context of *religions*, *cities* and *currencies* there are a lot of possible countries related to a single word (e.g. several countries might have a city with the same name). In these cases, we take into account the probability each entity is referring to each country. We then take all of the entities, analyze the probabilities and find which are the *most likely* countries. If a country is within that set of values and its probability is above a certain threshold, we assume the Tweet is referring to it.

In [7]:
import name_entity_recognition as ner

# Load the necessary data for the module
ner.load_data()

Samples:

In [8]:
ner.is_tweet_about_country('Portuguese people always arrive late', 'PT')

True

### Applying to Tweets
We can now find out which countries the Tweets are talking about

In [9]:
tweets_df['Countries'] = tweets_df['Text'].apply(lambda x: ner.get_countries_from_content(x).keys())

Sample:

In [10]:
tweets_df.head()

Unnamed: 0,Text,Countries
0,This Kutuzov who before the battle of Borodino...,()
1,"Djibouti The economy is based on agriculture, ...","(DJ, IN)"
2,The general looked the captain up and down as ...,()
3,Crude oil and natural gas and holds more than ...,()
4,The little man stood glancing from one to the ...,()


## Sentiment Analysis (SA)

The [sentiment analysis script](https://github.com/nunomota/ada2017-hw/blob/master/project/scripts/sentiment_analysis.py) depends on [NLTK's Vader](http://www.nltk.org/_modules/nltk/sentiment/vader.html) sentiment analyzer. From the final result of this library, we then return the *compound* value (which already takes into account the *positive*, *negative* and *neutral* sentiments in the text) and will be within the range [-1.0, 1.0].

In [11]:
import sentiment_analysis as sa

Samples:

In [12]:
sa.get_sentiment_intensity('Portuguese people always arrive late')

0.0

### Applying to Tweets
We will now proceed to analyze the sentiments on our dataset

In [13]:
tweets_df['Sentiment'] = tweets_df['Text'].apply(lambda x: sa.get_sentiment_intensity(x))

Sample:

In [14]:
tweets_df.head()

Unnamed: 0,Text,Countries,Sentiment
0,This Kutuzov who before the battle of Borodino...,(),-0.7964
1,"Djibouti The economy is based on agriculture, ...","(DJ, IN)",0.0
2,The general looked the captain up and down as ...,(),0.0
3,Crude oil and natural gas and holds more than ...,(),-0.296
4,The little man stood glancing from one to the ...,(),-0.296


## Sentimental impact

Our purpose on this notebook is to group the Tweets we found (based on their *Sentiment Strength*) and simulate a chronological order over a time-frame. Then step after is calculating the daily *Sentiment Strength* for those time-frames - making sure we penalize outliers, to avoid considering random over-emotional Tweets.

To define *sentimental impact*, at last, we analyse the contrast between the *before* and *after* of the daily sentimental strengths (not penalizing outliers, since one-day conflicts might have an emotional peak for a single day and they still might be as *emotionaly impactful* as a weekly conflict).

## Next steps

Although we already extracted the information we need from the UCDP dataset, we didn't do it yet with the Twitter one. Since we didn't have access to it from the start, we decided to focus on the above topics in the mean time - using a local, smaller dataset. Because of this, our next steps will be:

1. Categorize conflicts (e.g. according to casualties)
2. Filter Tweets that are within 3 days (before and after) each of the conflicts' start dates
3. Filter Tweets with *LR* and *NER* 
4. Apply *SA* on the remaining Tweets and calculate *sentimental impact*
5. Train a machine learning model that uses *conflict category* and *conflict country* as features and tries to predict *sentiment impact*
6. Show predictions accross countries for same-category conflicts