In [None]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='f50cd9d7-15ac-4ff4-8ca0-66ac443fd868', project_access_token='p-9fb965e91b44d582783cfe3da9a3cc430495311f')
pc = project.project_context


# Load and Visualize IBM Debater® Sentiment Composition Lexicons
This notebook relates to the IBM Debater® Sentiment Composition Lexicons dataset. The dataset includes sentiment composition lexicons and sentiment lexicons:
1. Sentiment composition lexicons containing 2,783 words.
2. Sentiment lexicons containing 66,058 unigrams and 262,555 bigrams.

This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/sentiment-composition-lexicons/).

In this notebook, we load, explore, clean and visualize the dataset.

This dataset addresses sentiment composition – predicting the sentiment of a phrase from the interaction between its constituents. For example, in the phrases “reduced bureaucracy” and “fresh injury”, both “reduced” and “fresh” are followed by a negative word. However, “reduced” flips the negative polarity, resulting in a positive phrase, while “fresh” propagates the negative polarity to the phrase level, resulting in a negative phrase. Accordingly, “reduced” is part of our “reversers” lexicon, and “fresh” is part of the “propagators” lexicon.

### Table of Contents

* [0. Prerequisite](#prerequisite)
* [1. Load Data](#1)   
    * [1.1 About](#abstract)
    * [1.2 Download and Extract](#download)
* [2. Data Visualization](#2)
* [3. Save the Cleaned Data](#3)
* [Authors](#authors)


<a class="anchor" id="prerequisite"></a>
### 0. Prerequisites

Before you run this notebook complete the following steps:
- Insert a project token
- Import required modules

#### Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

```python
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
```

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

* Click on `More -> Insert project token` in the top-right menu section

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

* This should insert a cell at the top of this notebook similar to the example given above.

  > If an error is displayed indicating that no project token is defined, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data).

* Run the newly inserted cell before proceeding with the notebook execution below

#### Import required modules

Import and configure the required modules.

In [None]:
# Define required imports
import pandas as pd
from pandas import read_excel
import numpy as np
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import seaborn as sns
!pip install cufflinks
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import clear_output
clear_output()

## 1. Load Data <a class="anchor" id="#1"></a>
### 1.1 About <a class="anchor" id="abstract"></a>

The goal of these notebooks is to use the [IBM's Debater - Sentiment Composition Lexicons](https://developer.ibm.com/exchanges/data/all/sentiment-composition-lexicons/) dataset to categorize text on a sentence level, or as a whole, with a range of sentiments. This could be used in for example an application that collects comments and feedback from customers of a company to determine which customers are more satisfied with the company.

Let's first explain a few definition
Sentiment Analysis: Using natural language processing to determine the sentiment of a piece of text, e.g. determine if the text has a positive, negative, or neutral connotation. 

N-gram: a sequence of N terms

Unigram: an N-gram with one term, e.g. “hello”

Bi-gram: an N-gram with two terms, e.g. “hello world”

POS (part of speech) tag: a word's part of speech, e.g. the POS tag for "dog" would be noun (or NN in the NLTK Python library).

### 1.2 Download and Extract <a class="anchor" id="download"></a>

First, we must load then modify the LEXICON_UG.txt and LEXICON_BG.txt datasets to include a sentiment column that is based on the SENTIMENT_SCORE column but instead uses 1 or 0 where 1 is positive sentiment and 0 is negative sentiment.

In [None]:
# Define get data file function
def get_file_handle(fname):
    # Project data path for the raw data file
    data_path = project.get_file(fname)
    data_path.seek(0)
    return data_path

#### LEXICON_UG.txt: 
A list of 66,058 unigrams and their predicted sentiment score. Note that in the paper, for unigrams that have sentiment in the HL lexicon (the publicly-available sentiment lexicon of Hu and Liu (2004)), we used the original sentiment from the HL lexicon (+1 or -1) and not the predicted score. This step is not reflected in the released lexicon. 

In [None]:
# define filename
DATA_PATH = 'LEXICON_UG.txt'

# Using pandas to read the data 
# Since the `DATE` column consists date-time information, we use Pandas parse_dates keyword for easier data processing
data_path = get_file_handle(DATA_PATH)
unigrams = pd.read_csv(data_path, sep=" ")
unigrams.head()

#### LEXICON_BG.txt: 
A list of 262,555 selected bigrams in the following format:
- Column 1: the bigram
- Column 2: the OpenNLP POS tags of its unigrams
- Column 3: the predicted sentiment score

In [None]:
# define filename
DATA_PATH = 'LEXICON_BG.txt'

# Using pandas to read the data 
# Since the `DATE` column consists date-time information, we use Pandas parse_dates keyword for easier data processing
data_path = get_file_handle(DATA_PATH)
bigrams = pd.read_csv(data_path, sep=" ")
bigrams.head()

### Composition and Adjective Class

In Part 2, we follow the rules from Table 1 of [this paper](https://www.aclweb.org/anthology/C18-1189.pdf) (by creators of this dataset), to create our sentiment analysis model to produce sentiment scores. Additionally in the second notebook, we will essentially match bigrams to certain rules that produce a predicted polarity (positive or negative). There are two groups of rules: composition classes and adjective classes. Adjective classes focus on the adjective pairs (high, low) and (fast, slow).

In order to do this, there are two files: 1) ADJECTIVES.xlsx and 2) SEMANTIC_CLASSES.xlsx. The adjectives files contains 5 sheets, the first sheet gives a list of words similar to each of high, low, fast, slow. The next four sheets are words that are associated with that specific case.

The semantic classes file has 6 sheets, one for each of the composition classes defined in the paper. In each sheet, there is a list of words that corresponds to that composition class.

#### SEMANTIC_CLASSES.xlsx: 
This file contains the lists of the semantic classes words for each type. For each semantic class (reversers, propagators, and dominators) there are two tabs in the Excel file, one for a positive composition (POS) and one for negative composition (NEG). Overall there are 6 tabs: `DOMINATOR_NEG`, `DOMINATOR_POS`, `PROPAGETOR_POS`, `PROPAGETOR_NEG`, `REVERSER_POS`, `REVERSER_NEG`.

#### ADJECTIVES.xlsx
This file contains the lists of the semantic classes words for the gradable adjective pairs.

- `(HIGH,LOW)_POS_NEG`, `(HIGH,LOW)_NEG_POS`: the lists of words for ADJ high/low.
- `(FAST,SLOW)_POS_NEG`, `(FAST,SLOW)_NEG_POS`: the lists of words for ADJ fast/slow.
- `ADJECTIVE_EXPANSION`: the list of adjective expansions for high, low, fast, slow.

### 2. Data Visualization <a class="anchor" id="2"></a>
#### 2.2 Unigrams <a class="anchor" id="2-2"></a>

In [None]:
# Add sentiment column
unigrams['sentiment'] = np.where(unigrams['SENTIMENT_SCORE'] > 0, 1, 0)  # 1 is positive, 0 is negative
unigrams.head()

In [None]:
# The distribution of review sentiment polarity score
unigrams['SENTIMENT_SCORE'].iplot(
    kind='hist',
    bins=50,
    xTitle='polarity',
    linecolor='black',
    yTitle='count',
    title='Sentiment Polarity Distribution')
# The sentiment polarity score is similar to a bell curve, center at 0, means half of them are positive, half are negative.

In [None]:
# Get the string length of each unigram word
unigrams['uni_len'] = [len(str(i)) for i in unigrams['UNIGRAM']]
# Plot the unigram length distribution
unigrams['uni_len'].iplot(
    kind='hist',
    xTitle='unigram length',
    linecolor='black',
    yTitle='count',
    title='Unigram Text Length Distribution')

In [None]:
unigrams['first_letter'] = unigrams.UNIGRAM.str[0]
unigrams.head()

In [None]:
# get number of words under each alphabet
group_data = unigrams.groupby(['first_letter', 'sentiment'])
group_data.count()

In [None]:
plt.figure(figsize=(20,20))
sns.set(style="darkgrid")
ax = sns.countplot(x="first_letter", data=unigrams)

plt.title('Data Distribution')

for p in ax.patches:
        total_count = str(p.get_height())
        x=p.get_x() + p.get_width() - 0.75
        y=p.get_y() +p.get_height()
        ax.annotate(total_count, (x, y))

In the unigrams dataset, the first letter normally starts with s or c while the least frequent letters are x, y, and z.

#### 2.3 Bigrams <a class="anchor" id="2-3"></a>

In [None]:
# Add sentiment column
bigrams['sentiment'] = np.where(bigrams['SENTIMENT_SCORE'] > 0, 1, 0)  # 1 is positive, 0 is negative
bigrams.head()

In [None]:
# The distribution of review sentiment polarity score
bigrams['SENTIMENT_SCORE'].iplot(
    kind='hist',
    bins=50,
    xTitle='polarity',
    linecolor='black',
    yTitle='count',
    title='Sentiment Polarity Distribution')

In [None]:
df = pd.DataFrame(bigrams.groupby(['POS_TAGS', 'sentiment']).size().reset_index())
df.head()

In [None]:
import plotly.express as px

fig = px.bar(df, x="POS_TAGS", y=0, color="sentiment", title="Long-Form Input")
fig.show()

In [None]:
bigrams.groupby('POS_TAGS').count()['SENTIMENT_SCORE'].sort_values(ascending=False).iplot(kind='bar', yTitle='Count', linecolor='black', opacity=0.8,
                                                           title='Pos Tags Count', xTitle='Pos Tag')

In [None]:
# get number of words under each POS Tag
group_data = bigrams.groupby(['POS_TAGS','sentiment'])
group_data.count()

In [None]:
# add first letter and sentiment columns
bigrams['first_letter'] = bigrams.BIGRAM.str[0]
# get number of words under each alphabet
group_data = bigrams.groupby(['first_letter','sentiment'])
group_data.count()

In [None]:
plt.figure(figsize=(20,20))
sns.set(style="darkgrid")
ax = sns.countplot(x="first_letter", data=bigrams)

plt.title('Data Distribution')

for p in ax.patches:
        #total_count = '{}'.format(p.get_height())
        total_count = str(p.get_height())
        x=p.get_x() + p.get_width() - 0.75
        y=p.get_y() +p.get_height()
        ax.annotate(total_count, (x, y))

### 3. Save the Cleaned Data

Finally, we save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:

```
{'file_name': 'bigrams.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebatersentimentcompositionlex-donotdelete-pr-jhjwrb2ah5iwb0',
 'asset_id': '644d1e6c-757e-401c-9ff8-f6090e5ac998'}
```

**Note**: In order for this step to work, your project token (see the first cell of this notebook) must have `Editor` role. By default this will overwrite any existing file.

In [None]:
project.save_data("unigrams.csv", unigrams.to_csv(float_format='%g'), overwrite=True)
project.save_data("bigrams.csv", bigrams.to_csv(float_format='%g'), overwrite=True)


#### Next steps

- Close this notebook.
- Open the `Part 2 - Model Development` notebook to explore the cleaned dataset.

<a id="authors"></a> 
### Authors
This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>