In [1]:
# The code was removed by Watson Studio for sharing.

# Part 1 - Data Exploration & Visualization
# Load and Visualize IBM Debater® Thematic Clustering of Sentences
This notebook relates to the IBM Debater® Thematic Clustering of Sentences dataset. The dataset contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

This dataset can be obtained for free from the IBM Developer [Data Asset Exchange](https://developer.ibm.com/exchanges/data/all/thematic-clustering-of-sentences/).

In this notebook, we load, explore, clean and visualize the dataset.

Text Clustering can be applied to texts at different levels, from single words to full documents, and can vary with respect to the clustering goal. In thematic clustering, the aim is to cluster texts based on thematic similarity between them, namely grouping together texts that discuss the same theme. In this dataset “Thematic Clustering of Sentences” sentences are annotated for their thematic clusters.

### Table of Contents

* [0. Prerequisite](#prerequisite)
* [1. Load Data](#1)   
    * [1.1 About](#abstract)
    * [1.2 Download and Extract](#download)
    * [1.3 Preprocessing Data](#preprocess)
* [2. Data Visualization](#2)
* [3. Save the Cleaned Data](#3)
* [Authors](#authors)


<a class="anchor" id="prerequisite"></a>
### 0. Prerequisites

Before you run this notebook complete the following steps:
- Insert a project token
- Import required modules

#### Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

```python
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
```

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

* Click on `More -> Insert project token` in the top-right menu section

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

* This should insert a cell at the top of this notebook similar to the example given above.

  > If an error is displayed indicating that no project token is defined, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data).

* Run the newly inserted cell before proceeding with the notebook execution below

#### Import required modules

Import and configure the required modules.

In [2]:
# Define required imports
import pandas as pd
from pandas import read_excel
import numpy as np
import matplotlib.pyplot as plt
# !pip install plotly==4.8.2 
import plotly.graph_objs as go
import seaborn as sns
!pip install cufflinks
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)
from sklearn.feature_extraction.text import CountVectorizer
from IPython.display import clear_output
clear_output()

### 1. Load Data <a class="anchor" id="1"></a>

### 1.1 About <a class="anchor" id="abstract"></a>
The goal of these notebooks is to use the [IBM's Debater - Thematic Clustering of Sentences](https://developer.ibm.com/exchanges/data/all/thematic-clustering-of-sentences/) dataset to group sentences by their main topics and themes. This could be used in for example an application that collects comments and feedback from customers of a company to help organize the comments.

This first notebook will focus on exploring the dataset modify it to be used in evaluate our model in the second notebook.

### 1.2 Download and Extract <a class="anchor" id="download"></a>

This notebook just requires one dataset which is from [IBM's Debater - Thematic Clustering of Sentences](https://developer.ibm.com/exchanges/data/all/thematic-clustering-of-sentences/) named `dataset.csv`. We use the below method to load and read this dataset that is in the Watson Studio Project.

In [3]:
# Define get data file function
def get_file_handle(fname):
    # Project data path for the raw data file
    data_path = project.get_file(fname)
    data_path.seek(0)
    return data_path

#### dataset.csv: 
This file contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

In [4]:
# define filename
DATA_PATH = 'dataset.csv'

# Using pandas to read the data 
# Since the `DATE` column consists date-time information, we use Pandas parse_dates keyword for easier data processing
data_path = get_file_handle(DATA_PATH)
clustering_df = pd.read_csv(data_path, sep=" ")
clustering_df.head()

Unnamed: 0,UNIGRAM,SENTIMENT_SCORE
0,aa,0.019674
1,aaa,0.032775
2,aaas,0.074593
3,aachen,0.011926
4,aah,0.11807


### 1.3 Preprocessing Data <a class="anchor" id="preprocess"></a>

In order for this data to be used to evaluate a clustering model, we need to assign them clusters. According to the readme file of the dataset, each cluster is each sectionTitle. Thus, you can combine the "Article Title" and "SectionTitle" to get a group. 

We will add columns in the dataset to more easily show the cluster. `label` is the unique string while `label_id` is a unique number.

In [None]:
clustering_df['label'] = clustering_df.apply(lambda row: row['Article Title'].replace(" ", "_") + ":" + row['SectionTitle'].replace(" ", "_"), axis=1)
clustering_df['label_id'] = clustering_df.label.astype('category').cat.codes
clustering_df.head()

We can create a dictionary mapping the label ID to the label name.

In [None]:
id_to_category = dict( enumerate(clustering_df.label.astype('category').cat.categories) )

If we take a look at the number of sentences that correspond to each cluster (label), we see that one cluster has a lot more sentences.

In [None]:
# One group has a lot more sentences. 
clustering_df.label_id.value_counts()

In [None]:
id_to_category[32]

We remove this cluster from our dataset because when we test, we want to keep groups together. Having this one very large group may not be an accurate representation of the real data.

In [None]:
# Remove rows in that top category
top_id = clustering_df.label_id.value_counts().index[0] 
df = clustering_df.loc[(clustering_df.label != id_to_category[top_id])]

In [None]:
df.head()

Next, we set the features to be `Sentence` which is all the text data we are interested in. We are predicting the `label_id`. Below we see that there are 5554 (1 removed) clusters and on average 8 sentences are in each cluster.

In [None]:
X = df.Sentence
y = df.label_id

print('Total data rows: ', len(X))
print('Unique groups: ', len(y.unique()))
print('Avgerage number of rows per group: ', clustering_df.label_id.value_counts().mean())

To test a model, we will break this dataset into smaller datasets because in the real world, we likely would not want to have 5000 unique clusters. So we will split the data so that each set has about 5 clusters. To do this, we will randomly take 5000 of the 5554 clusters, then split this into 1000 sets. Now we have 1000 sets to test on (`list_of_groups`).

In [None]:
np.random.seed(42)  # get reproducible results
number_of_groups = 1000
sampled_categories = np.random.choice(y.unique(), size=5000)
list_of_groups = np.split(sampled_categories, number_of_groups)  # 5 categories in each group

### 2. Data Visualization <a class="anchor" id="2"></a>

### 3. Save the Cleaned Data <a class="anchor" id="3"></a>

Finally, we save the cleaned dataset as a Project asset for later re-use. You should see an output like the one below if successful:

```
{'file_name': 'bigrams.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebatersentimentcompositionlex-donotdelete-pr-jhjwrb2ah5iwb0',
 'asset_id': '644d1e6c-757e-401c-9ff8-f6090e5ac998'}
```

**Note**: In order for this step to work, your project token (see the first cell of this notebook) must have `Editor` role. By default this will overwrite any existing file.

In [25]:
project.save_data("unigrams.csv", unigrams.to_csv(float_format='%g'), overwrite=True)
project.save_data("bigrams.csv", bigrams.to_csv(float_format='%g'), overwrite=True)

{'file_name': 'bigrams.csv',
 'message': 'File saved to project storage.',
 'bucket_name': 'ibmdebatersentimentcompositionlex-donotdelete-pr-jhjwrb2ah5iwb0',
 'asset_id': '644d1e6c-757e-401c-9ff8-f6090e5ac998'}


#### Next steps

- Close this notebook.
- Open the `Part 2 - Model Development` notebook to explore the cleaned dataset.



<a id="authors"></a> 
### Authors
This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>