# Part 2 - Model Development
# Modeling Using IBM Debater® Thematic Clustering of Sentences

### Table of Contents

* [0. Prerequisite](#prerequisite)
* [1. Load Data](#1)
* [2. Modeling](#2)   
* [3. Testing](#3)
* [4. Example](#4)
* [Authors](#authors)

<a class="anchor" id="prerequisite"></a>
### 0. Prerequisites

Before you run this notebook complete the following steps:
- Insert a project token
- Import required modules

#### Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

```python
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context
```

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

* Click on `More -> Insert project token` in the top-right menu section

![ws-project.mov](https://media.giphy.com/media/jSVxX2spqwWF9unYrs/giphy.gif)

* This should insert a cell at the top of this notebook similar to the example given above.

  > If an error is displayed indicating that no project token is defined, follow [these instructions](https://dataplatform.cloud.ibm.com/docs/content/wsj/analyze-data/token.html?audience=wdp&context=data).

* Run the newly inserted cell before proceeding with the notebook execution below

#### Import required modules

Import and configure the required modules.


In [2]:
import pandas as pd
import numpy as np
from ast import literal_eval
from collections import defaultdict

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.cluster import homogeneity_score, completeness_score, v_measure_score

### 1. Load Data <a class="anchor" id="1"></a>

Recall that in the first notebook (Part 1 - Data Exploration & Visualization), we modified the original dataset and saved two files to our Watson Studio project. We now need to load this files.


This function below will help load the files saved in our Watson Studio project.

In [3]:
# Define get data file function
def get_file_handle(fname):
    # Project data path for the raw data file
    data_path = project.get_file(fname)
    data_path.seek(0)
    return data_path

Now we can get the datasets.

In [4]:
# Define filenames
DATA_PATHS = ['themes.csv', 'groups_of_themes.csv']

# Use pandas to read the data 
df, groups_of_themes = [pd.read_csv(get_file_handle(data_path)) for data_path in DATA_PATHS]

In [5]:
# Check loaded dataframe
df.head()

Unnamed: 0,Article Title,Sentence,SectionTitle,Article Link,label,label_id
0,Moeller High School,"Moeller's student-run newspaper, The Crusader,...",School publications,https://en.wikipedia.org/wiki/Moeller_High_School,Moeller_High_School:School_publications,3414
1,Moeller High School,"In 2008, The Crusader won First Place, the sec...",School publications,https://en.wikipedia.org/wiki/Moeller_High_School,Moeller_High_School:School_publications,3414
2,Moeller High School,The Squire is a student literary journal that ...,School publications,https://en.wikipedia.org/wiki/Moeller_High_School,Moeller_High_School:School_publications,3414
3,Moeller High School,Paul Keels - play-by-play announcer for Ohio S...,Notable alumni,https://en.wikipedia.org/wiki/Moeller_High_School,Moeller_High_School:Notable_alumni,3413
4,Moeller High School,Joe Uecker - Ohio State Senator (R-66) .,Notable alumni,https://en.wikipedia.org/wiki/Moeller_High_School,Moeller_High_School:Notable_alumni,3413


In [6]:
# Check loaded dataframe
groups_of_themes.head()

Unnamed: 0,group
0,"[2822, 1492, 2014, 4508, 4393]"
1,"[535, 2896, 3550, 1670, 2837]"
2,"[739, 659, 1015, 1362, 3938]"
3,"[4167, 4753, 1516, 1386, 1705]"
4,"[3029, 3826, 3057, 3969, 5299]"


We convert `groups_of_themes` to a list of groups (`list_of_groups`) to more easily used in section 2.

In [7]:
# Convert groups_of_themes to list of lists
groups_of_themes['group'] = groups_of_themes['group'].apply(lambda x: literal_eval(x))
list_of_groups = groups_of_themes.group.values.tolist()

Now we are ready to create a model.

###  2. Modeling <a class="anchor" id="2"></a>

In this section, we will create the clustering model. The model we create will be evaluated using the processed data we just loaded in section 1.

To understand the model it may be important to understand these definitions:
* __TF-IDF__ (term frequency inverse document frequency) is a statistic that informs you how important a word is to a document in a corpus)
* __document__ is an input text. So if we wanted to cluster a list of sentences, each sentence would be a document.
* __corpus__ is a collection of text, so in the above example it would be the list of sentences.
* __stopwords__ are common words in the English language that are removed to ensure that they are not the most influential words in an NLP model. For example, you would not the word "the" to be the most important feature.
* __unigram__ is a single term (e.g. "how"), __bigram__ is a string of two terms (e.g. "how are"), __trigram__ is a string of three terms (e.g. "how are you")

Now to create the clustering model, we use the following steps:
1. Create a TF-IDF matrix using the input text. You will notice that in [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) there are parameters:
    * max_df=0.75 means that all terms in more than 75% of documents (the documents means the entire text input) will be ignored
    * min_df=0.1 means that all terms in less than 10% of documents will be ignored
    * stop_words='english' means Sklearn's stop words are removed from the input text
    * ngram_range = (1,3) means that unigrams, bigrams, and trigrams are used
2. Then this matrix is used in a KMeans model.
    * Number of clusters is the input.
    * If no number of clusters is input then we determine the best number of clusters using [silhouette scores](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)
3. Additionally, a function is written to extract the top terms used to cluster the text.


The following function `get_top_n_terms_per_cluster()` does step 3 and will be called in `run_model()`.
The purpose of this function is to get the top terms that was used to cluster the text so that someone can view and better understand and patterns.

In [8]:
def get_top_n_terms_per_cluster(km_model, terms, n=5):
    """
    Gets the top terms used to cluster text
    
    :param km_model: KMeans model
    :param terms: list of terms from a TfidfVecotrizer object
    :return: dictionary mapping cluster number to top n terms
             {cluster_number: [term1, term2,..., termn]}
    """
    cluster_terms = defaultdict(list)
    
    order_centroids = km_model.cluster_centers_.argsort()[:, ::-1]
    for i in range(len(order_centroids)):
        cluster = order_centroids[i]
        for term_idx in cluster[:n]:
            cluster_terms[i].append(terms[term_idx])
            
    return cluster_terms

The next function is `run_kmeans()` which uses Python library `sklearn` to create a [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) model (a type of clustering model). KMeans essentially uses the distance between points (each comment or text) to find the best clusters. The distances are calucated using a TF-IDF matrix. This TF-IDF matrix is created in `run_model()`.

In [9]:
def run_kmeans(number_of_clusters, tfidf_matrix):
    """
    :param number_of_clusters: int
    :param tfidf_matrix: matrix from TfidfVectorizer object
    :return: KMeans model, list of cluster labels
    """
    km_model = KMeans(n_clusters=number_of_clusters, init='k-means++')
    km_model.fit(tfidf_matrix.toarray())
    clusters = km_model.labels_.tolist()
    return km_model, clusters

Finally, `run_model()` puts together the entire process by 1) creating TF-IDF matrix from the input text, 2) running KMeans model, and 3) get the top terms used to create the clusters.

This method returns
1. `best_clusters` which is a list of the cluster labels for each text
    * e.g. [0, 1, 1, 2, 0] means that there were 5 input texts and the model created 3 clusters
    
    
2. `cluster_terms` which is a dictionary mapping the cluster label to the top terms used to create that cluster.
    * e.g. {0: ['money', 'price'], 1: ['customer', 'service'], 2: ['online', 'web']}

In [10]:
def run_model(X, number_of_clusters=None, number_of_terms=5, max_number_of_groups=5):
    """
    Runs the entire modeling process
    1. create TFIDF matrix
    2. run KMeans with TFIDF matrix
    3. get top terms used
    
    :return: (list of cluster assignments, 
              dictionary mapping cluster number of terms)
    """
    # First find TFIDF matrix
    tfidf_vectorizer = TfidfVectorizer(max_df=0.75 if len(X)>1 else 1, 
                                       min_df=0.1 if len(X)>1 else 1,
                                       stop_words='english',
                                       use_idf=True, 
                                       ngram_range=(1,3),
                                      )

    tfidf_matrix = tfidf_vectorizer.fit_transform(X)
    terms = tfidf_vectorizer.get_feature_names()
    
    # Number of clusters must be > 2 for silhouette_score to work.
    # If there are 2 or less comments, then just set to number of comments.
    number_of_clusters = len(X) if len(X) <= 2 else number_of_clusters
    
    if number_of_clusters:
        # If there's a specific number of clusters specified, then run with that number.
        km_model, best_clusters = run_kmeans(number_of_clusters, tfidf_matrix)
        cluster_terms = get_top_n_terms_per_cluster(km_model, terms, number_of_terms)
    else:
        # Automatically find number of clusters with silhouette_score
        # but have a maximum of max_number_of_groups 
        max_silhouette_score = 0
        for k in range(2, min(max_number_of_groups, len(X))):
            km_model, clusters = run_kmeans(k, tfidf_matrix)
            current_silhouette_score = silhouette_score(tfidf_matrix, clusters)
            if current_silhouette_score > max_silhouette_score:
                max_silhouette_score = current_silhouette_score
                cluster_terms = get_top_n_terms_per_cluster(km_model, terms, number_of_terms)
                best_clusters = clusters
            
    return best_clusters, cluster_terms

### 3 Testing Model  <a class="anchor" id="3"></a>

Next, we can test the clustering model with the dataset we downloaded in section 1 and preprocessed in the first notebook. We evaluate the model using [V-measure](!https://scikit-learn.org/stable/modules/generated/sklearn.metrics.v_measure_score.html), which is the harmonic mean between [homogeneity](!https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_score.html#sklearn.metrics.homogeneity_score) and [completedness](!https://scikit-learn.org/stable/modules/generated/sklearn.metrics.completeness_score.html#sklearn.metrics.completeness_score). The higher the number, the better the model is, values are between 0 and 1.

Additionally, we create a baseline model that predicts random clusters for each input. We want our clustering model defined in section 2 to do better than this baseline.

In [11]:
def run_model_testing(list_of_groups, df, test_size=50):
    average_homogeneity_score, average_completeness, average_v_measure = 0, 0, 0
    avg_baseline_score = 0

    for group in list_of_groups[:test_size]:
        X_test = list(df[df.label_id.isin(list(group))].Sentence.values)
        y_test = list(df[df.label_id.isin(list(group))].label_id.values)
        n = len(df[df.label_id.isin(list(group))].label_id.unique())
        
        clusters, cluster_terms = run_model(X_test)
        
        average_homogeneity_score += homogeneity_score(y_test, clusters)
        average_completeness += completeness_score(y_test, clusters)
        average_v_measure += v_measure_score(y_test, clusters)
        
        # baseline
        baseline_predictions = np.random.choice(np.arange(1, 5), len(X_test))
        avg_baseline_score += v_measure_score(y_test, baseline_predictions)

    
    print('Test V Measure:     ', average_v_measure/ test_size)
    print('Baseline V Measure: ', avg_baseline_score / test_size)


We see that our model actually does do much better than randomly picking clusters. Since the Test V Measure is higher than the Baseline V Measure.

In [12]:
run_model_testing(list_of_groups, df)



Test V Measure:      0.5087237379338818
Baseline V Measure:  0.15418958938016988


### 4. Example  <a class="anchor" id="4"></a>

Finally, we use a sample of comments a retail company could see and run the model on it. 
To test out your own comments:
1. Define your comments e.g. `comments = ['comment1', 'comment2', ...]`,
2. Use `run_model(comments)`, which will return `best_labels` and `top_terms`. e.g. `best_labels, top_terms = run_model(comments)`
3. Follow the example below to use `print_clustering_result` to print out the groups in a more human readable way


In [13]:
comments_5 = [
    'Customer service was polite.',
    'The socks are a pretty color but expensive.',
    'The shirt I bought was green and service was great.',
    'I think the sweater and socks were perfect.',
    'I do not like the shoes, so ugly and expensive.',
]

comments_20 = [
    'Out of all the products I bought, the shirt was my favorite because it is comfortable. However, the sweater and socks really missed the mark and were not worth it.',
    'My order arrived several days late. But when I contacted customer service they were very helpful and refunded me.',
    'Horrible customer service, I have never met such rude people. Would not recommend at all.',
    'Everything I ordered arrived perfectly on time and looked exactly like in the pictures! This company has high quality products.',
    'The company is okay.',
    'Prices are ridiculous',
    'Way too overpriced.',
    'Can never find anything that fits right',
    'Nice clothes for your teenager.',
    'They were really well organized and made the experience way less stressful than i thought it would be',
    'Friendly staff, good range of clothes.',
    'Fashionable place',
    'Decent quality product! Friendly customer service',
    'I really love all the clothes, beautiful. Just a little too expensive',
    'Great quality of the items, cashier and stocker were very friendly.',
    'Ugly and overpriced.'
    'This place was okay and I did find a couple plain shirts for cheap. Overall disappointed with their selection of basics and prices.',
    'Rude employees. Horrible customer service and limited clothing. Only good thing is cheap clothing',
    'Service was good and I got a lot for my money',
    'My favorite brand',
]

The following method helps to print out clustering results in a more interpretable way.

In [14]:
def print_clustering_result(sentences, labels, top_terms):
    label_to_sentences = defaultdict(list)
    for i in range(len(labels)):
        label_to_sentences[labels[i]].append(sentences[i])
    
    number_of_groups = len(label_to_sentences.keys())
    for i in range(number_of_groups):
        print('---------------------------------------')
        print('Group {}. Top Terms: {}'.format(i, top_terms[i]))
        for j in range(len(label_to_sentences[i])):
            print('- ' + label_to_sentences[i][j])

Using the example comments, we can see how the model descided to cluster the sentences.

In [15]:
# Automatically determine number of clusters.
# However can also add the parameter e.g. number_of_clusters=5
for comments in [comments_5, comments_20]:
    best_labels, top_terms = run_model(comments)
    print_clustering_result(comments, best_labels, top_terms)
    print('\n\n')

---------------------------------------
Group 0. Top Terms: ['expensive', 'socks', 'pretty color expensive', 'color', 'color expensive']
- The socks are a pretty color but expensive.
- I think the sweater and socks were perfect.
- I do not like the shoes, so ugly and expensive.
---------------------------------------
Group 1. Top Terms: ['service', 'polite', 'service polite', 'customer', 'customer service']
- Customer service was polite.
- The shirt I bought was green and service was great.



---------------------------------------
Group 0. Top Terms: ['favorite', 'company', 'really', 'products', 'way']
- Out of all the products I bought, the shirt was my favorite because it is comfortable. However, the sweater and socks really missed the mark and were not worth it.
- Everything I ordered arrived perfectly on time and looked exactly like in the pictures! This company has high quality products.
- The company is okay.
- They were really well organized and made the experience way less st

### Authors
This notebook was created by the [Center for Open-Source Data & AI Technologies](http://codait.org).

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>