# Biclustering

In this exercise, we will explore the biclustering method on a spam dataset. Perform the following tasks: 

1. Create feature matrix using the raw text using TFIDF vectorizer 
    - you can use the one that is defined in the lab
2. Remove empty rows/columns from the TFIDF matrix
3. Apply spectral coclustering method on this matrix; extract two biclusters
4. Compare results against k-means
5. Identify the combination of spam and ham as well as important words in each bicluster
    - do these biclusters make sense in terms of topic mixture and words


In [1]:
import pandas as pd
from collections import defaultdict
import operator
from time import time
import numpy as np
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import SpectralCoclustering,SpectralBiclustering
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.cluster import v_measure_score

*Original Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset/home*

For this exercise, there are 100 sms that have been parsed and categorized as "Spam" or "Ham". The dataframe also contains the original text message. 


### Load the database

In [2]:
df = pd.read_csv("/dsa/data/DSA-8410/spam.csv", encoding='latin1')
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [3]:
df_sub = df[['v1', 'v2']][:100]
df_sub.columns = ['class', 'text']
df_sub.head()

Unnamed: 0,class,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
y_true = df_sub['class']
categories =['spam','ham']

### T1. Create a TFIDF matrix from the raw text. You can use `NumberNormalizingVectorizer` defined in the lab. Use min_df=2 and max_features = 100. 

In [5]:
def number_normalizer(tokens):
    """ Map all numeric tokens to a placeholder.

    For many applications, tokens that begin with a number are not directly
    useful, but the fact that such a token exists can be relevant.  By applying
    this form of dimensionality reduction, some methods may perform better.
    """
    return ("#NUMBER" if token[0].isdigit() else token for token in tokens)

class NumberNormalizingVectorizer(TfidfVectorizer):
    def build_tokenizer(self):
        tokenize = super().build_tokenizer()
        return lambda doc: list(number_normalizer(tokenize(doc)))

In [6]:
vectorizer = TfidfVectorizer(stop_words='english', min_df=2, max_features=100)
X = vectorizer.fit_transform(df_sub['text'])

### T2. Remove emty rows (i.e. all values are 0) from the feature matrix. Also, update y_true.

In [7]:
df_clean = pd.DataFrame.sparse.from_spmatrix(X, columns = vectorizer.get_feature_names())
df_clean.head()

Unnamed: 0,000,100,1000,16,20,50,87121,afternoon,anymore,apply,...,watching,way,week,wif,win,won,word,work,yeah,yes
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.638962,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.404579,0.0,0.0,0.404579,...,0.0,0.0,0.0,0.0,0.404579,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
non_empty_rows = np.logical_not(df_clean.sum(axis=1) == 0)
len(non_empty_rows)
df_clean= df_clean[non_empty_rows]
df_clean.shape

(92, 100)

In [9]:
y_true = y_true[non_empty_rows]
y_true.shape

(92,)

In [10]:
X = df_clean.values

### T3. Define and fit a spectral coclustering model; use n_clusters = 2

In [11]:
cocluster = SpectralCoclustering(n_clusters=len(categories),
                                 svd_method='arpack', random_state=0)
cocluster.fit(X)

SpectralCoclustering(n_clusters=2, random_state=0, svd_method='arpack')

### T4. Compare the peformance against k-Means using V-measure

In [12]:
y_cocluster = cocluster.row_labels_

In [13]:
v_score = v_measure_score(y_cocluster, y_true)
print(f"V-measure: {v_score:.4f}")

V-measure: 0.0218


In [14]:
kmeans = KMeans(n_clusters=len(categories), 
                         random_state=0)

In [15]:
y_kmeans = kmeans.fit_predict(X)
v_score = v_measure_score(y_kmeans, y_true)
print(f"V-measure: {v_score:.4f}")


V-measure: 0.0081


### T5. Identify the combination of spam and ham as well as important words in each bicluster. Do these biclusters make sense in terms of topic mixture and words?

In [16]:
feature_names = vectorizer.get_feature_names()
document_names = y_true.values

In [17]:
def bicluster_ncut(i):
    rows, cols = cocluster.get_indices(i)
    if not (np.any(rows) and np.any(cols)):
        import sys
        return sys.float_info.max
    row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0]
    col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0]
    # Note: the following is identical to X[rows[:, np.newaxis],
    # cols].sum() but much faster in scipy <= 0.16
    weight = X[rows][:, cols].sum()
    cut = (X[row_complement][:, cols].sum() +
           X[rows][:, col_complement].sum())
    return cut / weight


def most_common(d):
    """Items of a defaultdict(int) with the highest values.

    Like Counter.most_common in Python >=2.7.
    """
    return sorted(d.items(), key=operator.itemgetter(1), reverse=True)

In [18]:
bicluster_ncuts = list(bicluster_ncut(i)
                       for i in range(len(categories)))
best_idx = np.argsort(bicluster_ncuts)[:5]
best_idx

array([0, 1])

In [19]:


print()
print("Best biclusters:")
print("----------------")

for idx, cluster in enumerate(best_idx):
    n_rows, n_cols = cocluster.get_shape(cluster)
    cluster_docs, cluster_words = cocluster.get_indices(cluster)
    if not len(cluster_docs) or not len(cluster_words):
        continue

    # categories
    counter = defaultdict(int)
    for i in cluster_docs:
        counter[document_names[i]] += 1
    cat_string = ", ".join("{:.0f}% {}".format(float(c) / n_rows * 100, name)
                           for name, c in most_common(counter)[:3])

    # words
    out_of_cluster_docs = cocluster.row_labels_ != cluster
    out_of_cluster_docs = np.where(out_of_cluster_docs)[0]
    word_col = X[:, cluster_words]
    word_scores = np.array(word_col[cluster_docs, :].sum(axis=0) -
                           word_col[out_of_cluster_docs, :].sum(axis=0))
    word_scores = word_scores.ravel()
    important_words = list(feature_names[cluster_words[i]]
                           for i in word_scores.argsort()[:-11:-1])

    print("bicluster {} : {} documents, {} words".format(
        idx, n_rows, n_cols))
    print("categories   : {}".format(cat_string))
    print("words        : {}\n".format(', '.join(important_words)))


Best biclusters:
----------------
bicluster 0 : 89 documents, 98 words
categories   : 81% ham, 19% spam
words        : did, like, free, don, just, sorry, ok, way, ll, yes

bicluster 1 : 3 documents, 2 words
categories   : 100% ham
words        : watching, wat

