<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Clustering-based" data-toc-modified-id="Clustering-based-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Clustering based</a></span><ul class="toc-item"><li><span><a href="#modeling" data-toc-modified-id="modeling-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>modeling</a></span></li><li><span><a href="#prediction" data-toc-modified-id="prediction-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>prediction</a></span></li><li><span><a href="#evaluation" data-toc-modified-id="evaluation-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>evaluation</a></span></li></ul></li></ul></div>

In [3]:
import numpy as np 
seeds = 1234
np.random.seed(seeds)

In [4]:
import pandas as pd
train = pd.read_json('../data/structured_train.json')
test = pd.read_json('../data/structured_test.json')

In [5]:
train = train.groupby('label').sample(50, random_state=seeds)
test = test.groupby('label').sample(50, random_state=seeds)

In [6]:
select_cols = ["global_index", "doc_path", "label",
               "reply", "reference_one", "reference_two", "tag_reply", "tag_reference_one", "tag_reference_two",
               "Subject", "From", "Lines", "Organization", "contained_emails", "long_string", "text", "error_message"
               ]
print("\nmay use cols: \n", select_cols)
train = train[select_cols]
test = test[select_cols]


may use cols: 
 ['global_index', 'doc_path', 'label', 'reply', 'reference_one', 'reference_two', 'tag_reply', 'tag_reference_one', 'tag_reference_two', 'Subject', 'From', 'Lines', 'Organization', 'contained_emails', 'long_string', 'text', 'error_message']


# Clustering based
- Steps:
    1. Transform into TF-IDF matrix
    2. Dimension reduction into 200
    3. Clustering in cosine similarity space (since it is word)
    4. Assign labels with majority vote based on training set labels
    5. Prediction
        1. Transform test set into TF-IDF matrix
        2. Dimension reduction into 200
        3. Make prediction based on the clusters and mapping between clusters and labels from training set
    6. Evaluation
        1. Based on classification report
        
- Time complexity 
    - O(mnkd)，m与数据集本身的分布情况和初始中心点位置有关。n为数据集中数据样本数量，k为聚类个数，d为数据的维数。

## modeling

In [7]:
train_text = train['tag_reply'] + ' ' + train['tag_reference_one']
# train_text = train['reply'] + ' ' + train['reference_one']
train_label = train['label']
test_text  = test['tag_reply'] + ' ' + test['tag_reference_one']
# test_text  = test['reply'] + ' ' + test['reference_one']
test_label = test['label']

In [9]:
from clustering_utils import *
# tfidf_vectorizer, dimension_reduction

In [10]:
dtm_train, dtm_test, word_to_idx, tfidf_vect = tfidf_vectorizer(train_text, test_text, min_df=2)
dtm_train, transform_mapper = dimension_reduction(dtm_train, out_dim=2)
dtm_test = transform_mapper.transform(dtm_test)

print('dtm_train.shape', dtm_train.shape)
print('dtm_test.shape', dtm_test.shape)
print(word_to_idx)

num of words: 3685
Dimension reduction with truncate SVD:
   input columns with  3685
   output columns with  2
dtm_train.shape (1000, 2)
dtm_test.shape (1000, 2)


In [35]:
clusterer, clusters_to_labels = fit_clustering_model(dtm_train, train_label, num_clusters=3, repeats=2)

  return 1 - (numpy.dot(u, v) / (sqrt(numpy.dot(u, u)) * sqrt(numpy.dot(v, v))))


Cluster to label mapping: 
Cluster 0 <-> label soc.religion.christian
Cluster 1 <-> label sci.electronics
Cluster 2 <-> label comp.sys.ibm.pc.hardware




## prediction

In [None]:
pred = pred_clustering_model(dtm_test, clusterer, clusters_to_labels)

## evaluation

In [None]:
from sklearn import preprocessing
# le = preprocessing.LabelEncoder()
# encoded_test_label = le.fit_transform(test_label)
# print(metrics.classification_report(y_true = encoded_test_label, y_pred=pred, target_names=le.classes_))
print(metrics.classification_report(y_true = test_label, y_pred=pred))