### 1.Paper Discussion
#### (a).
SegPhrase:
- Build a candidate phrases set by frequent pattern mining
- Extract phrase features based on concordance and informativeness creteria
- Train a classifier on limited labeled data to predict a quality score
- Partition a sequence of words by maximizing the likelihood and filter out low rectified frequency phrases
- Using rectified frequency, re-compute phrase features and re-train the classfier

AutoPhrase:
- Build a positive pool:mining high-quality phrases from public knowledge bases
- Build a noisy negative pool: obtaining candidate phrases from the given corpus but not matching any high-quality phrases in the positive poll
- Constructing a good ensemble of base classifiers for predicting phrase quality scores in order to reduce the impact of the false negative phrases
- Apply a POS-guided phrasal segmentation algorithm to find the best segmentation for each sentence by utilizing its shallow, language-specific knowledge
- Re-compute statistical features based on the recitified frequency of phrases and re-estimate phrase qualities


#### (b).
SegPhrase is not a competely unsupervised framework. When applying to a new corpus, the high-quality phrases under this new context are subject to change. Therefore, the phrase features extracted in this new corpus are very likely to have a different distribution from the original training corpus, thus requiring a classifier re-training. Human effort is still needed to label some high-quality and low-quality phrases in this new corpus.

#### (c).
Motivation:
- SegPhrase requires a labeled phrase dataset annotated by domain experts, which can be expensive, especially in specialized domains.
- The prior linguistic information hasn't been fully utilized in SegPhrase

Novelty:
- Apply distant supervision to avoid the need of labeled data
- Utilize shallow linguistic knowledge to increase phrasal segmentation accuracy

#### (d).
To save human effort. If polling strategy is not used, then all the phrases generated by every method need to be labeled manually in order to compute precision and recall of each method, which can be very expensive.

#### (e).
One thing I noticed is that both SegPhrase and AutoPhrase follow a frequent pattern mining manner in the candidate phrase generation stage. The minimum support threshold are fixed to be 30 in these two methods, which makes it competely impossible to extract those high-quality but infrequent phrases. However, it's very likey to have some high-quality phrases that only appear a few times because they are only relevant to some certain niche fields.

#### (f).
Increase the recall of candidate phrase generation stage by taking in more infrequent but potentially high-quality phrases. Pretrained POS tagger or sequence labeling model (BIOES) may be utilized to help identify these phrases.

### 2.Phrase Mining experiments

#### (b).
- I noticed that for some phrases, if one of their words change from plural form to singular form, their quality scores will change greatly:
  - robotic tasks(0.87) => robotic task(0.34)
  - machine learning classifiers(0.76) => machine learning classifier(0.41)
  - ppm models(0.57) => ppm model(0.38)
- Treating the singular form and the plural form of one phrase as two different will definitely underestimate the real quality score of this phrase. I think singularizing plural nouns should be performed before tokenization to solve this problem.
<hr>
- Some non-English low-quality phrases are assigned with very high scores, like le_problème(score: 0.919, French, means the_problem) and zur_behandlung(score: 0.832, German, means for_treatment). These phrases contain non-English stop words that were not taken into account when estimating quality scores. I think the stop words set should be updated and includes stop words of other languages.
<hr>
- I also found some phrases with very high quality scores but I can't recognize, such as "cl sr" (0.869) and "vc pm" (0.869). I'm not sure where these phrases come from and what they stand for, and I also don't understant what is the cause.


#### (c). Apply Word2Vec on segmented corpus

(iv)

In [1]:
import re
import numpy as np
from gensim import utils, models
from sklearn.cluster import KMeans

In [7]:
def remove_punc(text):
    punc = '!"#$%&\'()*,-./:;<=>?@[\\]^`{|}~'
    table = str.maketrans(punc, ' '*len(punc))
    return text.translate(table)

def merge_phrase(text):
    start_inds = [m.span() for m in re.finditer('<phrase>', text)]
    end_inds = [m.span() for m in re.finditer('</phrase>', text)]

    res = ''
    for i, start_pos in enumerate(start_inds):
        end_pos = end_inds[i]
        if i == 0:
            res += text[:start_pos[0]]
        
        tmp_phrase = text[start_pos[1]: end_pos[0]]
        phrase = '_'.join(tmp_phrase.split())
        res += phrase
        if i == len(start_inds) - 1:
            res += text[end_pos[1]:]
        else:
            res += text[end_pos[1]: start_inds[i+1][0]]
    return remove_punc(res).strip().lower()

with open('segmentation.txt', 'r') as f:
    segs = f.readlines()

processed = list()
for text in segs:
    text = text.strip()
    if text == '.':
        continue
    processed_text = merge_phrase(text)
    processed.append(processed_text.split())




(v)

In [8]:
dblp_model = models.Word2Vec(processed, size=100, window=5, min_count=3, workers=4)

#### (d). Run KMeans on quality phrases

In [10]:
with open('AutoPhrase_multi-words.txt') as f:
    multi_words = f.readlines()

phrase_vec = list()
phr2ind = dict()
ind2phr = list()
for line in multi_words:
    score = float(line.strip().split()[0])
    phrase = '_'.join(line.strip().split()[1:])
    # filter out low-quality phrases
    if score < 0.5:
        continue
    if phrase in dblp_model.wv:
        phrase_vec.append(dblp_model.wv[phrase])
        ind2phr.append(phrase)

for ind, phrase in enumerate(ind2phr):
    phr2ind[phrase] = ind

In [11]:
phrase_vec = np.array(phrase_vec)
kmeans = KMeans(n_clusters=6, random_state=0).fit(phrase_vec)
phrase_distance = kmeans.transform(phrase_vec)

In [12]:
# return the closest 20 phrases in each cluster
def get_closest_phrases(phrase_distance, ind2phr, center):
    sort_inds = np.argsort(phrase_distance[:, center])
    res = list()
    for ind in sort_inds[:20]:
        res.append(ind2phr[ind])
    return res

In [13]:
res = list()
for i in range(6):
    res.append(get_closest_phrases(phrase_distance, ind2phr, i))

#### An explanation of the result
I found one of the clusters very interesting (first column below). It contains lots of non-English phrases, like système_pour (French, means system_for), un_proceso(Spanish, means a_process), von_modellen(German, means of_models) and so on. They are actually not high-quality phrases and don't belong to one single topic. I think this is because the non-English documents in the DBLP dataset only account for a small proportion, so our Word2Vec model cannot learn a good representation of these non-English phrases based on their context. However, our model can somehow realize that these phrases are not in English thus making their vector representations very close to each other.

In [21]:
from prettytable import PrettyTable

table = PrettyTable()
table.add_column('Other Languages', res[0])
table.add_column('Image Processing and Pattern Recognition', res[1])
table.add_column('Computer Networks', res[2])
print (table)

+----------------------+------------------------------------------+---------------------------------+
|   Other Languages    | Image Processing and Pattern Recognition |        Computer Networks        |
+----------------------+------------------------------------------+---------------------------------+
|       der_emv        |          texture_representation          |     mobile_cellular_networks    |
|     système_pour     |            spectral_unmixing             |   dynamic_resource_management   |
|      un_proceso      |            wavelet_denoising             | optical_burst_switched_networks |
|     le_problème      |             image_clustering             |    cellular_wireless_networks   |
|     von_modellen     |        color_and_texture_features        |    optical_transport_networks   |
|    zur_behandlung    |                kernel_pca                |       capacity_enhancement      |
|     und_aufgaben     |           texture_recognition            |          soft_

In [22]:
table = PrettyTable()
table.add_column('Computer-supported Collaborative Learning', res[3])
table.add_column('Valuation Metrics', res[4])
table.add_column('Mathematics', res[5])
print (table)

+-------------------------------------------+-----------------------+---------------------------------------------+
| Computer-supported Collaborative Learning |   Valuation Metrics   |                 Mathematics                 |
+-------------------------------------------+-----------------------+---------------------------------------------+
|            collaboration_tools            |    throughput_gain    |              rayleigh_quotient              |
|              service_science              |     search_effort     |              circulant_matrices             |
|           collaboration_support           |      query_delay      |               type_inequality               |
|       healthcare_information_systems      |     delivery_rate     |           proximal_point_algorithm          |
|            learning_technology            |  aggregate_throughput |            equivalence_relations            |
|           collaboration_systems           |   bandwidth_overhead  |   