# zhsegment: 

In [2]:
from zhsegment import *
import os
os.chdir("..")

## Run the baseline (only on unigram counts)

In [5]:
Pw = Pdist(data=datafile("data/count_1w.txt"), missingfn=penalize_long_unknown_baseline)


segmenter = Segment(Pw)
output_full = []
with open("data/input/dev.txt") as f: 
    for line in f:
        output = " ".join(segmenter.segment(line.strip()))
        output_full.append(output)


print("\n".join(output_full[:3])) # print out the first three lines of output as a sanity check

中 美 在 沪 签订 高 科技 合作 协议
新华社 上海 八月 三十一日 电 （ 记者 白 国 良 、 夏儒阁 ）
“ 中 美 合作 高 科技 项目 签字 仪式 ” 今天 在 上海 举行 。


## Some details on the baseline

The baseline model has been implemented as in the pseudo-code. For unknown words, we realized that the penalty had to be greater because the minimum unigram probability was very low and not much greater from 1/N. Therefore, we used a penalty of:
(1/N) * 1000000**-(len(key)-1)

This caused some issues in calculating log10 when the unknown word was very long. Whenever this issue was encountered, we simply replace the prbability with 1e-300 i.e: we calculate log10(1e-300)

## Checking output for baseline model

In [6]:
from zhsegment_check import fscore
with open('data/reference/dev.out', 'r') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
    tally = fscore(ref_data, output_full)
    print("score: {:.2f}".format(tally), file=sys.stderr)


score: 0.86


## Bigram Model


In [7]:
Pw = Pdist(data=datafile("data/count_1w.txt"), missingfn=penalize_long_unknown) 
P2w = Pdist(data=datafile("data/count_2w.txt"), missingfn=penalize_long_unknown)


segmenter2 = Segment(Pw,P2w)
output_bigram = []
with open("data/input/dev.txt") as f: 
    for line in f:
        output = " ".join(segmenter2.segment2(line.strip()))
        output_bigram.append(output)
print("\n".join(output_full[:3]))

中 美 在 沪 签订 高 科技 合作 协议
新华社 上海 八月 三十一日 电 （ 记者 白 国 良 、 夏儒阁 ）
“ 中 美 合作 高 科技 项目 签字 仪式 ” 今天 在 上海 举行 。


## Checking output for bigram model



In [9]:
from zhsegment_check import fscore
with open('data/reference/dev.out', 'r') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
    tally = fscore(ref_data, output_bigram)
    print("score: {:.2f}".format(tally),file=sys.stderr)


score: 0.93


## Analysis

Do some analysis of the results. What ideas did you try? What worked and what did not?

### Penalty for long unknown words

* We iteratively search for the correct length based penality function for unknown words.
* We find (1/N) * 12500**-(len(key)-1) maximizes our dev score
* The results below are done after implementing linearly interpolated bigrams.
* Compared to english (hw0) a larger penalty works better. This indicates the average chinese word has less characters than the average english word. A quick google search seems to agree with this. 

In [1]:
# def penalize_long_unknown(key, N): 
    # return (1/N) * 100**-(len(key)-1) #.79
    # return (1/N) * 1000000**-(len(key)-1) #.87
    # return (1/N) * 100000**-(len(key)-1) #.92
    # return (1/N) * 10000**-(len(key)-1) #.9345
    # return (1/N) * 12500**-(len(key)-1) #.9349 <-- Selected
    # return (1/N) * 1000**-(len(key)-1) #.92
    # return (1/N) * 5000**-(len(key)-1) #.93
    # return (1/N) * 7500**-(len(key)-1) #.9334
    # return (1/N) * 15000**-(len(key)-1) #.9341

#### What ideas worked:

 We tried to modify the lambda values for JM by iteratively update the parameters using a step size of 0.05. We found that this parameter was not particularly sensitive, but the maximum parameter value was 0.8.

In [3]:
# Lambda values
# JM = .7 #.0.9338
# JM = .8 #.0.9349
# JM = .75 #.0.9337
# JM = .9 #.0.9323

####  What ideas didn't work:

We also experimented with the replacement of numerical characters with  flags in the distribution. We found that the score would decrease very slightly by ~1 point and decided to not implement this technique as a result. The updating of numerical characters was done using the following code in Pdist as an experimental purpose, as runtime was not a consideration when just testing out the functionality and resultant score. The numerical characters in the runtime were not standard numerical characters as well, which is why they had to be mapped out individually. The text numbers had UTF8 values like \xEF\xBC\x91 which represented the number 1 rather than the typical \x31.

In [4]:
 #class Pdist(dict):
#    "A probability distribution estimated from counts in datafile."
#    def __init__(self, data=[], N=None, missingfn=None):
#        if re.search('[１２３４５６７８９０]', key):
#            key = key.replace('１','<NUM>')
#            key = key.replace('２','<NUM>')
#            key = key.replace('３','<NUM>')
#            key = key.replace('４','<NUM>')
#            key = key.replace('５','<NUM>')
#            key = key.replace('６','<NUM>')
#            key = key.replace('７','<NUM>')
#            key = key.replace('８','<NUM>')
#            key = key.replace('９','<NUM>')
#             key = key.replace('０','<NUM>')
#         for key,count in data:
#             self[key] = self.get(key, 0) + int(count)
#         self.N = float(N or sum(self.values()))
#         self.missingfn = missingfn or (lambda k, N: 1./N)
#     def __call__(self, key): 
#         if re.search('[１２３４５６７８９０]', key):
#             key = key.replace('１','<NUM>')
#             key = key.replace('２','<NUM>')
#             key = key.replace('３','<NUM>')
#             key = key.replace('４','<NUM>')
#             key = key.replace('５','<NUM>')
#             key = key.replace('６','<NUM>')
#             key = key.replace('７','<NUM>')
#             key = key.replace('８','<NUM>')
#             key = key.replace('９','<NUM>')
#             key = key.replace('０','<NUM>')
#         if key in self: return self[key]/self.N  
#         else: return self.missingfn(key, self.N)