#題目: 將某篇文章以上下文相同，比方三連詞(trigram)方式修改內容
說明：某篇文章中我們可以找出所有的三連詞(trigram)，以及在前字與後字出現時，按照出現度隨機選出一個字去換掉中間字，這是利用三連詞修改文章內容的最基本作法。一旦字典的資料結構建立，我們就以某種機率(比方20%)去置換原文，並將置換文與原文印出來

延伸: 可用五連詞或七連詞去取代中間字，可利用三連詞之前兩字去更換第三字，可增加加詞性的相同性(Parts Of Sentence)提高可讀性，甚至使用 Word2Vec, Glove，或者RNN的[redacted]

範例程式檔名: article_modifier_自動文件修改器.py。

模組: sklearn, random, numpy, nltk, bs4

輸入檔：./electronics/positive.review

成績：被置換文的合理性與可讀性

In [None]:
from __future__ import print_function, division
from future.utils import iteritems
from builtins import range

import nltk
import random
import numpy as np
from bs4 import BeautifulSoup

nltk.download(['punkt','wordnet'])

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [None]:
# load the reviews
positive_reviews = BeautifulSoup(open('positive.review', encoding='ISO-8859-1').read(), "lxml")
positive_reviews = positive_reviews.findAll('review_text')


# 提出 三連詞 並置入字典
# (w1, w3) 當作 key, [ w2 ] 當作值
trigrams = {}
for review in positive_reviews:
    s = review.text.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        k = (tokens[i], tokens[i+2])
        if k not in trigrams:
            trigrams[k] = []
        trigrams[k].append(tokens[i+1])

# 將中間字矩陣變成或然率向量
trigram_probabilities = {}
for k, words in iteritems(trigrams):
    # 產生一個  word -> count 字典
    if len(set(words)) > 1:
        # 如果中間字middle word不只有一個機率 
        d = {}
        n = 0
        for w in words:
            if w not in d:
                d[w] = 0
            d[w] += 1
            n += 1
        for w, c in iteritems(d):
            d[w] = float(c) / n
        trigram_probabilities[k] = d


def random_sample(d):
    # 從字典隨機選出一個帶機率值的樣本，回傳累積機率值最大的字
    r = random.random()
    cumulative = 0
    for w, p in iteritems(d):
        cumulative += p
        if r < cumulative:
            return w


def test_spinner():
    review = random.choice(positive_reviews)
    s = review.text.lower()
    print("Original:", s)
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        if random.random() < 0.2: # 20% chance of replacement
            k = (tokens[i], tokens[i+2])
            if k in trigram_probabilities:
                w = random_sample(trigram_probabilities[k])
                tokens[i+1] = w
    print("Spun:")
    print(" ".join(tokens).replace(" .", ".").replace(" '", "'").replace(" ,", ",").replace("$ ", "$").replace(" !", "!"))

test_spinner()

Original: 
i liked these speakers they are small compact they work great with my mp3 player and you can't really beat the price.  my only complaint is that they don't come with an adapter but i just bought rechargeable batteries and the system work great i use them everyda

Spun:
i liked these speakers they're small compact they work great on the mp3 player and you ca n't really beat the price. my only complaint is because they do n't come with an adapter but i ever bought rechargeable batteries and the smallest work. i use them everyda


In [14]:
fgrams = {}
for review in positive_reviews:
    s = review.text.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 4):
        k = (tokens[i], tokens[i+4])
        if k not in fgrams:
            fgrams[k] = []
        fgrams[k].append(tokens[i+1])

# 將中間字矩陣變成或然率向量
fgram_probabilities = {}
for k, words in iteritems(fgrams):
    # 產生一個  word -> count 字典
    if len(set(words)) > 1:
        # 如果中間字middle word不只有一個機率 
        d = {}
        n = 0
        for w in words:
            if w not in d:
                d[w] = 0
            d[w] += 1
            n += 1
        for w, c in iteritems(d):
            d[w] = float(c) / n
        fgram_probabilities[k] = d


def random_sample(d):
    # 從字典隨機選出一個帶機率值的樣本，回傳累積機率值最大的字
    r = random.random()
    cumulative = 0
    for w, p in iteritems(d):
        cumulative += p
        if r < cumulative:
            return w

def test_spinner():
    review = random.choice(positive_reviews)
    s = review.text.lower()
    print("Original:", s)
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        if random.random() < 0.2: # 20% chance of replacement
            k = (tokens[i], tokens[i+2])
            if k in trigram_probabilities:
                w = random_sample(trigram_probabilities[k])
                tokens[i+1] = w
    print("Spun:")
    print(" ".join(tokens).replace(" .", ".").replace(" '", "'").replace(" ,", ",").replace("$ ", "$").replace(" !", "!"))

def test_spinner_no_random():
    review = random.choice(positive_reviews)
    s = review.text.lower()
    print("Original:", s)
    tokens = nltk.tokenize.word_tokenize(s)
    for i in range(len(tokens) - 2):
        k = (tokens[i], tokens[i+2])
        if k in trigram_probabilities:
            w = random_sample(trigram_probabilities[k])
            tokens[i+1] = w
    print("Spun:")
    print(" ".join(tokens).replace(" .", ".").replace(" '", "'").replace(" ,", ",").replace("$ ", "$").replace(" !", "!"))

print("===========spinner with probablistic replacement==============")
test_spinner()
print("\n==================spinner without probablistic replacement=================")
test_spinner_no_random()

Original: 
these speakers are amazing, i am truly shocked to receive something so sweet at an unbelievable price, amazon thank you for the hook up, my speaker system rocks.  my delivery experience with amazon was great.  can't thank you enough, keep up the great service, truly

Spun:
these two are amazing, i am truly shocked to receive something so sweet at an unbelievable price, amazon thank you for the hook up, my speaker system rocks. my delivery experience with amazon was great. ca n't have you enough, keep up the great deal, truly

Original: 
great quality, excellent construction and strong rj45 plugs.  i have worked with a decent share of cat5 and i have never had to cut  and terminate a belkin cable due to regular wear and tear

Spun:
great set, excellent with and strong rj45 plugs. you have not solid a decent sound of your satellite i have ever had to carry and even a voice cable but to regular wear and tear
