## 0. Introduction
   * **Structure:**
       + Data Loading
       + Balanced Corpus
       + Text Processing with Modules
           - Jieba
           - pkuseg
           - THULAC
       
       
   * **Data Souce:**
       A Food Delivery Comments Dataset from an Anonymous Author.

## 1. Data Loading

In [1]:
import pandas as pd
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [2]:
data = pd.read_csv('data/Food_Delivery.csv')

print('Overall size：%d' % data.shape[0])
print('Positive comments：%d' % data[data.label==1].shape[0])
print('Negative comments：%d' % data[data.label==0].shape[0])

Overall size：11987
Positive comments：4000
Negative comments：7987


In [3]:
data.sample(5)

Unnamed: 0,label,review
6346,0,"薯条是软的……不会再买了,拌饭忘放酱了，打电话说给送也没送……自己去超市买了老干妈……,哎"
1503,1,非常美味，下次还会再来！
6901,0,"菜码太小了吧，能不能传图呀,32元的菜，半盒洋葱，肉呢！是店里顾客吃剩下的吧！"
78,1,最近一直订这家的外卖，相当不错，够味儿，快递小哥也是棒棒的！
10661,0,这次虾饺里面居然全是淀粉都没有虾！


## 2. Balanced Corpus

In [4]:
positive = data[data.label==1]
negative = data[data.label==0]

def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):
    sample_size = corpus_size // 2
    pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \
                                   corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])
    
    print('Overall size：%d' % pd_corpus_balance.shape[0])
    print('Positive comments：%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])
    print('Negative comments：%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0])    
    
    return pd_corpus_balance

In [5]:
df = get_balance_corpus(10000, positive, negative)

Overall size：10000
Positive comments：5000
Negative comments：5000


In [6]:
df.sample(5)

Unnamed: 0,label,review
8741,0,打死卖糖的油好大又甜不唧唧的难吃死
2532,1,快递小哥服务好，速度快
11462,0,这家的送餐超慢，没有很好的耐性就不要定他家了
8923,0,配送很快，味道一般
121,1,骑士很好，赞一个，很礼貌，辛苦了！


--------------------------------------------------------

## 3. Text Processing with Modules

### **1. jieba**

In [7]:
# pip install jieba --upgrade
import jieba

In [8]:
# jieba.load_userdict(r"dictionary.txt") # add a customized dictionary if needed

df['review_seg'] = df.apply(
    lambda row: jieba.lcut(row['review']), 
    axis = 1
    )

#-------------------------------------add a customized stop wprds list if needed----------------------------#
file_name = 'stop words/hit_stopwords.txt'

stop_f = open(file_name,"r",encoding='utf-8')
stop_words = list()
for line in stop_f.readlines():
    line = line.strip()
    if not len(line):
        continue

    stop_words.append(line)

def filter_key_word(seg_list):
    outstr = []
    for word in seg_list:
            if word not in stop_words:
                outstr.append(word)
    return outstr

df['review_seg_key'] = df.apply(
    lambda row: filter_key_word(row['review_seg']), 
    axis = 1
    ) 

df.head(5)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\S0048300\AppData\Local\Temp\jieba.cache
Loading model cost 0.782 seconds.
Prefix dict has been built successfully.


Unnamed: 0,label,review,review_seg,review_seg_key
109,1,态度非常好，速度快，味道好极了!,"[态度, 非常, 好, ，, 速度, 快, ，, 味道, 好极了, !]","[态度, 非常, 好, 速度, 快, 味道, 好极了, !]"
592,1,态度好，菜味佳,"[态度, 好, ，, 菜味佳]","[态度, 好, 菜味佳]"
1023,1,有点油腻～其他很好,"[有点, 油腻, ～, 其他, 很, 好]","[有点, 油腻, 很, 好]"
3158,1,黄太极里的巨无霸。。,"[黄, 太极, 里, 的, 巨无霸, 。, 。]","[黄, 太极, 里, 巨无霸]"
3308,1,第一次尝试，味道还不错,"[第一次, 尝试, ，, 味道, 还, 不错]","[第一次, 尝试, 味道, 还, 不错]"


In [9]:
sentences = df['review_seg_key'].values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

VOCAB_SIZE = len(tokenizer.word_index) + 1

print('VOCAB_SIZE: ' + str(VOCAB_SIZE))

VOCAB_SIZE: 8315


### **2. pkuseg**

In [10]:
# pip install pkuseg --upgrade
import pkuseg

In [11]:
df1 = data.copy()

# file_name = 'dictionary.txt' # add a customized dictionary if needed
# add a customized stop wprds list if needed
# the structure of the dictionary should contain stop words in the first column and a tag in the second column. E.g. 'trash_words'

seg = pkuseg.pkuseg(
    model_name = 'default',
    #user_dict = file_name,
    postag = True)

df1['review_seg'] = df1.apply(
    lambda row: seg.cut(row['review']), 
    axis = 1
    )

def filter_key_word(word_ls):
    
    x = list(filter(
        lambda word: word[1] not in 
        ['w', 'm'], word_ls))   # add 'trash_words' in the list
    y = [i[0] for i in x]
    
    return y

df1['review_seg_key'] = df1.apply(
    lambda row: filter_key_word(row['review_seg']), 
    axis = 1
    )

df1.head(5)

Unnamed: 0,label,review,review_seg,review_seg_key
0,1,很快，好吃，味道足，量大,"[(很快, d), (，, w), (好吃, a), (，, w), (味道, n), (足...","[很快, 好吃, 味道, 足, 量, 大]"
1,1,没有送水没有送水没有送水,"[(没有, d), (送水, v), (没有, v), (送水, n), (没有, d), ...","[没有, 送水, 没有, 送水, 没有, 送水]"
2,1,非常快，态度好。,"[(非常, d), (快, a), (，, w), (态度, n), (好, a), (。,...","[非常, 快, 态度, 好]"
3,1,方便，快捷，味道可口，快递给力,"[(方便, a), (，, w), (快捷, z), (，, w), (味道, n), (可...","[方便, 快捷, 味道, 可口, 快递, 给力]"
4,1,菜味道很棒！送餐很及时！,"[(菜, n), (味道, n), (很, d), (棒, a), (！, w), (送, ...","[菜, 味道, 很, 棒, 送, 餐, 很, 及时]"


In [12]:
sentences = df1['review_seg_key'].values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

VOCAB_SIZE = len(tokenizer.word_index) + 1

print('VOCAB_SIZE: ' + str(VOCAB_SIZE))

VOCAB_SIZE: 10118


### **3. thulac**

In [13]:
# pip install thulac --upgrade
import thulac

In [14]:
df2 = data.copy()

thu1 = thulac.thulac()

df2['review_seg'] = df2.apply(
    lambda row: thu1.cut(row['review']), 
    axis = 1
    )

def filter_key_word(word_ls):
    
    x = list(filter(
        lambda word: word[1] not in 
        ['w', 'm'], word_ls))   # add 'trash_words' in the list
    y = [i[0] for i in x]
    
    return y

df2['review_seg_key'] = df2.apply(
    lambda row: filter_key_word(row['review_seg']), 
    axis = 1
    )

df2.head(5)

Model loaded succeed


Unnamed: 0,label,review,review_seg,review_seg_key
0,1,很快，好吃，味道足，量大,"[[很快, d], [，, w], [好吃, a], [，, w], [味道, n], [足...","[很快, 好吃, 味道, 足, 量, 大]"
1,1,没有送水没有送水没有送水,"[[没有, d], [送, v], [水, n], [没有, d], [送, v], [水,...","[没有, 送, 水, 没有, 送, 水, 没有, 送, 水]"
2,1,非常快，态度好。,"[[非常, d], [快, a], [，, w], [态度, n], [好, a], [。,...","[非常, 快, 态度, 好]"
3,1,方便，快捷，味道可口，快递给力,"[[方便, a], [，, w], [快捷, a], [，, w], [味道, n], [可...","[方便, 快捷, 味道, 可口, 快递, 给, 力]"
4,1,菜味道很棒！送餐很及时！,"[[菜味道, n], [很, d], [棒, a], [！, w], [送, v], [餐,...","[菜味道, 很, 棒, 送, 餐, 很, 及时]"


In [15]:
sentences = df2['review_seg_key'].values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

VOCAB_SIZE = len(tokenizer.word_index) + 1

print('VOCAB_SIZE: ' + str(VOCAB_SIZE))

VOCAB_SIZE: 9049
