## 0. Introduction
   * Structure:
       + Data Loading
       + Balanced Corpus
       + Text Processing with Modules
           - Jieba
           - pkuseg
           - THULAC
       
       
   * Data Souce:
       A Food Delivery Comments Dataset from an Anonymous Author.

## 1. Data Loading

In [10]:
import pandas as pd
from keras.preprocessing.text import Tokenizer

Using TensorFlow backend.


In [2]:
data = pd.read_csv('data/Food_Delivery.csv')

print('Overall size：%d' % data.shape[0])
print('Positive comments：%d' % data[data.label==1].shape[0])
print('Negative comments：%d' % data[data.label==0].shape[0])

Overall size：11987
Positive comments：4000
Negative comments：7987


In [3]:
data.sample(5)

Unnamed: 0,label,review
11732,0,送餐员态度很好，唯一不足是送餐时间较长
4941,0,因为楼层很多所以让人去校门口自取，好懒……
4505,0,"肘子卷太油腻,可能个人不太适应这个肥肉,味道真贵还好"
7052,0,送餐的是个SB，吃的全都撒出来了，而且这人只送到电梯，一步都不多走，还得自己走一段去拿。饺子全凉的
9870,0,超级差，都快两个小时了没送来，我定的是晚餐吗？？电话也打不通，定他家的饭能等出胃病来！已经打...


## 2. Balanced Corpus

In [4]:
positive = data[data.label==1]
negative = data[data.label==0]

def get_balance_corpus(corpus_size, corpus_pos, corpus_neg):
    sample_size = corpus_size // 2
    pd_corpus_balance = pd.concat([corpus_pos.sample(sample_size, replace=corpus_pos.shape[0]<sample_size), \
                                   corpus_neg.sample(sample_size, replace=corpus_neg.shape[0]<sample_size)])
    
    print('Overall size：%d' % pd_corpus_balance.shape[0])
    print('Positive comments：%d' % pd_corpus_balance[pd_corpus_balance.label==1].shape[0])
    print('Negative comments：%d' % pd_corpus_balance[pd_corpus_balance.label==0].shape[0])    
    
    return pd_corpus_balance

In [5]:
df = get_balance_corpus(10000, positive, negative)

Overall size：10000
Positive comments：5000
Negative comments：5000


In [6]:
df.sample(5)

Unnamed: 0,label,review
9100,0,订了一份牛肉水饺，不知道是送错了，还是什么原因，一点牛肉味都没有，跟猪肉大葱一样味，26元，...
9835,0,芝士没有看到芝士……实在太慢了，不打电话一直催，根本不取餐，一个半小时都过去了。我等了这么久...
131,1,"大风天,不容易,非常好"
1381,1,一如既往，好！非常满意！谢谢商家哦！
7063,0,"不知道为什么晚了三十分钟,后定的东西都到了和合谷还没到,已经饿到胃疼了,菜还可以,到的时候也..."


--------------------------------------------------------

## 3. Text Processing with Modules

### **1. jieba**

In [7]:
# pip install jieba --upgraade
import jieba

In [8]:
# jieba.load_userdict(r"dictionary.txt") # add a customized dictionary if needed

df['review_seg'] = df.apply(
    lambda row: jieba.lcut(row['review']), 
    axis = 1
    )

#-------------------------------------add a customized stop wprds list if needed----------------------------#
file_name = 'stop words/hit_stopwords.txt'

stop_f = open(file_name,"r",encoding='utf-8')
stop_words = list()
for line in stop_f.readlines():
    line = line.strip()
    if not len(line):
        continue

    stop_words.append(line)

def filter_key_word(seg_list):
    outstr = []
    for word in seg_list:
            if word not in stop_words:
                outstr.append(word)
    return outstr

df['review_seg_key'] = df.apply(
    lambda row: filter_key_word(row['review_seg']), 
    axis = 1
    ) 

df.head(5)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\S0048300\AppData\Local\Temp\jieba.cache
Loading model cost 0.766 seconds.
Prefix dict has been built successfully.


Unnamed: 0,label,review,review_seg,review_seg_key
1536,1,大份的超级大，一个人吃不了。。,"[大份, 的, 超级, 大, ，, 一个, 人, 吃, 不了, 。, 。]","[大份, 超级, 大, 人, 吃, 不了]"
2626,1,非常的正宗，就是羊杂汤太咸啦,"[非常, 的, 正宗, ，, 就是, 羊杂, 汤太咸, 啦]","[非常, 正宗, 羊杂, 汤太咸]"
1530,1,好吃，棒棒的,"[好吃, ，, 棒棒, 的]","[好吃, 棒棒]"
1824,1,价格合适，味道鲜美。我们公司每天订他们家饭,"[价格, 合适, ，, 味道鲜美, 。, 我们, 公司, 每天, 订, 他们, 家饭]","[价格, 合适, 味道鲜美, 公司, 每天, 订, 家饭]"
1580,1,"好吃,送的很快","[好吃, ,, 送, 的, 很快]","[好吃, 送, 很快]"


In [12]:
sentences = df['review_seg_key'].values

tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)

VOCAB_SIZE = len(tokenizer.word_index) + 1

print('VOCAB_SIZE: ' + str(VOCAB_SIZE))

VOCAB_SIZE: 8503


### **2. pkuseg**