This project uses NLP techniques to filter reviews written by 'water army' from ordinary reviews, which needs following packages.

In [34]:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import jieba.posseg
import jieba
import math
from collections import defaultdict, Counter
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
jieba.enable_parallel(8)

First step is loading review data and user data, for the convinience of following steps, we use Jieba to 
cut sentences into words and store them in original locations(dataframe).

In [11]:
reviews = pd.read_csv('/Users/mchen/Documents/18FallCourses/Independent Study/Codes/Data/ReviewsData_update.csv')
Users = pd.read_csv('/Users/mchen/Documents/18FallCourses/Independent Study/Codes/Data/userdata_update.csv')
ReviewsTrain = reviews.iloc[:int(reviews.shape[0]*0.8),:]
Users = Users[Users.isna()['intro']== False]
Users['intro'] = Users['intro'].apply(lambda x: [y for y in jieba.cut(x)])
Users.head() # show the first 5 rows of dataframe

       user_id  status   user_name  \
0       tjz230     1.0          影志   
1      7542909     1.0      翻滚吧！蛋堡   
2  lingrui1995     1.0          凌睿   
3      metiche     1.0  杨欢喜Metiche   
4       kar200     1.0           觉   

                                               intro  followers  follow  \
0  [豆瓣, 商务, 合作,  , 请, 微信, 联系,  , YOGA,  , YANG,  ...   101317.0   789.0   
1  [撕扯, 吧,  , 趁, 青春, 未, 退役, 前,  , 谁, 在, 哭,  , 他, ...     4706.0   378.0   
2  [我, 在, 此, 发毒, 誓,  , 我, 在, 豆瓣, 上, 从, 没收, 钱, 写, ...    23759.0    74.0   
3  [新浪, 微博,  ,  , 杨, 欢喜, 的, 杨, 任性, 更新, 的, 公众, 号, ...     2196.0  1662.0   
4                                        [Squeak,  ]      646.0   192.0   

   watched_movies  
0          4039.0  
1          2346.0  
2          2592.0  
3          1805.0  
4          2562.0  


To determine whether a review is from an ordinary person, we first distinguish 'water army' from ordinary 
users. The following criterions demonstrate how to define the 'water army':
1. From their self-intro, can we find such terms '合作', '专栏', '淘宝' etc.
2. Do they have extraordinary number of followers and watched movies? e.g. more than 3000 followers

To save the result in a dictionary, key is user name and value is 0 or 1, 0 means 'water army', 1 means 'ordinary user'.

In [16]:
def DetermineUser(Users):
    tags = ['合作', '专栏', '淘宝', '影评人', '公众', '号', '公众号','电邮','商务','微信','豆瓣','影评','短评','账号','营销','字幕']
    UserDict = {}
    for index, row in Users.iterrows():
        if any(x in row['intro'] for x in tags) or row['followers']> 3000 or row['watched_movies']> 2000:
            UserDict[row['user_name']] = 0
        else:
            UserDict[row['user_name']] = 1
    return UserDict

In [15]:
UserDict = DetermineUser(Users)
list(UserDict.items())[0:5] # show the first 5 items in dictionary

[('影志', 0), ('翻滚吧！蛋堡', 0), ('凌睿', 0), ('杨欢喜Metiche', 0), ('觉', 0)]

The reviews written by 'water army' are what we concern about, so labelling reviews as 'water army' is a 
pre task for training data. Since the reviews without labelled users are little helpful in this scenario, 
we can only match reviews data with labelled users and get the labelled reviews.

In [32]:
def DetermineReview(ReviewsTrain, UserDict):
    ReviewsLabel = {}
    for index, row in ReviewsTrain.iterrows():
        try:
            ReviewsLabel[row['cid']] = UserDict[row['user_name']]
        except KeyError:
            pass
    return ReviewsLabel

In [30]:
ReviewsLabel = DetermineReview(ReviewsTrain, UserDict)
list(ReviewsLabel.items())[0:5] # show the first 5 items in dictionary

[(1348293405, 0),
 (1345018216, 0),
 (1348402447, 0),
 (1346991805, 0),
 (1349303778, 0)]

For the following processing, we also need to cut sentences in review datasets into words, and filter some 
meaningless words by the part of speech.

In [43]:
ReviewsTrain = ReviewsTrain[ReviewsTrain['cid'].isin(list(ReviewsLabel.keys()))]
pos = ['n', 'nz', 'v', 'vd', 'vn', 'l', 'a', 'd', 'nrt', 'r', 'nr']
ReviewsTrain['content'] = ReviewsTrain['content'].apply(lambda x: [y for y,f in jieba.posseg.cut(x) if f in pos])
Articles = dict(zip(ReviewsTrain['cid'],ReviewsTrain['content']))

After labeling reviews data, we do some calculations to translate words into numbers. In this case, we 
focus on what kinds of words occurs can support that this review is written by 'water army'. Thus, we 
simply use tf-idf to translate terms.

In [36]:
def IDF(Articles):
    Terms = set([y for key, value in Articles.items() for y in value])
    N = len(Articles)
    IDF = {}
    for term in Terms:
        i = 0
        for cid, article in Articles.items():
            if term in article: i += 1
        IDF[term] = np.log(N / (i + 1))
    return IDF

In [38]:
idf = IDF(Articles)
list(idf.items())[:3]

[('太爽', 7.0066952268370404),
 ('缓慢', 6.6012301187288767),
 ('划上', 7.0066952268370404)]

In [40]:
def TFIDF(IDF, Articles):
    Frequency = {}
    for cid, article in Articles.items():
        Frequency[cid] = Counter(article)

    TfIdf = {}
    for cid, freqs in Frequency.items():
        tfidf = {}
        for key, freq in freqs.items():
            tfidf[key] = IDF[key] * freq
        TfIdf[cid] = tfidf
    return TfIdf

In [42]:
TfIdf = TFIDF(idf, Articles)
list(TfIdf.items())[0]

(1348293405,
 {'斯皮尔伯格': 3.525455137501349,
  '他': 7.727635283981181,
  '电影': 3.2444003280959031,
  '梦想': 5.3019471345986151,
  '热心': 7.0066952268370404,
  '爱': 3.3431335807073941,
  '情怀': 3.9621727891136174,
  '浓缩': 6.0904044949628853,
  '到': 2.3016797058792333,
  '这部': 3.0651134191673504,
  '片子': 3.5101876653705602,
  '极具': 6.313548046277095,
  '经典电影': 6.6012301187288767,
  '角色': 3.2000327370667208,
  '又': 2.5640439703467242,
  '并茂': 7.0066952268370404,
  '高科技': 5.7539322583416723,
  '游戏': 5.5597229631377223,
  '闯关': 6.0904044949628853,
  '拿手': 6.6012301187288767,
  '专注': 6.6012301187288767,
  '想': 2.8246450841958342,
  '诉说': 6.6012301187288767,
  '都': 1.5013636909046779,
  '这里': 4.6553199696735632,
  '影迷': 4.2658552029118395,
  '情': 7.0066952268370404,
  '倾盆': 7.0066952268370404,
  '呈现': 4.7554034282305455,
  '谢谢': 9.8545073703144102,
  '你': 5.0249132031124617,
  '玩': 3.4513471653476269,
  '我': 3.0272675669929843,
  '推向': 5.9080829381689313,
  '高潮': 4.2986450257348308,
  '就': 1.91601

The following step is reducing the dimension of  above embedding matrix by using SVD, which is called 
latent semantic analysis.

In [None]:
def LSA(TfIdf):
    OccurenceMatrix = pd.DataFrame(TfIdf).fillna(0)
    U, S, V = np.linalg.svd(OccurenceMatrix.values)
    k = S.size
    Uk = U[:, :k]
    Sk = np.diag(S)
    translate = np.linalg.inv(Sk) @ np.transpose(Uk)

    return V.T, list(OccurenceMatrix.columns), translate, Sk

Last step is prediction, using SVM and Decision Tree to do prediction and get accuracy of 
0.8506787330316742 and 0.5927601809954751 respectively.