## 0. Dataset : Naver sentiment movie corpus v1.0

- total 20milion reviews which are less than 140 characters 
(ratings_train.txt = 15M, ratings_test.txt=5M)
- the ratio of positivity and negativity is 5:5
- url : http://github.com/e9t/nsmc/

## 1. Data Preprocessing(ft.KoNLPy)
![image.png](attachment:image.png)

In [1]:
# read data
def read_data(filename):
    with open(filename, 'r') as f:
        data = [line.split('\t') for line in f.read().splitlines()]
        data = data[1:]
    return data

In [2]:
train_data = read_data('ratings_train.txt')
test_data = read_data('ratings_test.txt')

In [3]:
train_data

[['9976970', '아 더빙.. 진짜 짜증나네요 목소리', '0'],
 ['3819312', '흠...포스터보고 초딩영화줄....오버연기조차 가볍지 않구나', '1'],
 ['10265843', '너무재밓었다그래서보는것을추천한다', '0'],
 ['9045019', '교도소 이야기구먼 ..솔직히 재미는 없다..평점 조정', '0'],
 ['6483659',
  '사이몬페그의 익살스런 연기가 돋보였던 영화!스파이더맨에서 늙어보이기만 했던 커스틴 던스트가 너무나도 이뻐보였다',
  '1'],
 ['5403919', '막 걸음마 뗀 3세부터 초등학교 1학년생인 8살용영화.ㅋㅋㅋ...별반개도 아까움.', '0'],
 ['7797314', '원작의 긴장감을 제대로 살려내지못했다.', '0'],
 ['9443947',
  '별 반개도 아깝다 욕나온다 이응경 길용우 연기생활이몇년인지..정말 발로해도 그것보단 낫겟다 납치.감금만반복반복..이드라마는 가족도없다 연기못하는사람만모엿네',
  '0'],
 ['7156791', '액션이 없는데도 재미 있는 몇안되는 영화', '1'],
 ['5912145', '왜케 평점이 낮은건데? 꽤 볼만한데.. 헐리우드식 화려함에만 너무 길들여져 있나?', '1'],
 ['9008700', '걍인피니트가짱이다.진짜짱이다♥', '1'],
 ['10217543', '볼때마다 눈물나서 죽겠다90년대의 향수자극!!허진호는 감성절제멜로의 달인이다~', '1'],
 ['5957425', '울면서 손들고 횡단보도 건널때 뛰쳐나올뻔 이범수 연기 드럽게못해', '0'],
 ['8628627', '담백하고 깔끔해서 좋다. 신문기사로만 보다 보면 자꾸 잊어버린다. 그들도 사람이었다는 것을.', '1'],
 ['9864035', '취향은 존중한다지만 진짜 내생에 극장에서 본 영화중 가장 노잼 노감동임 스토리도 어거지고 감동도 어거지', '0'],
 ['6852435', 'ㄱ냥 매번 긴장되고 재밋음ㅠㅠ', '1'],
 ['9143163',
  '참 사람들 웃긴게 

In [5]:
# check a number of row, column
print(len(train_data))
print(len(train_data[0]))

print(len(test_data))
print(len(test_data[0]))

150000
3
50000
3


------------------------------------------------------------------------------------------------
### [KoNLPy](http://konlpy.org/en/latest/)
KoNLPy is Korean natural language package that consists of Kkma, Twitter, Mecab and so forth.

To compare how different packages work, take a look at the chart below.
![image.png](attachment:image.png)

As you can see, **Mecab package is the most effective one**(Twitter works fine as well).
However, in order to use Mecab in MacOS environment, you have to download tons of other packages, which I do not prefer. Hence, I decided to use Twitter.

In [6]:
# example
#update
#"Twitter" has changed to "Okt" since KoNLPy v0.4.5
from konlpy.tag import Okt

text = '아버지가방에들어가신다'
pos_tagger = Okt()
#pos_tagger.morphs(text)할 경우 품사 표시X
pos_tagger.pos(text)

[('아버지', 'Noun'), ('가방', 'Noun'), ('에', 'Josa'), ('들어가신다', 'Verb')]

In [7]:
# tokenizing corpus
def tokenize(doc):
    return ['/'.join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)]

train_docs = [(tokenize(row[1]), row[2]) for row in train_data]
test_docs = [(tokenize(row[1]), row[2]) for row in test_data]

In [8]:
train_docs[0]

(['아/Exclamation',
  '더빙/Noun',
  '../Punctuation',
  '진짜/Noun',
  '짜증나다/Adjective',
  '목소리/Noun'],
 '0')

In [9]:
test_docs[0]

(['굳다/Adjective', 'ㅋ/KoreanParticle'], '1')

## 2. Data Exploration(ft.NLTK)
Let's examine what kind of features our corpus share!

In [10]:
tokens = [t for d in train_docs for t in d[0]]
print(len(tokens))

2159921


[NLTK](https://www.nltk.org)
 is utilized to examine token data.

In [11]:
import nltk

text = nltk.Text(tokens, name='NMSC')
print(text)

<Text: NMSC>


In [12]:
#number of tokens
print(len(text.tokens))

2159921


In [13]:
#to remove duplication
print(len(set(text.tokens)))

49895


In [14]:
# top 10 common token
print(text.vocab().most_common(10))

[('./Punctuation', 67778), ('영화/Noun', 50818), ('하다/Verb', 41209), ('이/Josa', 38540), ('보다/Verb', 38538), ('의/Josa', 30188), ('../Punctuation', 29055), ('가/Josa', 26627), ('에/Josa', 26468), ('을/Josa', 23118)]


## 3. Sentiment classification with term-existence
- to check if the specific term exists 

In [15]:
selected_words = [f[0] for f in text.vocab().most_common(2000)]

def term_exists(doc):
    return {'exists({})'. format(word): (word in set(doc)) for word in selected_words}

train_xy = [(term_exists(d), c) for d, c in train_docs]
test_xy = [(term_exists(d), c) for d, c in test_docs]

In [16]:
#아 더빙.. 진짜 짜증나네요 목소리
#아 에서 true인거 확인 가능
train_xy[0]

({'exists(./Punctuation)': False,
  'exists(영화/Noun)': False,
  'exists(하다/Verb)': False,
  'exists(이/Josa)': False,
  'exists(보다/Verb)': False,
  'exists(의/Josa)': False,
  'exists(../Punctuation)': True,
  'exists(가/Josa)': False,
  'exists(에/Josa)': False,
  'exists(을/Josa)': False,
  'exists(.../Punctuation)': False,
  'exists(도/Josa)': False,
  'exists(은/Josa)': False,
  'exists(들/Suffix)': False,
  'exists(,/Punctuation)': False,
  'exists(는/Josa)': False,
  'exists(없다/Adjective)': False,
  'exists(를/Josa)': False,
  'exists(있다/Adjective)': False,
  'exists(좋다/Adjective)': False,
  'exists(너무/Adverb)': False,
  'exists(?/Punctuation)': False,
  'exists(이/Determiner)': False,
  'exists(재밌다/Adjective)': False,
  'exists(정말/Noun)': False,
  'exists(것/Noun)': False,
  'exists(되다/Verb)': False,
  'exists(!/Punctuation)': False,
  'exists(진짜/Noun)': True,
  'exists(같다/Adjective)': False,
  'exists(적/Suffix)': False,
  'exists(으로/Josa)': False,
  'exists(이/Noun)': False,
  'exists(점/Nou

### Naive Bayes Classifier
- [나이브 베이즈 알고리즘](https://gomguard.tistory.com/69)

In [17]:
classifier = nltk.NaiveBayesClassifier.train(train_xy)
print(nltk.classify.accuracy(classifier, test_xy))

0.81668


In [18]:
classifier.show_most_informative_features(5)

Most Informative Features
  exists(낭비하다/Adjective) = True                0 : 1      =     85.3 : 1.0
         exists(최악/Noun) = True                0 : 1      =     67.9 : 1.0
         exists(반개/Noun) = True                0 : 1      =     59.0 : 1.0
       exists(♥/Foreign) = True                1 : 0      =     56.9 : 1.0
         exists(펑펑/Noun) = True                1 : 0      =     46.1 : 1.0


### Decision Tree Classifier
- [의사결정트리 알고리즘](https://gomguard.tistory.com/86)

In [None]:
classifier = nltk.DecisionTreeClassifier.train(train_xy)
print(nltk.classify.accuracy(classifier, test_xy))

## 4. Sentiment classification with doc2vec (feat. Gensim)
- [Gensim](https://radimrehurek.com/gensim/intro.html)
- data transformation needed

In [18]:
from collections import namedtuple

TaggedDocument = namedtuple('TaggedDocument', 'words tags')

tagged_train_docs = [TaggedDocument(d, [c]) for d, c in train_docs]
taaged_test_docs = [TaggedDocument(d, [c]) for d, c in test_docs]

In [25]:
from gensim.models import doc2vec

model = doc2vec.Doc2Vec(size=300, alpha=0.025, min_alpha=0.025, seed=1234)
model.build_vocab(tagged_train_docs)

#train document vectors
for epoch in range(10):
    model.train(tagged_train_docs,epochs=model.epochs, total_examples=model.corpus_count)
    model.alpha -= 0.002
    model.min_alpha = model.alpha

# To save
# model.save('doc2vec.model')

In [26]:
print(model.wv.most_similar('공포/Noun'))

[('공포영화/Noun', 0.4368985891342163), ('서스펜스/Noun', 0.3643965721130371), ('호러/Noun', 0.3576434850692749), ('긴박/Noun', 0.34094157814979553), ('스릴러/Noun', 0.3137616217136383), ('미스터리/Noun', 0.31117933988571167), ('멜로영화/Noun', 0.2903384864330292), ('스릴/Noun', 0.2883269190788269), ('괴기/Noun', 0.2824685275554657), ('박진/Noun', 0.2799838185310364)]


In [28]:
print(model.wv.most_similar('ㅋㅋ/KoreanParticle'))

[('ㅋㅋㅋ/KoreanParticle', 0.3308281898498535), ('ㅎ/KoreanParticle', 0.2720175087451935), ('ㅋ/KoreanParticle', 0.26531916856765747), ('ㅎㅎㅎ/KoreanParticle', 0.2587934136390686), ('-_-;/Punctuation', 0.25734734535217285), ('^^^/Punctuation', 0.23995190858840942), ('살찌다/Adjective', 0.23321013152599335), ('ㅎㅎ/KoreanParticle', 0.23285427689552307), ('ㅋㅋㅋㄱ/KoreanParticle', 0.2318296581506729), ('!!!/Punctuation', 0.23062261939048767)]


In [30]:
#generate features
train_x = [model.infer_vector(doc.words) for doc in tagged_train_docs]
train_y = [doc.tags[0] for doc in tagged_train_docs]
test_x = [model.infer_vector(doc.words) for doc in taaged_test_docs]
test_y = [doc.tags[0] for doc in taaged_test_docs]

In [31]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state=1234)
classifier.fit(train_x, train_y)
classifier.score(test_x, test_y)



0.66416

## Word Embedding : TF-IDF
### Why don't we try TF-IDF instead?
- Term Frequency — Inverse Document Frequency(TF-IDF) represents the importance of specific word in a document
- Tf-IDF의 목적은 다른 문서에 자주 언급되지 않고 해당 문서에는 자주 언급되는 token에 대해 점수를 높게 부여하는 것

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [None]:
tfidf = TfidfVectorizer(tokenizer=selected_words)