**Dataset** 
* https://www.kaggle.com/hacker-news/hacker-news-posts
* This data set is Hacker News posts from the last 12 months (up to September 26 2016).

**참고 Notebook**
* https://www.kaggle.com/alexklibisz/simple-word-vectors-with-co-occurrence-pmi-and-svd

#Data download 

In [None]:
!pip install kaggle



In [None]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"jungyoonchoi","key":"7275df798cfa62460b910b4b4e6473eb"}'}

In [None]:
ls -lha kaggle.json

-rw-r--r-- 1 root root 68 Jan 13 07:05 kaggle.json


In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/kaggle.json

#permission warning 방지
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
!kaggle datasets download -d hacker-news/hacker-news-posts

Downloading hacker-news-posts.zip to /content
 42% 8.00M/19.0M [00:00<00:00, 82.6MB/s]
100% 19.0M/19.0M [00:00<00:00, 93.2MB/s]


In [None]:
!unzip /content/hacker-news-posts.zip

Archive:  /content/hacker-news-posts.zip
  inflating: HN_posts_year_to_Sep_26_2016.csv  


#Data Preparation

In [None]:
from __future__ import print_function, division
from collections import Counter
from itertools import combinations
from math import log
from pprint import pformat
from scipy.sparse import csc_matrix
from scipy.sparse.linalg import svds
from string import punctuation
from time import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
print('Ready')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Ready


In [None]:
df = pd.read_csv('/content/HN_posts_year_to_Sep_26_2016.csv', usecols=['title']) #기사 제목만 사용할 것
df.head()

Unnamed: 0,title
0,You have two days to comment if you want stem ...
1,SQLAR the SQLite Archiver
2,What if we just printed a flatscreen televisio...
3,algorithmic music
4,How the Data Vault Enables the Next-Gen Data W...


#Preprocessing 

* 소문자
* (accent 문자 제거)
* 특수문자 제거
* 불용어 제거

In [None]:
# import string
# import nltk
# nltk.download('stopwords')
# from nltk.corpus import stopwords

def preprocess(x):
  sent = x.lower() # 소문자 변환
  sent = sent.encode('ascii', 'ignore').decode() # accent 된 문자가 들어가는 경우 제외 (영어문자만 취급)
  sent = ''.join([c for c in sent if c not in string.punctuation]) # 특수문자, 부호 제외하고, 문장별 단어리스트 생성
  sent = [word for word in sent.split() if word not in stopwords_set] # 불용어 제거 
  return sent

stopwords_set = set(stopwords.words('english'))
preprocessed = df['title'].apply(preprocess) #df.apply 행/열/전체원소에 대해 원하는 함수 지원
print(preprocessed[:10])

0    [two, days, comment, want, stem, cells, classi...
1                            [sqlar, sqlite, archiver]
2       [printed, flatscreen, television, side, boxes]
3                                 [algorithmic, music]
4    [data, vault, enables, nextgen, data, warehous...
5                           [saving, hassle, shopping]
6    [macalifa, new, opensource, music, app, uwp, w...
7    [github, theweavrsmacalifa, music, player, wri...
8                    [google, allo, first, impression]
9         [advanced, multimedia, linux, command, line]
Name: title, dtype: object


#N-gram count(단어 빈도 카운트)

동시발생 행렬 시도해봤으나... 자꾸 터짐, 너무 sparse하고 거대한 행렬이 만들어져서 그런듯 

=> 동시 발생 횟수를 bigram으로 카운트하여 pmi 생성

Unigram and Bigram Count
* n-gram = n개의 연속적인 단어 나열을 의미
* unigrams : an, adorable, little, boy, is, spreading, smiles
* bigram :  an adorable, adorable little, little boy, boy is, is spreading, spreading smiles

In [None]:
# from collections import Counter
# from itertools import combinations -> 리스트 원소들의 조합 count 후 튜플 리스트로 리턴

unigram = Counter() # 한 단어의 빈도 수 
bigram = Counter() # 두 단어 조합의 빈도수 

for sent in preprocessed:

  for word in sent: # unigram 
    unigram[word] += 1

  for w1, w2 in map(sorted, combinations(sent, 2)): # bigram count (같은 단어의 쌍도 포함)
    bigram[(w1, w2)] += 1

고빈도, 저빈도 단어 제거

In [None]:
len(unigram) # 전체 코퍼스 수

99044

In [None]:
len(df)

293119

In [None]:
min_count = (1/1000)*len(df) #전체 문장(293119) 중 0.001 = 293번 이하로 등장하는 단어(저빈도 단어) 제거
max_count = (1/50)*len(df) #전체문장의 반 이상 등장하는 단어(고빈도 단어) 제거

for word in list(unigram.keys()): # 단어 리스트
  if unigram[word] < min_count or unigram[word] > max_count:
    del unigram[word]

print(len(unigram))
print('Most common : ', unigram.most_common()[:25])

972
Most common :  [('google', 5532), ('app', 5124), ('using', 4613), ('us', 4189), ('web', 4134), ('startup', 3849), ('open', 3828), ('first', 3730), ('code', 3705), ('apple', 3695), ('pdf', 3659), ('software', 3558), ('video', 3462), ('tech', 3410), ('free', 3180), ('2016', 3108), ('use', 3100), ('learning', 3092), ('time', 2934), ('facebook', 2853), ('go', 2836), ('get', 2766), ('one', 2764), ('make', 2751), ('world', 2712)]


In [None]:
# Bigram에서도 제거
for w1, w2 in list(bigram.keys()):
  if w1 not in unigram or w2 not in unigram: 
    del bigram[(w1, w2)]

#단어 간 관련도 계산 - PMI Matrix 생성

In [None]:
word_to_id, id_to_word = {}, {}

for id, word in enumerate(unigram.keys()):
  word_to_id[word] = id
  id_to_word[id] = word

PMI Matrix (개별 단어의 빈도가 고려된 동시발생 행렬, 단어 간 관련도 행렬)

In [None]:
uni_sum = sum(unigram.values()) # 개별 단어 등장 횟수 sum
bi_sum = sum(bigram.values()) # 조합 등장 횟수 sum

In [None]:
# co-occurence matrix가 없는 대신, bigram에서 꺼내올 수 있음
# from scipy.sparse import csc_matrix (compressed sparse matrix)

ppmi_rates = Counter()
rates, rows, cols = [], [], []
eps = 1e-8
for (w1, w2), n in bigram.items(): # n은 bigram 발생 횟수
  rows.append(word_to_id[w1]) # 단어 인덱스
  cols.append(word_to_id[w2])
  rates.append(max(0,np.log2( (n/bi_sum) / (unigram[w1]/uni_sum) / (unigram[w2]/uni_sum) + eps))) # divided by 0 방지
  ppmi_rates[(w1,w2)] = rates[-1]

#PPMI 행렬 생성 (그냥 만들면 터지고, scipy 이용)
PPMI = csc_matrix((rates, (rows, cols)))
ppmi_rates.most_common()[:25] # 관련 있는 단어 상위 25개 출력

[(('elon', 'musk'), 9.995189957188554),
 (('pi', 'raspberry'), 9.822414021016007),
 (('street', 'wall'), 9.703687724353829),
 (('francisco', 'san'), 9.437920571045314),
 (('capital', 'venture'), 9.361458001336121),
 (('basic', 'income'), 9.195722541304471),
 (('card', 'credit'), 9.092029507670537),
 (('studio', 'visual'), 9.068871786683712),
 (('star', 'wars'), 8.975033805686056),
 (('command', 'line'), 8.900331150559108),
 (('continuous', 'delivery'), 8.76849187993323),
 (('encrypt', 'lets'), 8.766993476697026),
 (('artificial', 'intelligence'), 8.757913052294004),
 (('rocket', 'spacex'), 8.68016831258142),
 (('silicon', 'valley'), 8.579054334409285),
 (('jobs', 'steve'), 8.553065530351653),
 (('continuous', 'integration'), 8.545001656362246),
 (('os', 'x'), 8.528515115525519),
 (('networks', 'neural'), 8.459635767152436),
 (('aws', 'lambda'), 8.448799481515607),
 (('house', 'white'), 8.44586825948013),
 (('poor', 'rich'), 8.369202243280403),
 (('chrome', 'extension'), 8.3518721573981

elon musk, raspberry pi, wall street...

**PMI행렬 SVD**

In [None]:
# SVD - scipy 라이브러리 좀더 빠른가.?
from scipy.sparse.linalg import svds

u, s, v = svds(PPMI, k=20)

In [None]:
# Truncated SVD
from sklearn.utils.extmath import randomized_svd

U, S, V = randomized_svd(PPMI, n_components=20)

**단어 벡터 간 유사도 (코사인 유사도)**

$$similarity(x,y) = \frac{xy}{ \lVert x \rVert \lVert y \rVert}$$

In [None]:
# norm 계산
norms = np.sqrt(np.sum(np.square(U), axis=1, keepdims=True))
U /= np.maximum(norms, 1e-7)

In [None]:
k = 5
for word in ['facebook', 'twitter', 'instagram', 'messenger', 'hack', 'security', 
          'deep', 'encryption', 'cli', 'venture', 'paris']:
    cs = np.dot(U, U[word_to_id[word]]) # 특정 단어 벡터와 나머지 단어들 간의 Cosine similarity 
    s = ''
    # Compile the list of nearest neighbor descriptions.
    # Argpartition is faster than argsort and meets our needs.
    for id in np.argpartition(-1 * cs, k + 1)[:k + 1]:
        if id_to_word[id] == word: continue
        similar_word = tuple(sorted((word, id_to_word[id])))
        s += '(%s, %.3lf) ' % (id_to_word[id], cs[id]) # (가까운 단어, 관련도) 쌓기
    print('%s, %d\n %s' % (word, unigram[word], s)) # 타겟 단어, 타겟 단어 빈도수, 관련단어 리스트 출력
    print('-' * 10)

facebook, 2853
 (facebooks, 0.717) (friends, 0.702) (ads, 0.794) (instagram, 0.772) (google, 0.701) 
----------
twitter, 1641
 (traffic, 0.905) (tracking, 0.884) (time, 0.893) (tv, 0.869) (understand, 0.852) 
----------
instagram, 391
 (links, 0.762) (facebook, 0.772) (ads, 0.710) (images, 0.699) (ad, 0.692) 
----------
messenger, 374
 (messaging, 0.818) (lets, 0.725) (chat, 0.739) (bots, 0.650) (media, 0.621) 
----------
hack, 881
 (hackers, 0.904) (hacked, 0.786) (hacking, 0.762) (malware, 0.723) (attack, 0.680) 
----------
security, 2425
 (remote, 0.895) (secure, 0.825) (software, 0.760) (servers, 0.731) (running, 0.728) 
----------
deep, 1375
 (learning, 0.841) (networks, 0.770) (algorithms, 0.794) (neural, 0.816) (machine, 0.726) 
----------
encryption, 968
 (government, 0.783) (crypto, 0.923) (fbi, 0.830) (nsa, 0.719) (attacks, 0.716) 
----------
cli, 311
 (custom, 0.799) (easy, 0.749) (command, 0.907) (browser, 0.746) (client, 0.724) 
----------
venture, 393
 (vc, 0.961) (valley

In [None]:
def check_similar_words(word):
    cs = np.dot(U, U[word_to_id[word]]) # 특정 단어 벡터와 나머지 단어들 간의 Cosine similarity 
    s = ''
    # Compile the list of nearest neighbor descriptions.
    # Argpartition is faster than argsort and meets our needs.
    for id in np.argpartition(-1 * cs, k + 1)[:k + 1]:
        if id_to_word[id] == word: continue
        similar_word = tuple(sorted((word, id_to_word[id])))
        s += '(%s, %.3lf) ' % (id_to_word[id], cs[id]) # (가까운 단어, 관련도) 쌓기
    print('%s, %d\n %s' % (word, unigram[word], s)) # 타겟 단어, 타겟 단어 빈도수, 관련단어 리스트 출력

In [None]:
check_similar_words('car')

car, 1125
 (autonomous, 0.886) (cars, 0.878) (electric, 0.821) (battery, 0.735) (drones, 0.673) 


토크나이징 필요