
- language modeling : predict next token ->  to match realistic probability distribution
- SLM : statistical language model with chain rule
- sparsity problem : limitation of count based approach (if not the word sequence appear -> denominator becomes zero)
   - n-gram, generalization(smoothing, backoff)

## n-gram
- not count all words before but include random number of previous words
- "n" sequential words 
- predict using previous "n-1" words
- trade off : sparsity problem (n up ) vs small model size (n small)
- n <= 5 

## 한국어 자연어 처리가 어려운 이유
- 의미전달에 어순이 중요하지 않다. 
- 교착어이다. -> 조사가 있다.
- 띄어쓰기가 제대로 지켜지지 않는다.  

## Perplexity (PPL)
- multiplicative inverse(역수) of probability of normalized test data 
- branch factor
- the less, the better (maximizes the probability of sentence)
- dependent of test data (domain, quantity )

# {Count, Prediction} based Word Representation

- local(discrete) Representation ex) Bag of Words, DTM
- distributed(continuous) Representation  ex) Word2Vec(but not distributed by Tomas Mikolov), LSA, LDA



In [4]:
%%HTML
<img src="https://wikidocs.net/images/page/31767/wordrepresentation.PNG" width=200 height=200 />


In [2]:
# Bag of Words : Count Frequency of words -> classification, similarity between documents 
# order of index doesn't matter
from konlpy.tag import Okt
import re

okt = Okt()
token = re.sub("(\.)","","정부의 물가상승률과 소비자가 느끼는 물가상승률은 다르다.")
token = okt.morphs(token)

word2index = {}
bow = []
for voca in token : 
    if voca not in word2index.keys():
        word2index[voca] = len(word2index)
        bow.insert(len(word2index)-1,1) #[1] * len(word2index)-1 
    else : 
        index = word2index.get(voca)
        bow[index] = bow[index]+1
word2index, bow


({'정부': 0,
  '의': 1,
  '물가상승률': 2,
  '과': 3,
  '소비자': 4,
  '가': 5,
  '느끼는': 6,
  '은': 7,
  '다르다': 8},
 [1, 1, 2, 1, 1, 1, 1, 1, 1])

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['you know I want your love, because I love you']
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray())
print(vector.vocabulary_)


[[1 1 2 1 2 1]]
{'you': 4, 'know': 1, 'want': 3, 'your': 5, 'love': 2, 'because': 0}


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

text = ["Family is not an important thing. It's everything"]
vect = CountVectorizer(stop_words=["the","a","an","is","not"])

vect.fit_transform(text).toarray()

array([[1, 1, 1, 1, 1]])

In [7]:
vect.vocabulary

In [8]:
vect.vocabulary_

{'family': 1, 'important': 2, 'thing': 4, 'it': 3, 'everything': 0}

In [9]:
vect.fixed_vocabulary_

False

In [10]:
from sklearn.feature_extraction.text import CountVectorizer 
text = ["Family is not an important thing. It's everything"]
vect = CountVectorizer(stop_words="english")
vect.fit_transform(text).toarray()

array([[1, 1, 1]])

In [11]:
vect.vocabulary_

{'family': 0, 'important': 1, 'thing': 2}

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
text=["Family is not an important thing. It's everything."]
sw = stopwords.words("english")
vect = CountVectorizer(stop_words=sw)
sw

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
vect.fit_transform(text).toarray()

array([[1, 1, 1, 1]])

In [14]:
vect.vocabulary_

{'family': 1, 'important': 2, 'thing': 3, 'everything': 0}

# Document-Term matrix, DTM 
- combine other BoW of other documents

- sparse vector & sparse matrix
- important to decrease the size of vocabulary sets

## TF-IDF : how to differentiate stopwords with important words


In [None]:
# TF-IDF (Term Frequency-Inverse Document Frequency) : after DTM, TF-IDF weights important words 
- tf(d,t) : word(t) frequency in document(d)
- df(t) : number of document that word(t) appears
- idf(d,t) : inverse proportion  of df(t) = log(n/1+df(t))



In [16]:
import pandas as pd
from math import log

docs = [
  '먹고 싶은 사과',
  '먹고 싶은 바나나',
  '길고 노란 바나나 바나나',
  '저는 과일이 좋아요'
] 
vocab = list(set(w for doc in docs for w in doc.split()))
vocab.sort()
vocab

  return f(*args, **kwds)


['과일이', '길고', '노란', '먹고', '바나나', '사과', '싶은', '저는', '좋아요']

In [18]:
vocab = list(set(w for doc in docs for w in doc.split()))
vocab

['노란', '길고', '과일이', '바나나', '사과', '싶은', '저는', '먹고', '좋아요']

In [20]:
vocab = list(set(doc for doc in docs))

In [23]:
#failed nested loop
vocab = list(set(w for w in doc.split() for doc in docs))
vocab

NameError: name 'doc' is not defined

In [24]:
N = len(docs) #n

def tf(t,d):
    return d.count(t)

def idf(t):
    df = 0
    for doc in docs:
        df += t in doc
    return log(N/(df+1)) # +1, precisely

def tfidf(t,d):
    return tf(t,d)*idf(t)

In [25]:
result = []
for i in range(N): # 각 문서에 대해서 아래 명령을 수행
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]        
        result[-1].append(tf(t, d))

tf_ = pd.DataFrame(result, columns = vocab)
tf_


Unnamed: 0,먹고 싶은 사과,저는 과일이 좋아요,길고 노란 바나나 바나나,먹고 싶은 바나나
0,1,0,0,0
1,0,0,0,1
2,0,0,1,0
3,0,1,0,0


In [31]:
docs[0].count('사과')

1

In [28]:
result = []
for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))

idf_ = pd.DataFrame(result, index = vocab, columns = ["IDF"])
idf_

Unnamed: 0,IDF
먹고 싶은 사과,0.693147
저는 과일이 좋아요,0.693147
길고 노란 바나나 바나나,0.693147
먹고 싶은 바나나,0.693147


In [34]:
t = '바나나'
df = 0
for doc in docs:
    df += t in doc
df

2

In [35]:
result = []
for i in range(N):
    result.append([])
    d = docs[i]
    for j in range(len(vocab)):
        t = vocab[j]

        result[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(result, columns = vocab)
tfidf_

Unnamed: 0,먹고 싶은 사과,저는 과일이 좋아요,길고 노란 바나나 바나나,먹고 싶은 바나나
0,0.693147,0.0,0.0,0.0
1,0.0,0.0,0.0,0.693147
2,0.0,0.0,0.693147,0.0
3,0.0,0.693147,0.0,0.0


In [36]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
vector = CountVectorizer()
print(vector.fit_transform(corpus).toarray()) # 코퍼스로부터 각 단어의 빈도 수를 기록한다.
print(vector.vocabulary_) # 각 단어의 인덱스가 어떻게 부여되었는지를 보여준다.

[[0 1 0 1 0 1 0 1 1]
 [0 0 1 0 0 0 0 1 0]
 [1 0 0 0 1 0 1 0 0]]
{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}


In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]
tfidfv = TfidfVectorizer().fit(corpus)
print(tfidfv.transform(corpus).toarray())
print(tfidfv.vocabulary_)

[[0.         0.46735098 0.         0.46735098 0.         0.46735098
  0.         0.35543247 0.46735098]
 [0.         0.         0.79596054 0.         0.         0.
  0.         0.60534851 0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.
  0.57735027 0.         0.        ]]
{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}
