# MVD 5. cvičení

## 1. část - TF-IDF s word embeddingy

V minulém cvičení bylo za úkol implementovat TF-IDF algoritmus nad datasetem z Kagglu. Dnešní cvičení je rozšířením této úlohy s použitím word embeddingů. Lze použít předtrénované GloVe embeddingy ze 3. cvičení, nebo si v případě zájmu můžete vyzkoušet práci s Word2Vec od Googlu (najdete [zde](https://code.google.com/archive/p/word2vec/)).

Cvičení by mělo obsahovat následující části:
- Načtení článků a embeddingů
- Výpočet document vektorů pomocí TF-IDF a word embeddingů 
    - Pro výpočet TF-IDF využijte [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) z knihovny sklearn
    - Vážený průměr GloVe / Word2Vec vektorů

<center>
$
doc\_vector = \frac{1}{|d|} \sum\limits_{w \in d} TF\_IDF(w) glove(w)
$
</center>

- Dotaz bude transformován stejně jako dokument

- Výpočet relevance pomocí kosinové podobnosti
<center>
$
score(q,d) = cos\_sim(query\_vector, doc\_vector)
$
</center>

### Načtení článků

In [1]:
import spacy
import pandas
import numpy as np

lemmatizer = spacy.load('en_core_web_sm', disable=['parser', 'ner']) # NLTK


def normalize_data(db):
    # set all to lowercase
    db['text'] = db['text'].str.lower()
    db['title'] = db['title'].str.lower()

    # delete all special characters
    db['text'] = db['text'].str.replace(r'\W', ' ', regex=True)
    db['title'] = db['title'].str.replace(r'\W', ' ', regex=True)
    # delete multiple spaces
    db['text'] = db['text'].str.replace(r'\s+', ' ', regex=True)
    db['title'] = db['title'].str.replace(r'\s+', ' ', regex=True)
    return db


#data load and normalize
df = pandas.read_csv('articles.csv')
df = df[['title', 'text']]
df_norm = normalize_data(df)
df



Unnamed: 0,title,text
0,chatbots were the next big thing what happened...,oh how the headlines blared chatbots were the ...
1,python for data science 8 concepts you may hav...,if you ve ever found yourself looking up the s...
2,automated feature engineering in python toward...,machine learning is increasingly moving from h...
3,machine learning how to go from zero to hero f...,if your understanding of a i and machine learn...
4,reinforcement learning from scratch insight data,want to learn about applied artificial intelli...
...,...,...
332,you can build a neural network in javascript e...,click here to share this article on linkedin s...
333,artificial intelligence ai in 2018 and beyond ...,these are my opinions on where deep neural net...
334,spiking neural networks the next generation of...,everyone who has been remotely tuned in to rec...
335,surprise neurons are now more complex than we ...,one of the biggest misconceptions around is th...


### Načtení embeddingů

In [2]:
with open('glove.6B/glove.6B.50D.txt', encoding='utf8') as f:
    data = []
    for line in f:
        data.append(line)
word = []
word2idx = {}
vec = np.zeros((len(data),len(data[0].split(' '))-1))
for i,item in enumerate(data):
    splited = item.replace('\n','').split(' ')
    word.append(splited[0])
    vec[i,:] = np.asarray(splited[1:])
    word2idx[splited[0]] = i

print(vec[0])
print(word[0])
print(word2idx['the'])

[ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
the
0


In [30]:
def cossim(a,b):
    return (a @ b) / ((np.linalg.norm(a)*np.linalg.norm(b)) + 0.0000001)

In [16]:
def tf(words):
    out = {}
    for word in words:
        if word not in out:
            out[word] = 1
        else:
            out[word] += 1
    return out

def tf_idf(q,d,ii):

    M = len(d) #num of all documents
    q_lem = " ".join([token.lemma_ for token in lemmatizer(q)]).split(' ')
    freq = tf(q_lem) # how many times is each word in query
    out = []
    for doc in d:
        scores = 0
        wd = tf(" ".join([token.lemma_ for token in lemmatizer(doc)]).split(' ')) #how many times is each word in specific document
        for word in q_lem:
            if word in wd:
                scores += freq[word] * wd[word] *  np.log((M+1) / len(set(ii[word])) )
        out.append(scores)
    return  out

In [5]:
def inverted_index(texts):
    text_dic = {}
    for i,line in enumerate(texts):
        for word in " ".join([token.lemma_ for token in lemmatizer(line)]).split(' '):
            if word not in text_dic:
                text_dic[word] = [i]
            else:
                text_dic[word].append(i)

    return text_dic

In [6]:
text_ii = inverted_index(df_norm['text'])
title_ii = inverted_index(df_norm['title'])

### TF-IDF + Word2Vec a vytvoření doc vektorů

In [43]:
def doc_vec(d,ii):
    doc_v = np.zeros((len(d),1,50))
    query_vec = np.zeros((1,50))
    for i,dv in enumerate(d):
        d1_vec = np.zeros((1,50))
        di = np.array(tf_idf(dv,d,ii)).sum(axis = 0)
        for w in dv.split(" "):
            if w in word2idx:
                d1_vec += vec[word2idx[w]] * di
        query_vec /= len(dv.split(" "))
        doc_v[i] = d1_vec
        print(i)
    return doc_v

d_vec = doc_vec(df_norm['title'],title_ii)
print(d_vec)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [42]:


def query_vec(q,d,ii):
    query_vec = np.zeros((1,50))
    for w in q.split(" "):
        if w in word2idx:
            query_vec += vec[word2idx[w]] * np.array(tf_idf(w,d,ii)).sum(axis = 0)
    query_vec /= len(q.split(" "))
    return np.array(query_vec)


[[ -48.28642132 -114.87699405  263.30144729  -34.85819443   28.15154816
   128.53535078  -28.05336303 -232.95894108  -37.97448448   95.27181213
    57.94754734 -117.47802273  -92.28301899  -58.50592591 -214.26793548
   -44.75310992 -209.86734907  470.05982582 -104.89031137 -223.32267358
   235.95310282  195.40060415  -76.35850314   75.64537188  312.5953846
  -420.06441488 -304.61407369 -264.36193981  331.12764364 -175.93353586
  1253.88225721  -96.86190128 -268.51948023 -190.30090023   49.89546052
   290.31803348  122.44014715  208.55126311   99.49805982  109.69392858
   272.63552347 -164.95314345 -257.00323717  331.12675617  154.58771327
  -107.67416018  327.45644999  -58.91725682    4.20906798  165.75764451]]


### Transformace dotazu a výpočet relevance

In [44]:
alpha = 0.7
qs = 'coursera vs udacity machine learning'
sim_title = []
q_vec = query_vec(qs,df_norm['title'],title_ii)
print(q_vec)
for i in range(d_vec.shape[0]):
    sim_title.append(cossim(np.squeeze(q_vec),np.squeeze(d_vec[i,:])))

#np.array(tf_idf(qs, df_norm['text'], text_ii))
test = list(alpha * np.array(sim_title))

df_norm['score'] = test
sorted_df = df_norm.sort_values(by='score', ascending=False)
sorted_df

[[ -6.10533147  -8.42828629  21.7225685   -0.87193982  -0.35134624
   10.12982589  -0.72145852 -19.7847615   -3.2313815    8.45122462
    5.56506736 -10.68801083  -7.8287766   -6.65396507 -18.50859228
   -3.76011018 -17.80316582  39.06653678 -10.25425096 -16.25390679
   19.45546041  14.74946449  -6.51509618   8.49808727  26.40632496
  -34.95873162 -22.79594686 -23.42215637  26.72050866 -14.91365807
  106.69365229  -5.87175197 -22.3921459  -14.57018479   3.6158006
   24.21315467  10.37955324  17.99955362   7.50820361   6.74671778
   25.11893168 -15.08138985 -20.59913838  25.57670735  14.56643153
   -6.72564207  29.55293695  -5.83870106   2.33114702  13.55679168]]


Unnamed: 0,title,text,score
90,an intro to machine learning for designers ux ...,there is an ongoing debate about whether or no...,0.620963
6,an intro to machine learning for designers ux ...,there is an ongoing debate about whether or no...,0.620963
192,ultimate guide to leveraging nlp machine learn...,code snippets and github included over the pas...,0.612662
202,cheat sheets for ai neural networks machine le...,over the past few months i have been collectin...,0.609367
292,cheat sheets for ai neural networks machine le...,over the past few months i have been collectin...,0.609367
...,...,...,...
234,de la coope ration entre les hommes et les mac...,originally published at www cuberevue com on n...,0.298082
286,multi stream rnn concat rnn internal conv rnn ...,for the last two week i have been dying to imp...,0.286621
167,o grupo de estudo em deep learning de brasi li...,o grupo de estudo em deep learning de brasi li...,0.268719
307,sema ntica desde informacio n desestructurada ...,detectar patrones es un nu cleo importante en ...,0.223020


## Bonus - Našeptávání

Bonusem dnešního cvičení je našeptávání pomocí rekurentních neuronových sítí. Úkolem je vytvořit jednoduchou rekurentní neuronovou síť, která bude generovat text (character-level přístup). 

Optimální je začít po dokončení cvičení k předmětu ANS, kde se tato úloha řeší. 

Dataset pro učení vaší neuronové sítě naleznete na stránkách [Yahoo research](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1), lze využít např. i větší [Kaggle dataset](https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data) nebo vyhledat další dataset na [Google DatasetSearch](https://datasetsearch.research.google.com/).

Vstupem bude rozepsaný dotaz a výstupem by měly být alespoň 3 dokončené dotazy.