# MVD 5. cvičení

## 1. část - TF-IDF s word embeddingy

V minulém cvičení bylo za úkol implementovat TF-IDF algoritmus nad datasetem z Kagglu. Dnešní cvičení je rozšířením této úlohy s použitím word embeddingů. Lze použít předtrénované GloVe embeddingy ze 3. cvičení, nebo si v případě zájmu můžete vyzkoušet práci s Word2Vec od Googlu (najdete [zde](https://code.google.com/archive/p/word2vec/)).

Cvičení by mělo obsahovat následující části:
- Načtení článků a embeddingů
- Výpočet document vektorů pomocí TF-IDF a word embeddingů 
    - Pro výpočet TF-IDF využijte [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) z knihovny sklearn
    - Vážený průměr GloVe / Word2Vec vektorů

<center>
$
doc\_vector = \frac{1}{|d|} \sum\limits_{w \in d} TF\_IDF(w) glove(w)
$
</center>

- Dotaz bude transformován stejně jako dokument

- Výpočet relevance pomocí kosinové podobnosti
<center>
$
score(q,d) = cos\_sim(query\_vector, doc\_vector)
$
</center>

### Načtení článků

In [483]:
import pandas as pd
import spacy
import numpy as np
import math
import warnings
from numpy.linalg import norm
from numpy import dot
warnings.filterwarnings('ignore')

In [484]:
lemmatizer = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

In [485]:
df = pd.read_csv('articles.csv', usecols=["title", "text"])

In [486]:
def lemmatize_text(text):
    return " ".join([token.lemma_ for token in lemmatizer(text)])

In [487]:
df['title'] = df['title'].str.replace('[^\w\s]','').str.lower()
df['text'] = df['text'].str.replace('[^\w\s]','').str.lower()
df['text'] = df['text'].str.replace('\s\s+',' ')
df['title'] = df['title'].str.replace('\s\s+',' ')

df['title'] = df['title'].apply(lemmatize_text)
df['text'] = df['text'].apply(lemmatize_text)
display(df)

Unnamed: 0,title,text
0,chatbots be the next big thing what happen the...,oh how the headline blare \n chatbot be the ne...
1,python for data science 8 concept you may have...,if you ve ever find yourself look up the same ...
2,automate feature engineering in python towards...,machine learning be increasingly move from han...
3,machine learn how to go from zero to hero free...,if your understanding of ai and machine learni...
4,reinforcement learning from scratch insight datum,want to learn about apply artificial intellige...
...,...,...
332,you can build a neural network in javascript e...,click here to share this article on linkedin s...
333,artificial intelligence ai in 2018 and beyond ...,these be my opinion on where deep neural netwo...
334,spike neural network the next generation of ma...,everyone who have be remotely tune in to recen...
335,surprise neuron be now more complex than we think,one of the big misconception around be the ide...


In [488]:
def inverted_index(document):
    dic = {}
    for id_doc, row in enumerate(document):
        words = row.split(" ")
        for idx_word, word in enumerate(words):
            if word in dic.keys():
                if id_doc in dic[word]:
                    continue
                dic[word].append(id_doc)
            else:
                dic[word] = [id_doc]
    return dic

### Načtení embeddingů

In [489]:
glove_words = []
glove_vectors = []
glove_word2idx = {}
file = ['glove/glove.6B.50d.txt','glove/glove.6B.100d.txt','glove/glove.6B.200d.txt','glove/glove.6B.300d.txt']
with open(file[0]) as f:
    for idx, line in enumerate(f):
        row = line.split(" ")
        glove_words.append(row[0])
        glove_vectors.append([float(num) for num in row[1:len(row)-1]])
        glove_vectors[-1].extend([float(row[len(row)-1].replace('\n',''))])     
        glove_word2idx[row[0]]=idx

### TF-IDF + Word2Vec a vytvoření doc vektorů

In [490]:
def tf_idf(word, count, inv_idx, M):
    return count * math.log((M+1)/len(inv_idx[word]))

In [491]:
def tf_idf_word2vec(df, glove_word2idx, glove_vectors, inv_idx):
    vectors = np.zeros((len(df),len(glove_vectors[0])))
    for idx_d,text in enumerate(df.tolist()):
        words = text.split(" ")
        for word in words:
            if word in glove_word2idx.keys():
                vectors[idx_d,:] += (np.array(glove_vectors[glove_word2idx[word]]) * tf_idf(word, words.count(word), inv_idx, len(df)))                          
        vectors[idx_d,:] /= len(words)
    return vectors

### Transformace dotazu a výpočet relevance

In [492]:
def cos_sim(vec_querry, vec2):
    score = np.zeros(vec2.shape[0])
    for i in range(vec2.shape[0]):
        score[i] = np.abs(dot(vec_querry, vec2[i,:]))/(norm(vec_querry) * norm(vec2[i,:]))
    return score
    #return dot(vec_querry, np.transpose(vec2))/(norm(vec_querry)*norm(vec2,axis=1))

In [494]:
#inv_title = inverted_index(df['title'].tolist())
#inv_text = inverted_index(df['text'].tolist())


vectors_title = tf_idf_word2vec(df['title'], glove_word2idx,glove_vectors, inv_title)
vectors_text = tf_idf_word2vec(df['text'], glove_word2idx,glove_vectors, inv_text)
querry = "coursera vs udacity machine learning"

df_querry = pd.DataFrame([querry], columns=['querry'])
df_querry['querry'] = df_querry['querry'].apply(lemmatize_text)

querry_vec_title = tf_idf_word2vec(df_querry['querry'], glove_word2idx, glove_vectors, inv_title)
querry_vec_text = tf_idf_word2vec(df_querry['querry'], glove_word2idx, glove_vectors, inv_text)
alpha = 0.7
df['score'] = np.squeeze(alpha*cos_sim(querry_vec_title,vectors_title) + (1-alpha)*cos_sim(querry_vec_text,vectors_text))
df = df.sort_values(by=['score'], ascending=False)
display(df)

Unnamed: 0,title,text,score
144,a beginner guide to aiml machine learning for ...,part 1 why machine learning matter the big pic...,0.834309
68,a beginner guide to aiml machine learning for ...,part 1 why machine learning matter the big pic...,0.834309
196,a beginner guide to aiml machine learning for ...,part 1 why machine learning matter the big pic...,0.834309
312,learn how to code neural network learn new stu...,this be the second post in a series of I try t...,0.824249
169,machine learning be fun part 3 deep learning a...,update this article be part of a series check ...,0.822888
...,...,...,...
242,announce poncho the weatherbot renderfrombetawork,you can now get personal weather forecast in s...,0.274610
167,o grupo de estudo em deep learning de brasilia...,o grupo de estudo em deep learning de brasilia...,0.203857
234,de la cooperation entre les homme et les machi...,originally publish at wwwcuberevuecom on novem...,0.186547
307,semantica desde informacion desestructurada be...,detectar patrone es un nucleo importante en el...,0.154085
