# MVD 5. cvičení

## 1. část - TF-IDF s word embeddingy

V minulém cvičení bylo za úkol implementovat TF-IDF algoritmus nad datasetem z Kagglu. Dnešní cvičení je rozšířením této úlohy s použitím word embeddingů. Lze použít předtrénované GloVe embeddingy ze 3. cvičení, nebo si v případě zájmu můžete vyzkoušet práci s Word2Vec od Googlu (najdete [zde](https://code.google.com/archive/p/word2vec/)).

Cvičení by mělo obsahovat následující části:
- Načtení článků a embeddingů
- Výpočet document vektorů pomocí TF-IDF a word embeddingů 
    - Pro výpočet TF-IDF využijte [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) z knihovny sklearn
    - Vážený průměr GloVe / Word2Vec vektorů

<center>
$
doc\_vector = \frac{1}{|d|} \sum\limits_{w \in d} TF\_IDF(w) glove(w)
$
</center>

- Dotaz bude transformován stejně jako dokument

- Výpočet relevance pomocí kosinové podobnosti
<center>
$
score(q,d) = cos\_sim(query\_vector, doc\_vector)
$
</center>

### Načtení článků

In [None]:
import csv
def load(inf):
    cl = list()
    titleId = dict()
    textId = dict()
    titlesList = list()
    textsList = list()
    count = 0
    with open(inf,"r",encoding="utf-8") as file:
        reader = csv.reader(file)
        next(reader)
        for i,line in enumerate(reader):
            cl.append(i)
            count = i
            titlesList.append(line[4])
            textsList.append(line[5])
            for word in line[4].split():
                if not word in titleId:
                    titleId[word] = [i]
                else:
                    titleId[word].append(i)
            for word in line[5].split():
                if not word in textId:
                    textId[word] = [i]
                else:
                    textId[word].append(i)
    return titleId,textId,count,titlesList,textsList
_,_,count,titlesList,textsList = load("articlesLEMMA.csv")
#print(titleId['a'][:5],textId['a'][:100],count)


### Načtení embeddingů

In [2]:
import numpy as np
VecFile = "glove.6B.300d.txt"
vecSize = 300
wordVec = list()
vecVec = list()
wordDict = dict()
read = 0

with open(VecFile,"r", encoding = "utf-8") as file:
    if(read>0):
        for i in range(read):
            line  = file.readline()
            line = line.strip().split(" ")
            wordVec.append(line[0])
            wordDict[line[0]] = i
            vecVec.append(np.array(line[1:]).astype(float))
    else:
        i = 0
        for iline in file:
            line = iline.strip().split(" ")
            wordVec.append(line[0])
            wordDict[line[0]] = i
            vecVec.append(np.array(line[1:]).astype(float))
            i += 1

In [3]:
import numpy as np
def dist(vecl,vecr):
    vv = np.dot(vecl,vecr)
    a = np.sqrt(np.sum(np.square(vecl)))
    b = np.sqrt(np.sum(np.square(vecr)))
    return (vv/(a*b))

### TF-IDF + Word2Vec a vytvoření doc vektorů

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
titlesVectorizer = TfidfVectorizer()
titlesFeatures = titlesVectorizer.fit_transform(titlesList)
print(titlesVectorizer.get_feature_names_out()[:20])
print(titlesFeatures.shape)
textVectorizer = TfidfVectorizer()
textsFeatures = textVectorizer.fit_transform(textsList)
print('a' in textVectorizer.get_feature_names_out())
print(textsFeatures.shape)



['019' '10' '1000x' '101' '1700' '18' '2012' '2017' '2018' '30' '37' '73'
 '90' 'a3c' 'abdulla' 'about' 'achievement' 'activation' 'actor' 'adam']
(337, 816)
False
(337, 16324)


In [5]:
from ipywidgets import IntProgress
from IPython.display import display
from os import path
import csv
lst = list(textVectorizer.get_feature_names_out())
def doc_vector_text(docIndex):
    vec = np.zeros((vecSize))
    wordlist = textsList[docIndex].split(' ')
    wordHist = dict()
    for w in wordlist:
        if w in wordHist:
            wordHist[w]+=1
        else:
            wordHist[w] = 1
    dicwordCount = len(wordlist)
    for w in wordHist:
        if w in lst:
            indx = list(textVectorizer.get_feature_names_out()).index(w)
            if w in wordDict:
                vec+= vecVec[wordDict[w]] * textsFeatures[docIndex,indx]*wordHist[w]
    return vec/dicwordCount
titlelst = list(titlesVectorizer.get_feature_names_out())
def doc_vector_title(docIndex):
    vec = np.zeros((vecSize))
    wordlist = titlesList[docIndex].split(' ')
    wordHist = dict()
    for w in wordlist:
        if w in wordHist:
            wordHist[w]+=1
        else:
            wordHist[w] = 1
    dicwordCount = len(wordlist)
    for w in wordHist:
        if w in titlelst:
            indx = list(titlesVectorizer.get_feature_names_out()).index(w)
            if w in wordDict:
                vec+= vecVec[wordDict[w]] * titlesFeatures[docIndex,indx]*wordHist[w]
    return vec/dicwordCount


In [None]:
textDocVec = list()
filestr = "textsVec.csv"
if not path.exists(filestr):
    bar = IntProgress(min=0, max=count)
    display(bar)
    with open(filestr,"w",newline="") as file:
        writer = csv.writer(file)
        for i in range(count):
            docvec_text = doc_vector_text(i)
            textDocVec.append(docvec_text)
            line = list(docvec_text)
            writer.writerow(line)
            bar.value+=1
else:
    print("file \"textsVec.csv\" already exists")
    with open(filestr,"r") as file:
        reader = csv.reader(file)
        for line in reader:
            linef = [float(x) for x in line]
            linenp = np.asarray(linef)
            textDocVec.append(linenp)
        if len(textDocVec) == count:
            print("file \"textsVec.csv\" OK")
        else:
            print("file \"textsVec.csv\" NOK!!!")
            print("len",len(textDocVec))
            bar = IntProgress(min=0, max=count)
            bar.value=len(textDocVec)
            display(bar)
            with open(filestr,"a",newline="") as file:
                writer = csv.writer(file)
                for i in range(len(textDocVec),count):
                    docvec_text = doc_vector_text(i)
                    textDocVec.append(docvec_text)
                    line = list(docvec_text)
                    writer.writerow(line)
                    bar.value+=1

In [None]:
titleDocVec = list()
if not path.exists("titlesVec.csv"):
    bar = IntProgress(min=0, max=count)
    display(bar)
    with open("titlesVec.csv","w",newline="") as file2:
        titlesWriter = csv.writer(file2)
        for i in range(count):
            docvec_titles = doc_vector_title(i)
            titleDocVec.append(docvec_titles)
            line_title = list(docvec_titles)
            titlesWriter.writerow(line_title)
            bar.value+=1
else:
    print("file \"titlesVec.csv\" already exists")
    with open("titlesVec.csv","r") as file:
        reader = csv.reader(file)
        for line in reader:
            linef = [float(x) for x in line]
            linenp = np.asarray(linef)
            titleDocVec.append(linenp)
        if len(titleDocVec) == count:
            print("file \"titlesVec.csv\" OK")
        else:
            print("file \"titlesVec.csv\" NOK!!!")
            print("len",len(titleDocVec))
            bar = IntProgress(min=0, max=count)
            display(bar)
            with open("titlesVec.csv","w",newline="") as file2:
                titlesWriter = csv.writer(file2)
                for i in range(len(titleDocVec),count):
                    docvec_titles = doc_vector_title(i)
                    titleDocVec.append(docvec_titles)
                    line_title = list(docvec_titles)
                    titlesWriter.writerow(line_title)
                    bar.value+=1

file "titlesVec.csv" already exists
file "titlesVec.csv" OK


### Transformace dotazu a výpočet relevance

In [None]:
def scoreText(q:str, d:int)->float:
    if(d>=len(textDocVec)):
        return 0
    qvec = np.zeros((vecSize))
    wordlist = q.split(' ')
    dicwordCount = len(wordlist)
    for w in wordlist:
        if w in lst:
            indx = list(textVectorizer.get_feature_names_out()).index(w)
            if w in wordDict:
                qvec+=vecVec[wordDict[w]] * textsFeatures[d,indx]
    qvec = qvec/dicwordCount
    return dist(textDocVec[d],qvec)

def scoreTitle(q:str, d:int)->float:
    if(d>=len(titleDocVec)):
        return 0
    qvec = np.zeros((vecSize))
    wordlist = q.split(' ')
    dicwordCount = len(wordlist)
    for w in wordlist:
        if w in titlelst:
            indx = list(titlesVectorizer.get_feature_names_out()).index(w)
            if w in wordDict:
                qvec+=vecVec[wordDict[w]] * titlesFeatures[d,indx]
    qvec = qvec/dicwordCount
    return dist(titleDocVec[d],qvec)



In [None]:
def scoreDocTitle(q:str,d:int,alpha:float = 0.7):
    docScore = alpha * scoreTitle(q,d)+ (1-alpha) * scoreText(q,d)
    if docScore != docScore:
        return 0
    else:
        return docScore
tts = list()
text = "coursera vs udacity machine learning"
for i in range(count):
    tts.append((i,scoreDocTitle(text,i)))
tts.sort(key = lambda x: x[1],reverse=True)
titles = list()
texts = list()
with open("articles.csv","r",encoding="utf-8") as file:
    reader = csv.reader(file)
    next(reader)
    for line in reader:
        titles.append(line[4])
        texts.append(line[5])

for i in range(10):
    print(tts[i][0],titles[tts[i][0]][:40],'... :',texts[tts[i][0]][:40],'... :',tts[i][1])


  return (vv/(a*b))


0 Chatbots were the next big thing: what h ... : Oh, how the headlines blared:
Chatbots w ... : nan
1 Python for Data Science: 8 Concepts You  ... : If you’ve ever found yourself looking up ... : nan
2 Automated Feature Engineering in Python  ... : Machine learning is increasingly moving  ... : nan
6 An intro to Machine Learning for designe ... : There is an ongoing debate about whether ... : 0.5526526190915289
7 The Big List of DS/ML Interview Resource ... : Data science interviews certainly aren’t ... : nan
9 What I learned from interviewing at mult ... : Over the past 8 months, I’ve been interv ... : nan
10 From Ballerina to AI Researcher: Part I  ... : Last year, I published the article “From ... : nan
11 3 Ways to Apply Latent Semantic Analysis ... : Latent semantic analysis works on large- ... : nan
14 Machine Learning is Fun! Part 3: Deep Le ... : Update: This article is part of a series ... : 0.7000479615271177
15 Machine Learning is Fun! Part 4: Modern  ... : Update: This arti

## Bonus - Našeptávání

Bonusem dnešního cvičení je našeptávání pomocí rekurentních neuronových sítí. Úkolem je vytvořit jednoduchou rekurentní neuronovou síť, která bude generovat text (character-level přístup). 

Optimální je začít po dokončení cvičení k předmětu ANS, kde se tato úloha řeší. 

Dataset pro učení vaší neuronové sítě naleznete na stránkách [Yahoo research](https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&guccounter=1), lze využít např. i větší [Kaggle dataset](https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data) nebo vyhledat další dataset na [Google DatasetSearch](https://datasetsearch.research.google.com/).

Vstupem bude rozepsaný dotaz a výstupem by měly být alespoň 3 dokončené dotazy.