# Laboratorium 4 - Singular Value Decomposition

##### Aleksandra Mazur

## Zadanie 1 Wyszukiwarka

#### 1. 
Przygotuj duży (> 1000 elementów) zbiór dokumentów tekstowych w języku angielskim
(np. wybrany korpus tekstów, podzbiór artykułów Wikipedii, zbiór dokumentów
HTML uzyskanych za pomoca Web crawlera, zbiór rozdziałów wyciętych z
różnych książek).

Ze strony https://ebible.org/find/details.php?id=eng-web&all=1 pobrano zbiór dokumentów tekstowych w języku angielskim, liczący 1100 plików.

Poniżej przedstawiono kilka przykładowych plików.

In [1]:
import numpy as np
import os
import io

files = os.listdir('documents/')

In [2]:
def show_two_files(files):
    i = 0
    for file in files:
        f = io.open('documents/' + file, encoding="utf8")
        text_from_file = f.read()
        print("File number: ", i)
        print(text_from_file)
        i += 1
        if i == 2:
            break

In [3]:
show_two_files(files)

File number:  0
﻿This set of files contains a script of canonical text, chapter by chapter,
for the purpose of reading to make an audio recording.
All footnotes, introductions, and verse numbers have been stripped out.


File number:  1
﻿The First Book of Moses, Commonly Called Genesis.
Chapter 1.
In the beginning, God created the heavens and the earth. 
The earth was formless and empty. Darkness was on the surface of the deep and God’s Spirit was hovering over the surface of the waters. 
God said, “Let there be light,” and there was light. 
God saw the light, and saw that it was good. God divided the light from the darkness. 
God called the light “day”, and the darkness he called “night”. There was evening and there was morning, the first day. 
God said, “Let there be an expanse in the middle of the waters, and let it divide the waters from the waters.” 
God made the expanse, and divided the waters which were under the expanse from the waters which were above the expanse; and it was s

#### 2. 
Określ słownik słów kluczowych (termów) potrzebny do wyznaczenia wektorów
cech bag-of-words (indeksacja). Przykładowo zbiorem takim może być unia wszystkich
słów występujących we wszystkich tekstach.

In [4]:
import re
import nltk
nltk.download('punkt')

Słownik słów kluczowych określono jako unię wszystkich słów występujących we wszystkich tekstach.

In [5]:
def create_dictionary(files):
    words_with_quantity = {}
    
    for file in files:
        f = io.open('documents/' + file, encoding="utf8")
        text_from_file = f.read()
        f.close()
        sentences = text_from_file.split('\n')
        
        for sentence in sentences:
            sentence = sentence.lower()
            sentence = re.sub(r'[^\w\s]', '', sentence)
            sentence = re.sub('[0-9]', '', sentence)
            words = nltk.word_tokenize(sentence)
            
            for word in words:
                if word in words_with_quantity.keys():
                    words_with_quantity[word] += 1
                else:
                    words_with_quantity[word] = 1
                    
    return words_with_quantity

In [6]:
words_with_quantity = create_dictionary(files)
dictionary = words_with_quantity.keys()

Poniżej przedstawiono 10 słów, które najczęściej pojawiały się w plikach tekstowych wraz z ilością ich występowania.

In [7]:
def show_most_popular_words(quantity, words_with_quantity):
    words_list = sorted(words_with_quantity.items(), key = lambda x: x[1], reverse=True)
    print("Liczba słów w słowniku: ", len(words_with_quantity))
    print(quantity, " najpopularniejszych słów:")
    for i, element in enumerate(words_list):
        print(element[0], " -> ", element[1])
        if i == quantity - 1:
            break

In [8]:
show_most_popular_words(10, words_with_quantity)

Liczba słów w słowniku:  13675
10  najpopularniejszych słów:
the  ->  53774
of  ->  31534
and  ->  30339
to  ->  19633
you  ->  12157
in  ->  12045
he  ->  9289
will  ->  9088
a  ->  8511
for  ->  8391


#### 3.
Dla każdego dokumentu j wyznacz wektor cech bag-of-words dj zawierający częstości
występowania poszczególnych słów (termów) w tekście.

In [9]:
def create_vector(file, dictionary):
    bow = {}
    for word in dictionary:
        bow[word] = 0
    f = io.open('documents/' + file, encoding="utf8")
    text_from_file = f.read()
    f.close()
    sentences = text_from_file.split('\n')

    for sentence in sentences:
        sentence = sentence.lower()
        sentence = re.sub(r'[^\w\s]', '', sentence)
        sentence = re.sub('[0-9]', '', sentence)
        words = nltk.word_tokenize(sentence)

        for word in words:
            bow[word] += 1
    return bow

In [10]:
def show_words (dictionary, without_zero = True):
    sorted_dictionary = sorted(dictionary.items(), key = lambda x: x[1], reverse=True)
    
    for element in sorted_dictionary:
        if (element[1] == 0 and without_zero):
            break
        print(element[0], " -> ", element[1])

In [11]:
def bag_of_words(quantity, files, dictionary):
    for i, file in enumerate(files):
        print ("\nFile number: ", i)
        bow = create_vector(file, dictionary)
        show_words(bow)

        if i == quantity -1:
            break

Poniżej znajdują się wektory cech bag-of-words dla dwóch pierwszych plików. Przy wyświetlaniu pominięto słowa, które nie występowały w danym tekście.

In [12]:
bag_of_words(1, files, dictionary)


File number:  0
of  ->  3
chapter  ->  2
this  ->  1
set  ->  1
files  ->  1
contains  ->  1
a  ->  1
script  ->  1
canonical  ->  1
text  ->  1
by  ->  1
for  ->  1
the  ->  1
purpose  ->  1
reading  ->  1
to  ->  1
make  ->  1
an  ->  1
audio  ->  1
recording  ->  1
all  ->  1
footnotes  ->  1
introductions  ->  1
and  ->  1
verse  ->  1
numbers  ->  1
have  ->  1
been  ->  1
stripped  ->  1
out  ->  1


#### 4.
Zbuduj rzadką macierz wektorów cech term-by-document matrix w której wektory
cech ułożone są kolumnowo A m×n = [d1|d2| . . . |dn] (m jest liczbą termów w
słowniku, a n liczbą dokumentów)

In [13]:
def create_matrix(files, dictionary):
    matrix = []
    for file in files:
        vector = create_vector(file, dictionary)
        matrix.append(vector)
    return matrix

In [14]:
matrix = create_matrix(files, dictionary)

In [15]:
print("Liczba wierszy: ", len(matrix))
print("Liczba kolumn: ", len(matrix[0]))

Liczba wierszy:  1100
Liczba kolumn:  13675


Jak widać liczba wierszy macierzy odpowiada liczbie plików tekstowych, a liczba kolumn zgadza się z ilością słów znajdujących się w zdefiniowanym wcześniej słowniku.

In [16]:
show_words(matrix[0], False)

of  ->  3
chapter  ->  2
this  ->  1
set  ->  1
files  ->  1
contains  ->  1
a  ->  1
script  ->  1
canonical  ->  1
text  ->  1
by  ->  1
for  ->  1
the  ->  1
purpose  ->  1
reading  ->  1
to  ->  1
make  ->  1
an  ->  1
audio  ->  1
recording  ->  1
all  ->  1
footnotes  ->  1
introductions  ->  1
and  ->  1
verse  ->  1
numbers  ->  1
have  ->  1
been  ->  1
stripped  ->  1
out  ->  1
first  ->  0
book  ->  0
moses  ->  0
commonly  ->  0
called  ->  0
genesis  ->  0
in  ->  0
beginning  ->  0
god  ->  0
created  ->  0
heavens  ->  0
earth  ->  0
was  ->  0
formless  ->  0
empty  ->  0
darkness  ->  0
on  ->  0
surface  ->  0
deep  ->  0
gods  ->  0
spirit  ->  0
hovering  ->  0
over  ->  0
waters  ->  0
said  ->  0
let  ->  0
there  ->  0
be  ->  0
light  ->  0
saw  ->  0
that  ->  0
it  ->  0
good  ->  0
divided  ->  0
from  ->  0
day  ->  0
he  ->  0
night  ->  0
evening  ->  0
morning  ->  0
expanse  ->  0
middle  ->  0
divide  ->  0
made  ->  0
which  ->  0
were  ->  0
under  -

generation  ->  0
pairs  ->  0
clean  ->  0
cause  ->  0
nights  ->  0
six  ->  0
floodwaters  ->  0
unclean  ->  0
hundredth  ->  0
year  ->  0
noahs  ->  0
month  ->  0
seventeenth  ->  0
fountains  ->  0
burst  ->  0
skys  ->  0
windows  ->  0
rained  ->  0
japheththe  ->  0
noahand  ->  0
entered  ->  0
shut  ->  0
increased  ->  0
floated  ->  0
high  ->  0
mountains  ->  0
covered  ->  0
higher  ->  0
moved  ->  0
whose  ->  0
destroyed  ->  0
flooded  ->  0
remembered  ->  0
wind  ->  0
pass  ->  0
subsided  ->  0
deeps  ->  0
stopped  ->  0
restrained  ->  0
receded  ->  0
ararats  ->  0
tenth  ->  0
tops  ->  0
visible  ->  0
window  ->  0
raven  ->  0
back  ->  0
forth  ->  0
dried  ->  0
himself  ->  0
dove  ->  0
abated  ->  0
rest  ->  0
foot  ->  0
returned  ->  0
waited  ->  0
freshly  ->  0
plucked  ->  0
olive  ->  0
leaf  ->  0
removed  ->  0
covering  ->  0
looked  ->  0
twentyseventh  ->  0
spoke  ->  0
breed  ->  0
abundantly  ->  0
families  ->  0
altar  ->  0
off

milk  ->  0
dressed  ->  0
asked  ->  0
certainly  ->  0
behind  ->  0
advanced  ->  0
childbearing  ->  0
herself  ->  0
grown  ->  0
pleasure  ->  0
being  ->  0
laugh  ->  0
hard  ->  0
season  ->  0
comes  ->  0
around  ->  0
denied  ->  0
hide  ->  0
known  ->  0
command  ->  0
justice  ->  0
spoken  ->  0
cry  ->  0
grievous  ->  0
whether  ->  0
deeds  ->  0
bad  ->  0
reports  ->  0
consume  ->  0
spare  ->  0
shouldnt  ->  0
find  ->  0
answered  ->  0
speak  ->  0
although  ->  0
ashes  ->  0
lack  ->  0
fortyfive  ->  0
fortys  ->  0
twentys  ->  0
just  ->  0
once  ->  0
tens  ->  0
soon  ->  0
communing  ->  0
angels  ->  0
gate  ->  0
lords  ->  0
stay  ->  0
rise  ->  0
early  ->  0
street  ->  0
urged  ->  0
feast  ->  0
baked  ->  0
unleavened  ->  0
lay  ->  0
surrounded  ->  0
quarter  ->  0
sex  ->  0
act  ->  0
wickedly  ->  0
virgin  ->  0
seems  ->  0
shadow  ->  0
stand  ->  0
fellow  ->  0
appoints  ->  0
deal  ->  0
worse  ->  0
pressed  ->  0
break  ->  0
rea

birthday  ->  0
position  ->  0
hanged  ->  0
interpreted  ->  0
forgot  ->  0
sleek  ->  0
marsh  ->  0
ugly  ->  0
thin  ->  0
brink  ->  0
heads  ->  0
stalk  ->  0
healthy  ->  0
blasted  ->  0
swallowed  ->  0
egypts  ->  0
magicians  ->  0
faults  ->  0
hastily  ->  0
shaved  ->  0
poor  ->  0
ugliness  ->  0
withered  ->  0
explain  ->  0
declared  ->  0
forgotten  ->  0
reason  ->  0
follows  ->  0
doubled  ->  0
shortly  ->  0
discreet  ->  0
overseers  ->  0
part  ->  0
plenteous  ->  0
store  ->  0
supply  ->  0
perish  ->  0
throne  ->  0
arrayed  ->  0
robes  ->  0
linen  ->  0
chain  ->  0
ride  ->  0
chariot  ->  0
knee  ->  0
zaphenathpaneah  ->  0
asenath  ->  0
potiphera  ->  0
stored  ->  0
fields  ->  0
counting  ->  0
manasseh  ->  0
forget  ->  0
ephraim  ->  0
houses  ->  0
countries  ->  0
buy  ->  0
governor  ->  0
acted  ->  0
roughly  ->  0
spies  ->  0
honest  ->  0
verified  ->  0
guilty  ->  0
begged  ->  0
understood  ->  0
interpreter  ->  0
bags  ->  0


midnight  ->  0
sits  ->  0
mill  ->  0
dog  ->  0
bark  ->  0
move  ->  0
congregation  ->  0
defect  ->  0
posts  ->  0
lintel  ->  0
meat  ->  0
roasted  ->  0
raw  ->  0
legs  ->  0
inner  ->  0
belt  ->  0
passover  ->  0
execute  ->  0
ordinance  ->  0
yeast  ->  0
eats  ->  0
leavened  ->  0
convocation  ->  0
observe  ->  0
bunch  ->  0
hyssop  ->  0
dip  ->  0
basin  ->  0
destroyer  ->  0
promised  ->  0
spared  ->  0
dough  ->  0
observed  ->  0
bringing  ->  0
person  ->  0
law  ->  0
sanctify  ->  0
opens  ->  0
abib  ->  0
stubbornly  ->  0
symbols  ->  0
minds  ->  0
armed  ->  0
etham  ->  0
encamp  ->  0
pihahiroth  ->  0
migdol  ->  0
zephon  ->  0
entangled  ->  0
serving  ->  0
chosen  ->  0
captains  ->  0
encamping  ->  0
marching  ->  0
graves  ->  0
treated  ->  0
forward  ->  0
wheels  ->  0
heavily  ->  0
fights  ->  0
sang  ->  0
song  ->  0
sing  ->  0
triumphed  ->  0
gloriously  ->  0
thrown  ->  0
horse  ->  0
yah  ->  0
sunk  ->  0
depths  ->  0
glorious

majestic  ->  0
boughs  ->  0
willows  ->  0
brook  ->  0
rejoice  ->  0
temporary  ->  0
israelite  ->  0
strove  ->  0
blasphemed  ->  0
shelomith  ->  0
dibri  ->  0
blasphemes  ->  0
mortally  ->  0
fracture  ->  0
prune  ->  0
undressed  ->  0
fortynine  ->  0
fiftieth  ->  0
liberty  ->  0
jubilee  ->  0
vines  ->  0
crops  ->  0
shortness  ->  0
selling  ->  0
safety  ->  0
perpetuity  ->  0
grant  ->  0
kinsman  ->  0
reckon  ->  0
sale  ->  0
surplus  ->  0
released  ->  0
walled  ->  0
accounted  ->  0
support  ->  0
uphold  ->  0
resident  ->  0
harshness  ->  0
member  ->  0
carved  ->  0
figured  ->  0
rains  ->  0
vintage  ->  0
safely  ->  0
chase  ->  0
supplies  ->  0
abhor  ->  0
reject  ->  0
abhors  ->  0
consumption  ->  0
fever  ->  0
pine  ->  0
vain  ->  0
pursues  ->  0
spite  ->  0
chastise  ->  0
pride  ->  0
soil  ->  0
contrary  ->  0
roads  ->  0
waste  ->  0
desolation  ->  0
fragrance  ->  0
astonished  ->  0
enjoy  ->  0
faintness  ->  0
flight  ->  0
f

unseemly  ->  0
certificate  ->  0
divorce  ->  0
sends  ->  0
business  ->  0
cheer  ->  0
millstone  ->  0
stealing  ->  0
loan  ->  0
sets  ->  0
deprive  ->  0
condemn  ->  0
stripes  ->  0
degraded  ->  0
muzzle  ->  0
blotted  ->  0
draws  ->  0
private  ->  0
bag  ->  0
diverse  ->  0
unrighteously  ->  0
rearmost  ->  0
profess  ->  0
populous  ->  0
tithing  ->  0
transgressed  ->  0
uncut  ->  0
craftsman  ->  0
secret  ->  0
dishonors  ->  0
removes  ->  0
leads  ->  0
withholds  ->  0
motherinlaw  ->  0
barns  ->  0
doings  ->  0
inflammation  ->  0
fiery  ->  0
blight  ->  0
tossed  ->  0
frighten  ->  0
tumors  ->  0
scurvy  ->  0
madness  ->  0
astonishment  ->  0
grope  ->  0
noonday  ->  0
gropes  ->  0
robbed  ->  0
betroth  ->  0
longing  ->  0
sights  ->  0
mad  ->  0
proverb  ->  0
byword  ->  0
olives  ->  0
drop  ->  0
joyfulness  ->  0
abundance  ->  0
facial  ->  0
expressions  ->  0
trusted  ->  0
siege  ->  0
delicate  ->  0
venture  ->  0
delicateness  ->  0

ahimaaz  ->  0
abner  ->  0
ner  ->  0
summoned  ->  0
telaim  ->  0
vile  ->  0
monument  ->  0
bleating  ->  0
witchcraft  ->  0
idolatry  ->  0
cheerfully  ->  0
bethlehemite  ->  0
abinadab  ->  0
ruddy  ->  0
player  ->  0
prudent  ->  0
ephesdammim  ->  0
champion  ->  0
goliath  ->  0
helmet  ->  0
shin  ->  0
weavers  ->  0
philistine  ->  0
ephrathite  ->  0
cheeses  ->  0
greeted  ->  0
ranks  ->  0
taxfree  ->  0
eliabs  ->  0
paw  ->  0
clad  ->  0
strapped  ->  0
pouch  ->  0
disdained  ->  0
slung  ->  0
davids  ->  0
sheath  ->  0
gai  ->  0
chasing  ->  0
behaved  ->  0
singing  ->  0
joy  ->  0
music  ->  0
awe  ->  0
adriel  ->  0
meholathite  ->  0
deadline  ->  0
often  ->  0
highly  ->  0
esteemed  ->  0
delighted  ->  0
seeks  ->  0
slipped  ->  0
stuck  ->  0
pillow  ->  0
naioth  ->  0
secu  ->  0
discloses  ->  0
misses  ->  0
answers  ->  0
disclose  ->  0
missed  ->  0
ezel  ->  0
shoot  ->  0
danger  ->  0
shamefully  ->  0
arrow  ->  0
jonathans  ->  0
nob 

shebna  ->  0
rabshakehs  ->  0
isaiah  ->  0
amoz  ->  0
rejection  ->  0
point  ->  0
warring  ->  0
tirhakah  ->  0
ethiopia  ->  0
destroyedgozan  ->  0
rezeph  ->  0
telassar  ->  0
enthroned  ->  0
ridiculed  ->  0
ruinous  ->  0
confounded  ->  0
housetops  ->  0
raging  ->  0
hook  ->  0
downward  ->  0
defend  ->  0
assyrians  ->  0
worshiping  ->  0
nisroch  ->  0
sharezer  ->  0
ararat  ->  0
esar  ->  0
haddon  ->  0
sundial  ->  0
berodach  ->  0
baladan  ->  0
storehouse  ->  0
thingsthe  ->  0
issue  ->  0
fiftyfive  ->  0
hephzibah  ->  0
seduced  ->  0
plumb  ->  0
wipes  ->  0
wiping  ->  0
turning  ->  0
uzza  ->  0
meshullemeth  ->  0
haruz  ->  0
jotbah  ->  0
jedidah  ->  0
adaiah  ->  0
shaphan  ->  0
azaliah  ->  0
meshullam  ->  0
workers  ->  0
workmen  ->  0
ahikam  ->  0
asaiah  ->  0
huldah  ->  0
tikvah  ->  0
harhas  ->  0
quenched  ->  0
himwith  ->  0
agreed  ->  0
idolatrous  ->  0
planets  ->  0
wove  ->  0
topheth  ->  0
melech  ->  0
necoh  ->  0
ha

hatipha  ->  0
sotai  ->  0
hassophereth  ->  0
peruda  ->  0
jaalah  ->  0
darkon  ->  0
hattil  ->  0
pochereth  ->  0
hazzebaim  ->  0
ami  ->  0
ninetytwo  ->  0
tel  ->  0
melah  ->  0
addan  ->  0
tobiah  ->  0
habaiah  ->  0
registered  ->  0
deemed  ->  0
disqualified  ->  0
jozadak  ->  0
henadad  ->  0
vestments  ->  0
directions  ->  0
weakened  ->  0
frustrate  ->  0
darius  ->  0
ahasuerus  ->  0
accusation  ->  0
artaxerxes  ->  0
bishlam  ->  0
tabeel  ->  0
chancellor  ->  0
shimshai  ->  0
dinaites  ->  0
apharsathchites  ->  0
tarpelites  ->  0
apharsites  ->  0
archevites  ->  0
babylonians  ->  0
shushanchites  ->  0
dehaites  ->  0
elamites  ->  0
noble  ->  0
osnappar  ->  0
toll  ->  0
hurtful  ->  0
informed  ->  0
rebellions  ->  0
decreed  ->  0
insurrection  ->  0
revolts  ->  0
haggai  ->  0
helping  ->  0
tattenai  ->  0
shetharbozenai  ->  0
apharsachites  ->  0
diligence  ->  0
prospers  ->  0
archives  ->  0
scroll  ->  0
achmetha  ->  0
media  ->  0
str

contemplative  ->  0
sapped  ->  0
forgave  ->  0
fitting  ->  0
fashions  ->  0
considers  ->  0
radiant  ->  0
encamps  ->  0
protects  ->  0
brandish  ->  0
slippery  ->  0
unawares  ->  0
robs  ->  0
bereaving  ->  0
attackers  ->  0
wrongfully  ->  0
wink  ->  0
wake  ->  0
vindicate  ->  0
gloat  ->  0
thats  ->  0
revelation  ->  0
flatters  ->  0
detect  ->  0
plots  ->  0
fret  ->  0
envious  ->  0
patiently  ->  0
gnashes  ->  0
upholds  ->  0
generously  ->  0
holds  ->  0
begging  ->  0
lends  ->  0
slide  ->  0
transgressors  ->  0
soundness  ->  0
health  ->  0
groaned  ->  0
throbs  ->  0
lovers  ->  0
deceits  ->  0
reproofs  ->  0
meditated  ->  0
frail  ->  0
widths  ->  0
correct  ->  0
exist  ->  0
horrible  ->  0
miry  ->  0
sickbed  ->  0
worst  ->  0
contemplation  ->  0
thirsts  ->  0
mizar  ->  0
waterfalls  ->  0
billows  ->  0
countenance  ->  0
exceeding  ->  0
victories  ->  0
scoffing  ->  0
shaking  ->  0
taunt  ->  0
reproaches  ->  0
verbally  ->  0
abu

whirlwinds  ->  0
medias  ->  0
attentiveness  ->  0
watchtower  ->  0
post  ->  0
dedanites  ->  0
tumultuous  ->  0
joyous  ->  0
perplexity  ->  0
reservoir  ->  0
dressing  ->  0
chiseling  ->  0
hurl  ->  0
grasp  ->  0
ball  ->  0
station  ->  0
nail  ->  0
replenished  ->  0
revenue  ->  0
market  ->  0
antiquity  ->  0
giver  ->  0
stain  ->  0
canaans  ->  0
sufficiently  ->  0
durable  ->  0
seller  ->  0
debtor  ->  0
taker  ->  0
merryhearted  ->  0
sway  ->  0
hammock  ->  0
complete  ->  0
dreaded  ->  0
wines  ->  0
swims  ->  0
swim  ->  0
craft  ->  0
faith  ->  0
whoevers  ->  0
ordain  ->  0
delivery  ->  0
dragon  ->  0
rough  ->  0
chalk  ->  0
thresh  ->  0
precept  ->  0
stammering  ->  0
refreshing  ->  0
agreement  ->  0
measuring  ->  0
annulled  ->  0
blanket  ->  0
oneself  ->  0
unusual  ->  0
extraordinary  ->  0
leveled  ->  0
dill  ->  0
cumin  ->  0
threshed  ->  0
posted  ->  0
mumble  ->  0
ruthless  ->  0
awakes  ->  0
pause  ->  0
educated  ->  0
de

performers  ->  0
difference  ->  0
dispossessing  ->  0
mina  ->  0
cor  ->  0
israelfor  ->  0
holidays  ->  0
moisten  ->  0
ankles  ->  0
eastern  ->  0
swamps  ->  0
marshes  ->  0
hethlon  ->  0
berothah  ->  0
sibraim  ->  0
hazer  ->  0
hatticon  ->  0
hauran  ->  0
enon  ->  0
meriboth  ->  0
meribath  ->  0
ashpenaz  ->  0
wellfavored  ->  0
science  ->  0
belteshazzar  ->  0
shadrach  ->  0
meshach  ->  0
abednego  ->  0
vegetables  ->  0
examined  ->  0
fairer  ->  0
fatter  ->  0
enchanters  ->  0
trying  ->  0
magician  ->  0
terrifying  ->  0
partly  ->  0
brittle  ->  0
mingle  ->  0
mix  ->  0
sovereignty  ->  0
revealer  ->  0
dura  ->  0
sheriffs  ->  0
herald  ->  0
zither  ->  0
hour  ->  0
usually  ->  0
heated  ->  0
mantles  ->  0
singed  ->  0
flourishing  ->  0
lengthening  ->  0
tranquility  ->  0
claws  ->  0
reputed  ->  0
belshazzar  ->  0
reads  ->  0
fatheryes  ->  0
fathermade  ->  0
interpreting  ->  0
sentences  ->  0
dissolving  ->  0
doubts  ->  0
d

festivity  ->  0
disposed  ->  0
collectively  ->  0
inaccessible  ->  0
hateful  ->  0
robed  ->  0
pharsannes  ->  0
delphon  ->  0
phasga  ->  0
pharadatha  ->  0
barea  ->  0
sarbaca  ->  0
marmasima  ->  0
ruphaeus  ->  0
arsaeus  ->  0
zabuthaeus  ->  0
permitted  ->  0
spoil  ->  0
differently  ->  0
confirmation  ->  0
pledging  ->  0
levied  ->  0
valour  ->  0
viceroy  ->  0
detail  ->  0
combined  ->  0
ptolemeus  ->  0
cleopatra  ->  0
dositheus  ->  0
authentic  ->  0
lysimachus  ->  0
singleness  ->  0
convicts  ->  0
enslaved  ->  0
blasphemer  ->  0
hearer  ->  0
unseen  ->  0
lawless  ->  0
conviction  ->  0
murmuring  ->  0
generative  ->  0
wholesome  ->  0
hades  ->  0
immortal  ->  0
summondeath  ->  0
deeming  ->  0
pined  ->  0
unsound  ->  0
extinguished  ->  0
traces  ->  0
rosebuds  ->  0
revelry  ->  0
useless  ->  0
annoys  ->  0
training  ->  0
professes  ->  0
unlike  ->  0
abstains  ->  0
torture  ->  0
protected  ->  0
reasoned  ->  0
mysteries  ->  0
pr

chalphi  ->  0
spartans  ->  0
arius  ->  0
reigning  ->  0
signify  ->  0
undertaken  ->  0
encompassed  ->  0
unwilling  ->  0
troublesome  ->  0
numenius  ->  0
antipater  ->  0
sentinels  ->  0
zabadaeans  ->  0
isolated  ->  0
chaphenatha  ->  0
adida  ->  0
owed  ->  0
detaining  ->  0
adora  ->  0
bascama  ->  0
pyramids  ->  0
elaborate  ->  0
erecting  ->  0
suits  ->  0
oversights  ->  0
collected  ->  0
qualified  ->  0
seventieth  ->  0
seige  ->  0
engine  ->  0
reconciled  ->  0
expelled  ->  0
pollutions  ->  0
seventyfirst  ->  0
hymns  ->  0
seventysecond  ->  0
arsaces  ->  0
uncleannesses  ->  0
conversed  ->  0
sparta  ->  0
asaramel  ->  0
occurred  ->  0
rallied  ->  0
restoration  ->  0
simons  ->  0
nullify  ->  0
precinct  ->  0
contents  ->  0
troublemakers  ->  0
warships  ->  0
remissions  ->  0
remitted  ->  0
coin  ->  0
evermore  ->  0
seventyfourth  ->  0
lucius  ->  0
consul  ->  0
attalus  ->  0
arathes  ->  0
sampsames  ->  0
delos  ->  0
myndos  ->  

phacareth  ->  0
sabie  ->  0
sarothie  ->  0
masias  ->  0
gas  ->  0
addus  ->  0
subas  ->  0
apherra  ->  0
barodis  ->  0
templeservants  ->  0
thermeleth  ->  0
thelersas  ->  0
charaathalan  ->  0
allar  ->  0
dalan  ->  0
ban  ->  0
nekodan  ->  0
usurped  ->  0
obdia  ->  0
akkos  ->  0
jaddus  ->  0
augia  ->  0
zorzelleus  ->  0
attharias  ->  0
express  ->  0
oblations  ->  0
convey  ->  0
emadabun  ->  0
joda  ->  0
iliadun  ->  0
asbasareth  ->  0
popular  ->  0
persuasions  ->  0
commotions  ->  0
aggaeus  ->  0
zacharius  ->  0
addo  ->  0
sisinnes  ->  0
sathrabuzanes  ->  0
sanabassarus  ->  0
ekbatana  ->  0
corn  ->  0
supervised  ->  0
assisting  ->  0
azaraias  ->  0
zechrias  ->  0
sadduk  ->  0
ahitob  ->  0
amarias  ->  0
memeroth  ->  0
savias  ->  0
boccas  ->  0
abisne  ->  0
omitted  ->  0
reader  ->  0
firkins  ->  0
imposition  ->  0
justices  ->  0
gerson  ->  0
gamael  ->  0
attus  ->  0
sechenias  ->  0
zacharais  ->  0
eliaonias  ->  0
zathoes  ->  0


Powyżej przedstawiono pierwszy wiersz macierzy, w którym zostały prawidłowo wpisane liczby słów występujące w pierwszym dokumencie.

#### 5.
Przetwórz wstępnie otrzymany zbiór danych mnożąc elementy bag-of-words przez
inverse document frequency. Operacja ta pozwoli na redukcje znaczenia często występujących
słów.

In [17]:
def count_documents_with_word(word):
    counter = 0
    for row in matrix:
        if row.get(word) != 0:
            counter += 1
    return counter

In [18]:
import math

def tf_idf(dictionary):
    for word in dictionary:
        documents_with_word = count_documents_with_word(word)
        if documents_with_word == 0:
            idf = 0
            print(word)
        else:
            idf = float(math.log(len(matrix)/documents_with_word))
        for row in matrix:
            row[word] = row.get(word) * idf

In [19]:
print(count_documents_with_word('chapter'))
print(len(matrix))

1100
1100


Słowo 'chapter' występuje w każdym pliku tekstowym, więc po wykonaniu funkcji tf_idf jego waga powinna zostać ustawiona na 0.

In [20]:
show_words(matrix[0])
matrix[0].get('chapter')

of  ->  3
chapter  ->  2
this  ->  1
set  ->  1
files  ->  1
contains  ->  1
a  ->  1
script  ->  1
canonical  ->  1
text  ->  1
by  ->  1
for  ->  1
the  ->  1
purpose  ->  1
reading  ->  1
to  ->  1
make  ->  1
an  ->  1
audio  ->  1
recording  ->  1
all  ->  1
footnotes  ->  1
introductions  ->  1
and  ->  1
verse  ->  1
numbers  ->  1
have  ->  1
been  ->  1
stripped  ->  1
out  ->  1


2

In [21]:
tf_idf(dictionary)

In [22]:
show_words(matrix[0])
matrix[0].get('chapter')

files  ->  7.003065458786462
script  ->  7.003065458786462
canonical  ->  7.003065458786462
text  ->  7.003065458786462
audio  ->  7.003065458786462
recording  ->  7.003065458786462
footnotes  ->  7.003065458786462
introductions  ->  7.003065458786462
verse  ->  7.003065458786462
contains  ->  5.9044531701183525
reading  ->  5.616771097666572
stripped  ->  4.112693700890297
purpose  ->  3.6018680771243066
numbers  ->  3.2188758248682006
been  ->  1.3366387706740297
set  ->  0.8982722263714767
make  ->  0.8418581370913855
an  ->  0.7301884522402944
this  ->  0.4378004887511008
by  ->  0.2649129641905048
out  ->  0.23227603487748236
have  ->  0.2208734027796706
all  ->  0.12370965432602289
a  ->  0.04938124791592503
for  ->  0.028586547761416694
to  ->  0.015575211785471372
of  ->  0.0081929955336952
and  ->  0.004555816535860661
the  ->  0.0018198367169858993


0.0

Zgodnie z oczekiwaniami zredukowano do 0 znaczenie słowa, które występowało w każdym tekście. Zatem słowo 'chapter' nie jest istotne.

#### 6.
Napisz program pozwalający na wprowadzenie zapytania (w postaci sekwencji
słów) przekształcanego nastpnie do reprezentacji wektorowej q (bag-of-words).
Program ma zwrócić k dokumentów najbardziej zbliżonych do podanego zapytania
q.

In [23]:
def create_vector_from_text(text):
    bow = {}
    for word in dictionary:
        bow[word] = 0
        
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub('[0-9]', '', text)
    words = nltk.word_tokenize(text)

    for word in words:
        if word in bow:
            bow[word] += 1
    return list(bow.values())

In [24]:
def get_similar_documents(text, k):
    text_vector = create_vector_from_text(text)
    result = []
    for i, row in enumerate(matrix):
        cos = np.matmul(np.array(text_vector).T, np.array(list(row.values()))) / (np.linalg.norm(text_vector) * np.linalg.norm(list(row.values())))
        
        if (len(result) >= k):
            if cos > result[0][1] or math.isnan(result[0][1]):
                result[0] = (i, cos)
        else:
            result.append((i, cos))
            
        result = sorted(result, key=lambda x: x[1])
    return result

In [25]:
def write_files_at_index(result):
    result = sorted(result, key=lambda x: x[0])
    index = 0
    for i, file in enumerate(files):
        if len(result) == index:
            break
        if i == result[index][0]:
            f = io.open('documents/' + file, encoding="utf8")
            print("\nFile number: ", i)
            print(f.read())
            index += 1  

In [26]:
text = "When Abram was ninety-nine years old, Yahweh appeared to Abram and said to him, “I am God Almighty. Walk before me and be blameless. I will make my covenant between me and you, and will multiply you exceedingly.” Abram fell on his face. God talked with him, saying, ."
result = get_similar_documents(text, 3)

In [27]:
print(result)
write_files_at_index(result)

[(17, 0.28634335213682716), (12, 0.31396209612601805), (15, 0.3461022720388943)]

File number:  12
﻿Genesis.
Chapter 12.
Now Yahweh said to Abram, “Leave your country, and your relatives, and your father’s house, and go to the land that I will show you. 
I will make of you a great nation. I will bless you and make your name great. You will be a blessing. 
I will bless those who bless you, and I will curse him who treats you with contempt. All the families of the earth will be blessed through you.” 
So Abram went, as Yahweh had told him. Lot went with him. Abram was seventy-five years old when he departed from Haran. 
Abram took Sarai his wife, Lot his brother’s son, all their possessions that they had gathered, and the people whom they had acquired in Haran, and they went to go into the land of Canaan. They entered into the land of Canaan. 
Abram passed through the land to the place of Shechem, to the oak of Moreh. At that time, Canaanites were in the land. 
Yahweh appeared to Abram an

#### 7.
Zastosuj normalizację wektorów cech dj i wektora q, tak aby miały one długość 1.

In [28]:
from sklearn.preprocessing import normalize

In [29]:
def normalize_matrix(matrix):
    result = []
    for i, row in enumerate(matrix):
        result.append(normalize([list(row.values())], norm="l1"))
    return result

In [30]:
normalized_matrix = normalize_matrix(matrix)

In [31]:
def get_normalize_vectors(text):
    text_vector_normalized = normalize([create_vector_from_text(text)],norm="l1")
    result = []
    for i, row in enumerate(normalized_matrix):
        cos = np.matmul(np.array(text_vector_normalized[0]).T,np.array(row[0]))/(np.linalg.norm(text_vector_normalized)*np.linalg.norm(row))
        result.append((i, cos))
    return result

In [32]:
def get_similar_documents_normalized(text, k):
    vectors = get_normalize_vectors(text)
    vectors = sorted(vectors, key=lambda x: x[1], reverse=True)
    return vectors[:k]

In [33]:
normalized_result = get_similar_documents_normalized(text, 3)

In [34]:
print(normalized_result)
write_files_at_index(normalized_result)

[(15, 0.3461022720388944), (12, 0.31396209612601805), (17, 0.28634335213682716)]

File number:  12
﻿Genesis.
Chapter 12.
Now Yahweh said to Abram, “Leave your country, and your relatives, and your father’s house, and go to the land that I will show you. 
I will make of you a great nation. I will bless you and make your name great. You will be a blessing. 
I will bless those who bless you, and I will curse him who treats you with contempt. All the families of the earth will be blessed through you.” 
So Abram went, as Yahweh had told him. Lot went with him. Abram was seventy-five years old when he departed from Haran. 
Abram took Sarai his wife, Lot his brother’s son, all their possessions that they had gathered, and the people whom they had acquired in Haran, and they went to go into the land of Canaan. They entered into the land of Canaan. 
Abram passed through the land to the place of Shechem, to the oak of Moreh. At that time, Canaanites were in the land. 
Yahweh appeared to Abram an

Jak widać powyżej znalezione pliki są takie same w obu przypadkach (z normalizacją i bez).

#### 8.
W celu usunięcia szumu z macierzy A (normalized_matrix) zastosuj SVD i low rank approximation.

In [35]:
import scipy

In [36]:
def svd_and_lra(k):
    A = []
    for row in normalized_matrix:
        A.append(row[0])
    u, s, vt = scipy.sparse.linalg.svds(np.array(A), k=k)
    return u @ np.diag(s) @ vt

In [37]:
def get_normalize_vectors_approx(text, k):
    res = svd_and_lra(k)
    text_vector_normalized = normalize([create_vector_from_text(text)],norm="l1")
    result = []
    for i, row in enumerate(res):
        cos = np.matmul(np.array(text_vector_normalized[0]).T,np.array(row))/(np.linalg.norm(text_vector_normalized)*np.linalg.norm(row))
        result.append((i, cos))
    return result

In [38]:
def get_similar_documents_approx(text, k, approx_k):
    vectors = get_normalize_vectors_approx(text, approx_k)
    vectors = sorted(vectors, key=lambda x: x[1], reverse=True)
    return vectors[:k]

In [40]:
approx_result = get_similar_documents_approx(text, 3, 150)

In [41]:
print(approx_result)
write_files_at_index(approx_result)

[(15, 0.37849064276400085), (12, 0.3437077217291979), (13, 0.33155858160665236)]

File number:  12
﻿Genesis.
Chapter 12.
Now Yahweh said to Abram, “Leave your country, and your relatives, and your father’s house, and go to the land that I will show you. 
I will make of you a great nation. I will bless you and make your name great. You will be a blessing. 
I will bless those who bless you, and I will curse him who treats you with contempt. All the families of the earth will be blessed through you.” 
So Abram went, as Yahweh had told him. Lot went with him. Abram was seventy-five years old when he departed from Haran. 
Abram took Sarai his wife, Lot his brother’s son, all their possessions that they had gathered, and the people whom they had acquired in Haran, and they went to go into the land of Canaan. They entered into the land of Canaan. 
Abram passed through the land to the place of Shechem, to the oak of Moreh. At that time, Canaanites were in the land. 
Yahweh appeared to Abram an

Po usunięciu szumu z macierzy i zastosowaniu SVD oraz low rank approximation otrzymano dwa takie same pliki jak w dwóch wcześniejszych zadaniach, jednak jeden się różnił.

#### 9.
Porównaj działanie programu bez usuwania szumu i z usuwaniem szumu. Dla jakiej wartości k wyniki wyszukiwania są najlepsze (subiektywnie). Zbadaj wpływ przekształcenia IDF na wyniki wyszukiwania.

In [46]:
approx_result50 = get_similar_documents_approx(text, 3, 50)
print("Dla k = 50: ")
print(approx_result50)
approx_result200 = get_similar_documents_approx(text, 3, 200)
print("Dla k = 200: ")
print(approx_result200)
approx_result400 = get_similar_documents_approx(text, 3, 400)
print("Dla k = 400: ")
print(approx_result400)
approx_result1000 = get_similar_documents_approx(text, 3, 1000)
print("Dla k = 1000: ")
print(approx_result1000)

Dla k = 50: 
[(15, 0.37463985883148804), (12, 0.3594304697495078), (13, 0.3518695287766173)]
Dla k = 200: 
[(15, 0.3810069539281877), (12, 0.336898193700019), (13, 0.3215925862419337)]
Dla k = 400: 
[(15, 0.3672243856604417), (12, 0.32986513759813824), (17, 0.28313708673923976)]
Dla k = 1000: 
[(15, 0.3466462619225087), (12, 0.31488088535888265), (17, 0.2860633122632338)]


Wyniki wyszukiwania różnią się w zależności od wartości k. Dla większych k (około 400 lub więcej) wynik jest taki sam, jak bez usuwania szumów, a dla mniejszych różni się.