# List 1. BIg Data Algoritms Laboratory

### 1. Read about the TF-IDF coefficient (and understand the idea of this measure)

TF-IDF (Term Frequency times Inverse Document Frequency) is a measure of how important a specific word is in a set of documents. 
<i>"To downgrade the relative importance of words that occur all too frequently, an inverse weighting is introduced to scale down the words that occur too frequently. This inverse weighting is referred to as Inverse Document Frequency. Together, TF-IDF captures the relative importance of words in a set of documents or a collection of texts."</i>
To calculate it the following must be defined:

###### definition 1. 
$TF(i,j)=\frac{f_{i,j}}{max_k{f_{k,j}}}$ where $f_{i,j}$ is the number of times the i-th element appears in the j-th document
###### definition 2. 
$IDF(i)=log_2{N/n_i}$ where $N$ is the total number of documents in the set and $n_i$ is the number of documents from the set in which the word (=element i) appears
##### defintion 3. TF.IDF for word i and document j
$TF.IDF(i,j)=TF(i,j) \times IDF(i) $

### 2. Find sources of 10 different books  (plaintext files with UTF-8 encoding). You can use e.g. the Project Gutenberg website. The set of selected books should containat least 2 books  of the same author and at least 2 books of the same genre (e.g. fantasy, sci-fi, romance,...)
My books: 
* Odyssey
* Dracula
* Alice's adventures in Wonderland (fantasy)
* Frankenstein (fantasy)
* Moby Dick
* Pride and Prejudice
* The happy prince and other tales (Oscar Wilde)
* The masque of red death
* The picture of Dorian Grey (Oscar Wilde)
* The yellow wallpaper

I eliminated Gutenberg's preambule and notes after the end of the original text so that I obtained only original text of the book. I merged scala and python to perform the tasks.

### 3. Read the books from the input files, divide each book into words and remove the stop-words (also remove the special characters like ".,:;{}() etc. and convert the text to lowercase). For each book calculate the total number of distinct words used by the author. 

In [1]:
//as input a spark data frame - out calculation results
import scala.io.Source._
val stopwords = fromFile("stopwords.txt").getLines.toArray // define stopwords - downloaded list
val my_books=Array("moby_dick.txt","alice's_adventures.txt","dracula.txt","frankenstein.txt","pride_and_prejudice.txt",
                   "the_happy_prince_and_other_tales.txt","the_masque_of_red_death.txt","the_oddyssey.txt",
                   "the_picture_of_dorian_grey.txt","the_yellow_wallpaper.txt"
                  )

Intitializing Scala interpreter ...

Spark Web UI available at http://LAPTOP-AR3KTOMI:4045
SparkContext available as 'sc' (version = 3.0.1, master = local[*], app id = local-1603895178731)
SparkSession available as 'spark'


import scala.io.Source._
stopwords: Array[String] = Array(able, about, above, abroad, according, accordingly, across, actually, adj, after, afterwards, again, against, ago, ahead, ain't, all, allow, allows, almost, alone, along, alongside, already, also, although, always, am, amid, amidst, among, amongst, an, and, another, any, anybody, anyhow, anyone, anything, anyway, anyways, anywhere, apart, appear, appreciate, appropriate, are, aren't, around, as, a's, aside, ask, asking, associated, at, available, away, awfully, back, backward, backwards, be, became, because, become, becomes, becoming, been, before, beforehand, begin, behind, being, believe, below, beside, besides, best, better, between, beyond, both, brief, but, by, came, can, cannot, cant, can't, caption, cause, causes, certain...


In [2]:
def perform_calculation( textFile:org.apache.spark.sql.Dataset[String] ){ 
    // Everything to lowercase letter, then split to words - "\\W+" (deletes colons, semicolons etc.) 
    //and filter stop words (word not in the stopwords list)
    //val my_new = textFile.flatMap{_.toLowerCase.split("\\W+").filter(word => !stopwords.contains(word) && word!="")}
    val my_new = textFile.flatMap{_.toLowerCase.split("\\W+")}
                         .flatMap{_.split("\\_").filter(word => !stopwords.contains(word) && word!="")}
    //val my_new=
    //create list of distinct words
    val distinct = my_new.distinct.collect.toList
    //print number of different words
    print(distinct.length)
   }

perform_calculation: (textFile: org.apache.spark.sql.Dataset[String])Unit


In [3]:
for (i <- 0 to my_books.length-1){
    val textFile = spark.read.textFile(my_books(i)) //current text file
    print("book name: "+my_books(i)+" number of distnict words: ")
    perform_calculation(textFile)
    println()
}

book name: moby_dick.txt number of distnict words: 16510
book name: alice's_adventures.txt number of distnict words: 2146
book name: dracula.txt number of distnict words: 8717
book name: frankenstein.txt number of distnict words: 6498
book name: pride_and_prejudice.txt number of distnict words: 5772
book name: the_happy_prince_and_other_tales.txt number of distnict words: 1787
book name: the_masque_of_red_death.txt number of distnict words: 646
book name: the_oddyssey.txt number of distnict words: 7023
book name: the_picture_of_dorian_grey.txt number of distnict words: 5023
book name: the_yellow_wallpaper.txt number of distnict words: 914


Moby Dick has the higehst number, the books of the same author don't have the similar values of distinct words.

### 4. For all words in all books calculate the TF-IDF coefficient. For each book identify 20 words with highest TF-IDF coefficient. Write the results to a text file (together with the #of distinct words calculated in 4.).

Prepare tuples (word, number of appearences)

In [4]:
import java.nio.file._ // biblioteka do zpaisania plików

import java.nio.file._


In [7]:
def prepare_counts(book_name: String): Array[(String, Int)]={ 

    // prepare current book - remove stopwords, split etc.
    val textFile = spark.read.textFile(book_name)
    //val before_counts = textFile.flatMap{_.toLowerCase.split("\\W+").filter(word => !stopwords.contains(word)&& word!="")}
    val before_counts = textFile.flatMap{_.toLowerCase.split("\\W+")}
                                .flatMap{_.split("\\_").filter(word => !stopwords.contains(word)&& word!="")}
    val transformed_counts = sc.parallelize(before_counts.collect())
    // count appearence of each word - f(i,j)
    val counts = transformed_counts.map(word => (word, 1)).reduceByKey( (x,y) => x+y).sortBy(_._2,false)
    val new_counts = counts.collect
    new_counts
}

prepare_counts: (book_name: String)Array[(String, Int)]


In [8]:
val my_books=Array("moby_dick","alice's_adventures","dracula","frankenstein","pride_and_prejudice",
                   "the_happy_prince_and_other_tales","the_masque_of_red_death","the_oddyssey",
                   "the_picture_of_dorian_grey","the_yellow_wallpaper"
                  )

for (i <- 0 to my_books.length-1){
    val book_name = my_books(i)
     //save the counts for each book to the txt file book_counts.txt
    val counts = prepare_counts(book_name+".txt")
    counts.foreach(d => Files.write(Paths.get(book_name+"_counts.txt"), (d._1 + " " + d._2 + "\n").getBytes, 
                                      StandardOpenOption.CREATE, StandardOpenOption.APPEND))
}

my_books: Array[String] = Array(moby_dick, alice's_adventures, dracula, frankenstein, pride_and_prejudice, the_happy_prince_and_other_tales, the_masque_of_red_death, the_oddyssey, the_picture_of_dorian_grey, the_yellow_wallpaper)


^^^ Above used scala below Python

#### For each book the python dictionary that represents it is created. Format: word, how many times appeared, tf_idf
First read counts to the dics

In [1]:
import pandas as pd
import os
counts_files = [i for i in os.listdir('.') if "counts" in i]

In [2]:
counts_files

["alice's_adventures_counts.txt",
 'dracula_counts.txt',
 'frankenstein_counts.txt',
 'moby_dick_counts.txt',
 'pride_and_prejudice_counts.txt',
 'the_happy_prince_and_other_tales_counts.txt',
 'the_masque_of_red_death_counts.txt',
 'the_oddyssey_counts.txt',
 'the_picture_of_dorian_grey_counts.txt',
 'the_yellow_wallpaper_counts.txt']

In [3]:
books_counts_dictionaries=[]
for i in counts_files:
    d = {}
    with open(i) as f:
        for line in f:
            (key, val) = line.split()
            d[key] = [int(val)]
    books_counts_dictionaries.append(d)

Second add TF.IDF calculation results to the dics

In [4]:
import math
N=10 # number of books
book_nr=0
for book_counts in books_counts_dictionaries:
    
    for word in book_counts: # dictionary for the book: word - count
    # for each word in the list of distnict not stop words for the specific book
    
        #TF.ID calculation
        max_appearence = max([i[word][0] for i in books_counts_dictionaries if word in i.keys()])
        TF = book_counts[word][0]/max_appearence # how many time appears in the book divided by max appearence
        n_i = sum([1 if word in book.keys() else 0 for book in books_counts_dictionaries])
        IDF = math.log(N/n_i,2)

        book_counts[word].append(TF*IDF)
    
    # sort the ready 
    #sorted_counts_by_tfidf={k: v for k, v in sorted(book_counts.items(), key=lambda item: item[1][1]),reverse=True}
    #sorting and saving the dic
    #book_counts = sorted(book_counts.items(), key=lambda x: x[1][1], reverse=True)
    #f = open(counts_files[book_nr]+"_tfidf.txt","w")
    #f.write(str(book_counts))
    #f.close()
    
    #book_nr+=1

Exemplary parts of the dictionaries for the first and third book

In [5]:
print("Dict for Alice in Wonderland (first book): \n")
print(list(books_counts_dictionaries[0].items())[0:40])
print("...\n")
print("Dict for Dracula (second book): \n")
print(list(books_counts_dictionaries[1].items())[0:40])
print("...\n")

Dict for Alice in Wonderland (first book): 

[('alice', [399, 3.3219280948873626]), ('queen', [76, 0.7369655941662062]), ('thought', [74, 0.0]), ('time', [71, 0.0]), ('king', [63, 0.23583104625469567]), ('don', [61, 0.10281473187502148]), ('turtle', [60, 2.321928094887362]), ('began', [58, 0.07801928690099914]), ('ll', [57, 0.3652785988475979]), ('mock', [57, 0.7369655941662062]), ('hatter', [56, 3.3219280948873626]), ('gryphon', [55, 3.3219280948873626]), ('rabbit', [53, 2.321928094887362]), ('head', [50, 0.0]), ('voice', [48, 0.0]), ('looked', [45, 0.0]), ('mouse', [44, 1.7369655941662063]), ('ve', [44, 0.7369655941662062]), ('duchess', [42, 1.7369655941662063]), ('tone', [40, 0.5145731728297582]), ('dormouse', [40, 3.3219280948873626]), ('great', [39, 0.0]), ('cat', [37, 1.7369655941662063]), ('march', [34, 0.7369655941662062]), ('large', [33, 0.06349496308464117]), ('long', [33, 0.0]), ('moment', [31, 0.0]), ('hare', [31, 1.7369655941662063]), ('white', [30, 0.0]), ('heard', [30, 0

In [6]:
for i in range(len(books_counts_dictionaries)):
    print("Book name: ",counts_files[i])
    print("20 words with the heighest TF.IDF: ")
    lala=sorted(books_counts_dictionaries[i].items(), key=lambda x: x[1][1], reverse=True)[0:20]
    keys=[i[0] for i in lala]
    print(keys)

Book name:  alice's_adventures_counts.txt
20 words with the heighest TF.IDF: 
['alice', 'hatter', 'gryphon', 'dormouse', 'dinah', 'dodo', 'pigeon', 'croquet', 'timidly', 'knave', 'lobster', 'whiting', 'oop', 'treacle', 'lory', 'soo', 'puppy', 'lobsters', 'cheshire', 'wow']
Book name:  dracula_counts.txt
20 words with the heighest TF.IDF: 
['helsing', 'lucy', 'mina', 'harker', 'seward', 'godalming', 'quincey', 'renfield', 'westenra', 'whitby', 'dracula', 'hawkins', 'carfax', 'varna', 'garlic', 'holmwood', 'galatz', 'czarina', 'spiders', 'exeter']
Book name:  frankenstein_counts.txt
20 words with the heighest TF.IDF: 
['clerval', 'justine', 'felix', 'frankenstein', 'safie', 'creator', 'ingolstadt', 'hovel', 'kirwin', 'protectors', 'ernest', 'mont', 'occupations', 'contemplated', 'waldman', 'lacey', 'krempe', 'walton', 'agrippa', 'extinguish']
Book name:  moby_dick_counts.txt
20 words with the heighest TF.IDF: 
['ahab', 'whales', 'stubb', 'queequeg', 'starbuck', 'pequod', 'whaling', 'nant

### 4. Write a function that takes as input a word w and returns the list of the most matching books. I.e., the output should be the list of books sorted according to the TF-IDF coefficient of w (in descending order). Show the results for several different input words.

In [61]:
def search_matches(books_names,books_dictionaries,word):
    matching={}
    for i in range(0,len(books_dictionaries)):
        if word in books_dictionaries[i]:
            matching[books_names[i]]=books_dictionaries[i][word][1]
        else:
            matching[books_names[i]]=0 # word not present
    #books_dictionaries[i][word][1] - tf.idf of word in book i 
    #books_dictionaries[i][word][0] - count of word in book i          
    
    p=pd.DataFrame(matching.items())
    p=p.rename(columns={0: "book name", 1: "TF.IDF value"})
    p=p.sort_values(by=p.columns[1],ascending=False)
    print("Book matching for "+word+":")
    display(p)
    
    return matching
                                                 

In [62]:
matching=search_matches(counts_files,books_counts_dictionaries,"queen")

Book matching for queen:


Unnamed: 0,book name,TF.IDF value
0,alice's_adventures_counts.txt,0.736966
7,the_oddyssey_counts.txt,0.25212
3,moby_dick_counts.txt,0.184241
8,the_picture_of_dorian_grey_counts.txt,0.058181
2,frankenstein_counts.txt,0.009697
5,the_happy_prince_and_other_tales_counts.txt,0.009697
1,dracula_counts.txt,0.0
4,pride_and_prejudice_counts.txt,0.0
6,the_masque_of_red_death_counts.txt,0.0
9,the_yellow_wallpaper_counts.txt,0.0


In [63]:
matching=search_matches(counts_files,books_counts_dictionaries,"row")

Book matching for row:


Unnamed: 0,book name,TF.IDF value
3,moby_dick_counts.txt,0.736966
7,the_oddyssey_counts.txt,0.47686
1,dracula_counts.txt,0.130053
0,alice's_adventures_counts.txt,0.086702
2,frankenstein_counts.txt,0.043351
8,the_picture_of_dorian_grey_counts.txt,0.043351
4,pride_and_prejudice_counts.txt,0.0
5,the_happy_prince_and_other_tales_counts.txt,0.0
6,the_masque_of_red_death_counts.txt,0.0
9,the_yellow_wallpaper_counts.txt,0.0


In [64]:
matching=search_matches(counts_files,books_counts_dictionaries,"flow")

Book matching for flow:


Unnamed: 0,book name,TF.IDF value
3,moby_dick_counts.txt,0.736966
4,pride_and_prejudice_counts.txt,0.49131
2,frankenstein_counts.txt,0.368483
7,the_oddyssey_counts.txt,0.368483
1,dracula_counts.txt,0.122828
5,the_happy_prince_and_other_tales_counts.txt,0.122828
0,alice's_adventures_counts.txt,0.0
6,the_masque_of_red_death_counts.txt,0.0
8,the_picture_of_dorian_grey_counts.txt,0.0
9,the_yellow_wallpaper_counts.txt,0.0
