# Practice Exercise

__<font color='red'>NOTE</font>__: This is just an exercise to practise data structures in R. It is not a real problem. We will apply the previously defined __data science methodology__ to tackle the proposed problem. But both solution and problem are just a excuse to use basics and data structures of R

### 1. Business Understanding

As a university, we need to compare texts to identify key words and topics. Especially in the case of ancient texts, we need a tool that analyses different texts and compares them to get statistics about topics of the texts, genders...

### 2. Analytical Approach

We can build a tool where we analyse texts and provide statistics that can be saved for studies. Those statistics can be used to make changes in the texts, books, ... that are taught and identify possible biases

### 3. Data requirements

We need a collection of public domain texts to test the application and start the statistics.
We need pdfs of these texts to analyse them and get statistics
We need to install the package: __pdftools__

In [1]:
install.packages("pdftools")

Installing package into 'C:/Users/ingov/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)



package 'pdftools' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
	C:\Users\ingov\AppData\Local\Temp\RtmpyWEq7H\downloaded_packages


### 4. Data collection

We can get those public domain texts and books from the library of the university.
<font color='red'>For learning purposes </font> we will pretend that the url of the library of university is [wikisource](https://en.wikisource.org/wiki/Main_Page)

Those pdf files are located in the folder _books_

### 5. Data Understanding

The pdf files comtain texts from which we can extract information about the topics using most common words. We can extract the following information:
- We can extract all the words contained in each text fragment
- We would have to remove common words like "of", "the"...
- We can get the frequency of each word in the text
- We can compare two texts getting the words that are contained in both
- Later, we can compare the frequency of the words in the two texts
- We can get a set of all words contained in all our texts and count the number of different words (to check how rich vocabulary is)

### 6. Data Preparation

We will keep a shared directory called books in this project structure where we will drop all the texts in pdf format that we want to explore and get statistics from.

To handle character variable, we need to import the library __string__ which includes methods to _transform and handle strings_:

In [2]:
library(stringr)

In [3]:
var <- "./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf"

In order to open the files, we will need to import the librart __pdftools__, below you have an example of _importing_ the library and _opening a pdf file_ with it:

In [4]:
library(pdftools)
text <- pdftools::pdf_text(pdf = "./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf")
typeof(text)

Using poppler version 23.08.0



In [5]:
print(text)

[1] "Le Morte d'Arthur — Volume I, Book\n            I, Chapter I\n               Thomas Malory\n\n\n\n\n       Exported from Wikisource on September 14, 2023\n\n\n\n\n                             1\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

As you can check every __paragraph__ is an _element_ in the result. So we need to join our data:

In [6]:
join_paragraphs <- function(filepath){
    # Given a pdf file path, we will open the file. The result is a vector that contains the different paragraphs in each element. 
    # We will join each paragraph to the same text variable"
    text_array <- pdftools::pdf_text(pdf = filepath)
    text <- "\n "
    for(paragraph in text_array){
        text <- paste(text, paragraph)
    }
    return(text)
}

In [7]:
print(join_paragraphs("./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf"))



The strategy will be __splitting our text by " "__ as words are separated by spaces, that will transform our data _from text to a vector where each word is an element_

In order to use that strategy successfully, we need to __clean our data__ from other possible simbols or special characters that could change the result (for instance "." can be appended to a word making it a new register later

In [8]:
remove_spaces_returns <- function(text_to_clean){
    # Given a text, it removes special characters like '\n' or '.' and replaces them 
    # by an empty space (so we can split the text) later
    upper_text <- toupper(text_to_clean)
    clean_text <- stringr::str_replace_all(upper_text, "\n", " ")
    clean_text <- stringr::str_replace_all(clean_text, "\\n", " ")
    clean_text <- stringr::str_replace_all(clean_text, "\\.", "")
    clean_text <- stringr::str_replace_all(clean_text, ",", "")
    clean_text <- stringr::str_replace_all(clean_text, ":", "")
    clean_text <- stringr::str_replace_all(clean_text, ";", "")
    clean_text <- stringr::str_replace_all(clean_text, "\\[[0-9]*\\]", "")
    clean_text <- stringr::str_replace_all(clean_text, "\\(", "")
    clean_text <- stringr::str_replace_all(clean_text, "\\)", "")
    clean_text <- stringr::str_replace_all(clean_text, "'s", "")
    clean_text <- stringr::str_replace_all(clean_text, "-", "")
    clean_text <- stringr::str_replace_all(clean_text, "—", "")
    return(clean_text)
}

So now we have to go __from a filepath to a vector that contains in its elements all the words__ contained within the text. We have to:
1. Open the file and join all paragraphs -> __join_paragraphs__
2. Clean the text of special characters and symbol so it can be split by empty spaces to get the words <- __remove_spaces_returns__
3. Get the words in the paragraph <- _Function that will described below_ (will contain the other two steps too)

In [9]:
get_words_in_text <- function(filepath){
    # Given a pdf file path, it returns all the words contained in the text as elements in a vector
    text <- join_paragraphs(filepath)
    cleaned_text <- remove_spaces_returns(text)
    words_in_text <- unlist(strsplit(cleaned_text, " "))
    # We need to remove "" elements in the result
    words_vector <- unlist(words_in_text[words_in_text != ""])
    return(words_vector)
}

In [10]:
get_words_in_text("./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf")

Finally, as words can be repeated, it would be useful to get a vector with each words just once (not repeated)

In [11]:
get_unique_words_in_pdf <- function(filepath){
    # Given a pdf file path, it returns a vector that contains the words in the text without repetitions
    words_in_text <- get_words_in_text(filepath)
    unique_words <- unique(words_in_text)
    return(unique_words)
}

In [12]:
print(get_unique_words_in_pdf("./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf"))

  [1] "LE"                                               
  [2] "MORTE"                                            
  [3] "D'ARTHUR"                                         
  [4] "VOLUME"                                           
  [5] "I"                                                
  [6] "BOOK"                                             
  [7] "CHAPTER"                                          
  [8] "THOMAS"                                           
  [9] "MALORY"                                           
 [10] "EXPORTED"                                         
 [11] "FROM"                                             
 [12] "WIKISOURCE"                                       
 [13] "ON"                                               
 [14] "SEPTEMBER"                                        
 [15] "14"                                               
 [16] "2023"                                             
 [17] "1"                                                
 [18] "HOW"   

##### Now, we define several functions to get statistics on our texts

1. We define a function that returns the __frequency__ of each word in text

In [13]:
get_frequency_words_in_text <- function(filepath){
    # Given a pdf file path, it returns the frequency of each word (number of times that a word appears in the text)
    
    # First, we need to get the words that appear in the text without repetitions
    unique_words <- get_unique_words_in_pdf(filepath)
    # Now, we get the text in the pdf file (first we open it, then we join the different parts in a string)
    full_text <- join_paragraphs(filepath)
    # We clean our data and turn the string to upper case
    full_text <- remove_spaces_returns(full_text)
    full_text <- toupper(full_text)
    
    # We create an empty vector that will contain the word as name and how many times appears in the text as value
    frequencies <- c()
    i <- 1
    for(word_vector in unique_words){
        freq <- length(unlist(gregexpr(word_vector, full_text)))
        frequencies <- append(frequencies, freq)
        names(frequencies)[i] <- word_vector
        i <- i + 1
    }
    # We order the vector by value from higher to lower frequency
    frequencies <- frequencies[order(-frequencies)]
    return(frequencies)
}

In [14]:
get_frequency_words_in_text("./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf")

2. A function that return a list of words that are __contained in every text in the list__:

In [15]:
get_words_in_every_text_list_of_files <- function(list_texts){
    # Given a list of pdf file paths, returns all the words that are contained in every text and without repetitions
    # The word has to be contained in every text in the list
    unique_words_in_files <- c()
    i <- 1
    # We iterate through the file paths
    for(file_text in list_texts){
        if(typeof(file_text) == 'character'){
            # We get the unique words contained in that file
            unique_words_text <- get_unique_words_in_pdf(file_text)
            # If this is the first time we are performing this operation, we add the vector with the unique words in this file
            # to the vector that will contain the result
            if(i == 1){
                unique_words_in_files <- append(unique_words_in_files, unique_words_text)
                i <- i + 1
            }else{
                # Otherwise, we use intersect to return only words that are contained in both texts
                unique_words_in_files <- intersect(unique_words_in_files, unique_words_text)
            }
        }else{
            stop("Input parameter has to be a list/vector of character variables")
        }
    }
    return(unique_words_in_files)
}

In [16]:
var1 <- "./books/Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf"
var2 <- "./books/An_Account_of_the_Battle_of_Megiddo.pdf"
list_files <- list(var1, var2)
words_in_texts <- get_words_in_every_text_list_of_files(list_files)

In [17]:
print(words_in_texts)

  [1] "I"                                                
  [2] "BOOK"                                             
  [3] "EXPORTED"                                         
  [4] "FROM"                                             
  [5] "WIKISOURCE"                                       
  [6] "ON"                                               
  [7] "SEPTEMBER"                                        
  [8] "14"                                               
  [9] "2023"                                             
 [10] "1"                                                
 [11] "FOR"                                              
 [12] "THE"                                              
 [13] "OF"                                               
 [14] "AND"                                              
 [15] "HIS"                                              
 [16] "THEIR"                                            
 [17] "IT"                                               
 [18] "IN"    

In [18]:
var_text <- get_unique_words_in_pdf("./books/An_Account_of_the_Battle_of_Megiddo.pdf")

In [19]:
length(var_text)

In [20]:
var_text

In [21]:
get_frequency_words_in_text("./books/An_Account_of_the_Battle_of_Megiddo.pdf")

3. Function that will return __words that are in text 1 but not in text 2__

In [22]:
words_in_text_a_not_in_text_b <- function(filepath_text1, filepath_text2){
    # Given a two pdf file paths, this function will return the words contain in the first file but not in the second
    unique_words_text1 <- get_unique_words_in_pdf(filepath_text1)
    unique_words_text2 <- get_unique_words_in_pdf(filepath_text2)
    words_in_text1_not_in_text2 <- setdiff(unique_words_text1, unique_words_text2)
    return(words_in_text1_not_in_text2)
}

In [23]:
words_in_text_a_not_in_text_b(var1, var2)

4. A function to find the __frequency of a word in a text__

In [24]:
find_frequency_of_word_in_text <- function(filepath, word){
    # Given a pdf file path and a word, this function returns how many times this word appears in the text
    text_in_file <- pdftools::pdf_text(pdf = filepath)
    found_times <- 0
    for(paragraph in text_in_file){
        list_found_indexes = stringr::str_locate_all(paragraph, word)
        found_times <- found_times + length(unlist(list_found_indexes))
    }
    return(found_times)
}

In [25]:
print(find_frequency_of_word_in_text(var1, "I"))

[1] 58


5. A function to __compare the frequency of words__ between two texts:

In [26]:
compare_frequency_of_common_words <- function(filepath1, filepath2){
    # Given two pdf file paths, it will return a vector with the common words in both text as names. If the value is True,
    # it means the words is more frequent in the first text and if it is false, it means the word is more frequent in 
    # the second text
    freq_words_text1 <- get_frequency_words_in_text(filepath1)
    freq_words_text2 <- get_frequency_words_in_text(filepath2)
    common_words <- intersect(names(freq_words_text1), names(freq_words_text2))
    more_freq_in_1 <- c()
    for(word in common_words){
        freq_word_1 <- as.integer(freq_words_text1[word])
        freq_word_2 <- as.integer(freq_words_text2[word])
        freq_word_diff <- freq_word_1 - freq_word_2
        if(freq_word_diff > 0){
            freq_word_diff <- as.logical(freq_word_diff)
        }else{
            freq_word_diff <- FALSE
        }
        more_freq_in_1 <- append(more_freq_in_1, freq_word_diff)
    }
    names(more_freq_in_1) <- common_words
    more_freq_in_1 <- more_freq_in_1[sort(names(more_freq_in_1))]
    return(more_freq_in_1)
}

In [27]:
compare_frequency_of_common_words(var1, var2)

6. Function to return the words that __appear more frequently than averate__:

In [28]:
return_words_more_common_than_mean <- function(filepath){
    # Given a pdf file path, it returns a vector with only that words that appearç
    # more times than the average times a word appears in the text in the file
    freq_words_text <- get_frequency_words_in_text(filepath)
    mean_freq <- mean(freq_words_text)
    print(sprintf("The average/mean frequency of a word in this text is %f", mean_freq))
    result_words <- which(freq_words_text > mean_freq)
    return <- result_words
}

In [29]:
freq <- return_words_more_common_than_mean(var1)
names(freq)

[1] "The average/mean frequency of a word in this text is 6.937126"


In [30]:
print(return_words_more_common_than_mean(var2))

[1] "The average/mean frequency of a word in this text is 8.092000"
       A        I       HE      THE       IN       OF       IS       ON 
       1        2        3        4        5        6        7        8 
      RE       OR      HIS       AT       AN       TO       IT       BE 
       9       10       11       12       13       14       15       16 
      AS      AND       ME  MAJESTY       WE       US     THIS       MY 
      17       18       19       20       21       22       23       24 
     FOR       SO      WAS     THAT      ALL     WITH        2       GO 
      25       26       27       28       29       30       31       32 
       1     ARMY       DO    THEIR      OUR      OUT        3     THEY 
      33       34       35       36       37       38       39       40 
     FOE       UP     AMON     CITY       YE    WHICH     THEM     GOLD 
      41       42       43       44       45       46       47       48 
      IF      DAY      HAD      MAN   BEHOLD  MEGIDDO   

7. Function that returns the __most frequent and least frequent words in a text__:

In [31]:
get_most_and_least_frequent_word_in_text <- function(filepath){
    # Given a pdf file path, it returns the words that appears the least and the most number of times
    # (we say words because several words might appear the same number times and those could be the most
    # and least frequent in the text)
    freq_words_text <- get_frequency_words_in_text(filepath)
    sorted_freq <- freq_words_text[order(freq_words_text)]
    sorted_freq <- sorted_freq[-(length(sorted_freq))]
    result_list <- list()
    least_freq_list <- list(frequency = unname(sorted_freq[1]), words = names(sorted_freq[sorted_freq == sorted_freq[1]]))
    most_freq_list <- list(frequency = unname(sorted_freq[length(sorted_freq)]), words = names(sorted_freq[sorted_freq == sorted_freq[length(sorted_freq)]]))
    result_list <- append(result_list, least_freq_list)
    result_list <- append(result_list, most_freq_list)
    least_frequent_message <- sprintf("The lowest frequency of a word in the text is %d by words: %s", least_freq_list$frequency, paste(least_freq_list$word, collapse = ", "))
    print(least_frequent_message)
    most_frequent_message <- sprintf("The highest frequency of a word in the text is %d by words: %s", most_freq_list$frequency, paste(most_freq_list$word, collapse = ", "))
    print(most_frequent_message)
    return(result_list)
}

In [32]:
print(get_most_and_least_frequent_word_in_text(var1))

[1] "The highest frequency of a word in the text is 276 by words: I"
$frequency
[1] 1

$words
  [1] "MORTE"                                            
  [2] "D'ARTHUR"                                         
  [3] "VOLUME"                                           
  [4] "MALORY"                                           
  [5] "EXPORTED"                                         
  [6] "SEPTEMBER"                                        
  [7] "14"                                               
  [8] "2023"                                             
  [9] "HOW"                                              
 [10] "BEFELL"                                           
 [11] "ENGLAND"                                          
 [12] "REIGNED"                                          
 [13] "HELD"                                             
 [14] "AGAINST"                                          
 [15] "TIME"                                             
 [16] "CHARGING"                    

### 7. Modelling

We will design an interactive program where we will list the files in the directory __books__ that contains the pdf and then offer the different methods

In [33]:
print("Welcome to BOOKS DATA:")
print("We can perform the following operations: ")
print("1. Get frequency of the words a the text")
print("2. Get every word in the files in our current directory")
print("3. Get words in one text but not in another one")
print("4. Find how many times a word appears in a text")
print("5. Compare the frequency of common words in two files")
print("6. Return the words that are more common than average in a certain text")
print("7. Get the most and the least frequent word in a text")
function_picked <- readline("Give me the number of the function you want to perform")

[1] "Welcome to BOOKS DATA:"
[1] "We can perform the following operations: "
[1] "1. Get frequency of the words a the text"
[1] "2. Get every word in the files in our current directory"
[1] "3. Get words in one text but not in another one"
[1] "4. Find how many times a word appears in a text"
[1] "5. Compare the frequency of common words in two files"
[1] "6. Return the words that are more common than average in a certain text"
[1] "7. Get the most and the least frequent word in a text"


Give me the number of the function you want to perform 7


In [34]:
print("Welcome to BOOKS DATA:")
print("This are the books we can analyze in our directory:")

[1] "Welcome to BOOKS DATA:"
[1] "This are the books we can analyze in our directory:"


In [35]:
files_list <- unlist(list.files(path=paste0(as.character(getwd()),"/books")))
files_index <- c()
i <- 1
for(filepath in files_list){
    print(paste0(i, ": ", filepath))
    i <- i + 1
}

[1] "1: An_Account_of_the_Battle_of_Megiddo.pdf"
[1] "2: Le_Morte_d'Arthur_Volume_I_Book_I_Chapter_I.pdf"


In [36]:
file_choice <- as.numeric(readline("Give me the index of the file you want to explore: "))
chosen_file <- files_list[file_choice]
print(sprintf("The file chosen was '%s'", chosen_file))

Give me the index of the file you want to explore:  1


[1] "The file chosen was 'An_Account_of_the_Battle_of_Megiddo.pdf'"


In [37]:
if(function_picked == "1"){
    print("Frequency of words in the text")
    print(get_frequency_words_in_text(paste0(as.character(getwd()),"/books/", chosen_file)))
}else if(function_picked == "2"){
    chosen_files <- c()
    chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
    chosen_files <- append(chosen_files, chosen_file)
    file_choice <- -1
    while(file_choice != 0){
        file_choice <- as.numeric(readline("Pick another file, give me the index of the file you want to explore or select 0: "))
        if(file_choice == 0){
            break
        }
        chosen_file <- files_list[file_choice]
        chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
        chosen_files <- append(chosen_files, chosen_file)
        print(sprintf("The file chosen was '%s'", chosen_file))
    }
    print("Let's perform the operation them")
    print("The words in every file in the list are:")
    print(get_words_in_every_text_list_of_files(chosen_files))
}else if(function_picked == "3"){
    file_choice <- as.numeric(readline("Pick another file, give me the index of the file you want to explore: "))
    chosen_file2 <- files_list[file_choice]
    chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
    chosen_file2 <- paste0(as.character(getwd()),"/books/",chosen_file2)
    print("The words in first file we selected but not in the second are:")
    print(words_in_text_a_not_in_text_b(chosen_file, chosen_file2))
}else if(function_picked == "4"){
    chosen_word <- as.character(readline("Give me a word whose frequency we can search for: "))
    chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
    print("The frequency of that word in the text is: ")
    print(find_frequency_of_word_in_text(chosen_file, chosen_word))
}else if(function_picked == "5"){
    file_choice <- as.numeric(readline("Pick another file, give me the index of the file you want to explore: "))
    chosen_file2 <- files_list[file_choice]
    chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
    chosen_file2 <- paste0(as.character(getwd()),"/books/",chosen_file2)
    print("The words in first file we selected that are more frequent than in the second are:")
    print(compare_frequency_of_common_words(chosen_file, chosen_file2))
}else if(function_picked == "6"){
    chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
    print("Returning words that appear more than the mean in the text: ")
    print(return_words_more_common_than_mean(chosen_file))
}else if(function_picked == "7"){
    chosen_file <- paste0(as.character(getwd()),"/books/",chosen_file)
    print("Most and least frequent words in the selected text are: ")
    print(get_most_and_least_frequent_word_in_text(chosen_file))
}

[1] "Most and least frequent words in the selected text are: "
[1] "The lowest frequency of a word in the text is 1 by words: ACCOUNT, TJANENI, EXPORTED, SEPTEMBER, 2023, MIGHTY, BULL, SHINING, UPPER, LANDS, TABLET, SETTING, CARRIED, AWAY, THEREIN, DONE, 22, SECOND, THARU, PERIOD, FALLEN, DISAGREEMENT, NEIGHBOR, HAPPENED, TRIBES, SHARUHEN, YERAZA, MARSHES, EARTH, BEGUN, KING'S, POSSESSION, RULER, GAZA, DEPARTURE, TRIUMPH, OVERTHROW, SEIZE, SIXTEENTH, YEHEM, ORDERED, CONSULTATION, VALIANT, FOLLOWS, \"THAT, GATHERED, WATER, FAR, NAHARIN, THUS, SPEAKS, 'I, ARISEN, MEGIDDO'\", \"HOW, SHOULD, THREATENS, WAITING, HOLDING, MULTITUDE, ADVANCEGUARD, REARGUARD, STANDING, YONDER, FOUGHT?, CARRY, ZEFTI, DESIRES, DIFFICULT, ROAD\", MESSENGERS, CONCERNING, DESIGN, UTTERED, VIEW, WHAT, BEEN, COURT, \"I, SWEAR, LOVES, FAVORS, REJUVENATED, SATISFYING, MENTIONED, ENEMIES, DETESTS, 'DOES, ANOTHER, BEGINS, FEARFUL, US', THINK\", \"MAY, PRESIDER, KARNAK, THEE, WHITHER, PROCEEDETH, SERVANT, MASTER\", MARCH,

### 8. Evaluation
In this case, this should be sent to the final customers for UAT in a demo environment to get some feedback from them

### 9. Deployment
Once we get approval this will be deployed to a server where it will be used by the university

### 10. Feedback
After some time, feedback was:
1. Improve the UI (make it more user friendly)
2. Add new function that can search for words related with a topic
3. Include data visualization