## Application Examples

Both scripts can either be imported as modules to access their function or can be run from the command line. If run by the command line, the following commands can be followed:
1. `text_stats.py <txt_file> <output_file>` where the last argument is just optional, if provided the output is not printed but rather stored in the provided file
2. `generate_text.py <txt_file> <starting_word> <max_words>` where the first argument specifies the word to start with and the second argument the number of words to be generated

### Generation of Shakespeare-like text:

In [1]:
import numpy as np
from text_stats import get_word_list, get_string_from_txt
from generate_text import get_neighbour_matrix, get_random_sentence

In [2]:
word_list = get_word_list(get_string_from_txt("shakespeare.txt"))

In [3]:
get_random_sentence(word_list, "I", 50)

"i like rotten policy respecting you now this were there be taken not had a paper this will answer prince your husband's lands to answer your highness loss of the lady lady percy my venison master o the jew and building how now on caesar have told him an outward part"

In [4]:
get_random_sentence(word_list, "This", 300)

"this night methought he found by all struck first citizen why is that you when nobles bearing should wear their accusers and exeter bedford and vanquished these labours which ten you help me i took me pistol come from thy miseries on sweetest flowers i say do wink being wasted in thy part of philippi and you for a good news by thy doom this very good gentlemen let’s carve break promise of them the name ursula sure exeunt all will i warrant you _exit_ scene iv duke humphrey being blown lies in the morning no reason says she's like an end i do seem night great siz’d monster fly from such a speak cordelia time but tell thee meet us all the ground inhabitable where douglas arm arm as they have the strongest in their eyes of quiet course what says very true i am above the red and do nothing is my crown of your word of syracuse i'll buy it grieves me next is master corporate bardolph be not how then to joy pleasance revel it pistol me would say to fly would kill coriolanuswho falls are we m

### Generation of Michael Jackson-like lyrics:

In [11]:
word_list = get_word_list(get_string_from_txt("michael_jackson.txt"))

In [21]:
get_random_sentence(word_list, "What", 50)

"what you gotta hear it my soul it's the entire human everyone's taking you remember the wings of mehe really had enough keep playing chorus so relaxed mingle jingle in the window i love you down these soldiers come to give in the boogie alright keep it feed it no one"

In [23]:
get_random_sentence(word_list, "I", 300)

"i want to get me workin workin day but wait mother's preaching abraham brothers create your love's a man's supposed to be showing you i don't be grievin this particular monster heâ s a cold man the lines divinity in love in blood is it rap performed before you move ya always missin maybe you scream in my hand tahitian tanned in my every page you feelin insane swift and shout it so easy so dangerous i feel real tired now in the time i wanna hit the border it's just what about us we can't take your alibis and i'll smile what's goin down turn us just to me keep on baby i would lie for you can't believe the way how you to talk ya ooooooh you know it to be trippin why do giving me and girl so tired of us to me be a better place too far cause when this is saying i am in the needs me it for you and then you warm you love me he say she says that makes the color that life and everything make a compramise ya masquerade the fallin in my feet my father proud to read sheâ ll be complanin ooh go of

### Further thoughts:

We decided to store all the neighbours to compute the probabilities for next words in a NxN neighbour numpy matrix. But after computing it, the code to find a string still took very long (even though the values were already computed and we only had to find the probabilities in the matrix and do a random choice). But then we discovered, that the `np.random.choice()` method is far slower (~14x slower, 9.4s vs. 0.59) if choosing from a string list compared to a numeric list with indices. The following profiles show both methods for 2000 words after computing the next_matr neighbour matrix:

In [24]:
word_list = get_word_list(get_string_from_txt("shakespeare.txt"))
idx_map, unique_words, next_matr = get_neighbour_matrix(word_list)

In [30]:
import cProfile 

with cProfile.Profile() as prof:
    curr_word = "king"
    sent = curr_word
    for i in range(2000):
        idx_row = idx_map[curr_word]
        probs = next_matr[idx_row, ]

        if np.sum(probs) == 0: # Check if there are any next words, otherwise break
            print("No next words found. - Stop!")
            break

        choice = np.random.choice(unique_words, p=probs)

        curr_word = choice
        sent = sent + " " + curr_word
prof.print_stats()

         40002 function calls in 8.433 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    0.003    0.000    0.075    0.000 <__array_function__ internals>:2(sum)
        1    0.000    0.000    0.000    0.000 cProfile.py:133(__exit__)
     2000    0.000    0.000    0.000    0.000 fromnumeric.py:2100(_sum_dispatcher)
     2000    0.004    0.000    0.069    0.000 fromnumeric.py:2105(sum)
     2000    0.006    0.000    0.064    0.000 fromnumeric.py:70(_wrapreduction)
     2000    0.002    0.000    0.002    0.000 fromnumeric.py:71(<dictcomp>)
     4000    0.009    0.000    0.010    0.000 getlimits.py:366(__new__)
     4000    0.005    0.000    0.007    0.000 numerictypes.py:286(issubclass_)
     2000    0.004    0.000    0.011    0.000 numerictypes.py:360(issubdtype)
     2000    0.001    0.000    0.001    0.000 {built-in method builtins.isinstance}
     6000    0.003    0.000    0.003    0.000 {built-in method builtins

In [31]:
with cProfile.Profile() as prof:
    curr_word = "king"
    idx_curr_word = idx_map[curr_word]
    rev_idx_map = {v:k for k, v in idx_map.items()}
    all_idx = np.arange(len(unique_words))
    chosen = [idx_curr_word]
    #sent = curr_word

    for i in range(2000):
        probs = next_matr[idx_curr_word, ]

        if np.sum(probs) == 0: # Check if there are any next words, otherwise break
            print("No next words found. - Stop!")
            break

        choice = np.random.choice(all_idx, p=probs)

        chosen.append(choice) 
        idx_curr_word = choice

    sent = " ".join([rev_idx_map[wrd_id] for wrd_id in chosen])
prof.print_stats()

         42008 function calls in 0.591 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2000    0.001    0.000    0.046    0.000 <__array_function__ internals>:2(sum)
        1    0.001    0.001    0.001    0.001 <ipython-input-31-b17e0b5f081b>:21(<listcomp>)
        1    0.001    0.001    0.001    0.001 <ipython-input-31-b17e0b5f081b>:4(<dictcomp>)
        1    0.000    0.000    0.000    0.000 cProfile.py:133(__exit__)
     2000    0.000    0.000    0.000    0.000 fromnumeric.py:2100(_sum_dispatcher)
     2000    0.001    0.000    0.044    0.000 fromnumeric.py:2105(sum)
     2000    0.002    0.000    0.042    0.000 fromnumeric.py:70(_wrapreduction)
     2000    0.001    0.000    0.001    0.000 fromnumeric.py:71(<dictcomp>)
     4000    0.002    0.000    0.002    0.000 getlimits.py:366(__new__)
     4000    0.001    0.000    0.002    0.000 numerictypes.py:286(issubclass_)
     2000    0.001    0.000    0.003    0.000 nu