# PS 18 Sep - Dictionaries, arrays

### Working with words
* Use a dictionary to find all the anagrams of a word (using the wordlist: "words.txt"). 
* Also write a function to return a list of the word with the most anagrams.


### Working with text
Write a function that reads a textfile and returns all the words that you find in it as an array.

Create a histogram of the book, what are the ten most frequent words?

### Zipf's law

The “rank” of a word is its position in an array of words sorted by frequency: the most common word has rank 1, the second most common has rank 2, etc.

Zipf’s law describes a relationship between the ranks and frequencies of words in natural languages [(cf. Wikipedia)](https://en.wikipedia.org/wiki/Zipf's_law). Specifically, it predicts that the frequency, $f$ of the word with rank $r$ is:
\begin{equation}
f=cr^{−s}
\end{equation}
where $s$ and $c$ are parameters that depend on the language and the text. If you take the logarithm of both sides of this equation, you get:
\begin{equation}
\log{f}=\log{c}-s\log{r}
\end{equation}

So if you plot $\log{f}$ versus $\log{r}$, you should get a straight line with slope $−s$ and intercept $\log{c}$.

Write a program that reads a text from a file, counts word frequencies. Determine $c$ and $s experimentally for texts in different languages.

For the linear regression you can use the following:
```Julia
"""Simple linear regression implementation"""
function linreg(x, y) 
    hcat(fill!(similar(x), 1), x) \ y
end
```
Try your application for several texts: "3mousquetaires.txt", "Dracula.txt", "deoogst.txt"

### Markov analysis
Write a program that reads a file, breaks each line into words, strips whitespace and punctuation from the words, and converts them to lowercase. (If) Hint: the function `isletter` tests whether a character is alphabetic.


A series of random words seldom makes sense because there is no relationship between successive words. For example, in a real sentence you would expect an article like “the” to be followed by an adjective or a noun, and probably not a verb or adverb.

One way to measure these kinds of relationships is Markov analysis, which characterizes, for a given sequence of words, the probability of the words that might come next. For example, the song Eric, the Half a Bee begins:

```
Half a bee, philosophically,
Must, ipso facto, half not be.
But half the bee has got to be
Vis a vis, its entity. D’you see?

But can a bee be said to be
Or not to be an entire bee
When half the bee is not a bee
Due to some ancient injury?
```

In this text, the phrase “half the” is always followed by the word “bee”, but the phrase “the bee” might be followed by either “has” or “is”.

The result of Markov analysis is a mapping from each prefix (like “half the” and “the bee”) to all possible suffixes (like “has” and “is”). suffix)

Given this mapping, you can generate a random text by starting with any prefix and choosing at random from the possible suffixes. Next, you can combine the end of the prefix and the new suffix to form the next prefix, and repeat.

For example, if you start with the prefix “Half a”, then the next word has to be “bee”, because the prefix only appears once in the text. The next prefix is “a bee”, so the next suffix might be “philosophically”, “be” or “due”.