# `lab04`—Probabilistic Language Prediction

**Objectives**

-   Work with online data sources (using the `requests` library).
-   Learn the standard pipeline of data analysis:  data cleaning and preparation, data processing, and output.

### Language Letter Frequency

A random sampling of English text produces approximately the following letter frequency distribution:

<img src="./img/freq-eng.png" width="80%;"/>

whereas Latin has the letter frequency distribution:

<img src="./img/freq-lat.png" width="80%;"/>

and Welsh has the letter frequency distribution:

<img src="./img/freq-cym.png" width="80%;"/>

Each language tends to have a unique "fingerprint" because of the relative frequency of letters and sounds.  Such letter frequency information could be used, for instance, to determine how much type should be ordered for a letterpress, or how many tiles should be included in a country-specific version of Scrabble.

Today you will use this fingerprint to assign rough probabilities to the likely language of a given text sample in an unknown language.  (This is similar to what [Google Translate](https://translate.google.com/) does when it auto-detects the language of a text sample, except that it uses whole words instead of letter frequencies to make its guess.)

There are three steps in the data processing pipeline for you to complete today:

1.  Count the frequency of each letter in the text sample.  Then divide the resulting list of frequencies by the total number of letters.  (That is, *normalize* the frequencies.)
1.  Load the reference language frequencies.
1.  Predict the most likely language based on comparing the text letter frequency with each of the reference frequencies.

Our intent is to reproduce this toolset:

![](./img/flowchart.png)

You will use and/or compose these several functions in order to extract information from the data sources (here listed as "language files" and use it to study text samples.

<br/>
<div class="alert alert-info">
We will restrict ourselves to the 26 letters of the basic Latin alphabet, disallowing diacritics ('naïve'→'naive'), accents ('recherché'→'recherche'), and nonbasic letters ('Skjærvø'→'Skjarvo').  (We regret this rank ASCIIcentrism.)
</div>

### 1.  Calculate the normalized letter frequencies.

In order to calculate letter frequencies, you need a list of letters and the string in all upper-case letters.  To avoid confusion, we will rename this built-in string `ascii_uppercase` as `alphabet` when we `import` it.

In [3]:
#grade
from string import ascii_uppercase as alphabet
# use `alphabet` as ascii_uppercase from now on
print(alphabet)

ABCDEFGHIJKLMNOPQRSTUVWXYZ


In [4]:
#grade
# Our example text.
text = 'Jackdaws love my big Sphinx of Quartz.'
text = text.upper()
print(text)

JACKDAWS LOVE MY BIG SPHINX OF QUARTZ.


Next we create an empty frequency dictionary `letter_freq`.  Loop over each letter of the `alphabet` and `count` the number of times each letter occurs in `text`.  Add this count to `letter_freq`.

In [5]:
#grade
letter_freq = {}  # a blank dictionary

# Loop over the alphabet.
for letter in alphabet:
    # For each letter, get the number of times it occurs in the string `text`.
    letter_count = text.count(letter)
    letter_freq[letter] = letter_count

letter_freq

{'A': 3,
 'B': 1,
 'C': 1,
 'D': 1,
 'E': 1,
 'F': 1,
 'G': 1,
 'H': 1,
 'I': 2,
 'J': 1,
 'K': 1,
 'L': 1,
 'M': 1,
 'N': 1,
 'O': 2,
 'P': 1,
 'Q': 1,
 'R': 1,
 'S': 2,
 'T': 1,
 'U': 1,
 'V': 1,
 'W': 1,
 'X': 1,
 'Y': 1,
 'Z': 1}

The final step is to normalize the values.  To do this, we need to calculate the total number of letters in `text` (letters, NOT whitespace or punctuation).  Since this is a bit involved, the following lines of code will give us a copy of `text` without whitespace or punctuation:

In [7]:
#grade
# These are built-in collections of characters, useful for just this sort of filtering.
from string import whitespace, punctuation, digits
print(whitespace, punctuation, digits)
for character in whitespace+punctuation+digits:
    text = text.replace(character, '')

 	
 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 0123456789


Now set each frequency value in the dictionary to its normalized value.

In [8]:
#grade
for key in letter_freq.keys():
    letter_freq[key] = letter_freq[key] / len(text)
letter_freq

{'A': 0.0967741935483871,
 'B': 0.03225806451612903,
 'C': 0.03225806451612903,
 'D': 0.03225806451612903,
 'E': 0.03225806451612903,
 'F': 0.03225806451612903,
 'G': 0.03225806451612903,
 'H': 0.03225806451612903,
 'I': 0.06451612903225806,
 'J': 0.03225806451612903,
 'K': 0.03225806451612903,
 'L': 0.03225806451612903,
 'M': 0.03225806451612903,
 'N': 0.03225806451612903,
 'O': 0.06451612903225806,
 'P': 0.03225806451612903,
 'Q': 0.03225806451612903,
 'R': 0.03225806451612903,
 'S': 0.06451612903225806,
 'T': 0.03225806451612903,
 'U': 0.03225806451612903,
 'V': 0.03225806451612903,
 'W': 0.03225806451612903,
 'X': 0.03225806451612903,
 'Y': 0.03225806451612903,
 'Z': 0.03225806451612903}

Now we will turn the above process into a general function to process a string into its letter frequency.

### <span style="color:#345995">Exercise 1: Calculate the Normalized Letter Frequencies</span>

Compose a function `calc_freq` which accepts a string `text`.  `calc_freq` should `return` a dictionary containing the normalized frequency by letter.
    
You should use the above process just outlined to write this function.
    
<div class="alert alert-warning">
When diagnosing the behavior of your code, we encourage you to use `print` statements freely.
</div>

In [21]:
#grade
# define your function here
from string import whitespace, punctuation, digits
from string import ascii_uppercase as alphabet

def calc_freq(text):
    '''
    Calculate the frequency in the text of each letter
    Args:
        string: a piece of text 
    Returns:
        dict: the frequency of each letter in a dictionary (e.g. letter_freq['A'] gives 0.06)
    '''
    # Create an empty frequency dictionary letter_freq.
    letter_freq_dict = {}
    
    # Initialize values in the letter_freq_dict
    ## YOUR CODE HERE
    
    # Make text upper-case.
    text = text.upper()
    
    # Loop over each letter of the alphabet:
    for letter in alphabet:
        letter_count = text.count(letter)
        letter_freq_dict[letter] = letter_count # Count the number of times each letter occurs in text.
        # Add this count to letter_freq.
    
    # Make a copy of text without non-alphabet characters.
    from string import whitespace, punctuation, digits
    for character in whitespace+punctuation+digits:
        text = text.replace(character, '')
    
    # Normalize the frequencies and put the results back into letter_freq.
    for key in letter_freq.keys():
        letter_freq_dict[key] = letter_freq_dict[key] / len(text)
    
    # Finally, return the dict letter_freq.
    return letter_freq_dict

In [22]:
# test your code here.  You may edit this cell, and you may use any sample text, but the following is provided for convenience.
text = """Neither the naked hand nor the understanding left to itself can effect much. It is by instruments and helps that the work is done,
which are as much wanted for the understanding as for the hand. And as the instruments of the hand either give motion or guide it, so the
instruments of the mind supply either suggestions for the understanding or cautions.  (Francis Bacon, Novum Organon, Aphorism II)"""
calc_freq(text)

{'A': 0.065625,
 'B': 0.00625,
 'C': 0.025,
 'D': 0.05,
 'E': 0.10625,
 'F': 0.03125,
 'G': 0.025,
 'H': 0.071875,
 'I': 0.078125,
 'J': 0.0,
 'K': 0.00625,
 'L': 0.0125,
 'M': 0.028125,
 'N': 0.109375,
 'O': 0.065625,
 'P': 0.0125,
 'Q': 0.0,
 'R': 0.0625,
 'S': 0.075,
 'T': 0.10625,
 'U': 0.040625,
 'V': 0.00625,
 'W': 0.009375,
 'X': 0.0,
 'Y': 0.00625,
 'Z': 0.0}

In [23]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_text1 = """The study of nature with a view to works is engaged in by the mechanic, the mathematician, the physician, the alchemist, and
the magician; but by all (as things now are) with slight endeavor and scanty success.  (Francis Bacon, Novum Organon, Aphorism V)"""
result_text1 = calc_freq(test_text1)
assert isclose(result_text1['T'], 0.09045226130653267) and \
       isclose(result_text1['Q'], 0.0) and \
       isclose(result_text1['Y'], 0.0251256281407035)
print('Success!')

Success!


In [24]:
# it should pass this test---do NOT edit this cell
test_text2 = """In order to penetrate into the inner and further recesses of nature, it is necessary that both notions and axioms be derived
from things by a more sure and guarded way, and that a method of intellectual operation be introduced altogether better and more certain.
(Francis Bacon, Novum Organon, Aphorism XVIII)"""
result_text2 = calc_freq(test_text2)
assert isclose(result_text2['K'], 0.0) and \
       isclose(result_text2['N'], 0.09523809523809523) and \
       isclose(result_text2['L'], 0.015873015873015872)
print('Success!')

Success!


### 2.  Load the reference language frequencies.

Each language has a characteristic pattern of letter frequencies.  In previous labs, we stored data like these on the disk as files.  This time, we will use the `requests` library to acquire data available on the Web.  Reference frequencies for the following languages are available.  (These frequencies are derived from the work of Stefan Trost<sup>[[Trost2015](http://www.sttmedia.com/characterfrequencies)]</sup> and used with his permission.)

- [Afrikaans](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/afrikaans)
- [Catalan](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/catalan)
- [Danish](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/danish)
- [English](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/english)
- [Finnish](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/finnish)
- [French](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/french)
- [German](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/german)
- [Latin](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/latin)
- [Polish](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/polish)
- [Portuguese](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/portuguese)
- [Spanish](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/spanish)
- [Welsh](https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/welsh)

In [65]:
#grade
import requests
# example_data = requests.get( 'https://raw.githubusercontent.com/UI-CS101/cs101-wiki/master/lab07/danish' )
# print( example_data.text )

In order to obtain the reference language frequencies, you will first write a function `load_ref` to load a given language reference URL.  You will write a function `load_languages` which uses `load_ref` with a list of available languages to create a `dict` of all of the language frequencies available.

Take a look at the format of `example_data` (Danish):
    
    A,8.27%
    B,1.42%
    C,0.45%
    ...

If you wanted to read this into a dictionary, you could take each line and split it by the comma.

Since you want to include the second part as a `float`, you need to convert it.  Try this out directly (but it will fail):

In [26]:
testDict = {}
testDict['A'] = float('8.27%')

ValueError: could not convert string to float: '8.27%'

<div class="alert alert-danger">
The problem is that Python doesn't know if the percent sign in the string is supposed to be a string format marker or actually a percent sign, so it doesn't correctly parse this string into a `float`.
</div>

### <span style="color:#345995">Exercise 2: Convert Percentage Strings to `float`s</span>

In order to convert a string of a percent value into a float, compose a function `p2f` (short for `percentToFloat`) which accepts a string `value`.  `p2f` `strip`s the percent sign off of the string `value`, converts this to a `float`, and then divides by `100` and `return`s the result.  (Python provides a function `round` which you may elect to use here to simplify the result, but this is not required.)

In [27]:
# str.strip() example
a = ' hi hello.     '
print(a.strip())

hi hello.


In [28]:
# str.replace() example
a = ' hi hello.     '
print(a.replace(" ", ""))

hihello.


In [45]:
#grade
# define your function here
def p2f(value):
    '''
    Take a string in the format of '8.27%', 
    and convert it to a number (0.0827 in this case).
    Args:
        string: n%
    Returns:
        float: a number between 0 and 1
    '''
    # Strip any whitespace and then strip the percent sign off of value.
    value = value.strip().replace("%" ,"")
    
    # Convert the result to a float and divide by 100.
    
    result = float(value) / 100
    
    # Finally, return the result.
    return result

In [46]:
# test your code here.  You may edit this cell, and you may use any sample value, but the following is provided for convenience.
value = "5.6%"
p2f(value)

0.055999999999999994

In [47]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
assert isclose(p2f('1.79%'), 0.0179)
print('Success!')

Success!


Now try to add to the dictionary:

In [62]:
testDict = {}
testDict['A'] = p2f('8.27%')

### <span style="color:#345995">Exercise 3: Open Web Data</span>

Compose a function `open_url` which accepts a string `language`.  `open_url` should `return` a `str` containing the reference language letter frequencies stored at the URL of the form given below.

In [57]:
# test your code here.  You may edit this cell, and you may use any language listed above, but the following is provided for convenience.
language = 'polish'
open_url(language)

'A,9.16%\nB,1.93%\nC,4.49%\nD,3.35%\nE,9.81%\nF,0.26%\nG,1.46%\nH,1.25%\nI,8.83%\nJ,2.28%\nK,3.01%\nL,4.62%\nM,2.81%\nN,5.85%\nO,8.32%\nP,2.87%\nQ,0.00%\nR,4.15%\nS,4.85%\nT,3.85%\nU,2.06%\nV,0.00%\nW,4.11%\nX,0.00%\nY,4.03%\nZ,5.50%\n'

In [55]:
#grade
languageNames = [ 'afrikaans','catalan','danish','english','finnish','french',
                  'german','latin','polish','portuguese','spanish','welsh' ]
Lang = {}
Lang['afrikaans'] = """A,7.94%
B,1.60%
C,0.30%
D,5.76%
E,18.12%
F,0.80%
G,3.58%
H,1.69%
I,8.47%
J,0.32%
K,3.07%
L,4.03%
M,2.18%
N,8.07%
O,5.98%
P,1.48%
Q,0.01%
R,6.52%
S,6.89%
T,5.74%
U,2.27%
V,2.27%
W,1.87%
X,0.02%
Y,0.99%
Z,0.04%
"""
Lang['catalan']="""A,13.35%
B,1.48%
C,3.36%
D,3.36%
E,14.21%
F,0.82%
G,1.24%
H,0.82%
I,6.50%
J,0.39%
K,0.11%
L,6.41%
M,3.49%
N,6.20%
O,5.38%
P,2.93%
Q,1.27%
R,7.12%
S,7.94%
T,6.31%
U,4.27%
V,2.17%
W,0.04%
X,0.51%
Y,0.15%
Z,0.10%
"""
Lang['danish']="""A,8.27%
B,1.42%
C,0.45%
D,6.65%
E,16.09%
F,2.20%
G,4.88%
H,2.40%
I,5.73%
J,0.94%
K,3.26%
L,4.90%
M,3.29%
N,7.32%
O,5.32%
P,1.31%
Q,0.01%
R,7.63%
S,5.18%
T,7.19%
U,1.88%
V,2.90%
W,0.08%
X,0.05%
Y,0.51%
Z,0.04%
"""
Lang['english'] = """A,8.34%
B,1.54%
C,2.73%
D,4.14%
E,12.60%
F,2.03%
G,1.92%
H,6.11%
I,6.71%
J,0.23%
K,0.87%
L,4.24%
M,2.53%
N,6.80%
O,7.70%
P,1.66%
Q,0.09%
R,5.68%
S,6.11%
T,9.37%
U,2.85%
V,1.06%
W,2.34%
X,0.20%
Y,2.04%
Z,0.06%
"""
Lang['finnish']="""A,16.66%
B,0.12%
C,0.27%
D,0.91%
E,8.42%
F,0.09%
G,0.30%
H,2.49%
I,10.46%
J,2.07%
K,4.92%
L,5.87%
M,3.18%
N,9.14%
O,5.89%
P,1.77%
Q,0.01%
R,2.15%
S,6.59%
T,9.68%
U,4.67%
V,2.45%
W,0.06%
X,0.03%
Y,1.71%
Z,0.04%
"""
Lang['french']="""A,8.70%
B,0.93%
C,3.15%
D,3.55%
E,17.83%
F,0.96%
G,0.97%
H,1.08%
I,6.97%
J,0.71%
K,0.16%
L,5.68%
M,3.23%
N,6.42%
O,5.35%
P,3.03%
Q,0.89%
R,6.43%
S,7.91%
T,7.11%
U,6.14%
V,1.83%
W,0.04%
X,0.42%
Y,0.19%
Z,0.21%
"""
Lang['german']="""A,6.12%
B,1.96%
C,3.16%
D,4.98%
E,16.93%
F,1.49%
G,3.02%
H,4.98%
I,8.02%
J,0.24%
K,1.32%
L,3.60%
M,2.55%
N,10.53%
O,2.54%
P,0.67%
Q,0.02%
R,6.89%
S,7.16%
T,5.79%
U,4.48%
V,0.84%
W,1.78%
X,0.05%
Y,0.05%
Z,1.21%
"""
Lang['latin']="""A,8.89%
B,1.58%
C,3.99%
D,2.77%
E,11.38%
F,0.93%
G,1.21%
H,0.69%
I,11.44%
J,0.00%
K,0.00%
L,3.15%
M,5.38%
N,6.28%
O,5.40%
P,3.03%
Q,1.51%
R,6.67%
S,7.60%
T,8.00%
U,8.46%
V,0.96%
W,0.00%
X,0.60%
Y,0.07%
Z,0.01%
"""
Lang['polish']="""A,9.16%
B,1.93%
C,4.49%
D,3.35%
E,9.81%
F,0.26%
G,1.46%
H,1.25%
I,8.83%
J,2.28%
K,3.01%
L,4.62%
M,2.81%
N,5.85%
O,8.32%
P,2.87%
Q,0.00%
R,4.15%
S,4.85%
T,3.85%
U,2.06%
V,0.00%
W,4.11%
X,0.00%
Y,4.03%
Z,5.50%
"""
Lang['portuguese']="""A,13.52%
B,1.01%
C,3.75%
D,4.21%
E,14.07%
F,1.07%
G,1.08%
H,1.22%
I,5.67%
J,0.30%
K,0.13%
L,3.00%
M,5.07%
N,5.02%
O,10.44%
P,3.01%
Q,1.10%
R,6.73%
S,7.35%
T,5.07%
U,4.57%
V,1.72%
W,0.05%
X,0.28%
Y,0.04%
Z,0.45%
"""
Lang['spanish']="""A,12.16%
B,1.49%
C,3.87%
D,4.67%
E,14.08%
F,0.69%
G,1.00%
H,1.18%
I,5.98%
J,0.52%
K,0.11%
L,5.24%
M,3.08%
N,7.00%
O,9.20%
P,2.89%
Q,1.11%
R,6.41%
S,7.20%
T,4.60%
U,4.69%
V,1.05%
W,0.04%
X,0.14%
Y,1.09%
Z,0.47%
"""
Lang['welsh']="""A,9.36%
B,1.82%
C,2.89%
D,9.88%
E,8.31%
F,3.12%
G,3.41%
H,3.87%
I,6.98%
J,0.13%
K,0.00%
L,5.03%
M,2.48%
N,8.12%
O,5.59%
P,0.91%
Q,0.00%
R,6.52%
S,2.91%
T,2.84%
U,2.58%
V,0.00%
W,3.98%
X,0.00%
Y,8.49%
Z,0.00%
"""
def open_url( language ):
    '''
    Open the language URL using `requests` and read the data out via the text attribute.
    The URL you need is of the form
       https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/language
    where you'll replace language with the string passed in (think of string format operators).
    For instance, if language == 'english', then the URL should be
       https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/english
    Args:
        string: language name
    Returns:
        The text found at the url
    '''
    
#     url_prefix = 'https://raw.githubusercontent.com/UI-CS101/cs101-fa16/master/lab07/'
#     url_suffix = ## YOUR CODE HERE 
    
#     # Construct url here from `url_prefix` and `language`
#     url = url_prefix + url_suffix
#     # Read the data from the url, using the requests library's .get() function
#     language_data = requests.get( url ).text
    language_data = Lang[language]

    # Finally, return the string `language_data`.
    return language_data

In [56]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_ref = open_url('english')
assert test_ref.split('\n')[0] == 'A,8.34%'
assert len(test_ref) == 209
print('Success!')

Success!


### <span style="color:#345995">Exercise 4: Parse Web Data</span>

Compose a function `load_ref` which accepts a string `language`.  `load_ref` should `return` a `dict` containing the reference language letter frequencies stored at the correct URL (using `open_url`).

In [73]:
#grade
def load_ref( language ):
    '''
    Open the language URL using `open_url` and read the data out via the text attribute.
    Then, parse the text line by line, and store the values in a dictionary.
    Args:
        string: language name
    Returns:
        dict: The predicted distribution of each character in the given language
              in the format of {letter: percentage} ('A': 0.08, 'B": 0.065 ...)
    '''

    # Create an empty dictionary called `languages`.
    languages = {}
    
    # Open the language URL.
    data = open_url( language ) # data is a string
    data = data.strip().split( "\n" ) #convert it from a string to a list
    
    # Loop over each line in the data.
    for line in data: 
        # Split each line at the comma.
        data = line.split(",")
        # The first part should be assigned to a variable `letter`
        letter = data[0]
        # the second part to a variable `frequency`.
        frequency = data[1]
        
        # Add the second part (the frequency) to the dictionary as the value (converted to a float)
        # with the first part (the letter) as the key.  MAKE SURE THE KEY IS UPPER-CASE!
        languages[ letter ] = p2f( frequency )
    
    # Finally, return the dict `languages`.
    return languages

In [74]:
# test your code here.  You may edit this cell, and you may use any language listed above, but the following is provided for convenience.
language = 'polish'
load_ref(language)

{'A': 0.0916,
 'B': 0.019299999999999998,
 'C': 0.0449,
 'D': 0.0335,
 'E': 0.0981,
 'F': 0.0026,
 'G': 0.0146,
 'H': 0.0125,
 'I': 0.0883,
 'J': 0.022799999999999997,
 'K': 0.0301,
 'L': 0.0462,
 'M': 0.0281,
 'N': 0.058499999999999996,
 'O': 0.0832,
 'P': 0.0287,
 'Q': 0.0,
 'R': 0.0415,
 'S': 0.048499999999999995,
 'T': 0.0385,
 'U': 0.0206,
 'V': 0.0,
 'W': 0.041100000000000005,
 'X': 0.0,
 'Y': 0.0403,
 'Z': 0.055}

In [75]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
language = 'english'
test_ref = load_ref(language)
assert isclose(test_ref['A'], 0.0834)
print('Success!')

Success!


Next, you need to write a function `load_languages` which accepts a list of languages and creates a dictionary for each using `load_ref`.  Then all of these dictionaries will be added to an overall dictionary, by language.  That is, `master` will look something like this:

        `master` is a dictionary with keys:
            'afrikaans' -> (a dictionary with keys:
                                  letter -> frequency)
            'catalan'   -> (a dictionary with keys:
                                  letter -> frequency)
            'danish'    -> (a dictionary with keys:
                                  letter -> frequency)

Specifically,
    
    master['afrikaans']  # returns a dict containing the reference language frequencies for Afrikaans

You need a list of available languages.  You can then open each of them, reading them into a dictionary using `load_ref`.

In [76]:
#grade
# You don't need to edit this cell
languageNames = [ 'afrikaans','catalan','danish','english','finnish','french',
                  'german','latin','polish','portuguese','spanish','welsh' ]

### <span style="color:#345995">Exercise 5: Make a Language Dictionary Function</span>

Now we can loop over the list `languageNames`, and for each language we can 1) create a dictionary using `load_ref` and 2) add this dictionary to the master dictionary `master` with the language as the key.  Do this in the function `loadLanguages` (which need have no parameters) and `return` `master`.

In [79]:
#grade
def load_languages():
    '''
    For each language, construct a dictionary and store the dictionary as the value
    of the master dictionary.
    Args:
        None
    Returns:
        dict: The predicted distribution of each language 
              in the format of {language: distribution} 
              ('English': {'A': 0.08, 'B": 0.065 ...}, 'Welsh": ...)
    '''

    # Create an empty dictionary `master`.
    master = {}
    # We expect
    # master: language(str) -> dict
    # dict: character(str) -> frequency(float)
    
    # Get a list of languages available.
    languageNames = [ 'afrikaans','catalan','danish','english','finnish','french',
                  'german','latin','polish','portuguese','spanish','welsh' ]
    
    # Call `load_ref` on each of the languages and add the resulting dictionary as a value to `master` with key `language`.
    for language in languageNames:
        master[language] = load_ref(language)
    
    # Finally, return the dict `master`.
    return master

In [80]:
# test your code here.  You may edit this cell.
master = load_languages()
print(master.keys())
print(master['welsh'])

dict_keys(['afrikaans', 'catalan', 'danish', 'english', 'finnish', 'french', 'german', 'latin', 'polish', 'portuguese', 'spanish', 'welsh'])
{'A': 0.09359999999999999, 'B': 0.0182, 'C': 0.028900000000000002, 'D': 0.09880000000000001, 'E': 0.08310000000000001, 'F': 0.031200000000000002, 'G': 0.0341, 'H': 0.0387, 'I': 0.0698, 'J': 0.0013, 'K': 0.0, 'L': 0.050300000000000004, 'M': 0.0248, 'N': 0.0812, 'O': 0.0559, 'P': 0.0091, 'Q': 0.0, 'R': 0.0652, 'S': 0.0291, 'T': 0.028399999999999998, 'U': 0.0258, 'V': 0.0, 'W': 0.0398, 'X': 0.0, 'Y': 0.0849, 'Z': 0.0}


In [81]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_master = load_languages()
assert isclose(test_master['english']['A'], 0.0834)
print('Success!')

Success!


In [82]:
# it should pass this test---do NOT edit this cell
from numpy import isclose
test_master = load_languages()
assert isclose(test_master['catalan']['Z'], 0.001),'Check the URL.  You may be writing the open_url() function with the language given in the test case, not the one given in the function parameters.'
print('Success!')

Success!


### 3.  Predict the most likely language.

With `load_languages` and `calc_freq`, you are now prepared to assess the similarity of a text to a reference language.  This last step is the most mathematically involved.

We will define a frequency metric $f$ to assess the closeness of the match between two sets of frequencies.  In human language, you will calculate the difference between the two lists $L_{\text{unknown}}$ and $L_{\text{ref}}$, which yields a third list of the differences.  To make this list positive, take its absolute value.  (This keeps equal but opposite errors from canceling each other out.)  To provide a single value to compare, let $f$ be equal to the sum of these absolute values.  Thus a low value of $f$ means a low difference and a better fit between two frequency distributions than a high value of $f$.  As an equation,

$$
f \left( L_{\text{text}}, L_{\text{ref}} \right) =
\sum_{\text{letters}} \left| L_{\text{text}} - L_{\text{ref}}\right| \text{.}
$$

To be clear, the metric we are calculating, $f$, is a metric for how *different* two letter frequency distributions are.

### <span style="color:#345995">Exercise 6: Measure Goodness-of-Fit</span>

Compose a function `calc_match` which accepts two dictionaries `L_text` and `L_ref`.  `calc_match` should return the calculated metric `f` according to the formula above.

In [None]:
# example dictionary
a = {'firstname': 'lastname',
     'Professor': 'Davis',
     'Mickey': 'Mouse'}
print(a['Mickey'])

In [None]:
# example dictionary continued
print(a.keys())

In [None]:
# example dictionary continued
print(a.items())

In [None]:
#grade
def calc_match(L_text, L_ref):
    '''
    Compute the difference of two dictionaries.
    Args:
        L_text: The distribution of letter frequency of the analyzed text
        L_ref: The distribution of letter frequency of one language
    Returns:
        f: float, a caculated metric showing the difference between two dicts
    '''

    # Create an empty dictionary `L_diff`.
    L_diff = {}
    
    # Loop through the keys of the dictionaries (either by loading `alphabet` as above or by using `L_ref.keys()`).
    # Calculate the absolute value of the difference between each dictionary value for each letter
    #     L_diff['A'] = abs(L_text['A'] - L_ref['A'])  # for each letter (or key in L_ref)
    for key in L_ref.keys():
        L_diff[key] = abs(L_text[key] - L_ref[key])
        
    
    # Next, loop through `L_diff` and sum all of the differences into the variable `f`.
    f = 0.0
    for letter in L_diff:
        f += L_diff[letter]
    
    # Finally, return the metric `f`.
    return f

In [73]:
# it should pass this test---do NOT edit this cell
# test self-similarity and similarity across languages
from numpy import isclose
master = load_languages()
assert isclose(calc_match(master['danish'], master['danish']), 0.0)
assert isclose(calc_match(master['english'], master['finnish']), 0.5338)
print('Success!')

Success!


In [74]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
from numpy import isclose
text   = '''The conclusions of human reason as ordinarily applied in matters of nature, I call for the sake of distinction Anticipations of
Nature (as a thing rash or premature). That reason which is elicited from facts by a just and methodical process, I call Interpretation of
Nature.  (Francis Bacon, Novum Organon, Aphorism XXVI)'''
L_text = calc_freq(text)
master = load_languages()
L_ref  = master['english']
f = calc_match(L_text, L_ref)
print('welsh, %f'%f)

welsh, 0.348949


<div class="alert alert-danger">
Nothing needs to be written in the next cell, it's just to see how we can use the function we wrote. 
</div>

Finally, we will capture the above logic in a function `find_best_fit` which will accept a string `text` and a dictionary of reference language dictionaries `master`.  `find_best_fit` compares `text` against all languages in `master`.  `find_best_fit` will return the language corresponding to the lowest value of `f` across the different available reference languages.

**This is a freebie, so you can see the fruits of your labor in action.**

In [75]:
#grade -- Don't modify this cell
# This code already works---you don't need to write anything here.
def find_best_fit(text, master):
    # Create an empty dictionary `fs`.
    fs = {}
    
    L_text = calc_freq(text)
    
    # Loop through the keys of `master` (by using `master.keys()`).
    for language in master.keys():
        # Calculate `f` for each using `calc_match` and store the result in `fs` with the key of the language.
        L_ref = master[language]
        fs[language] = calc_match(L_text, L_ref)
    
    # Finally, return the language corresponding to the minimum `f` in `fs` and the value of `f` in a tuple.
    best_language = min(fs, key=fs.get)  # calculate the minimum value of any key in `fs`
    best_f = fs[best_language]
    return (best_language, best_f)

In [76]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
text = '''
    Soren Kierkegaard ("Frygt og baven:  Dialektisk lyrik", 1843)
    Er det virkelig saa, er al den Spidsborgerlighed, jeg seer i Livet, som jeg ikke lader mit Ord men min Gjerning domme, er den virkelig
    ikke hvad den synes, er den Vidunderet? Det lod sig jo tanke; thi hiin Troens Helt havde jo en paafaldende Lighed dermed; thi hiin Troens
    Helt var end ikke Ironiker og Humorist, men noget endnu Hoiere. Der tales i vor Tid meget om Ironi og Humor, Lsær af Folk, som aldrig have
    formaaet at praktisere deri, men som desuagtet vide at forklare Alt. Jeg er ikke ganske ubekjendt med disse tvende Lidenskaber, jeg veed
    lidt mere om dem end hvad der staaer i tydske og tydsk-danske Compendier. Jeg veed derfor, at disse tvende Lidenskaber ere vasentlig
    forskjellige fra Troens Lidenskab. Ironi og Humor reflektere ogsaa paa sig selv og hore derfor hjemme i den uendelige Resignations
    Sphare, de have deres Elasticitet i, at Individet er incommensurabelt for Virkeligheden.
    '''
master = load_languages()
language, f = find_best_fit(text, master)
print('The best fit for the text is %s with a metric of %f.'%(language,f))

The best fit for the text is danish with a metric of 0.174584.


In [77]:
# it should pass this test---do NOT edit this cell
# test success in counting name elements
from numpy import isclose
text = '''
    Below the thunders of the upper deep;
    Far, far beneath in the abysmal sea, 
    His ancient, dreamless, uninvaded sleep
    The Kraken sleepeth: faintest sunlights flee
    About his shadowy sides: above him swell
    Huge sponges of millennial growth and height; 
    And far away into the sickly light, 
    From many a wondrous grot and secret cell
    Unnumbered and enormous polypi
    Winnow with giant arms the slumbering green.
    There hath he lain for ages and will lie
    Battening upon huge sea-worms in his sleep,
    Until the latter fire shall heat the deep;
    Then once by man and angels to be seen,
    In roaring he shall rise and on the surface die.
    (Alfred Lord Tennyson)
    '''
master = load_languages()
language, f = find_best_fit(text, master)
assert isclose(f, 0.198151072125)
print('Success!')

Success!


This is the most complex program you've yet written.  Let's review its overall logic:

![](./img/flowchart.png)

It's easy to get lost, but charting out your program's logic can help you navigate and think about coding challenges.

---

The lab is now complete, but you may find it interesting to use this function to predict the language of the following text samples, or find your own online and try it out.

In [78]:
#test-cell (do not edit)
#grade
master = load_languages()

Consider the text

In [79]:
text = '''Onder hierdie hoof wil ek u kortliks op grondige teëstelling wys en ook op verbinding. Die satiere, immers, is algemeen opgevat as
spottende uiting van tenminste ontevredenheid of misnoeë ten opsigte van slegtheid en dwaasheid, bestaande wantoestande in die werklikheid,
met die doel om daarteen gedagte, wil en gevoel op te wek. Hierby wil ek vooropstel die verskillende grade ven gevoel in satieriese spot,
variërende tussen die uiterstes van hoon en sarkasme aan die een kant en gemoedelikheid van komiek en mildheid van humor aan die ander. 'n
Definiesie van satiere wat enkel op hoon en bitterheid wys, skyn my egter nie ruim genoeg vir hierdie begrip nie. Hierteen kan miskien
ingebring word dat ons dan die satiere nie langer in sy essensieelste vorm kry nie.  (F.E.J. Malherbe, Humor in die algemeen en sy uiting in
die Afrikaanse letterkunde)
    '''

-   Which language is the best match, and its value of $f$?

In [80]:
language, f = find_best_fit(text, master)
print('The best fit for the text is %s with a metric of %f.'%(language,f))

The best fit for the text is afrikaans with a metric of 0.150750.


-   Which language is the worst match, and its value of $f$?

In [81]:
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is %s with a metric of %f.'%(language,f))

The worst fit for the text is welsh with a metric of 0.573430.


---

Consider the text:

In [82]:
text = '''Tots els essers humans neixen lliures i iguals en dignitat i en drets. Son dotats de rao i de consciencia, i han de comportar-se
    fraternalment els uns amb els altres.'''

-   Which language is the best match, and its value of $f$?

In [83]:
language, f = find_best_fit(text, master)
print('The best fit for the text is %s with a metric of %f.'%(language,f))

The best fit for the text is french with a metric of 0.277559.


-   Which language is the worst match, and its value of $f$?

In [84]:
#test-cell
#grade
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is %s with a metric of %f.'%(language,f))

The worst fit for the text is polish with a metric of 0.560747.


(You will note that, unsurprisingly, short text samples are harder to statistically analyze in this manner.  The foregoing sample is written in Catalan, but this method detects a slightly different language.)

---

Consider the text:

In [85]:
text = '''Quoi que puisse dire Aristote, et toute la philosophie, il n'est rien d'egal
au tabac ; c'est la passion des honnetes gens ; et qui vit sans tabac n'est pas digne
de vivre. Non seulement il rejouit et purge les cerveaux humains, mais encore il
instruit les ames a la vertu, et l'on apprend avec lui a devenir honnete homme. Ne
voyez-vous pas bien, des qu'on en prend, de quelle maniere obligeante on en use avec
tout le monde, et comme on est ravi d'en donner a droite et a gauche, partout ou l'on
se trouve ? On n'attend pas meme qu'on en demande, et l'on court au-devant du souhait
des gens ; tant il est vrai que le tabac inspire des sentiments d'honneur et de vertu
a tous ceux qui en prennent. Mais c'est assez de cette matiere, reprenons un peu notre
discours. Si bien donc, cher Gusman, que done Elvire, ta maitresse, surprise de notre
depart, s'est mise en campagne apres nous ; et son coeur, que mon Maitre a su toucher
trop fortement, n'a pu vivre, dis-tu, sans le venir chercher ici. Veux-tu qu'entre-nous
je te dise ma pensee ? J'ai peur qu'elle ne soit mal payee de son amour, que son voyage
en cette ville produise peu de fruit, et que vous eussiez autant gagne a ne bouger de la.

(Moliere, Don Juan ou le Festin de pierre)
'''

-   Which language is the best match, and its value of $f$?

In [86]:
language, f = find_best_fit(text, master)
print('The best fit for the text is %s with a metric of %f.'%(language,f))

The best fit for the text is french with a metric of 0.162854.


-   Which language is the worst match, and its value of $f$?

In [87]:
#test-cell
#grade
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is %s with a metric of %f.'%(language,f))

The worst fit for the text is welsh with a metric of 0.653765.


---

Consider the text:

In [88]:
text = '''
En un lugar de la Mancha, de cuyo nombre no quiero acordarme, no ha mucho
tiempo que vivia un hidalgo de los de lanza en astillero, adarga antigua,
rocin flaco y galgo corredor. Una olla de algo mas vaca que carnero,
salpicon las mas noches, duelos y quebrantos los sabados, lantejas los
viernes, algun palomino de anadidura los domingos, consumían las tres
partes de su hacienda. El resto della concluian sayo de velarte, calzas de
velludo para las fiestas, con sus pantuflos de lo mesmo, y los dias de
entresemana se honraba con su vellori de lo mas fino. Tenia en su casa una
ama que pasaba de los cuarenta, y una sobrina que no llegaba a los veinte,
y un mozo de campo y plaza, que asi ensillaba el rocin como tomaba la
podadera. Frisaba la edad de nuestro hidalgo con los cincuenta anos; era de
complexion recia, seco de carnes, enjuto de rostro, gran madrugador y amigo
de la caza. Quieren decir que tenia el sobrenombre de Quijada, o Quesada,
que en esto hay alguna diferencia en los autores que deste caso escriben;
aunque, por conjeturas verosimiles, se deja entender que se llamaba
Quejana. Pero esto importa poco a nuestro cuento; basta que en la narracion
del no se salga un punto de la verdad.
(Miguel de Saavedra Cervantes, Don Quixote)
'''

-   Which language is the best match, and its value of $f$?

In [89]:
#test-cell
#grade
language, f = find_best_fit(text, master)
print('The best fit for the text is %s with a metric of %f.'%(language,f))

The best fit for the text is spanish with a metric of 0.145797.


-   Which language is the worst match, and its value of $f$?

In [90]:
#test-cell
#grade
fs = {}
L_text = calc_freq(text)
for language in master.keys():
    fs[language] = calc_match(L_text, master[language])

language = max(fs, key=fs.get)  # calculate the maximum value of any key in `fs`
f = fs[language]
print('The worst fit for the text is %s with a metric of %f.'%(language,f))

The worst fit for the text is welsh with a metric of 0.548518.
