# Data Fundamentals (H)
John H. Williamson -- Session 2019/2020

----

Read the submission instructions **carefully** before submitting. Note that marks shown are **provisional** and could change after grading.

**This submission must be your own work; you will have to make a Declaration of Originality on submission.**



---

## Lab 2: **Assessed**
# Numerical arrays and vectorized computation


### Notes
It is recommended to keep the lecture notes open while doing this lab exercise.

**This exercise is assessed**. Make sure you upload your solution by the deadline. See the notes at the bottom of this notebook for submission guidance.


### References
If you are stuck, the following resources are very helpful:


* [NumPy cheatsheet](https://github.com/juliangaal/python-cheat-sheet/blob/master/NumPy/NumPy.md)
* [NumPy API reference](https://docs.scipy.org/doc/numpy-1.13.0/reference/)
* [NumPy user guide](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.html)


* [Python for Data Science cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PythonForDataScience.pdf)
* [Another NumPy Cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)


## Purpose of this lab
This lab should help you:
* understand floating point representations
* understand how roundoff errors occur and how you can control them
* work with higher rank tensors, selecting the attributes you want to work with
* understand how to do simple operations in a vectorised manner

Note: this lab requires solving puzzles which require that you understand the course material. Very little code is required to get the correct solutions.


In [1]:
## custom utils
## uncomment and run the line below if you get an error about jhwutils
## then RESTART THE KERNEL (Kernel/Restart)

#!pip --no-cache install --user -U https://github.com/johnhw/jhwutils/zipball/master   

# comment it again before submitting!

In [2]:
# Standard imports
# Make sure you run this cell!
# NumPy
import numpy as np  
np.set_printoptions(suppress=True)

# Set up Matplotlib
import matplotlib as mpl   
import matplotlib.pyplot as plt
%matplotlib inline
plt.rc('figure', figsize=(8.0, 4.0), dpi=140)

from jhwutils.checkarr import array_hash, check_hash
from jhwutils.float_inspector import print_float_html
import jhwutils.image_audio as ia
import jhwutils.tick as tick

import lzma, base64
exec(lzma.decompress(base64.b64decode(b'/Td6WFoAAATm1rRGAgAhARYAAAB0L+Wj4AHWARFdAAUaCmNWCuiTCXe8bHWT/WeqghymfBRQKyklXJ3lgDWHk34myezvldkgSu3Adiur0vA+OkDfUwiMWzEclOxunCssCtgpVM94TwtylLQC9aX0APwnuNk2VBPkVpf3otXT04I1pElMWNdSgqgJ9/PqMJhdhfDr3Wrs/a/pRN/AOd+rZawioudIbGRTYZgWPHcqPLImmS2EO0Hbkc7kRAS3Nr9JkELrRMkejvVMnGgu+b1m4uXv6trDURkPrMO7HCVcO5FcMx1FURc+hNcKRmmBp1mCuW4iop6qRAMNnAur/spBmfuw+lbJkxOoIXMwrRuEXa6bnJz53WQnloXvzbWW5hqEtbPpSHPLPccxaiU5yPAKYAAAAADqkqJjsFbfFwABrQLXAwAA9LrSpbHEZ/sCAAAAAARZWg==')))

print("Everything imported OK")

Everything imported OK


# 1. Financial misconduct

You have been asked to verify the computation of some financial predictive models. These models produces a sequence of updates to the value of a product. The product updates are mainly of two types:
* **large deposits**, representing inflows of new cash, often up into the billions of pounds
* **small returns** from high-frequency trading activity

The simulator produces **two** model outputs from two distinct models `a` and `b` at each time step, which provide very similar estimates of the value of these updates.

You are asked to write code that will produce:

* an estimate of the total value of a product over some series
* the total difference between two different product models, both of which are very similar.

You are given the existing code below, which is supposed to compute and return:

* the sum of the `a` updates (i.e. total value of `a`)
* the accumulated difference between the `a` and `b` products.

However, the result is very inaccurate when tested. Modify this code to be more accurate. Do NOT use NumPy, or *any* other external module to improve your calculation. Use floating point, regardless of the fact that floating point is not appropriate for financial data.

The errors should be less than 0.5 for the `a` sum and less than 1e-10 for the difference in predictions.

In [3]:
class Simulator: # we use a class just to hold variables between calls
    def __init__(self):
        # initialise accumulators
        self.a_sum = 0
        self.b_sum = 0
        
    def update(self, a, b):
        # increment
        self.a_sum += a
        self.b_sum += b
        
    def results(self):
        # return a  pair of results
        # (you do not need to change this)
        return self.a_sum, self.a_sum - self.b_sum
        

In [4]:
a_error, d_error = simulate(Simulator())
# bad result!
print(f"Error in a_sum is {a_error} and {d_error} in d_sum")

Error in a_sum is 13.7734375 and 4.101232676410264e-08 in d_sum


Copy and paste the `Simulator` into the cell below and modify it:

In [14]:
class Simulator: # we use a class just to hold variables between calls, when you add
    #have variables for the big numbers, and variables for the small numbers. a sum small, a sum large
    #if the value of a is bigger than a specific value, store in a_sum_large, othereise store it in a_sum_small. 
    #do the same for the difference 
    # reverse engineer a function?
    # try epsilon, float 32 
    
    # add small numbers to small numbers 
    def __init__(self):
        # initialise accumulators
        self.a_sum_large = 0
        self.a_sum_small = 0
        self.b_sum_small = 0 #anything less than a billion
        self.b_sum_large = 0 #usally up to billion pounds 
        
    def update(self, a, b):
        # increment
        if(a<1*10^(6)):
            self.a_sum_small += a
        elif(a>1*10^(6)):
            self.a_sum_large += a
      
        #self.a_sum += a
        #self.b_sum += b
        
    def results(self):
        # return a  pair of results
        # (you do not need to change this)
        #return self.a_sum, self.a_sum - self.b_sum
        return self.a_sum_small,self.a_sum_small-self.b_sum_small
    

In [15]:
    
a_error, d_error = simulate(Simulator())
print(f"Error in a_sum is {a_error} and {d_error} in d_sum")

Error in a_sum is 64269781176237.58 and 91187.06327310433 in d_sum


In [12]:
with tick.marks(2):
    assert(a_error<2.0)

AssertionError: 

In [17]:
with tick.marks(2):
    assert(a_error<0.5)

AssertionError: 

In [18]:
with tick.marks(2):
    assert(d_error<1e-10)

AssertionError: 

In [None]:
with tick.marks(2):
    assert(d_error<1e-12)

# 2. Debugging the dump [1 hour]

Scenario: In your first day in a new post in the IT team in a finance company, you are provided with the a portion of a memory dump of a process that was running an important simulation of foreign exchange rates in the late 1990s. Unfortunately, the system crashed half way through and the raw memory dump is all that is left. You need to extract the relevant data so that the simulation can be restarted. 

You know the data is stored as a numerical array, so it has some known structure. You don't know the dtype or shape of the array, or where it starts or ends in the memory dump, however.

**This is a puzzle which will require careful thinking, but very little code to be written**

In [None]:
# read the data in
with open("data/crash_bytes.dump", "rb")  as f:
    crash_dump = f.read()

In [None]:
# the raw memory dump, in hex. This isn't too useful...
def print_hex(x):
    print(" ".join(["%02X" % byte for byte in x]))    
    
print_hex(crash_dump)

### What you know
All you have is the block of raw data you can see above.  You know the array is in there, but not exactly where it starts or stops.  The header information is gone, so there is no striding information/dope vector to go by.

* You know that the second column of the array consists soley of NaN

         a        b      c    ...
         ...     NaN 
         ...     NaN
         ...     NaN
         ...     NaN
              
* You also know that all *other* values are finite 
* All non-NaN values in the array are known to be positive.
* You can assume the data is some form of IEEE 754, though you do not know what specific type.
* The data starts on a byte boundary.

**This is sufficient information to solve the whole puzzle**

### Task
Recover the data, formatted correctly, and store it in the variable `recovered_array`. 

* This will take some trial and error (although there *is* a relatively fast way to do it).
      

* You can convert the data to a NumPy array like this:
`np.frombuffer(bytes, dtype, count, offset)`
* `bytes` the data to decode, as raw bytes
* `dtype` the datatype of the data to decode
* `count` the number of **elements** in the array
* `offset` **in bytes** to start recovering data
    

In [None]:
# A wrong example:
# try and read 18 words from offset 0
# reshape to a 6,3 array
# this clearly isn't right, as you will see
np.frombuffer(crash_dump, dtype=np.float64, count=18, offset=0).reshape(6,3)

* A hint: you can show how any NumPy memory will appear in memory in hex using `tobytes()` -- see the example below. Also, remember you need to infer the *shape* of the array.

In [None]:
# create a simple array, and then get the raw bytes and print them
print_hex(np.array([[1.0, 2.0, 3.0], 
                    [4.0, 0.0, 0.0]], dtype=np.float64).tobytes())

In [None]:
# YOUR CODE HERE

In [None]:
# test the shape
with tick.marks(5):        
    assert(check_hash(recovered_array.shape, ((2,), 77.0)))

In [None]:
# test if the result is correct   
with tick.marks(8):        
    assert(np.allclose(array_hash(recovered_array)[1], 1265119.8746899366, atol=1e-2, rtol=1e-2))

# 3. Working with tensors [1 hour]
The file `data/font_sheet.png` contains a number of characters in different fonts. It is an image which consists of the images of each *printable* ASCII character, (characters 32-128) arranged left to right. Each character image is precisely square. 

These are the characters present, in order:

In [None]:
chars = "".join([chr(i) for i in range(32,128)])
print(chars)

  
Each font is also stacked left to right, so the image is one *very* long strip of characters. The image is grayscale.

In [None]:
all_fonts = ia.load_image_gray("data/font_sheet.png")
print(all_fonts.shape)

In [None]:
# show a portion of the image
ia.show_image(all_fonts[64:128, 1024:2048])

# Tasks
A. Rearrange the image into a tensor called `font_sheet` that is ordered like this:

        (font, character, rows, cols)
        
* Showing the image `font_sheet[16, 33, :, :]` should show the "A" character of the 17th font.
* Showing the image `font_sheet[10, 1, :, :]` should be the "!" character of the 11th font.

In [None]:
## hint
from jhwutils.matrices import show_boxed_tensor_latex
n = np.arange(36).reshape(2*3, 3*2)
show_boxed_tensor_latex(n, box_rows=False)
show_boxed_tensor_latex(n.reshape(2,3,2,3), box_rows=False)

In [None]:
# YOUR CODE HERE

In [None]:
# if your code worked, you should see an ! below
ia.show_image_mpl(font_sheet[10, 1, :, :])

In [None]:
# if your code worked, you should see a gif of letters below
ia.show_gif(np.rollaxis(font_sheet[:,:,8,33:33+26],2), width="20%")


In [None]:
# test shape is correct
with tick.marks(6):        
    assert(check_hash(font_sheet.shape, ((4,), 938.9499472573252)))

In [None]:
# test content is ok
with tick.marks(10):    
    assert(np.allclose(array_hash(font_sheet)[1], 55441039333148.88, atol=1e-2, rtol=1e-2))    

B. Create an array `letter_sample`, which will be a 2D image containing one letter from each of the letters in the lowercase alphabet, with each character in a different font. The results should show "a" in font 0, "b" in font 1, "c" in font 2 and so on, as a single continuous strip.

The letters should be arranged horizontally and contiguously in a strip in the output image:

       abcdefghijklmnopqrstuvwxyz


Hint:
* you will have to partially *undo* some of the swapping/reshaping you did earlier to get the data in the right format
* you need to use fancy indexing
* you'll need to slice -- work out how to slice the array correctly
* do not use a loop

In [None]:
# YOUR CODE HERE

In [None]:
# each character should appear in a different font
ia.show_image(mean_letters)

In [None]:

with tick.marks(8):
    assert(check_hash(letter_sample,((64, 1664), 4861094994.1019411)))

C. Compute the average representation of the letter "x" by taking the 64x64 mean image of the letter `x` across all fonts and store it in `average_x`.

        

In [None]:
# YOUR CODE HERE

In [None]:
# show the result -- should be a 64x64 image
ia.show_image_mpl(average_x)

In [None]:
with tick.marks(5):    
    assert(check_hash(average_x, ((64, 64), 7278599.401423973)))

# End of assessed portion

----------------------------------

## Extended material

<font color="red"> Material beyond this point is optional. You do not have to attempt it or look at it. There are no marks. 
</font>

## Rendering fonts
Complete the function below. It should render text using the provided font index, and *return* a single array with the text rendered in a horizontal strip. It should use `font_sheet` that you defined earlier. You can assume equal spacing of letters. 

* You can compute the index of the character in the same units as the font sheet using the formula:

      ix = ord(char) - 32
    
Every ASCII character (32-127) should be rendered. Any character that could not be rendered should be rendered as a **blank white** square.    

* It is fine to use a `for` loop to solve this problem

In [None]:
def render_text(string, font_index):
    """Returns an image with the given string rendered, using the font_index selected.
    Reads characters from font_sheet.
    string: String to be rendered.
    font_index: index of the font to use"""
    pass # you can delete this line
    # YOUR CODE HERE

In [None]:
# you should be able to read this
ia.show_image(render_text("Can you see this clearly?", 23))

In [None]:
# this should look the same
ia.show_image(render_text("Can\tyou\nsee\xf5this\x00clearly?", 23))

In [None]:
ia.show_image(render_text("Data Fundamentals (H)", 1))

In [None]:
with tick.marks(0):
    assert(check_hash(render_text("Test 1", 1), ((64, 384), 269160963.20571893)))

In [None]:
with tick.marks(0):
    assert(check_hash(render_text("Test 2", 2),((64, 384), 282670129.18082076)))

In [None]:
with tick.marks(0):
    assert(check_hash(render_text("Test\n3", 3), ((64, 384), 283057779.18977338)))

In [None]:
with tick.marks(0):
    assert(check_hash(render_text("\n\tTest\x00\xff4", 4), ((64, 576), 657469474.43368447)))

-----

# Submission instructions

## Mark summary
You should check the marks you've got before submitting. To do this, 
* Make sure you fill in any place that says `YOUR CODE HERE` or `"YOUR ANSWER HERE"`.
* SAVE THE NOTEBOOK, 
* Go to `Cell/Restart and Run All` in the menu.
* Check the output of the cell here.

Note that this is an estimated mark, and if you don't do the above procedure *carefully* you may get nonsense estimates.


In [None]:
tick.summarise_marks()

<div class="alert alert-block alert-danger">
    
### Formatting the submission
* **WARNING**: If you do not submit the correct file, you will not get any marks.
* Submit this file **only** on Moodle. It will be named `week_<xxx>.ipynb`.

</div>


## Penalties (only for assessed labs)
<font color="red">
    
**Malformatted submissions**
</font>
These assignments are processed with an automatic tool; failure to follow instructions *precisely* will lead to you automatically losing two bands in grade regardless of whether the work is correct (not to mention a long delay in getting your work back). **If you submit a file without your work in it, it will be marked and you will get 0 marks.**

<font color="red">**Late submission**</font>
Be aware that there is a two band penalty for every *day* of late submission, starting the moment of the deadline.

<font color="red">
    
**Plagiarism**
</font> Any form of plagiarism will be subject to the Plagiarism Policy. The penalties are severe.