# HackOnData.com

## Exercise #0 - Decrypt a message


### Instructions
Starting from  the data http://tranquant.com/item-detail/1585053a-d457-4205-9135-69a771082dfd (you are required to register to the tranquant platform)

   - Find data points for which the value column is more than 4 std deviations from the population mean and find the hidden message

Notes: 
    - To solve the challenge in databricks, you have to upload the data to an S3 bucket, then mount it to dbfs. See:
    https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#03%20Data%20Sources/2%20AWS%20S3%20-%20py.html

In [2]:
ACCESS_KEY = "*"
SECRET_KEY = "*"
ENCODED_SECRET_KEY = SECRET_KEY.replace("/", "%2F")
AWS_BUCKET_NAME = "*"
MOUNT_NAME = "E0"
FILENAME = "challenge0.csv"

In [3]:
# dbutils.fs.unmount("/mnt/%s" % MOUNT_NAME) # if attached, unmount and rerun
dbutils.fs.mount("s3a://%s:%s@%s" % (ACCESS_KEY, ENCODED_SECRET_KEY, AWS_BUCKET_NAME), "/mnt/%s" % MOUNT_NAME)
myRDD = sc.textFile("/mnt/%s/%s" % (MOUNT_NAME, FILENAME)

In [4]:
import itertools
import math

def _splitter(l):
    """
    Transform value as a string into value-as-a-float.
    """
    row = l.split(',')
    return tuple(row[:1] + [float(row[1])] + row[2:])
  
def _mean(iterable):
    """
    O(n)
    Formula: 
        sum of all values divided by how many elements
    """
    r = 0
    c = itertools.count(1)
    for _, x, _ in iterable:
        cnt = next(c)
        r += x
    return r and r / cnt or iterable[0][1]
  
def _standard_deviation(iterable, mean):
    """
    3 significant digits
    Formula: 
        square root of (sum of ((a mean substracted from a value), squared), 
        divided by how many elements
    """
    squares = list(math.pow((x - mean), 2) for _, x, _ in iterable)
    variance = sum(squares) / len(squares)
    return round(math.sqrt(variance), 3)  
  
def _flt_by_deviations(rows, mean, stddev, max_dev, comment=False):
    """
    Filter out rows falling between (max_dev from mean)
    """
    def _f((t, v, m)):
        dev = (v - mean)/stddev
        if  not ((-1 * max_dev) < dev < max_dev):
            retval = dev, t, m
            if comment:
                print('Deviation: %s: \t%s, - with \t%s' % retval)
            return retval[1:]
    return filter(
      None, 
      map(_f, rows))

In [5]:
header = myRDD.first()
rows_raw = myRDD.filter(lambda line: line != header)
rows_ = rows_raw.map(_splitter)

rows = rows_.collect()
mean = _mean(rows)
standard_deviation = _standard_deviation(rows, mean)
hidden_message = _flt_by_deviations(
  rows, 
  mean, 
  standard_deviation, 
  max_dev=4, 
  comment=True)  # little more details regarding matches

In [6]:
val = ''.join(h for t, h in hidden_message)
print(val.replace('" "', ' '))