# ENSF 519.01 Applied Data Science 
**Assignment 1** - 100 marks

**Due:** October 4th, 04.00 pm.


**IMPORTANT NOTE: each task must be implemented as asked, even if there are other easier or better solutions.**

**How to deliver:**
Edit this file and write your solutions in sections specified with `# Your solution`. Test your code and when you were done, submit this notebook as an `.ipynb` file to D2L dropbox. 



## Problem 1 - The Zipf mystery (50 points)

In this problem, we'd like to read the text from a book and perform some simple statistical analysis on the word counts. We have provided you with the actual text from [Lost On The Moon or, In Quest of the Field of Diamonds](https://www.goodreads.com/book/show/8636132-lost-on-the-moon-or-in-quest-of-the-field-of-diamonds) book in a file named 'the book.txt'. The file is cleaned up and only contains alphanumeric characters, i.e. no punctuation, quotation marks, etc.

Read the file and break it down to its words. (5 points)

In [4]:
def read_and_tokenize(file_name):
    words = []
    with open(file_name) as f:
        words = [word for line in f for word in line.split()]
    return words

words = read_and_tokenize('the book.txt')
words[1101:1111] # Expected: ['the', 'latter', 'picked', 'it', 'up', 'gazed', 'at', 'it', 'first', 'from']

['the', 'latter', 'picked', 'it', 'up', 'gazed', 'at', 'it', 'first', 'from']

Using a sorted list of unique words in the book. Store the list in a variable called `V`. Also complete the `get_word_index` function below that gets a word and finds its index within `V`. (5 points)

In [5]:
# Your solution goes here
V = sorted(set(words))
def get_word_index(word):
    return V.index(word)

get_word_index('about')  # Expected: 9

9

Using no loops, and by only using `map` and `filter` built-in python functions traverse through the `V` (vocabulary) list above to find:

* `long_words`: The list of words that have 10 letters or more 
* `no_vowels`: A list of all words but with vowels (aoeiu) removed. You can nest `map` and `filter` calls to iterate through the characters of the words.

(5+5 points)

In [12]:
# Your solution here
vowels = ["a","e","i","o","u"]
long_words = list(filter((lambda word: len(word) >= 10), V))
no_vowels = list(map((lambda word: ''.join(list(filter((lambda char: char not in vowels), word)))), V))

Create a numpy array of size `|V|` that only contains 0s. Store it in a variable named `frequencies`. Use this array to count the number of times each word has appeared in the book. For example `frequencies[9]` should store how many times the word located in the index 9 of `V` (the sorted list) --which is the word "about"-- has been appreaed in the book (165 times). (10 points)


In [6]:
import numpy as np
# Your solution
frequencies = np.zeros(len(V), dtype=float)
wordsnp = np.array(words)
uniques, occurCount = np.unique(wordsnp, return_counts=True)
frequencies = np.add(frequencies, occurCount)
frequencies, frequencies[9] # Expected: array([ 1.,  1.,  1., ..., 11.,  1.,  1.]), 165.0

(array([ 1.,  1.,  1., ..., 11.,  1.,  1.]), 165.0)

Find the word that appeared most frequently in the book. Find the word itself as well as the number of times it was repeated in the book. Use numpy functions, i.e. do not iterate over the `frequencies` array manually using a `for` loop. (5 points)

In [7]:
# Your solution 
most_common_word = V[np.where(frequencies == np.amax(frequencies))[0][0]]
max_frequency = np.amax(frequencies)

print(f'"{most_common_word}" is the most common word which has appeared {max_frequency} times in the book.')
# Expected: "the" is the most common word which has appeared 3237 times in this book.

"the" is the most common word which has appeared 3237.0 times in the book.


Normalize all frequency values by dividing them by the maximum frequency value (using vectorized operators). After this the most common word in the book should get a normalized frequency of `1` and all other words get some value 
between `1/MAX` and `1`. (2.5 points)

In [8]:
# Your solution
normalized_frequencies = frequencies / max_frequency
normalized_frequencies

array([0.00030893, 0.00030893, 0.00030893, ..., 0.00339821, 0.00030893,
       0.00030893])

We want to check if the normalized frequencies have any corelation to their ranks. If such correlation exists, the Zipf's law states that it is linear in a log-log space. Take the logarithm of normalized frequencies (as y values) and create a numpy array of the same size containing the rank of each word (as x values). For example if the frequencies array is `[0.1, 1, 0.01, 0.0001]` the x and y values will be `X = [2, 1, 3, 4] Y=[-1, 0, -2, -4]`. 

You might want to sort the normalized frequencies first to make the task easier. (2.5 points)

In [11]:
# Your solution 
x = ((-normalized_frequencies).argsort()).argsort()
y = np.log(normalized_frequencies)
x, y

(array([2972, 3416, 3417, ...,  492, 4166, 4319]),
 array([-8.08240225, -8.08240225, -8.08240225, ..., -5.68450698,
        -8.08240225, -8.08240225]))

Calculate the [pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) on this data. The result is expected to be close to -1. Define appropriate functions for the the statistical calculations as neccessary. Additionally, you can use `pearsonr` function from `scipy` package to check if the calculated value is definitely correct. Though if you get a value close enough to -1 you can almost be sure that your implementation is correct and this step won't be necessary. (10 points)

In [13]:
# Your solution goes here
from scipy import stats
def pcc(x, y):
    meanx = np.mean(x)
    meany = np.mean(y)
    top = np.sum((x-meanx)*(y-meany))
    btm = np.sqrt(np.sum(np.square(x-meanx))*np.sum(np.square(y-meany)))
    return top/btm

pcc(x, y)

-0.8631989095267898

## Problem 2 - Log processing (50 points)

In this part of the assignment we are going to use regular expressions to mine data out of some webserver log files. Although these problems can be solved without use of RegExes, but for this assignment you need to use them.

A sample web server log file is provided along with this problem. In each line of the file one event is recorded. For simplicity all of the events in this file have the same format and are of the same type. Each event contains an ip address, date and time of the event, http method (`GET` or `POST`), a url, HTTP version, HTTP response code (usually 200), the response size in bytes, and the device's user agent which contains information about the device such as the brand and the operating system.

Since these logs have such a well defined format regular expressions are the prefect tool for breaking them down into parts and perform different analysis on them.

**Please make sure that when you are asked to write a function that _return_s something, you are _return_ing that value, not just _print_ing it**

We start off with a random log line and write python functions that use regular expressions to break it off to pieces.

In [1]:
import re

l = '5.106.145.204 - - [04/Sep/2019:13:51:39 +0430] "POST /v1/crash-report/incident/report/ HTTP/1.1" 200 65 "-" "Dalvik/1.6.0 (Linux; U; Android 4.2.2; GT-S7272 Build/JDQ39)"'
print(l)

5.106.145.204 - - [04/Sep/2019:13:51:39 +0430] "POST /v1/crash-report/incident/report/ HTTP/1.1" 200 65 "-" "Dalvik/1.6.0 (Linux; U; Android 4.2.2; GT-S7272 Build/JDQ39)"


Make a function that extracts the ip address part of the log line using regular expressions. (5 points)

In [2]:
def get_ip_address(l):
    # Your solution here
    return re.match('^([0-9]{1,3}\.){3}[0-9]{1,3}', l).group(0)


get_ip_address(l)  # Expected: '5.106.145.204'

'5.106.145.204'

Make a function that extracts the HTTP method, url, response code, and response size and returns a tuple. Use regular expressions. The http method is either `POST` or `GET` and the response code is always a 3 digit integer. (10 points)

In [3]:
def get_http_info(l):
    # Your solution here
    (method, path, res, size) = re.match('^.*(POST|GET)\s(\/.*\s).*\s(\d{3})\s(\d+).*$', l).groups()
    return (method, path.strip(), int(res), int(size))

get_http_info(l)  # Expected: ('POST', '/v1/crash-report/incident/report/', 200, 65)
# Please note that the last two numbers are converted to integers

('POST', '/v1/crash-report/incident/report/', 200, 65)

Use regular expressions to break the date and time section apart and create a python datetime object based on that. Mind the time zone. convert the datetimes to MDT. Using `strptime` is a better solution in general, but for this assignment please stick to writing RegExes so you become more comfortable in writing and debugging them. (20 points)


In [8]:
from datetime import datetime, timedelta, timezone
from calendar import month_abbr
from time import strptime

MDT = timezone(timedelta(minutes=-6*60 + 0))

def get_datetime(l):
    # Your solution here
    (day, month, year, hr, mins, sec, tmz) = re.match('.*\[(\d{2})\/(\S{3})\/(\d{4})\:(\d{2})\:(\d{2})\:(\d{2})\s([+-]\d{4}).*$', l).groups()
    print()
    calculatedTime = (int(tmz[1:3])*60 + int(tmz[3:])) * int(tmz[0]+'1')
    tmzn = timezone(timedelta(minutes= calculatedTime))
    dt = datetime(int(year), strptime(month,'%b').tm_mon, int(day), int(hr), int(mins), int(sec), tzinfo=tmzn)
    return dt.astimezone(MDT)

get_datetime(l)  # Expected: datetime.datetime(2019, 9, 4, 3, 21, 39, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=64800)))




datetime.datetime(2019, 9, 4, 3, 21, 39, tzinfo=datetime.timezone(datetime.timedelta(days=-1, seconds=64800)))

Read the log file line by line and use the `get_datetime` and `get_http_info` functions above to calculate the used bandwidth of the server (the sum of all the response sizes) per hour. Use a `dict` or a `defaultdict`. (15 points)

For example if there are 4 logs like:

    Sep 4 14:20 .... 65bytes
    Sep 4 14:35 .... 80bytes
    Sep 4 15:01 .... 44bytes
    Sep 5 18:20 .... 40bytes

The result will be like:

    Sep 4 14:00  145
    Sep 4 15:00  44
    Sep 5 18:00  40

In [57]:
# Your solution here
import pprint

bandwith_logs = {}

with open('log.txt', 'r') as f:
    for line in f:
        timeInfo = get_datetime(line);
        httpInfo = get_http_info(line);
        key, value = timeInfo.strftime('%Y, %m, %d %H:00'), int(httpInfo[-1])
        if key in bandwith_logs:
            bandwith_logs[key] += value
        else:
            bandwith_logs[key] = value

pprint.pprint(bandwith_logs)
# No specific format for the output is expected
# However the data will be something like:
#  2019, 7, 20 07:00    49130 bytes
#  2019, 7, 20 08:00    40469 bytes
#  2019, 7, 20 09:00    43556 bytes
#  2019, 7, 20 10:00    82526 bytes .... 

{'2019, 07, 20 07:00': 49130,
 '2019, 07, 20 08:00': 40469,
 '2019, 07, 20 09:00': 43556,
 '2019, 07, 20 10:00': 82526,
 '2019, 07, 20 11:00': 56328,
 '2019, 07, 20 12:00': 98862,
 '2019, 07, 20 13:00': 119679,
 '2019, 07, 20 14:00': 57126,
 '2019, 07, 20 15:00': 135680,
 '2019, 07, 20 16:00': 48710,
 '2019, 07, 20 17:00': 45631,
 '2019, 07, 20 18:00': 9805,
 '2019, 07, 20 19:00': 3569,
 '2019, 07, 20 20:00': 36087,
 '2019, 07, 20 21:00': 55406,
 '2019, 07, 20 22:00': 68764,
 '2019, 07, 20 23:00': 30101,
 '2019, 07, 21 00:00': 83251,
 '2019, 07, 21 01:00': 77896,
 '2019, 07, 21 02:00': 65166,
 '2019, 07, 21 03:00': 66326,
 '2019, 07, 21 04:00': 86495,
 '2019, 07, 21 05:00': 76888,
 '2019, 07, 21 06:00': 65378,
 '2019, 07, 21 07:00': 116337,
 '2019, 07, 21 08:00': 55348,
 '2019, 07, 21 09:00': 40975,
 '2019, 07, 21 10:00': 51204,
 '2019, 07, 21 11:00': 56527,
 '2019, 07, 21 12:00': 50933,
 '2019, 07, 21 13:00': 29773,
 '2019, 07, 21 14:00': 80279,
 '2019, 07, 21 15:00': 68270,
 '2019, 0