# HW 1: Web log data wrangling

Please also refer to the HW1 [README](https://github.com/berkeley-cs186/course/tree/master/hw1) for the full assignment details.

--------------------------------------------

## Introduction

### Jupyter Notebooks w/ iPython

Jupyter Notebook is a web-based interactive computing system, which allow you to mix code and rich-text in one document. A notebook consists of a sequence of cells, which can be run using the "Play" button in the toolbar or by hitting Shift-Enter on the keyboard.

In HW1, you will primarily use code cells with iPython code. You can find a tour and pointers to more documentation in the `Help` menu above.


### The dataset

Let's take a look at the data. These web logs were produced by an Apache web server. Each line represents a request to the server that originally hosted an early viral video from 2002.

In [None]:
import os
DATA_DIR = os.environ['MASTERDIR'] + '/sp16/hw1/'

In [None]:
with open(DATA_DIR + "web_log_small.log") as log_file:
    sample_line = log_file.readline()

print sample_line

This format is called "Combined Log Format", and you can find a description of each of the fields [here](https://httpd.apache.org/docs/1.3/logs.html#common).

Here's another way to view the first line of the dataset. We can run a shell command using [`! operator`](https://ipython.org/ipython-doc/3/interactive/reference.html#system-shell-access) (a feature of iPython). 

In [None]:
!head -1 {DATA_DIR}web_log_small.log

-----------

## Your Assignment

Fill in the `process_logs` function below to complete the specification in the README. You can add any helper functions you need. You may use any of Python 2's standard libraries available on the instructional machines. You cannot use (and shouldn't need) any external libraries.

Remember, you need to ensure that your code will scale to datasets that are bigger than memory -- no matter how large or skewed the dataset or how much memory is on your test machine.  Avoid keeping data structures of unbounded size in memory, since it **won't** scale, e.g.: 

- having a list of every line in the dataset
- having a dictionary with an key for every IP address

Finally, to ensure proper grading, please make sure all of your log processing code (including `import` statements) is between the **BEGIN/END STUDENT CODE** cells. Do not modify or remove either of these cells.

### * BEGIN STUDENT CODE *

In [None]:
import apachetime
import time

def apache_ts_to_unixtime(ts):
    """
    @param ts - a Apache timestamp string, e.g. '[02/Jan/2003:02:06:41 -0700]'
    @returns int - a Unix timestamp in seconds
    """
    dt = apachetime.apachetime(ts)
    unixtime = time.mktime(dt.timetuple())
    return int(unixtime)

In [None]:
def process_logs(dataset_iter):
    """
    Processes the input stream, and outputs the CSV files described in the README.    
    This is the main entry point for your assignment.
    
    @param dataset_iter - an iterator of Apache log lines.
    """
    # FIX ME
    with open("hits.csv", "w+") as hits_file:        
        for i, line in enumerate(dataset_iter):            
            if i % 1e5 == 0:
                print i,
        
        print "Done."

### * END STUDENT CODE *

------------------------


In [None]:
def process_logs_small():
    """
    Runs the process_logs function with the small dataset (186 MB).
    """        
    with open(DATA_DIR + "web_log_small.log") as log_file:
        process_logs(log_file)

In [None]:
%time process_logs_small()

In [None]:
import zipfile

def process_logs_large():
    """
    Runs the process_logs function on the full dataset.  The code below 
    performs a streaming unzip of the compressed dataset which is (158MB). 
    This saves the 1.6GB of disk space needed to unzip this file onto disk.
    """
    with zipfile.ZipFile(DATA_DIR + "web_log_large.zip") as z:
        fname = z.filelist[0].filename
        f = z.open(fname)
        process_logs(f)
        f.close()

In [None]:
%time process_logs_large()

---------------

# Testing

As mentioned in the README, we provide reference output only for the small dataset. `diff_outputs()` produces a `.diff` files if there's a difference between your output and the referrence output.

If you're unfamiliar with the format of `diff`'s output, you can read about it [here](https://en.wikipedia.org/wiki/Diff_utility#Usage).

There are other diff utilities which produce colored/side-by-side output, making it easier to see differences. If you're interested, try:

```
$ vimdiff hits.csv ~cs186/sp16/hw1/ref_output_small/hits.csv
OR
$ git diff hits.csv ~cs186/sp16/hw1/ref_output_small/hits.csv
```

In [None]:
import os

ref_output_dir = DATA_DIR + "ref_output_small/"

def _diff_helper(f, unordered=False):
    """
    @param f (str) - filename to diff with reference output
    @param unordered (bool) - whether the ordering of the lines matters
    """
    if not os.path.isfile(f):
        print "FAIL - {} does not exist.".format(f)
        return
    
    if unordered:
        tmp1 = !mktemp
        tmp1 = tmp1[0]
        !sort {f} > {tmp1}
        !sort {ref_output_dir + f} | diff {tmp1} - > {f}.diff
    else:
        !diff {f} {ref_output_dir + f} > {f}.diff
    
    success = _exit_code == 0
    if success:
        !rm {f}.diff
        print "PASS - {} matched reference output.".format(f)
    else:
        print "FAIL - {} did not match reference output. See {}.diff.".format(f, f)
        

def diff_against_reference():
    """
    Compares the output files in the current directory with the reference output.
    If there is a difference, writes a ".diff" file, e.g. hits.csv.diff.
    """ 
    _diff_helper("hits.csv")
    _diff_helper("sessions.csv", unordered=True)
    _diff_helper("session_length_plot.csv")

In [None]:
process_logs_small()
diff_against_reference()


### Testing Memory Usage

For additional testing, we've included a script which:
 - (1) makes sure all of your log processing code is between the BEGIN/END STUDENT CODE CELLS above, so it will work with our autograder
 - (2) runs your code with a memory cap of 1MB. If you see a `MemoryError`, it's a sign your code is not doing appropriate streaming and/or divide-and-conquer!
 
Make sure to save your notebook (`File > Save and Checkpoint`) before running the next cell.

In [None]:
!bash test_memory_usage.sh