# HW 1: Web log data wrangling

### This assignment has been originally developed for [UC Berkeley CS 186 course](http://www.cs186berkeley.net/); we use it for COMS4037 with their gracious permission

Please, refer to the HW1 [README](https://github.com/WITS-COMS4037/hw/tree/master/hw1) for the full details for this assignment.

--------------------------------------------

## Introduction

### Jupyter Notebook with IPython

Jupyter Notebook is a web-based interactive computing system that allows you to mix code and text in [markdown format](https://en.wikipedia.org/wiki/Markdown) in a single document. A notebook consists of a sequence of cells that can be run by hitting Shift-Enter on the keyboard.

In HW1, you will primarily use code cells to work with IPython code. You can find a tour and pointers to documentation in the `Help` menu at the top of this page.


### The dataset

The data are web logs that were produced by an Apache web server; each line represents a request to the server that hosted a video from 2002.

In [9]:
# we assume that the dataset is in ~/coms4037/hw1 
import os
DATA_DIR = os.environ['HOME'] + '/COMS4037/checkin/hw1/hw1/'

In [10]:
# take a look at the first line of a file to see what the data looks like
with open(DATA_DIR + "web_log_small.log") as log_file:
    sample_line = log_file.readline()

print( sample_line )
process_logs(log_file)

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"

Start
27.4461452960968
Done


The data is in the format called "Combined Log Format"; you can find a description of each of the fields [here](https://httpd.apache.org/docs/1.3/logs.html#common).

Another way to view the first line of the dataset is to run a shell command using the IPython's [`! operator`](https://ipython.org/ipython-doc/3/interactive/reference.html#system-shell-access), as follows: 

In [None]:
!head -1 {DATA_DIR}web_log_small.log

-----------

## Your Assignment

You need to complete the implementation of the `process_logs` function in the code given below to meet the specification described in the README file. In doing so, you can add any helper functions. You may also use any of Python 2's standard libraries; you should not, however, use any external libraries.

You need to ensure that your code will scale to datasets that are bigger than main memory, regardless of how large or skewed the dataset is or how much memory the executing machine has.  Thus, you should avoid keeping data structures of unbounded size, such as

- a list of every line in the dataset
- a dictionary with an key for every IP address

in memory. 

### Important

To ensure proper grading, make sure that all of your log processing code (including `import` statements) is between the **BEGIN/END STUDENT CODE** cells. Do not modify or remove either of these cells.

### * BEGIN STUDENT CODE *

In [None]:
import apachetime
import time

def apache_ts_to_unixtime( ts ):
    """
    @param ts - a Apache timestamp string, e.g. '[02/Jan/2003:02:06:41 -0700]'
    @returns int - a Unix timestamp in seconds
    """
    dt = apachetime.apachetime( ts )
    unixtime = time.mktime( dt.timetuple() )
    return int( unixtime )

In [8]:
import time
def process_logs (dataset_iter ):
    """
    Processes the input stream, and outputs the CSV files described in the README.    
    This is the main entry point for your assignment.
    
    @param dataset_iter - an iterator of Apache log lines.
    """
    # FIX ME
    #with open( "hits.csv", "w+" ) as hits_file:        
    #    for i, line in enumerate( dataset_iter ):            
    #        if i % 1e5 == 0:
    #            print( i ),

    log_file=dataset_iter.name
    dataset_iter.close()
    if os.path.exists(log_file) is False:
        log_file=DATA_DIR + dataset_iter.name[:-4] + ".zip"
        if os.path.exists(log_file) is False:
            print("Error Retrieving ZIP File")
    
    hits_file = DATA_DIR + "hits.csv"
    sessions_file = DATA_DIR + "sessions.csv"
    plot_file = DATA_DIR + "session_length_plot.csv"
    
    print( "Start" )
    start = time.time()
    os.system("""echo %(ip)s | if [ "$(cut -d'.' -f 2)" == "zip" ]; then unzip -p %(ip)s; else cat %(ip)s; fi | 
    awk 'BEGIN{print "ip,timestamp"; m=split("Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec",d,"|")
    for(o=1;o<=m;o++){months[d[o]]=sprintf("%%02d",o)}}
    {
    gsub(/\[/,"",$4); gsub(":","/",$4); gsub(/\]/,"",$5);
    n=split($4, DATE,"/")
    day=DATE[1]+0
    month_text=DATE[2]
    month = months[month_text]+0
    year=DATE[3]+0
    hour=DATE[4]+0
    minute=DATE[5]+0  
    second=DATE[6]+0
    s=split($5, TIME,",")
    timezone=TIME[1]
    
    offsetsign=substr(timezone,1,1)
    offsethour=substr(timezone,2,2)
    offsetminute=substr(timezone,4,2)
    if (offsethour=="07") 
        {if (offsetsign == "-") {offset=((offsethour+3)*60*60+offsetminute*60)}
        else {if (offsetsign == "+") {offset=-((offsethour+3)*60*60+offsetminute*60)}
        else {offset=0}}}
    else {if (offsetsign == "-") {offset=((offsethour+2)*60*60+offsetminute*60)}
        else {if (offsetsign == "+") offset=-((offsethour+2)*60*60+offsetminute*60)
        else {offset=0}}}
        
    hoursecs = 60 * 60
    daysecs = 24 * hoursecs
    totaldays=0
    total=0
    totaldays = (year - 1970) * 365

    for (i = 1970; i < year; i++){
        if (((i %% 4) == 0 && (i %% 100) != 0) || ((i %% 400) == 0))
            totaldays += 1}

    months[0,1] = months[0,3] = months[0,5] = months[0,7] = months[0,8] = months[0,10] = months[0,12] = 31
    months[1,1] = months[1,3] = months[1,5] = months[1,7] = months[1,8] = months[1,10] = months[1,12] = 31
    months[0,4] = months[0,6] = months[0,9] = months[0,11] = 30
    months[1,4] = months[1,6] = months[1,9] = months[1,11] = 30
    months[0,2] = 28; months[1,2] = 29

    j = (((year %% 4) == 0 && (year %% 100) != 0) || ((year %% 400) == 0))

    for (i = 1; i < month; i++){
        totaldays += months[j, i]}

    totaldays += (day - 1)
    total += totaldays * daysecs  
    utctime = hour - 2
    total += (utctime) * hoursecs
    total += minute * 60
    total += second
    ftime=total+offset
    print $1","ftime}' | tee %(op)s | sed 1d | sort | 
    awk -F, 'BEGIN{print "ip,session_length,num_hits"}{currentip=$1; currenttime=$2
    if (ip != currentip){
        sessionlength = time - starttime
        if (ip != "") {print ip","sessionlength","hits}
        sessionlength=0
        hits=1
        ip=currentip
        time=currenttime
        starttime=currenttime
        } else {
            if ((currenttime - time) > 1800 ){
            sessionlength = time - starttime
            print ip","sessionlength","hits    
            sessionlength=0
            hits=1
            time=currenttime
            starttime=currenttime
            }
            else
            {
            hits++
            time=currenttime            
            }
        }} END{
            if (ip != currentip){
            sessionlength = 0
            hits=1
            if (currentip != "") {print currentip","sessionlength","hits}
            } else {
                if ((currenttime - time) > 1800 ){
                sessionlength = 0
                hits=1
                if (currentip != "") {print currentip","sessionlength","hits}    
                }
                else
                {
                sessionlength = time - starttime
                if (currentip != "") {print currentip","sessionlength","hits}
                }
            }
        }' | tee %(op2)s | sed 1d | sort -t $',' -k 2,2 -nr|
        awk -F, 'BEGIN{print "left,right,count";range[1][0][2]=0;}
        {{
        if(NR==1){maxrange=$2;{for(i=2; k < maxrange; i++){j=2**(i-1); k=2**i; range[i][j][k]=0;}}}
        }
        {
        sessionlength = $2+0
        for (i in range) {for (j in range[i]){for (k in range[i][j])
            {if ( (j+0 <= sessionlength) && (sessionlength < k+0) )
                {range[i][j][k]=range[i][j][k]+1}}}}
        }}
        END
        {
        for (i in range) {for (j in range[i]){for (k in range[i][j]){print j","k","range[i][j][k]}}}
        }' | tee %(op3)s """ % {'ip': log_file, 'op': hits_file, 'op2': sessions_file, 'op3': plot_file})
    
    end = time.time()
    print(end-start)
    print( "Done" )

### * END STUDENT CODE *

------------------------


In [None]:
def process_logs_small( ):
    """
    Runs the process_logs function with the small dataset (186 MB).
    """        
    with open( DATA_DIR + "web_log_small.log" ) as log_file:
        process_logs( log_file )

In [None]:
%time process_logs_small( )

In [None]:
import zipfile

def process_logs_large( ):
    """
    Runs the process_logs function on the full dataset.  The code below 
    performs a streaming unzip of the compressed dataset which is (158MB). 
    This saves the 1.6GB of disk space needed to unzip this file onto disk.
    """
    with zipfile.ZipFile( DATA_DIR + "web_log_large.zip" ) as z:
        fname = z.filelist[0].filename
        f = z.open( fname )
        process_logs( f )
        f.close( )

In [None]:
%time process_logs_large( )

---------------

# Testing

We provide reference output only for the small dataset. The function `diff_against_reference()` produces `.diff` files if there's a difference between your output and the referrence output.

If you're unfamiliar with the format of `diff`'s output, you can read about it [here](https://en.wikipedia.org/wiki/Diff_utility#Usage).

There are other diff utilities that produce more convenient output, making it easier to see the differences between the input files. If you're interested, you might try:

```
$ vimdiff hits.csv ~/coms4037/hw1/ref_output_small/hits.csv
OR
$ git diff hits.csv ~/coms4037/hw1/ref_output_small/hits.csv
```

In [None]:
import os

ref_output_dir = DATA_DIR + "ref_output_small/"

def _diff_helper( f, unordered=False ):
    """
    @param f (str) - filename to diff with reference output
    @param unordered (bool) - whether the ordering of the lines matters
    """
    if not os.path.isfile( f ):
        print "FAIL - {} does not exist.".format( f )
        return
    
    if unordered:
        tmp1 = !mktemp
        tmp1 = tmp1[0]
        !sort {f} > {tmp1}
        !sort {ref_output_dir + f} | diff {tmp1} - > {f}.diff
    else:
        !diff {f} {ref_output_dir + f} > {f}.diff
    
    success = _exit_code == 0
    if success:
        !rm {f}.diff
        print "PASS - {} matched reference output.".format( f )
    else:
        print "FAIL - {} did not match reference output. See {}.diff.".format( f, f )
        
 
def diff_against_reference( ):
    """
    Compares the output files in the current directory with the reference output.
    If there is a difference, writes a ".diff" file, e.g. hits.csv.diff.
    """ 
    _diff_helper( "hits.csv" )
    _diff_helper( "sessions.csv", unordered=True )
    _diff_helper( "session_length_plot.csv" )

In [None]:
process_logs_small( )
diff_against_reference( )


### Testing Memory Usage

For additional testing, we've included a script that
 - (1) makes sure all of your log processing code is between the BEGIN/END STUDENT CODE CELLS above, so it will work with the test code
 - (2) runs your code with a memory cap of 1MB. If you see a `MemoryError`, it means that your code is not doing appropriate streaming and/or divide-and-conquer!
 
Make sure to save your notebook (`File > Save and Checkpoint` or Ctrl-S) before running the next cell.

In [None]:
!bash test_memory_usage.sh