# HW 1: Web log data wrangling

### This assignment has been originally developed for [UC Berkeley CS 186 course](http://www.cs186berkeley.net/); we use it for COMS4037 with their gracious permission

Please, refer to the HW1 [README](https://github.com/WITS-COMS4037/hw/tree/master/hw1) for the full details for this assignment.

--------------------------------------------

## Introduction

### Jupyter Notebook with IPython

Jupyter Notebook is a web-based interactive computing system that allows you to mix code and text in [markdown format](https://en.wikipedia.org/wiki/Markdown) in a single document. A notebook consists of a sequence of cells that can be run by hitting Shift-Enter on the keyboard.

In HW1, you will primarily use code cells to work with IPython code. You can find a tour and pointers to documentation in the `Help` menu at the top of this page.


### The dataset

The data are web logs that were produced by an Apache web server; each line represents a request to the server that hosted a video from 2002.

In [15]:
# we assume that the dataset is in ~/coms4037/hw1 
import os
DATA_DIR = os.environ['HOME'] + '/coms4037/hw1/'
#DATA_DIR = os.environ['HOME'] + '/coms4037/hw1/datasets/'
#DATA_DIR = os.environ['HOME'] + '/COMS7043/hw/hw1/Datasets/'
DATA_DIR

'/home/calvin/coms4037/hw1/'

In [16]:
# take a look at the first line of a file to see what the data looks like
with open(DATA_DIR + "web_log_small.log") as log_file:
    sample_line = log_file.readline()

print( sample_line )

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"



The data is in the format called "Combined Log Format"; you can find a description of each of the fields [here](https://httpd.apache.org/docs/1.3/logs.html#common).

Another way to view the first line of the dataset is to run a shell command using the IPython's [`! operator`](https://ipython.org/ipython-doc/3/interactive/reference.html#system-shell-access), as follows: 

In [17]:
!head -1 {DATA_DIR}web_log_small.log

62.172.72.131 - - [02/Jan/2003:02:06:41 -0700] "GET /random/html/riaa_hacked/ HTTP/1.0" 200 10564 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0; WWP 17 August 2001)"


-----------

## Your Assignment

You need to complete the implementation of the `process_logs` function in the code given below to meet the specification described in the README file. In doing so, you can add any helper functions. You may also use any of Python 2's standard libraries; you should not, however, use any external libraries.

You need to ensure that your code will scale to datasets that are bigger than main memory, regardless of how large or skewed the dataset is or how much memory the executing machine has.  Thus, you should avoid keeping data structures of unbounded size, such as

- a list of every line in the dataset
- a dictionary with an key for every IP address

in memory. 

### Important

To ensure proper grading, make sure that all of your log processing code (including `import` statements) is between the **BEGIN/END STUDENT CODE** cells. Do not modify or remove either of these cells.

### * BEGIN STUDENT CODE *

In [18]:
import apachetime
import time
# new imports
import csv
import re

def apache_ts_to_unixtime( ts ):
    """
    @param ts - a Apache timestamp string, e.g. '[02/Jan/2003:02:06:41 -0700]'
    @returns int - a Unix timestamp in seconds
    """
    dt = apachetime.apachetime( ts )
    unixtime = time.mktime( dt.timetuple() )    
    
    return int( unixtime )

In [19]:
def process_logs (dataset_iter ):
    """
    Processes the input stream, and outputs the CSV files described in the README.    
    This is the main entry point for your assignment.
    
    @param dataset_iter - an iterator of Apache log lines.
    """
    # FIX ME
    with open( "hits.csv", "w+" ) as hits_file:
        header = "ip,timestamp\n" #new line for writing header line to the csv file
        hits_file.write(header) #new line
        for i, line in enumerate( dataset_iter ):
            #regex = '([(\d\.)]+) (.*) (.*) (\[.*\]) "(.*)" (.*) (.*) "(.*)" "(.*)"'
            #logs = re.match(regex, line).groups()
            #timestamp = apache_ts_to_unixtime((logs[3]))
            
            logs = line.split() #splits each lines content where there is a space
            timestamp = logs[3] + logs[4] #combination to form timestamp with time zone included
            unix_time = apache_ts_to_unixtime(timestamp) #convert timestamp to unix time
            
            hits_file.write(str(logs[0])+ "," + str(unix_time) + "\n")            
            if i % 1e5 == 0:
                print( i ),
        
        print( "Done" )
        print ("hits.csv done")
    
    #creation of sessions.csv file
    
    # Sort the data based and create a new csv file with the sorted content from hits file
    !tail -n +2 hits.csv | sort -o hits_sorted.csv # tail command reads all but the header line
    with open("sessions.csv", "w+") as sessions_file: 
        sessions_file.write("ip,session_length,num_hits\n")
        with open("hits_sorted.csv", "r") as hits_sorted_file: #reading from the sorted hits file
            #Initialize parameters
            started = 0 #time started
            ended = 0 #ending time for the session
            num_hits = 0 # number of hits
            prev_ip = None #initialized to none as no row has been read yet

            for row in hits_sorted_file:
                row_data = row.split(",") #splits each line/row based on presence of a comma
                ip = row_data[0] #picks out or reads IP address
                t_stamp = int(row_data[1]) #picks out unix time value

                if ip != prev_ip:
                    if num_hits >= 1:
                        sessions_file.write(prev_ip +","+ str(ended - started) +","+ str(num_hits)+ "\n")
                    #re-evaluate initialized variables
                    prev_ip = ip
                    started = t_stamp
                    ended = t_stamp
                    num_hits = 1
                else:
                    if t_stamp - ended <= 1800: #1800 seconds represents 30 minutes for session determination
                        num_hits +=1
                        ended = t_stamp
                    else: #If difference in time > 1800 seconds, then a new session is started
                        sessions_file.write(prev_ip +","+ str(ended - started) +","+ str(num_hits)+ "\n")
                        started = t_stamp
                        ended = t_stamp
                        num_hits = 1
            sessions_file.write(prev_ip +","+ str(ended - started) +","+ str(num_hits)+ "\n")

    print "sessions.csv done"
    
    
    # creation of session_length_plot.csv
    !tail -n +2 sessions.csv | sort -t "," -k2,2n > sessions_sorted.csv #sorting using session_length column

    with open("session_length_plot.csv", "w+") as histogram_file:
        histogram_file.write("left,right,count\n")
        with open("sessions_sorted.csv", "rb") as sessions_data:
            #Initialization of params
            left = 0
            right = 2
            num = 0

            for session_row in sessions_data:
                sess_row_data = session_row.split(",")
                session_length = int(sess_row_data[1])              
                if session_length < right:
                    num += 1
                else:
                    if num >= 1:
                        histogram_file.write(str(left) +","+ str(right) +","+ str(num) +"\n")
                    left = right
                    right = right * 2
                    num = 1
            histogram_file.write(str(left) +","+ str(right) +","+ str(num) +"\n")

    
    print "session_length_plot.csv done"

### * END STUDENT CODE *

------------------------


In [24]:
def process_logs_small( ):
    """
    Runs the process_logs function with the small dataset (186 MB).
    """        
    with open( DATA_DIR + "web_log_small.log" ) as log_file:
        process_logs( log_file )

In [25]:
%time process_logs_small( )

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Done
hits.csv done
sessions.csv done
session_length_plot.csv done
CPU times: user 15.4 s, sys: 864 ms, total: 16.2 s
Wall time: 20.8 s


In [26]:
import zipfile

def process_logs_large( ):
    """
    Runs the process_logs function on the full dataset.  The code below 
    performs a streaming unzip of the compressed dataset which is (158MB). 
    This saves the 1.6GB of disk space needed to unzip this file onto disk.
    """
    with zipfile.ZipFile( DATA_DIR + "web_log_large.zip" ) as z:
        fname = z.filelist[0].filename
        f = z.open( fname )
        process_logs( f )
        f.close( )

In [27]:
%time process_logs_large( )

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 2000000 2100000 2200000 2300000 2400000 2500000 2600000 2700000 2800000 2900000 3000000 3100000 3200000 3300000 3400000 3500000 3600000 3700000 3800000 3900000 4000000 4100000 4200000 4300000 4400000 4500000 4600000 4700000 4800000 4900000 5000000 5100000 5200000 5300000 5400000 5500000 5600000 5700000 5800000 5900000 6000000 6100000 6200000 6300000 6400000 6500000 6600000 6700000 6800000 6900000 7000000 7100000 7200000 7300000 7400000 7500000 7600000 7700000 7800000 7900000 8000000 8100000 8200000 Done
hits.csv done
sessions.csv done
session_length_plot.csv done
CPU times: user 2min 37s, sys: 6.19 s, total: 2min 43s
Wall time: 3min 26s


---------------

# Testing

We provide reference output only for the small dataset. The function `diff_against_reference()` produces `.diff` files if there's a difference between your output and the referrence output.

If you're unfamiliar with the format of `diff`'s output, you can read about it [here](https://en.wikipedia.org/wiki/Diff_utility#Usage).

There are other diff utilities that produce more convenient output, making it easier to see the differences between the input files. If you're interested, you might try:

```
$ vimdiff hits.csv ~/coms4037/hw1/ref_output_small/hits.csv
OR
$ git diff hits.csv ~/coms4037/hw1/ref_output_small/hits.csv
```

In [28]:
import os

#ref_output_dir = DATA_DIR + "ref_output_small/"
ref_output_dir = DATA_DIR + "reference_output_web_log_small/"

def _diff_helper( f, unordered=False ):
    """
    @param f (str) - filename to diff with reference output
    @param unordered (bool) - whether the ordering of the lines matters
    """
    if not os.path.isfile( f ):
        print "FAIL - {} does not exist.".format( f )
        return
    
    if unordered:
        tmp1 = !mktemp
        tmp1 = tmp1[0]
        !sort {f} > {tmp1}
        !sort {ref_output_dir + f} | diff {tmp1} - > {f}.diff
    else:
        !diff {f} {ref_output_dir + f} > {f}.diff
    
    success = _exit_code == 0
    if success:
        !rm {f}.diff
        print "PASS - {} matched reference output.".format( f )
    else:
        print "FAIL - {} did not match reference output. See {}.diff.".format( f, f )
        
 
def diff_against_reference( ):
    """
    Compares the output files in the current directory with the reference output.
    If there is a difference, writes a ".diff" file, e.g. hits.csv.diff.
    """ 
    _diff_helper( "hits.csv" )
    _diff_helper( "sessions.csv", unordered=True )
    _diff_helper( "session_length_plot.csv" )

In [29]:
process_logs_small( )
diff_against_reference( )

0 100000 200000 300000 400000 500000 600000 700000 800000 900000 Done
hits.csv done
sessions.csv done
session_length_plot.csv done
PASS - hits.csv matched reference output.
PASS - sessions.csv matched reference output.
PASS - session_length_plot.csv matched reference output.



### Testing Memory Usage

For additional testing, we've included a script that
 - (1) makes sure all of your log processing code is between the BEGIN/END STUDENT CODE CELLS above, so it will work with the test code
 - (2) runs your code with a memory cap of 1MB. If you see a `MemoryError`, it means that your code is not doing appropriate streaming and/or divide-and-conquer!
 
Make sure to save your notebook (`File > Save and Checkpoint` or Ctrl-S) before running the next cell.

In [23]:
!bash test_memory_usage.sh

[NbConvertApp] Converting notebook hw1.ipynb to python
Running process_logs_large()
]0;IPython: hw/hw10 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 2000000 2100000 2200000 2300000 2400000 2500000 2600000 2700000 2800000 2900000 3000000 3100000 3200000 3300000 3400000 3500000 3600000 3700000 3800000 3900000 4000000 4100000 4200000 4300000 4400000 4500000 4600000 4700000 4800000 4900000 5000000 5100000 5200000 5300000 5400000 5500000 5600000 5700000 5800000 5900000 6000000 6100000 6200000 6300000 6400000 6500000 6600000 6700000 6800000 6900000 7000000 7100000 7200000 7300000 7400000 7500000 7600000 7700000 7800000 7900000 8000000 8100000 8200000 Done
hits.csv done
sessions.csv done
session_length_plot.csv done
Memory Test Done.
