## Parsing logs for error messages

### Standard Python version

In [2]:
import gzip
f =  gzip.open('server.log.gz', 'r')
line_count = 0
error_count = 0

for line in f:
    log_line = line.decode("utf-8")
    if log_line.find('HTTP/1.0" 4') >= 0:
        error_count += 1
    line_count += 1
    
print('Total count:{:d}, Error count:{:d}, Error percent:{:f}%'
      .format(line_count, error_count, 
              error_count*100/line_count))

Total count:8281188, Error count:23949, Error percent:0.289198%


#### Copy Files to local HDFS running in the docker container. May take few minutes.

In [1]:
!hdfs dfs -copyFromLocal tweets.json tweets.json
!hdfs dfs -copyFromLocal server.log server.log
!hdfs dfs -ls .

Found 3 items
drwxr-xr-x   - root supergroup          0 2015-01-15 04:05 input
-rw-r--r--   1 root supergroup 1627794620 2015-10-29 15:19 server.log
-rw-r--r--   1 root supergroup  470808452 2015-10-29 15:17 tweets.json


## PySpark version

lets define a function which we will call it from map phase. This is one of the awesome feature of Spark where you can easily call any custom or built-in python function as part of a spark job.

In [4]:
def find_http_error(line):
    return line.find('HTTP/1.0" 4')

In [5]:
rdd = sc.textFile("server.log")
#Mapper
rdd_filtered = rdd.filter(lambda line : 
                          find_http_error(line) >= 0)
#Reducers
line_count = rdd.count()
error_count = rdd_filtered.count()

print('Total count:{:d}, Error count:{:d}, Error percent:{:f}%'
      .format(line_count, error_count, error_count*100/line_count));

Total count:8281188, Error count:23949, Error percent:0.289198%
