# Exercise 12: MapReduce

Welcome to exercise 12! This exercise is another free-form challenge, just like with lab 4 that was focused on the London 2012 athletes dataset. This time, I want you to see if you can answer some questions on a dataset, but only by using the MapReduce programming model.

<img src="images/mapreduce.jpg"/>

First, run the cell below to view a sample of 10 rows from the text file `nasa_access_log_aug95_sample.txt`. 

In [1]:
# The function islice thats the list of lines returned by the 
# open( ... ) command and returns a slice of only 10 of those lines.

from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        print(line)

159.142.165.138 - - [15/Aug/1995:11:03:22 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

134.131.38.18 - - [22/Aug/1995:13:43:38 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

os2c14.aca.ilstu.edu - - [31/Aug/1995:21:47:11 -0400] "GET /shuttle/missions/sts-69/sts-69-patch-small.gif HTTP/1.0" 200 8083

suba01.suba.com - - [24/Aug/1995:04:48:23 -0400] "GET /htbin/wais.pl?TISP HTTP/1.0" 200 1349

146.138.145.170 - - [08/Aug/1995:16:30:51 -0400] "GET /shuttle/missions/sts-62/sts-62-patch-small.gif HTTP/1.0" 200 14385

pizza.innet.net - - [24/Aug/1995:18:22:52 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173

uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0

205.129.171.133 - - [16/Aug/1995:14:13:00 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713

icenet.blackice.com.au - - [16/Aug/1995:07:52:55 -0400] "GET /history/apollo/images/apollo.gif HTTP/1.0

## The Challenge (part A)

Unlike the previous exercises, I have not provided you with a CSV file. This is a file that contains lines of text that is the format output by the Apache HTTP Server -- one of the most popular Web servers on the Internet -- where the lines are in a standardised format (see the [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format) for details), but not comma-separated.

The first part of the challenge is to create a CSV file from this log file. As an example, I have written a few lines below that work on replacing ` - - ` with a comma `,`. You can use what you have alread learned about string replacement to try and create what you think is a sensible split of columns, split by commas.

In [2]:
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        # The following line simply takes the line read and does a string replacement
        print(line.replace(' - - ', ','))

159.142.165.138,[15/Aug/1995:11:03:22 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

134.131.38.18,[22/Aug/1995:13:43:38 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

os2c14.aca.ilstu.edu,[31/Aug/1995:21:47:11 -0400] "GET /shuttle/missions/sts-69/sts-69-patch-small.gif HTTP/1.0" 200 8083

suba01.suba.com,[24/Aug/1995:04:48:23 -0400] "GET /htbin/wais.pl?TISP HTTP/1.0" 200 1349

146.138.145.170,[08/Aug/1995:16:30:51 -0400] "GET /shuttle/missions/sts-62/sts-62-patch-small.gif HTTP/1.0" 200 14385

pizza.innet.net,[24/Aug/1995:18:22:52 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173

uplherc.upl.com,[01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0

205.129.171.133,[16/Aug/1995:14:13:00 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713

icenet.blackice.com.au,[16/Aug/1995:07:52:55 -0400] "GET /history/apollo/images/apollo.gif HTTP/1.0" 200 28847

qa2.silverplatter.com,[

In [3]:
def replace_right(source, target, replacement, replacements=None):
    return replacement.join(source.rsplit(target, replacements))

from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        strng = line.replace(',', ';') # replace commas with semicolons, otherwise we have a problem!
        strng = strng.replace(' - - ', ',')
        strng = strng.replace(' "GET ', ',')
        strng = strng.replace(' HTTP', ',HTTP')
        strng = strng.replace('" ', ',')
        strng = replace_right(strng, ' ', ',', replacements=1) # just replace the last space
        print(strng)

159.142.165.138,[15/Aug/1995:11:03:22 -0400],/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179

134.131.38.18,[22/Aug/1995:13:43:38 -0400],/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179

os2c14.aca.ilstu.edu,[31/Aug/1995:21:47:11 -0400],/shuttle/missions/sts-69/sts-69-patch-small.gif,HTTP/1.0,200,8083

suba01.suba.com,[24/Aug/1995:04:48:23 -0400],/htbin/wais.pl?TISP,HTTP/1.0,200,1349

146.138.145.170,[08/Aug/1995:16:30:51 -0400],/shuttle/missions/sts-62/sts-62-patch-small.gif,HTTP/1.0,200,14385

pizza.innet.net,[24/Aug/1995:18:22:52 -0400],/history/apollo/images/apollo-logo1.gif,HTTP/1.0,200,1173

uplherc.upl.com,[01/Aug/1995:00:00:10 -0400],/images/WORLD-logosmall.gif,HTTP/1.0,304,0

205.129.171.133,[16/Aug/1995:14:13:00 -0400],/images/launch-logo.gif,HTTP/1.0,200,1713

icenet.blackice.com.au,[16/Aug/1995:07:52:55 -0400],/history/apollo/images/apollo.gif,HTTP/1.0,200,28847

qa2.silverplatter.com,[23/Aug/1995:12:55:43 -0400],/history/mercury/mr-3/mr-3

Modify the code below to write out the `nasa_access_log_aug95_sample.csv` file with your string replacements to turn the input into a CSV file that can be read using `pandas`.

In [4]:
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as input_file_pointer:
    with open('nasa_access_log_aug95_sample.csv', 'w') as output_file_pointer:
        for line in input_file_pointer:
            output_file_pointer.write("{line}".format(line=line.replace(' - - ', ',')))

In [5]:
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as input_file_pointer:
    with open('nasa_access_log_aug95_sample.csv', 'w') as output_file_pointer:
        for line in input_file_pointer:
            strng = line.replace(',', ';') # replace commas with semicolons, otherwise we have a problem!
            strng = strng.replace(' - - ', ',')
            strng = strng.replace(' "GET ', ',')
            strng = strng.replace(' HTTP', ',HTTP')
            strng = strng.replace('" ', ',')
            strng = replace_right(strng, ' ', ',', replacements=1) # just replace the last space
            output_file_pointer.write("{line}".format(line=strng))

In [6]:
import pandas as pd
df = pd.read_csv('nasa_access_log_aug95_sample.csv', error_bad_lines=False, warn_bad_lines=False)#.sample(25)
df.columns = ['Address', 'Timestamp', 'File', 'Protocol', 'StatusCode', 'NumBytes']
df

Unnamed: 0,Address,Timestamp,File,Protocol,StatusCode,NumBytes
0,134.131.38.18,[22/Aug/1995:13:43:38 -0400],/shuttle/missions/sts-73/sts-73-patch-small.gif,HTTP/1.0,200,4179
1,os2c14.aca.ilstu.edu,[31/Aug/1995:21:47:11 -0400],/shuttle/missions/sts-69/sts-69-patch-small.gif,HTTP/1.0,200,8083
2,suba01.suba.com,[24/Aug/1995:04:48:23 -0400],/htbin/wais.pl?TISP,HTTP/1.0,200,1349
3,146.138.145.170,[08/Aug/1995:16:30:51 -0400],/shuttle/missions/sts-62/sts-62-patch-small.gif,HTTP/1.0,200,14385
4,pizza.innet.net,[24/Aug/1995:18:22:52 -0400],/history/apollo/images/apollo-logo1.gif,HTTP/1.0,200,1173
5,uplherc.upl.com,[01/Aug/1995:00:00:10 -0400],/images/WORLD-logosmall.gif,HTTP/1.0,304,0
6,205.129.171.133,[16/Aug/1995:14:13:00 -0400],/images/launch-logo.gif,HTTP/1.0,200,1713
7,icenet.blackice.com.au,[16/Aug/1995:07:52:55 -0400],/history/apollo/images/apollo.gif,HTTP/1.0,200,28847
8,qa2.silverplatter.com,[23/Aug/1995:12:55:43 -0400],/history/mercury/mr-3/mr-3-patch-small.gif,HTTP/1.0,200,19084
9,199.108.1.97,[05/Aug/1995:19:52:58 -0400],/shuttle/missions/sts-70/movies/woodpecker.mpg,HTTP/1.0,200,49152


## The Challenge (part B)

By adding your own code in your own Jupyter Notebook cells below (you can add a cell by pressing the + button in the toolbar), try and answer some of the following questions about this data set:

- Which files were most popular in terms of `GET` requests?
- What day were the most HTTP requests made to the server?
- How many HTTP 200 (OK) responses were made?
- How many other HTTP code responses were made? Hint: here is a [list of HTTP response codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- What were the biggest, smallest and average file sizes served?

**Important: I want you to try and complete this exercise using the MapReduce programming model. If you find this too difficult, go ahead an use `pandas` anyway as this is still a very challenging lab.**

If you comfortably work out answers for all of these, feel free to add your own analyses!

In [7]:
#Which files were most popular in terms of GET requests?
df['File'].value_counts().head(5)

/images/NASA-logosmall.gif      6198
/images/KSC-logosmall.gif       4919
/images/MOSAIC-logosmall.gif    4222
/images/WORLD-logosmall.gif     4217
/images/USA-logosmall.gif       4216
Name: File, dtype: int64

In [8]:
#try the same using MapReduce:

#map:
mapped_list = []
for filename in df['File']:
    if type(filename) == str:
        mapped = (filename, 1)
        mapped_list.append(mapped)

# reduce:
wordcounts = pd.DataFrame(columns=['filename','count'])
wordcounts
current_word=""
current_count = 0
for mapped in sorted(mapped_list, key=lambda x: x[0]):
    # parse the input we got from mapper
    word = mapped[0]
    count = mapped[1]

    # this IF-switch only works because we sorted map output by key
    if current_word == word:
        current_count += count
    else:
        if current_word:
            #print(current_word, current_count)
            wordcounts.loc[len(wordcounts)] = [current_word, current_count]
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    #print(current_word, current_count)
    wordcounts.loc[len(wordcounts)] = [current_word, current_count]

wordcounts.sort_values(by='count', ascending=False).head(5)

Unnamed: 0,filename,count
2049,/images/NASA-logosmall.gif,6198
2046,/images/KSC-logosmall.gif,4919
2047,/images/MOSAIC-logosmall.gif,4222
2058,/images/WORLD-logosmall.gif,4217
2057,/images/USA-logosmall.gif,4216


In [9]:
#What day were the most HTTP requests made to the server?
df['Day'] = df['Timestamp'].str[1:12]
df['Day'].value_counts().head(5)

31/Aug/1995    5717
30/Aug/1995    5130
29/Aug/1995    4435
10/Aug/1995    3913
14/Aug/1995    3895
Name: Day, dtype: int64

In [10]:
#How many HTTP 200 (OK) responses were made?
sum(df['StatusCode']=='200')

88638

In [11]:
#How many other HTTP code responses were made? 
df.shape[0] - sum(df['StatusCode']=='200')

11264

In [12]:
#What were the biggest, smallest and average file sizes served?
df['NumBytes'].describe()
# some dodgy characters in NumBytes, so need to do some cleaning/filtering first...

count     99534
unique     2607
top           0
freq       8744
Name: NumBytes, dtype: object

# Bonus Challenge

If you are feeling *really* adventurous, you can try using a Python library to do geographical-IP lookups to do some analyses. You will need to open up the command line and install the library called `geolite2`. To do this, open **Git Bash** and type the following:

```
$  pip install maxminddb-geolite2
```

Once `pip` has installed `geolite2`, if you restart Jupyter Notebook, you should be able to use it similar to as follows:

In [15]:
!pip install maxminddb-geolite2

Collecting maxminddb-geolite2
[?25l  Downloading https://files.pythonhosted.org/packages/51/01/d12231a190659c269fec87a1144c9e243aadfccb0b18f7aca329661d9308/maxminddb-geolite2-2018.703.tar.gz (26.1MB)
[K    100% |████████████████████████████████| 26.1MB 715kB/s ta 0:00:011
[?25hCollecting maxminddb (from maxminddb-geolite2)
[?25l  Downloading https://files.pythonhosted.org/packages/83/35/6dc423e0ff354c326849d6d878d104b44be7eec491dcf26787ab3593cd81/maxminddb-1.4.1.tar.gz (264kB)
[K    100% |████████████████████████████████| 266kB 1.3MB/s ta 0:00:01
[?25hBuilding wheels for collected packages: maxminddb-geolite2, maxminddb
  Building wheel for maxminddb-geolite2 (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/davidjohnson/Library/Caches/pip/wheels/94/69/0a/4453d83e882e2c55aa8c8b5b37342e0b4acddb92e808fa9664
  Building wheel for maxminddb (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/davidjohnson/Library/Caches/pip/wheels/58/60/71/9d07e2c0999b13b1f3ca3e21

In [16]:
from geolite2 import geolite2
reader = geolite2.reader()
reader.get('1.1.1.1')

{'city': {'geoname_id': 2151718, 'names': {'en': 'Research'}},
 'continent': {'code': 'OC',
  'geoname_id': 6255151,
  'names': {'de': 'Ozeanien',
   'en': 'Oceania',
   'es': 'Oceanía',
   'fr': 'Océanie',
   'ja': 'オセアニア',
   'pt-BR': 'Oceania',
   'ru': 'Океания',
   'zh-CN': '大洋洲'}},
 'country': {'geoname_id': 2077456,
  'iso_code': 'AU',
  'names': {'de': 'Australien',
   'en': 'Australia',
   'es': 'Australia',
   'fr': 'Australie',
   'ja': 'オーストラリア',
   'pt-BR': 'Austrália',
   'ru': 'Австралия',
   'zh-CN': '澳大利亚'}},
 'location': {'accuracy_radius': 1000,
  'latitude': -37.7,
  'longitude': 145.1833,
  'time_zone': 'Australia/Melbourne'},
 'postal': {'code': '3095'},
 'registered_country': {'geoname_id': 2077456,
  'iso_code': 'AU',
  'names': {'de': 'Australien',
   'en': 'Australia',
   'es': 'Australia',
   'fr': 'Australie',
   'ja': 'オーストラリア',
   'pt-BR': 'Austrália',
   'ru': 'Австралия',
   'zh-CN': '澳大利亚'}},
 'subdivisions': [{'geoname_id': 2145234,
   'iso_code': 'VIC

The `reader.get( ... )` line takes an IP address and looks up the geographical information about it, and returns a Python dictionary. You can now select specific geographical information about the IP address. For example

In [17]:
# Get the country, in particular the English name
geo_dict = reader.get('1.1.1.1')
geo_dict['country']['names']['en']

'Australia'

In [18]:
# Get the continent, in particular the English name
geo_dict = reader.get('1.1.1.1')
geo_dict['continent']['names']['en']

'Oceania'

I have not tested this, so I will leave it to you to work out for yourselves if you take on this Bonus Challenge!