# Lab 3: MapReduce in Python

This exercise is a free-form challenge. I want you to see if you can answer some analytical questions on a dataset, but only by using the MapReduce programming model.

<img src="https://www.ibm.com/content/dam/connectedassets-adobe-cms/worldwide-content/creative-assets/s-migr/ul/g/1c/98/using-mapreduce-to-determine-high-temperatures-per-city.component.xl-retina.ts=1763402375437.png/content/adobe-cms/us/en/think/topics/mapreduce/jcr:content/root/table_of_contents/body-article-8/image"/>

(Image from https://www.ibm.com/think/topics/mapreduce, also have a read of the article)

First, grab the data file from this URL: https://drive.google.com/file/d/1ZiyXLVDyirV_2OivNVdTqeUPQ5ez7M2a/view?usp=sharing

Then upload it to your Colab instance and run the cell below to view a sample of 10 rows from the text file `nasa_access_log_aug95_sample.txt`.

In [1]:
# The function islice thats the list of lines returned by the
#Â open( ... ) command and returns a slice of only 10 of those lines.

from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        print(line)

159.142.165.138 - - [15/Aug/1995:11:03:22 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

134.131.38.18 - - [22/Aug/1995:13:43:38 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

os2c14.aca.ilstu.edu - - [31/Aug/1995:21:47:11 -0400] "GET /shuttle/missions/sts-69/sts-69-patch-small.gif HTTP/1.0" 200 8083

suba01.suba.com - - [24/Aug/1995:04:48:23 -0400] "GET /htbin/wais.pl?TISP HTTP/1.0" 200 1349

146.138.145.170 - - [08/Aug/1995:16:30:51 -0400] "GET /shuttle/missions/sts-62/sts-62-patch-small.gif HTTP/1.0" 200 14385

pizza.innet.net - - [24/Aug/1995:18:22:52 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173

uplherc.upl.com - - [01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0

205.129.171.133 - - [16/Aug/1995:14:13:00 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713

icenet.blackice.com.au - - [16/Aug/1995:07:52:55 -0400] "GET /history/apollo/images/apollo.gif HTTP/1.0

## The Challenge (part A)

I have not provided you with a CSV file. This is a file that contains lines of text that is the format output by the Apache HTTP Server -- one of the most popular Web servers on the Internet -- where the lines are in a standardised format (see the [Common Log Format](https://en.wikipedia.org/wiki/Common_Log_Format) for details), but not comma-separated.

The first part of the challenge is to create a CSV file from this log file. As an example, I have written a few lines below that work on replacing ` - - ` with a comma `,`. You can try and create what you think is a sensible split of columns by replacing a string sequence with commas that indicate columns of a CSV.

In [5]:
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as file_pointer:
    for line in list(islice(file_pointer, 10)):
        # The following line simply takes the line read and does a string replacement
        print(line.replace(' - - ', ','))

159.142.165.138,[15/Aug/1995:11:03:22 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

134.131.38.18,[22/Aug/1995:13:43:38 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

os2c14.aca.ilstu.edu,[31/Aug/1995:21:47:11 -0400] "GET /shuttle/missions/sts-69/sts-69-patch-small.gif HTTP/1.0" 200 8083

suba01.suba.com,[24/Aug/1995:04:48:23 -0400] "GET /htbin/wais.pl?TISP HTTP/1.0" 200 1349

146.138.145.170,[08/Aug/1995:16:30:51 -0400] "GET /shuttle/missions/sts-62/sts-62-patch-small.gif HTTP/1.0" 200 14385

pizza.innet.net,[24/Aug/1995:18:22:52 -0400] "GET /history/apollo/images/apollo-logo1.gif HTTP/1.0" 200 1173

uplherc.upl.com,[01/Aug/1995:00:00:10 -0400] "GET /images/WORLD-logosmall.gif HTTP/1.0" 304 0

205.129.171.133,[16/Aug/1995:14:13:00 -0400] "GET /images/launch-logo.gif HTTP/1.0" 200 1713

icenet.blackice.com.au,[16/Aug/1995:07:52:55 -0400] "GET /history/apollo/images/apollo.gif HTTP/1.0" 200 28847

qa2.silverplatter.com,[

In [6]:
with open('nasa_access_log_aug95_sample.txt') as input_file_pointer:
    with open('nasa_access_log_aug95_sample.csv', 'w') as output_file_pointer:
        for line in input_file_pointer:
            new_line = line.replace(' - - ', ',')
            output_file_pointer.write(new_line)


Modify the code below to write out the `nasa_access_log_aug95_sample.csv` file with your string replacements to turn the input into a CSV file that can be read using `pandas`.

In [9]:
from itertools import islice
with open('nasa_access_log_aug95_sample.txt') as input_file_pointer:
    with open('nasa_access_log_aug95_sample.csv', 'w') as output_file_pointer:
        for line in input_file_pointer:
           cleaned_line = line.replace(' - - ', ',')
           cleaned_line = cleaned_line.replace('[', '')
           cleaned_line = cleaned_line.replace(']', '')
           cleaned_line = cleaned_line.replace('"', '')

        output_file_pointer.write("{line}".format(line=line.replace(' - - ', ',')))

In [11]:
import pandas as pd
df = pd.read_csv("nasa_access_log_aug95_sample.csv")
df.head()

Unnamed: 0,wpbfl2-22.gate.net,[30/Aug/19


## The Challenge (part B)

By adding your own code in your own Jupyter Notebook cells below (you can add a cell by pressing the + button in the toolbar), try and answer some of the following questions about this data set:

- Which files were most popular in terms of `GET` requests?
- What day were the most HTTP requests made to the server?
- How many HTTP 200 (OK) responses were made?
- How many other HTTP code responses were made? Hint: here is a [list of HTTP response codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) but hopefully you remember some of this stuff from the Digital Infrastructure course!
- What were the biggest, smallest and average file sizes served?

**Important: I want you to try and complete this exercise using the MapReduce programming model. If you find this too difficult, go ahead an use `pandas` anyway as this is still a very challenging lab.**

If you comfortably work out answers for all of these, feel free to add your own analyses!

When you're finished with the lab (or had completed what you can), choose **Save and Checkpoint** from the **File** menu, then choose **Download .ipynb** and save it to your computer. You can then submit via Studium.

In [12]:
# Read the log file into a list of lines
with open("nasa_access_log_aug95_sample.txt") as f:
    lines = f.readlines()

1. Which files were most popular (GET requests)

In [15]:
# MAP: Extract requested files from GET requests
def map_files(lines):
    results = []
    for line in lines:
        parts = line.split('"')
        if len(parts) > 1:
            request = parts[1]
            if request.startswith("GET"):
                file = request.split()[1]
                results.append((file, 1))
    return results


In [16]:
# REDUCE: Count how many times each file was requested
from collections import defaultdict

def reduce_counts(mapped):
    counts = defaultdict(int)
    for key, val in mapped:
        counts[key] += val
    return counts


In [17]:
# RUN: Apply Map and Reduce
mapped_files = map_files(lines)
reduced_files = reduce_counts(mapped_files)

# Get Top 10 most requested files
top_files = sorted(reduced_files.items(), key=lambda x: x[1], reverse=True)[:10]
top_files  # Display result

[('/images/NASA-logosmall.gif', 6198),
 ('/images/KSC-logosmall.gif', 4919),
 ('/images/MOSAIC-logosmall.gif', 4222),
 ('/images/WORLD-logosmall.gif', 4217),
 ('/images/USA-logosmall.gif', 4216),
 ('/images/ksclogo-medium.gif', 3951),
 ('/ksc.html', 2786),
 ('/history/apollo/images/apollo-logo1.gif', 2379),
 ('/images/launch-logo.gif', 2196),
 ('/images/ksclogosmall.gif', 1847)]

2. Day with most HTTP requests

In [18]:
import re
from collections import defaultdict

In [19]:

# MAP: Extract the day from each line
def map_days(lines):
    pairs = []
    for line in lines:
        match = re.search(r'\d{2}/[A-Za-z]{3}/\d{4}', line)
        if match:
            day = match.group(0)
            pairs.append((day, 1))
    return pairs

# REDUCE: Count occurrences
def reduce_counts(mapped):
    counts = defaultdict(int)
    for key, val in mapped:
        counts[key] += val
    return counts




In [20]:
# RUN
mapped_days = map_days(lines)
reduced_days = reduce_counts(mapped_days)

# Most active day
most_active_day = max(reduced_days.items(), key=lambda x: x[1])
most_active_day

('31/Aug/1995', 5717)

3. Number of HTTP 200(OK) responses made

In [21]:
import re
from collections import defaultdict


In [22]:

# MAP: Extract HTTP status codes
def map_status(lines):
    pairs = []
    for line in lines:
        match = re.search(r' (\d{3}) ', line)
        if match:
            code = match.group(1)
            pairs.append((code, 1))
    return pairs

# REDUCE: Count occurrences
def reduce_counts(mapped):
    counts = defaultdict(int)
    for key, val in mapped:
        counts[key] += val
    return counts




In [23]:
# RUN
mapped_status = map_status(lines)
reduced_status = reduce_counts(mapped_status)

# Count HTTP 200 responses
reduced_status.get("200", 0)

88996

4. Number of all other HTTP codes

In [24]:
# Use the reduced_status from previous cell
reduced_status  # Shows count for each HTTP code (200, 404, 301, etc.)


defaultdict(int,
            {'200': 88996,
             '304': 8596,
             '302': 1644,
             '404': 652,
             '403': 12,
             '501': 2})

5. Biggest, smallest, and average file sizes

In [25]:
from collections import defaultdict

In [26]:
# MAP: Extract file sizes
def map_sizes(lines):
    sizes = []
    for line in lines:
        parts = line.split()
        if parts and parts[-1].isdigit():
            size = int(parts[-1])
            sizes.append(("size", size))
    return sizes

# REDUCE: Compute min, max, average
def reduce_sizes(mapped):
    size_list = [v for k,v in mapped]
    return min(size_list), max(size_list), sum(size_list)/len(size_list)



In [27]:
# RUN
mapped_sizes = map_sizes(lines)
reduce_sizes(mapped_sizes)

(0, 3155499, 17458.96599690881)

# Bonus Challenge

If you are feeling *really* adventurous, you can try using a Python library to do geographical-IP lookups to do some analyses.