# Coursework 2: Data Processing

## Task 1
This coursework will assess your understanding of using NoSQL to store and retrieve data.  You will perform operations on data from the Enron email dataset in a MongoDB database, and write a report detailing the suitability of different types of databases for data science applications.  You will be required to run code to answer the given questions in the Jupyter notebook provided, and write a report describing alternative approaches to using MongoDB.

Download the JSON version of the Enron data (using the “Download as zip” to download the data file from http://edshare.soton.ac.uk/19548/, the file is about 380MB) and import into a collection called messages in a database called enron.  You do not need to set up any authentication.  In the Jupyter notebook provided, perform the following tasks, using the Python PyMongo library.

Answers should be efficient in terms of speed.  Answers which are less efficient will not get full marks.

### Importing the dataset

The JSON version of the dataset has been downloaded from [this link](http://edshare.soton.ac.uk/19548/)

The dataset has been imported into the database **enron**

The name of the collection is **messages**

**100000** documents have been imported

In [8]:
%%bash

# mongoimport is the Mongo command to import data.  
# It specifies the database, collection and format, and import file
# --drop means it's going to drop any collection with the same name which already exists
mongoimport --db enron_short --collection messages --drop --file ./messages_short.json
# Delete the JSON file we just downloaded
rm ./messages_short.json

2018-12-08T19:40:19.811+0000	connected to: localhost
2018-12-08T19:40:19.811+0000	dropping: enron_short.messages
2018-12-08T19:40:22.810+0000	[######..................] enron_short.messages	94.5MB/354MB (26.7%)
2018-12-08T19:40:25.811+0000	[#############...........] enron_short.messages	201MB/354MB (56.8%)
2018-12-08T19:40:28.810+0000	[###################.....] enron_short.messages	290MB/354MB (82.0%)
2018-12-08T19:40:31.810+0000	[######################..] enron_short.messages	329MB/354MB (92.9%)
2018-12-08T19:40:34.478+0000	[########################] enron_short.messages	354MB/354MB (100.0%)
2018-12-08T19:40:34.478+0000	imported 100000 documents


In [1]:
import pymongo
from pymongo import MongoClient
from datetime import datetime
from pprint import pprint

import re

from datetime import datetime

In [3]:
client = MongoClient('mongodb://localhost:27017')

client.list_database_names()

['admin', 'config', 'enron_short', 'local']

### 1)
Write a function which returns a MongoDB connection object to the "messages" collection. [4 points] 

In [4]:
db_name = 'enron_short'
coll_name = 'messages'

def get_collection():
    """
    Connects to the server, and returns a collection object
    of the `messages` collection in the `enron` database
    """
    # YOUR CODE HERE
    
    client = MongoClient('mongodb://localhost:27017')
    
    if db_name in client.list_database_names():
        db = client.enron_short
        # check if collection is present
        if coll_name in db.list_collection_names():
            collection = db[coll_name]
        else:
            return "Collection:", coll_name, "not found"
    else:
        return "Database:", db_name, "not found"
    
    return collection
            
        
messages_collection = get_collection()

Verifying that collection connection is able to read all documents

In [5]:
messages_collection.count_documents({})

100000

### 2)

Write a function which returns the amount of emails in the messages collection in total. [4 points] 

In [6]:
def get_amount_of_messages(collection):
    """
    :param collection A PyMongo collection object
    :return the amount of documents in the collection
    """    
    # YOUR CODE HERE
    return messages_collection.count_documents({})
number_of_emails = get_amount_of_messages(messages_collection)    

print (number_of_emails)

100000


### 3) 

Write a function which returns each person who was BCCed on an email.  Include each person only once, and display only their name according to the X-To header. [4 points] 



In [7]:
bcc_list = []

bcc_list1 = []
bcc_list2 = []
bcc_list3 = []

final_list = []

# find docs where bcc field exists and is not empty

for doc in messages_collection.find({ 'headers.X-bcc': {'$exists': True, '$ne': ''} }):
    bcc_list.append(doc['headers']['X-bcc'])
    
for bcc_value in bcc_list:
    bcc_list1.append(bcc_value.split('>,'))
    
bcc_list1[0]

for bcc_value in bcc_list1:
    for value in bcc_value:
        bcc_list2.append(value.split('</O')[0])
        
for bcc_value in bcc_list2:
    bcc_list3.append(bcc_value.split(' <')[0])
        
bcc_list3

for bcc_value in bcc_list3:
    if '@' not in bcc_value:
        if bcc_value.strip() not in final_list:
            final_list.append(bcc_value.strip())

final_list

['Villarreal, Alex',
 'Vuittonet, Laura',
 'Wood, Kim',
 'Choate, Heather',
 'Rangel, Ina',
 'Hogan, Irena D.',
 'Westbrook, Cherylene R.',
 'De La Paz, Janet',
 'Beck, Sally',
 'Denny, Jennifer',
 'Piper, Greg',
 'Patti Thompson',
 'Robert Superty',
 'Beth Apollo',
 'Greg Piper',
 'Apollo, Beth',
 'Barry, Patrick',
 'Blair, Jean',
 'Bryan, Randy',
 'Callans, Nancy',
 'Carr, James',
 'Clapper, Karen',
 'Perry, Renee',
 'Porter, Diana',
 'Walden, Shirley',
 'Washington, Kathy',
 'Causey, Richard',
 'Dyson, Fernley',
 'Jordan, Mike',
 'Lokey, Teb',
 'Ratliff, Dale',
 'Shapiro, Richard',
 'Dernehl, Ginger',
 'Guerrero, Janel',
 'Steffes, James D.',
 'Richard Shapiro',
 'James D Steffes',
 'Susan J Mara',
 'Harry Kingerski',
 'hgovenar',
 'Scott Govenar',
 'Davis, Dana',
 'Ogenyi, Gloria',
 'Collins, Harry',
 'Cooley, Jan',
 'Johnson, Jan',
 'McBride, Jane',
 'Wilson, Jane',
 'Butler, Janet',
 'Place, Janet',
 'Moore, Janice',
 'Desrochers, Jim',
 'Lamb, John',
 'Novak, John',
 'Schwartzen

In [8]:
def get_bcced_people(collection):
    """
    :param collection A PyMongo collection object
    :return the names of the people who have received an email by BCC
    """    
    # YOUR CODE HERE

    pass
    

### 4)

Write a function with parameter subject, which gets all emails in a thread with that parameter, and orders them by date (ascending). “An email thread is an email message that includes a running list of all the succeeding replies starting with the original email.”, check for detail descriptions at https://www.techopedia.com/definition/1503/email-thread [4 points]

In [9]:
def get_emails_in_thread(collection, subject):
    """
    :param collection A PyMongo collection object
    :return All emails in the thread with that subject
    """    
    # YOUR CODE HERE    
    
    pass

In [10]:
for doc in messages_collection.find({}).limit(10):
    pprint(doc['headers'])

{'Content-Transfer-Encoding': '7bit',
 'Content-Type': 'text/plain; charset=us-ascii',
 'Date': 'Tue, 14 Nov 2000 08:22:00 -0800 (PST)',
 'From': 'michael.simmons@enron.com',
 'Message-ID': '<6884142.1075854677416.JavaMail.evans@thyme>',
 'Mime-Version': '1.0',
 'Subject': 'Re: Plays and other information',
 'To': 'eric.bass@enron.com',
 'X-FileName': 'ebass.nsf',
 'X-Folder': '\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox',
 'X-From': 'Michael Simmons',
 'X-Origin': 'Bass-E',
 'X-To': 'Eric Bass',
 'X-bcc': '',
 'X-cc': ''}
{'Content-Transfer-Encoding': '7bit',
 'Content-Type': 'text/plain; charset=us-ascii',
 'Date': 'Tue, 14 Nov 2000 07:37:00 -0800 (PST)',
 'From': 'michael.simmons@enron.com',
 'Message-ID': '<6098626.1075854677438.JavaMail.evans@thyme>',
 'Mime-Version': '1.0',
 'Subject': 'Re: Plays and other information',
 'To': 'eric.bass@enron.com',
 'X-FileName': 'ebass.nsf',
 'X-Folder': '\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox',
 'X-From': 'Michael Simmons',
 'X-Or

In [11]:
datetime.strptime('Tue, 14 Nov 2000 08:22:00', '%a, %d %b %Y %H:%M:%S')

datetime.datetime(2000, 11, 14, 8, 22)

In [12]:
sub1 = "Plays and other information"

sub2 = "Re: " + sub1

limit_stage = {
    '$limit': 100
}

match_stage = {
    '$match': { 'headers.Subject': { '$in': [sub1, sub2] } }
}

project_stage = {
     '$project': { 'DateOfMessage': 
                      {'$dateFromString': { 'dateString' : 
                                           { '$rtrim': {'input': {'$substr': ['$headers.Date', 0, 25] }}}
                                          
                                          }
                      },
                        
                        'filename': 1,
                        'headers.Subject': 1,
                        'headers.Date': 1
                 } 
}

sort_stage = {
    '$sort': {'DateOfMessage': 1}
}


pipeline = [limit_stage, project_stage, sort_stage]

for doc in messages_collection.aggregate(pipeline):
    #pprint(doc['headers']['Date'][:-12])
    #pprint(datetime.strptime(doc['headers']['Date'][:-12], '%a, %d %b %Y %H:%M:%S'))
    
    pprint(doc)

{'DateOfMessage': datetime.datetime(2000, 11, 8, 9, 45),
 '_id': ObjectId('4f16fc97d1e2d32371003e70'),
 'filename': '516.',
 'headers': {'Date': 'Wed, 8 Nov 2000 09:45:00 -0800 (PST)',
             'Subject': 'Las Vegas from $39.95/nt! Plus Super Hot Hotel '
                        'Bargains throughout\r\n'
                        ' the U.S., Europe and the Caribbean!'}}
{'DateOfMessage': datetime.datetime(2000, 11, 10, 0, 19),
 '_id': ObjectId('4f16fc97d1e2d32371003e88'),
 'filename': '538.',
 'headers': {'Date': 'Fri, 10 Nov 2000 00:19:00 -0800 (PST)',
             'Subject': 'If you have 5 minutes'}}
{'DateOfMessage': datetime.datetime(2000, 11, 10, 0, 19),
 '_id': ObjectId('4f16fc97d1e2d32371003e89'),
 'filename': '539.',
 'headers': {'Date': 'Fri, 10 Nov 2000 00:19:00 -0800 (PST)',
             'Subject': 'Outage for Unify Gas Users'}}
{'DateOfMessage': datetime.datetime(2000, 11, 10, 0, 21),
 '_id': ObjectId('4f16fc97d1e2d32371003e87'),
 'filename': '537.',
 'headers': {'Date': '

### 5)

Write a function which returns the percentage of emails sent on a weekend (i.e., Saturday and Sunday) as a `float` between 0 and 1. [6 points]

In [13]:
def get_percentage_sent_on_weekend(collection):
    """
    :param collection A PyMongo collection object
    :return A float between 0 and 1
    """    
    # YOUR CODE HERE
    
    pass

In [14]:
total_documents = messages_collection.count_documents({})

In [15]:
limit_stage = {
    '$limit': 100
}



project_stage = {
     '$project': { 'DateOfMessage': 
                      {'$dateFromString': { 'dateString' : 
                                           { '$rtrim': {'input': {'$substr': ['$headers.Date', 0, 25] }}}
                                          
                                          }
                      },
                        
                        'filename': 1,
                        'headers.Date': 1,
                 } 
}

project_stage2 = {
     '$project': { 'DateOfMessage': 1,
                    'filename': 1,
                    'headers.Date': 1,
                    'DayOfWeek': {
                        '$dayOfWeek': '$DateOfMessage'
                    }
                 } 
}

project_stage3 = {
     '$project': { 'DayOfWeek': 1,
                    'filename': 1,
                    'DayType': {
                        '$cond': { 'if': { '$in': [ "$DayOfWeek", [1, 7] ] }, "then": "weekend", 'else': 'weekday' }
                    }
                 } 
}

group_stage1 = {
    '$group': {
        '_id': '$DayType', 'count': {'$sum': 1}
    }
}

match_stage1 = {
    '$match': {
        '_id': 'weekend'
    }
}

project_stage4 = {
    '$project': {
        'percentage_weekend': { '$divide': ['$count', total_documents] }
    }
}




pipeline = [project_stage, project_stage2, project_stage3, group_stage1, match_stage1, project_stage4]

for doc in messages_collection.aggregate(pipeline):
    #pprint(doc['headers']['Date'][:-12])
    #pprint(datetime.strptime(doc['headers']['Date'][:-12], '%a, %d %b %Y %H:%M:%S'))
    
    pprint(doc)

{'_id': 'weekend', 'percentage_weekend': 0.0393}


In [92]:
limit_stage = {
    '$limit': 100
}

project_stage1 = {
     '$project': {
                     '_id': 0,
                     'Sent': '$headers.From',
                     'Received': ['$headers.To', '$headers.Cc'],
                     'To': '$headers.To',
                     'From': '$headers.From',
                     'Cc': '$headers.Cc'
         
                 } 
}

pipeline = [limit_stage, project_stage1]

for doc in messages_collection.aggregate(pipeline):
    pprint(doc)

{'From': 'michael.simmons@enron.com',
 'Received': ['eric.bass@enron.com', None],
 'Sent': 'michael.simmons@enron.com',
 'To': 'eric.bass@enron.com'}
{'From': 'michael.simmons@enron.com',
 'Received': ['eric.bass@enron.com', None],
 'Sent': 'michael.simmons@enron.com',
 'To': 'eric.bass@enron.com'}
{'Cc': 'jason.bass2@compaq.com',
 'From': 'daphneco64@bigplanet.com',
 'Received': ['eric.bass@enron.com', 'jason.bass2@compaq.com'],
 'Sent': 'daphneco64@bigplanet.com',
 'To': 'eric.bass@enron.com'}
{'From': 'bryant@cheatsheets.net',
 'Received': ['cheatsheets@egroups.com', None],
 'Sent': 'bryant@cheatsheets.net',
 'To': 'cheatsheets@egroups.com'}
{'From': 'daphneco64@bigplanet.com',
 'Received': ['lwbthemarine@bigplanet.com, jason.bass2@compaq.com, '
              'eric.bass@enron.com, \r\n'
              '\tchebert108@aol.com',
              None],
 'Sent': 'daphneco64@bigplanet.com',
 'To': 'lwbthemarine@bigplanet.com, jason.bass2@compaq.com, '
       'eric.bass@enron.com, \r\n'
      

 'Sent': 'michael.walters@enron.com',
 'To': 'david.baumbach@enron.com, julie.meyers@enron.com, '
       'william.kelly@enron.com, \r\n'
       '\tpatrick.ryder@enron.com, denver.plachy@enron.com, \r\n'
       '\teddie.janzen@enron.com, mark.mccoy@enron.com, jody.crook@enron.com'}
{'From': 'bushnews@georgewbush.com',
 'Received': ['ebass@enron.com', None],
 'Sent': 'bushnews@georgewbush.com',
 'To': 'ebass@enron.com'}
{'From': 'announce-list@sfx.com',
 'Received': ['ebass@enron.com', None],
 'Sent': 'announce-list@sfx.com',
 'To': 'ebass@enron.com'}
{'From': 'enron.announcement@enron.com',
 'Received': ['ect.frankfurt@enron.com, ect.helsinki@enron.com, '
              'ect.houston@enron.com, \r\n'
              '\tect.london@enron.com, ect.madrid@enron.com, '
              'ect.moscow@enron.com, \r\n'
              '\tect.oslo@enron.com, ect.singapore@enron.com, '
              'ect.stockholm@enron.com, \r\n'
              '\tect.zurich@enron.com',
              None],
 'Sent': 'enron.

In [136]:
from_data = []
to_data = []

from_emails = []
to_emails = []

limit_stage = {
    '$limit': 10
}

project_stage1 = {
     '$project': {
                     '_id': 0,
                     'Received_by': ['$headers.To', '$headers.Cc'],
                     'Sent_by': '$headers.From',
                 } 
}

unwind_stage1 = {
    '$unwind': {'path': '$Received_by', 'preserveNullAndEmptyArrays': True}
}

project_stage2 = {
    '$project': {
                     'Sent_by': 1,
                     'Received_by': {'$split': ['$Received_by', ', ']}
                     
                 } 
}

unwind_stage2 = {
    '$unwind': {'path': '$Received_by', 'preserveNullAndEmptyArrays': True}
}



project_stage3 = {
    '$project': {
        'Received_by': {'$trim': {'input': '$Received_by'}},
        'Sent_by': 1
        
    }
}

match_stage1 = {
    '$match': {
        'Received_by': {'$ne': None}
    }
}

# grouping users : To

group_stage1 = {
    '$group': { 
        '_id': '$Received_by',
        'count_to': {'$sum': 1}          
              }
}



# grouping users: From

group_stage2 = {
    '$group': { 
        '_id': '$Sent_by',
        'count_from': {'$sum': 1}          
              }
}



pipeline = [limit_stage, project_stage1, unwind_stage1, project_stage2, unwind_stage2, project_stage3, 
            match_stage1, group_stage1]

pipeline2 = [limit_stage, project_stage1, unwind_stage1, project_stage2, unwind_stage2, project_stage3, 
             match_stage1, group_stage2]


for doc in messages_collection.aggregate(pipeline):
    to_data.append(doc)

for doc in messages_collection.aggregate(pipeline2):
    from_data.append(doc)
        

for user_data in to_data:
    email = user_data['_id']
    to_emails.append(email)
    
    
for user_data in from_data:
    email = user_data['_id']
    from_emails.append(email)
    
print (from_emails)


['shanna.husser@enron.com', 'matthew.lenhart@enron.com', 'luis.mena@enron.com', 'customers@travelnow.com', 'daphneco64@bigplanet.com', 'michael.simmons@enron.com', 'bryant@cheatsheets.net', 'steve.venturatos@enron.com']


### 6)

Write a function with parameter limit. The function should return for each email account: the number of emails sent, the number of emails received, and the total number of emails (sent and received). Use the following format: [{"contact": "michael.simmons@enron.com", "from": 42, "to": 92, "total": 134}] and the information contained in the To, From, and Cc headers. Sort the output in descending order by the total number of emails. Use the parameter limit to specify the number of results to be returned. If limit is null, the function should return all results. If limit is higher than null, the function should return the number of results specified as limit. limit cannot take negative values. [10 points]

In [16]:
def get_emails_between_contacts(collection, limit):
    """
    Shows the communications between contacts
    Sort by the descending order of total emails using the To, From, and Cc headers.
    :param `collection` A PyMongo collection object    
    :param `limit` An integer specifying the amount to display, or
    if null will display all outputs
    :return A list of objects of the form:
    [{
        'contact': <<Another email address>>
        'from': 
        'to': 
        'total': 
    },{.....}]
    """    
    # YOUR CODE HERE
    
    pass

### 7)
Write a function to find out the number of senders who were also direct receivers. Direct receiver means the email is sent to the person directly, not via cc or bcc. [4 points]

In [17]:
def get_from_to_people(collection):
    """
    :param collection A PyMongo collection object
    :return the NUMBER of the people who have sent emails and received emails as direct receivers.
    """    
    # YOUR CODE HERE

    pass

In [18]:
direct_receivers = []

direct_receivers1 = []

limit_stage = {
    '$limit': 100
}

project_stage1 = {
    '$project': {
        'To': '$headers.To',
        '_id': 0
    }
}


for doc in messages_collection.find({ 'headers.To': {'$exists': True, '$ne': ''} }).limit(10):
    direct_receivers.append(doc['headers']['To'])

for receivers in direct_receivers:
    for receiver in receivers.split(', '):
        if receiver.strip(' \t\n\r') not in direct_receivers1:
            direct_receivers1.append(receiver.strip(' \t\n\r'))
        
direct_receivers1

['eric.bass@enron.com',
 'cheatsheets@egroups.com',
 'lwbthemarine@bigplanet.com',
 'jason.bass2@compaq.com',
 'chebert108@aol.com',
 'jody.crook@enron.com',
 'david.baumbach@enron.com',
 'kelly.lombardi@enron.com',
 'bryan.hull@enron.com',
 'patrick.ryder@enron.com',
 'denver.plachy@enron.com',
 'yvette.connevey@enron.com',
 'pat.clynes@enron.com',
 'daren.farmer@enron.com',
 'ebass@enron.com',
 'chad.landry@enron.com',
 'timothy.blanchard@enron.com',
 'phillip.love@enron.com',
 'kenneth.shulklapper@enron.com',
 'jay.reitmeyer@enron.com',
 'tori.kuykendall@enron.com',
 'lisa.gillette@enron.com',
 'christa.winfrey@enron.com']

In [19]:
direct_senders = []

direct_senders1 = []


for doc in messages_collection.find({ 'headers.From': {'$exists': True, '$ne': ''} }).limit(100):
    direct_senders.append(doc['headers']['From'])

direct_senders

print (len(direct_senders))

for sender in direct_senders:
    if sender not in direct_senders1:
        direct_senders1.append(sender)
        
        
direct_senders1

print (len(set(direct_receivers1).intersection(direct_senders1)))

100
6


### 8)
Write a function with parameters start_date and end_date, which returns the number of email messages that have been sent between those specified dates, including start_date and end_date [4 points] 

In [30]:
len('Tue, 14 Nov 2000 08:22:00')

25

In [48]:
def get_emails_between_dates(collection, start_date, end_date):
    """
    :param collection A PyMongo collection object
    :return All emails between the specified start_date and end_date
    """    
    # YOUR CODE HERE 
    
    start_date_as_date = ''
    end_date_as_date = ''
    
    # start date and end date to be in form: Tue, 14 Nov 2000 08:22:00 -0800 (PST)
    # or in form: Tue, 14 Nov 2000 08:22:00
    if len(start_date) <= 25 and len(end_date) <= 25:
        # parse
        start_date_as_date = datetime.strptime(start_date, '%a, %d %b %Y %H:%M:%S')
        end_date_as_date = datetime.strptime(end_date, '%a, %d %b %Y %H:%M:%S')
    
    
    limit_stage = {
    '$limit': 100
    }



    project_stage = {
         '$project': { 'DateOfMessage': 
                          {'$dateFromString': { 'dateString' : 
                                               { '$rtrim': {'input': {'$substr': ['$headers.Date', 0, 25] }}}

                                              }
                          },

                            'filename': 1,
                            'headers.Date': 1,
                     } 
    }
    
    match_stage = {
        '$match': {
            '$and': [ {'DateOfMessage' : { '$gte': start_date_as_date } },  
                      {'DateOfMessage' : { '$lte': end_date_as_date } } ]
        }
    }
    
    
    group_stage1 = {
        '$group': {
            '_id': None, 'count': {'$sum': 1}
        }
    }
    
    project_stage2 = {
        '$project': {
            '_id': 0,
            'count_of_emails_sent': '$count',
            'start_date': start_date_as_date,
            'end_date': end_date_as_date
        }
    }
    

    pipeline = [limit_stage, project_stage, match_stage, group_stage1, project_stage2]
    
    for doc in collection.aggregate(pipeline):
        pprint(doc)
    
    
get_emails_between_dates(messages_collection, "Sat, 11 Nov 2000 16:30:00", "Mon, 13 Nov 2000 16:28:00")

{'count_of_emails_sent': 40,
 'end_date': datetime.datetime(2000, 11, 13, 16, 28),
 'start_date': datetime.datetime(2000, 11, 11, 16, 30)}


## Task 2
This task will assess your ability to use the Hadoop Streaming API and MapReduce to process data. For each of the questions below, you are expected to write two python scripts, one for the Map phase and one for the Reduce phase. You are also expected to provide the correct parameters to the `hadoop` command to run the MapReduce process. Write down your answers in the specified cells below.

To get started, you need to download and unzip the YouTube dataset (available at http://edshare.soton.ac.uk/19547/) onto the machine where you have Hadoop installed (this should be the virtual machine provided).

To help you, `%%writefile` has been added to the top of the cells, automatically writing them to "mapper.py" and "reducer.py" respectively when the cells are run.

### 1) 
Using Youtube01-Psy.csv, find the hourly interval in which most spam was sent. The output should be in the form of a single key-value pair, where the value is a datetime at the start of the hour with the highest number of spam comments. [9 points]

In [21]:
%%writefile mapper.py
#!/usr/bin/env python
#Answer for mapper.py


Overwriting mapper.py


In [22]:
%%writefile reducer.py
#!/usr/bin/env python
#Answer for reducer.py

Overwriting reducer.py


In [23]:
%%bash
#Hadoop command to run the map reduce.

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files    \
-input    \
-mapper   \
-reducer  \
-output output

bash: line 3: hadoop: command not found


In [24]:
#Expected key-value output format:
#hour_with_most_spam	"2013-11-10T10:00:00"

#Additional key-value pairs are acceptable, as long as the hour_with_most_spam pair is correct.

### 2) 
Find all comments associated with a username (the AUTHOR field). Return a JSON array of all comments associated with that username. (This should use the data from all 5 data files: Psy, KatyPerry, LMFAO, Eminem, Shakira) [11 points]

In [25]:
%%writefile mapper.py
#!/usr/bin/env python
#Answer for mapper.py

Overwriting mapper.py


In [26]:
%%writefile reducer.py
#!/usr/bin/env python
#Answer for reducer.py

Overwriting reducer.py


In [27]:
%%bash
#Hadoop command to run the map reduce.

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-files    \
-input    \
-mapper   \
-reducer  \
-output output

bash: line 3: hadoop: command not found


In [28]:
#Expected key-value output format:
#John Smith	["Comment 1", "Comment 2", "Comment 3", "etc."]
#Jane Doe	["Comment 1", "Comment 2", "Comment 3", "etc."]