The format of "reviews_electronics.16.json" is:
- reviewerID - ID of the reviewer, e.g. A2SUAM1J3GNN3B
- asin - ID of the product, e.g. 0000013714
- reviewerName - name of the reviewer
- helpful - helpfulness rating of the review, e.g. 2/3
- reviewText - text of the review
- overall - rating of the product
- summary - summary of the review
- unixReviewTime - time of the review

In [14]:
from pymongo import MongoClient
from bson.code import Code
from bson.son import SON

import json

### 1. Create a MongoDB DB called "amazon"

In [15]:
# Read the first 3 lines of the json file
ii = 1
with open ("reviews_electronics.16.json", "r", encoding = "utf-8") as file:
    for i in file:
        print(i)
        ii += 1
        if ii > 3:
            break

{"reviewerID": "AKM1MP6P0OYPR", "asin": "0132793040", "reviewerName": "Vicki Gibson \"momo4\"", "helpful": [1, 1], "reviewText": "Corey Barker does a great job of explaining Blend Modes in this DVD. All of the Kelby training videos are great but pricey to buy individually. If you really want bang for your buck just subscribe to Kelby Training online.", "overall": 5.0, "summary": "Very thorough", "unixReviewTime": 1365811200, "reviewTime": "04 13, 2013"}

{"reviewerID": "A2X8VX4DPMQFQQ", "asin": "B00E4KP4W6", "reviewerName": "lily68", "helpful": [1, 1], "reviewText": "I can't believe I waited to long to switch to a glass screen protector.  I love this.  It feels and looks like there is no protector on.  It does show fingerprints, which I think is inevitable unless you use a matte finish screen protector, but they wipe right away. I would definitely recommend this! Easier to apply than the films too!", "overall": 5.0, "summary": "LOVE this screen protector!!", "unixReviewTime": 139345920

In [16]:
# Connect to the server
client = MongoClient('localhost', 27017)

# Create the "amazon" database
db = client["amazon"]

# Access to the "reviews" collection
collection_reviews = db["reviews"]

### 2. Read "reviews_electronics.16.json" and uploads each review as a separate document to the collection "reviews" in the DB "amazon".

In [17]:
# Insert every single review to the database
with open ("reviews_electronics.16.json", "r", encoding = "utf-8") as file:
    for i in file:
        doc = json.loads(i)
        collection_reviews.insert_one(doc)

In [18]:
# Check if the reviews were inserted successfully
cursor = collection_reviews.find({})
ii = 1
for i in cursor:
    print(i)
    ii += 1
    if ii > 3:
        break

{'_id': ObjectId('641b70b50a7a5cb87e7ba395'), 'reviewerID': 'AKM1MP6P0OYPR', 'asin': '0132793040', 'reviewerName': 'Vicki Gibson "momo4"', 'helpful': [1, 1], 'reviewText': 'Corey Barker does a great job of explaining Blend Modes in this DVD. All of the Kelby training videos are great but pricey to buy individually. If you really want bang for your buck just subscribe to Kelby Training online.', 'overall': 5.0, 'summary': 'Very thorough', 'unixReviewTime': 1365811200, 'reviewTime': '04 13, 2013'}
{'_id': ObjectId('641b70b50a7a5cb87e7ba396'), 'reviewerID': 'A2X8VX4DPMQFQQ', 'asin': 'B00E4KP4W6', 'reviewerName': 'lily68', 'helpful': [1, 1], 'reviewText': "I can't believe I waited to long to switch to a glass screen protector.  I love this.  It feels and looks like there is no protector on.  It does show fingerprints, which I think is inevitable unless you use a matte finish screen protector, but they wipe right away. I would definitely recommend this! Easier to apply than the films too!",

### 3. Use MongoDB's map reduce function to build a new collection "avg_scores" that averages review scores by product ("asin")

In [19]:
# Create the new collection "avg_scores" that averages review scores by product ("asin")
mapf = Code('''function() { emit(this.asin, this.overall) }''')
reducef = Code('''function(key, values) { return Array.avg(values) }''')

cmd = {
    'mapreduce': "reviews",
    'map': mapf,
    'reduce': reducef,
    'out': "avg_scores"
}


result = db.command(SON(cmd))

In [20]:
# Print the first 10 entries of "avg_scores" to screen.
collection_avg_scores = db['avg_scores']

cursor = collection_avg_scores.find({})

ii = 0
for i in cursor:
    print(i)
    ii += 1
    if ii == 10:
        break

{'_id': 'B00F5GWYDK', 'value': 3.5}
{'_id': 'B00H5BW43S', 'value': 3.0}
{'_id': 'B00HJAEO1U', 'value': 3.5454545454545454}
{'_id': 'B00GGG9VAM', 'value': 5.0}
{'_id': 'B00EB70OV8', 'value': 4.12}
{'_id': 'B00F4MU0KO', 'value': 3.9166666666666665}
{'_id': 'B00ISFROFS', 'value': 5.0}
{'_id': 'B00FCZZDG4', 'value': 4.0}
{'_id': 'B00J8916C0', 'value': 5.0}
{'_id': 'B00GPG6YBW', 'value': 4.0}


### 4. Use MongoDB's map reduce function to build a new collection "weighted_avg_scores" that averages review scores by product ("asin"), weighted by the number of votes + 1 (the second number + 1). 

In [None]:
# Build a new collection "weighted_avg_scores"
# that averages review scores by product ("asin"), weighted by the number of votes + 1 (the second number + 1).
mapf_w = Code('''function() { emit(this.asin, {value: this.overall, weight: this.helpful[1]}); }''')
reducef_w = Code('''
function (key, values) {
    var sum = 0;
    var weight_sum = 0;
    values.forEach(function(doc) {
        sum += doc.value * (doc.weight+1);
        weight_sum += (doc.weight+1);
    });
    return sum / weight_sum;
}
''')

cmd_w = {
    'mapreduce': "reviews",
    'map': mapf_w,
    'reduce': reducef_w,
    'out': "weighted_avg_scores"
}


result_w = db.command(SON(cmd_w))

In [22]:
# Print the first 10 entires of "weighted_avg_scores" to screen.
collection_weighted_avg_scores = db['weighted_avg_scores']

cursor = collection_weighted_avg_scores.find({})

ii = 0
for i in cursor:
    print(i)
    ii += 1
    if ii == 10:
        break

{'_id': 'B00FYPHTGY', 'value': 4.0}
{'_id': 'B00HWSXYRY', 'value': 5.0}
{'_id': 'B00EV6WLNI', 'value': 4.826086956521739}
{'_id': 'B00E4UZP3Y', 'value': 4.433333333333334}
{'_id': 'B00F1CLD4O', 'value': 2.0}
{'_id': 'B00GNSYD0G', 'value': 2.0}
{'_id': 'B00FN9HWZE', 'value': 5.0}
{'_id': 'B00HIW7HE0', 'value': 4.0}
{'_id': 'B00GP3TZ3O', 'value': 2.888888888888889}
{'_id': 'B00II5UKBI', 'value': 5.0}
