CSCI-4830-002-2014 · indiesquidge · Nov 17, 2014 · Nov 17, 2014
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1 @@
+reddit.json
diff --git a/README.md b/README.md
@@ -4,19 +4,32 @@
 
 ## Challenge 1
 
-[Insert Screenshot]
+![challenge 1](img/reddit_ch01.png)
 
 ## Challenge 2
 
-[Explain what's interesting]
+It's interesting how most of the posts on here don't have more than 1 upvote
+if any at all. Out of 12817418 total instances, only 129 have more than one
+upvote, and only 4 posts have 4 upvotes, and 1 post has 6 upvotes (highest).
+
+![challenge 2](img/reddit_ch02a.png)
+
+Of the five posts that had 4+ upvotes, these posts only came from two different
+subreddits: nfl and CFB. These are the 8th and 12th most popular subreddits,
+respectively.
+
+![challenge 2](img/reddit_ch02b.png)
 
 ## Challenge 3
 
-[Explain possible Insights]
+It would be fun to see how many authors stick to certain subreddits that
+are related to each other or which ones are all over the board.
 
 ## Challenge 4
 
-[What it would tell you about the Reddit Community]
+Even with a smaller dataset such as this, I think you would still be able to
+look at the most popular subreddits, active users and how they interact amongst
+different posts, etc.
 
 ## Challenge 5
 
@@ -25,59 +38,61 @@
 
 ## Challenge 6
 
-[What does this change about our analysis?]
+For one, it may be much harder to find subreddits with correlating commenters
+since all posts under 10 upvotes would not be taken into account. Not to
+mention, less popular subreddits would be shadowed due to the fact that many
+comments, no matter how relevant, would not make it up to 10 upvotes.
 
 ## Challenge 7
 
-[How would you change your conclusions?]
+The conclusions would definitely change. The answers would favor more popular
+subreddits and may be compared to other subreddits solely based on the fact of
+having more upvotes for a topic.
 
 ## Challenge 8
 
-[Bias in answer]
-
+Even though the file is large, the dataset is still only over a 15 day period
+and thus not a large sample size considering how active a site like Reddit has
+been for years now. This may result in a biased list of subreddits, but that
+is unlikely to stretch too far away from the overall data. 
 ## Challenge 9
 
-[Other Biases]
+Simply comaring the number of comments also does not prove correlation, since
+some "trolling" comments could have exploded on one post while the other is
+completely legitimate.
 
 ## Challenge 10
 
-[How may you try and prove the bias]
+I noticed on the reddit.json file download link, someone posted a list of the
+top 50 most frequently used words. It would be fun to try and create a natural
+language processor to weed out troll comments (i.e. lots of cussing).
 
-# Yelp and Weather 
+# Yelp and Weather
 
 ## Challenge 1
 
-[Screenshot your query and a result]
+![challenge 1](img/weather_ch01.png)
 
 ## Challenge 2
 
-[Query snippet]
-[Answer]
-
-## Challenge 3
-
-[Query snippet]
-[Answer]
-
-## Challenge 4
+> db.normals.aggregate([{$match: {"DATE": {$regex: /20100425.*/}, "STATION_NAME": {$regex: /LAS VEGAS.*/}}}, {$group: {_id: "$STATION_NAME", wind: {$avg: "$HLY-WIND-AVGSPD"}}}])
 
-[Query snippet]
-[Answer]
+Answer: { "_id" : "LAS VEGAS MCCARRAN INTERNATIONAL AIRPORT NV US", "wind" : 110.08333333333333 }
 
-## Challenge 5
+## Challenge 3
 
-[Query snippet]
-[Answer]
+> db.businesses.aggregate([{ $match: { city: "Madison" }}, { $group: { _id: null, total_reviews: { $sum: "$review_count" }}}])
 
-## Challenge 6
+Answer: { "_id" : null, "total_reviews" : 34410 }
 
-[Query snippet]
-[Answer]
+## Challenge 4
 
-## Challenge 7 [BONUS]
+> db.businesses.aggregate([{ $match: { city: "Las Vegas" }}, { $group: { _id: null, total_reviews: { $sum: "$review_count" }}}])
 
-[Code]
-[Answer]
+Answer: { "_id" : null, "total_reviews" : 577550 }
 
+## Challenge 5
 
+> db.businesses.aggregate([{ $match: { city: "Las Vegas" }}, { $group: { _id: null, total_reviews: { $sum: "$review_count" }}}])
 
+Answer: { "_id" : null, "total_reviews" : 200089 }
diff --git a/examples/reddit/subreddits_by_commenters.py b/examples/reddit/subreddits_by_commenters.py
@@ -24,7 +24,7 @@
         data = json.loads(item)
         if data['subreddit'] in top_50:
             subreddits[data['subreddit']].add(data['author'])
-       
+
     subreddits_list = subreddits.items()
     similarity = Counter()
     print "Calculating similarity"

diff --git a/img/reddit_ch01.png b/img/reddit_ch01.png
diff --git a/img/reddit_ch02a.png b/img/reddit_ch02a.png
diff --git a/img/reddit_ch02b.png b/img/reddit_ch02b.png
diff --git a/img/weather_ch01.png b/img/weather_ch01.png