Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Austin Wood – Challenge Week 12 #26

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
reddit.json
79 changes: 47 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,32 @@

## Challenge 1

[Insert Screenshot]
![challenge 1](img/reddit_ch01.png)

## Challenge 2

[Explain what's interesting]
It's interesting how most of the posts on here don't have more than 1 upvote
if any at all. Out of 12817418 total instances, only 129 have more than one
upvote, and only 4 posts have 4 upvotes, and 1 post has 6 upvotes (highest).

![challenge 2](img/reddit_ch02a.png)

Of the five posts that had 4+ upvotes, these posts only came from two different
subreddits: nfl and CFB. These are the 8th and 12th most popular subreddits,
respectively.

![challenge 2](img/reddit_ch02b.png)

## Challenge 3

[Explain possible Insights]
It would be fun to see how many authors stick to certain subreddits that
are related to each other or which ones are all over the board.

## Challenge 4

[What it would tell you about the Reddit Community]
Even with a smaller dataset such as this, I think you would still be able to
look at the most popular subreddits, active users and how they interact amongst
different posts, etc.

## Challenge 5

Expand All @@ -25,59 +38,61 @@

## Challenge 6

[What does this change about our analysis?]
For one, it may be much harder to find subreddits with correlating commenters
since all posts under 10 upvotes would not be taken into account. Not to
mention, less popular subreddits would be shadowed due to the fact that many
comments, no matter how relevant, would not make it up to 10 upvotes.

## Challenge 7

[How would you change your conclusions?]
The conclusions would definitely change. The answers would favor more popular
subreddits and may be compared to other subreddits solely based on the fact of
having more upvotes for a topic.

## Challenge 8

[Bias in answer]

Even though the file is large, the dataset is still only over a 15 day period
and thus not a large sample size considering how active a site like Reddit has
been for years now. This may result in a biased list of subreddits, but that
is unlikely to stretch too far away from the overall data.
## Challenge 9

[Other Biases]
Simply comaring the number of comments also does not prove correlation, since
some "trolling" comments could have exploded on one post while the other is
completely legitimate.

## Challenge 10

[How may you try and prove the bias]
I noticed on the reddit.json file download link, someone posted a list of the
top 50 most frequently used words. It would be fun to try and create a natural
language processor to weed out troll comments (i.e. lots of cussing).

# Yelp and Weather
# Yelp and Weather

## Challenge 1

[Screenshot your query and a result]
![challenge 1](img/weather_ch01.png)

## Challenge 2

[Query snippet]
[Answer]

## Challenge 3

[Query snippet]
[Answer]

## Challenge 4
> db.normals.aggregate([{$match: {"DATE": {$regex: /20100425.*/}, "STATION_NAME": {$regex: /LAS VEGAS.*/}}}, {$group: {_id: "$STATION_NAME", wind: {$avg: "$HLY-WIND-AVGSPD"}}}])

[Query snippet]
[Answer]
Answer: { "_id" : "LAS VEGAS MCCARRAN INTERNATIONAL AIRPORT NV US", "wind" : 110.08333333333333 }

## Challenge 5
## Challenge 3

[Query snippet]
[Answer]
> db.businesses.aggregate([{ $match: { city: "Madison" }}, { $group: { _id: null, total_reviews: { $sum: "$review_count" }}}])

## Challenge 6
Answer: { "_id" : null, "total_reviews" : 34410 }

[Query snippet]
[Answer]
## Challenge 4

## Challenge 7 [BONUS]
> db.businesses.aggregate([{ $match: { city: "Las Vegas" }}, { $group: { _id: null, total_reviews: { $sum: "$review_count" }}}])

[Code]
[Answer]
Answer: { "_id" : null, "total_reviews" : 577550 }

## Challenge 5

> db.businesses.aggregate([{ $match: { city: "Las Vegas" }}, { $group: { _id: null, total_reviews: { $sum: "$review_count" }}}])

Answer: { "_id" : null, "total_reviews" : 200089 }
2 changes: 1 addition & 1 deletion examples/reddit/subreddits_by_commenters.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
data = json.loads(item)
if data['subreddit'] in top_50:
subreddits[data['subreddit']].add(data['author'])

subreddits_list = subreddits.items()
similarity = Counter()
print "Calculating similarity"
Expand Down
Binary file added img/reddit_ch01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/reddit_ch02a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/reddit_ch02b.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/weather_ch01.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.