## EXPLORING HACKER NEW POST
### Programming objectives:
- Work with String data
- Observe Object-Oriented programming
- Understand date and time data
### Data objectives:
In this project we will work with Hacker New Post website dataset
We initially will check the post title beginning with "Ask HN" or "Show HN". "Ask HN" - user summit specific question, while "Show HN" - show something interesting (products, projects, etc)
Our task is compare 2 kinds of posts and determine:
    - Do "Show HN" or "Task HN" receive more comments ?
    - Do posts created at a certain time receive more comments on average?

#### Remove the row containing column headers

In [1]:
# Open the file and assign as "hn"
from csv import reader

def open_file(file):
    o_file = open(file)
    r_file = reader(o_file)
    l_file = list(r_file)
    return l_file

hn = open_file("hacker_news.csv")
# remove header from list to analyze
hn_header = hn[0]
print(hn_header)
hn = hn[1:]
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


#### Extracting Ask HN and Show HN posts

In [2]:
# Collect post title starting with "Ask HN" and "Show HN" and other posts then group them in separated lists
ask_posts, show_posts, other_posts = [], [], []
for post in hn:
    title = post[hn_header.index("title")].lower()
    if title.startswith("show hn"):
        show_posts.append(post)
    elif title.startswith("ask hn"):
        ask_posts.append(post)
    else:
        other_posts.append(post)

print("Number of ask posts: ", len(ask_posts))
print("Number of show posts: ", len(show_posts))
print("Number of other posts: ", len(other_posts))

Number of ask posts:  1744
Number of show posts:  1162
Number of other posts:  17194


#### Calculate average number of comments for Ask HN and show HN post
To determine if ask posts or show posts receive more comments on average.

In [3]:
# Average ask post comments
total_ask_comments = 0
for post in ask_posts:
    comments_num = int(post[hn_header.index("num_comments")])
    total_ask_comments += comments_num

print("Average comments in each ask post: ", total_ask_comments / len(ask_posts))

total_show_comments = 0
for post in show_posts:
    comments_num = int(post[hn_header.index("num_comments")])
    total_show_comments += comments_num
print("Average comments in each show post: ", total_show_comments / len(show_posts))
print("Conclusion: ask post normally receives more comments.")

Average comments in each ask post:  14.038417431192661
Average comments in each show post:  10.31669535283993
Conclusion: ask post normally receives more comments.


#### Finding the Amount of Ask Posts and Comments by Hour Created
Focus on only Ask post and analyze certain times ask post receive more comments

In [21]:
import datetime as dt
# collect the time (created_at) column and number of comments (num_comments) from original dataset
result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])

# Extract hour from date
comment_by_hour = {} # number of comment by hour
count_by_hour = {} # number of post by hour
date_format = "%m/%d/%Y %H:%M"

for post in result_list:
    date, comment = post[0], post[1] # declare date and number of comments
    time = dt.datetime.strptime(date, date_format).strftime("%H") # extract hour from date
    if time in count_by_hour:
        comment_by_hour[time] += comment
        count_by_hour[time] += 1
    else:
        comment_by_hour[time] = comment
        count_by_hour[time] = 1

comment_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

#### Average number of comments for Ask HN post by hour

In [30]:
# calculate average number of comments on each post in particular time
avg_by_hour = []
for hour in comment_by_hour:
    avg_by_hour.append([hour, comment_by_hour[hour]/ count_by_hour[hour]]) # number of comments/ number of posts per hour

avg_by_hour[:5]

# sort
avg_by_hour_1 = [[row[1], row[0]] for row in avg_by_hour]
avg_by_hour_1 = sorted(avg_by_hour_1, reverse=True)
avg_by_hour_1
print("Top 5 hours for Ask Post comments")
for avg, time in avg_by_hour_1[:5]:
    hour = dt.datetime.strptime(time, "%H").strftime("%H:%M")
    print("{}:{:.2f} average comments per post".format(hour, avg))

Top 5 hours for Ask Post comments
15:00:38.59 average comments per post
02:00:23.81 average comments per post
20:00:21.52 average comments per post
16:00:16.80 average comments per post
21:00:16.01 average comments per post
