<a href="https://colab.research.google.com/github/Said-Akbar/Data-science/blob/master/HackerNews_posts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Hacker News post analysis
Hacker News is a platform for posting stories related to technology which gets upvoted and commented (similar to Reddit). The dataset we are working with has been obtained from [Kaggle](https://www.kaggle.com/hacker-news/hacker-news-posts) and modified so that it only includes posts that have at least one comment. There are two types of posts that we are particularly interested in:
- Posts that start with **Ask HN:** that ask the Hacker News community a question related to projects and technology.
- Posts with **Show HN** that showcases the user's project to the community.

Our main task will be which of the two types of posts above get more comments and whether the time of the day is important in receiving more comments.


In [0]:
from csv import reader
# we will not use any pandas or numpy at this point. Thus, we will need to code a few more lines.
file = open("hacker_news.csv")
output = reader(file)
hn = list(output)
header = hn[0]
hn = hn[1:]
print(header, '\n')
print(hn[0], '\n')
print(hn[1], '\n')

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 



The dataset contains 7 columns. Considering the first column is id, we are interested in those 6 columns.

In [0]:
hn[:3] # header is removed

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20']]

In [0]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print("The number of ask posts:", len(ask_posts))
print("The number of show posts:", len(show_posts))
print("The number of other posts:", len(other_posts))

The number of ask posts: 1744
The number of show posts: 1162
The number of other posts: 17194


In [0]:
total_ask_comments = 0


for post in ask_posts:
    comments = int(post[4])
    total_ask_comments +=comments
    
avg_ask_comments = total_ask_comments / len(ask_posts)
print('ask:', avg_ask_comments)

total_show_comments = 0
for post in show_posts:
    comments = int(post[4])
    total_show_comments += comments

avg_show_comments = total_show_comments / len(show_posts)   
print('show:', avg_show_comments)


ask: 14.038417431192661
show: 10.31669535283993


As we can see, ask posts receive more comments on average. We will focus on ask posts now. Let us try to determine how many posts and comments are received hourly. 

In [0]:
import datetime as dt
result_list = []

for row in ask_posts: # create a dictionary of posts and comments with time
    time = row[6]
    comments = int(row[4])
    result_list.append([time, comments])
counts_by_hour = {}
comments_by_hour = {}

for item in result_list: # count the number of posts each hour (minutes not included), count of comments in an hour for all ask posts 
    date = dt.datetime.strptime(item[0], "%m/%d/%Y %H:%M")
    time = dt.datetime.strftime(date, "%H")
    if time not in counts_by_hour:
        counts_by_hour[time] = 1
        comments_by_hour[time] = item[1]
    else:
        counts_by_hour[time] += 1
        comments_by_hour[time] += item[1]

In [0]:
avg_comments_per_post = []
for hour in sorted(comments_by_hour):
    avg_comments_per_post.append([hour,round(comments_by_hour[hour]/counts_by_hour[hour],2)])
avg_comments_per_post

[['00', 8.13],
 ['01', 11.38],
 ['02', 23.81],
 ['03', 7.8],
 ['04', 7.17],
 ['05', 10.09],
 ['06', 9.02],
 ['07', 7.85],
 ['08', 10.25],
 ['09', 5.58],
 ['10', 13.44],
 ['11', 11.05],
 ['12', 9.41],
 ['13', 14.74],
 ['14', 13.23],
 ['15', 38.59],
 ['16', 16.8],
 ['17', 11.46],
 ['18', 13.2],
 ['19', 10.8],
 ['20', 21.52],
 ['21', 16.01],
 ['22', 6.75],
 ['23', 7.99]]

In [0]:
swap_avg_by_hour = []
for item in avg_comments_per_post:
    swap_avg_by_hour.append([item[1], item[0]])
print(swap_avg_by_hour)

[[8.13, '00'], [11.38, '01'], [23.81, '02'], [7.8, '03'], [7.17, '04'], [10.09, '05'], [9.02, '06'], [7.85, '07'], [10.25, '08'], [5.58, '09'], [13.44, '10'], [11.05, '11'], [9.41, '12'], [14.74, '13'], [13.23, '14'], [38.59, '15'], [16.8, '16'], [11.46, '17'], [13.2, '18'], [10.8, '19'], [21.52, '20'], [16.01, '21'], [6.75, '22'], [7.99, '23']]


In [0]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

In [0]:
print("Top 5 Hours for Ask Posts Comments")
for i in sorted_swap[:5]:
    print("{}:00: {} average comments per post".format(i[1],i[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.8 average comments per post
21:00: 16.01 average comments per post


According to findings, it is best to post at 15:00 to receive more comments. The timezone for the dataset is ETS. So, if we are in Pacific Time zone (San Francisco), we should post at 18:00 to have higher chances of receiving comments.