**Hacker News Posts Data Exploration**

*This is a data exploration of the different posts on Hacker News, to determine the different types/classes of posts and the number of comments on such posts.*

*Also, to estimate the best time of the day to post on Hacker News to get a response from the community.*

In [1]:
from csv import reader
from datetime import *


**Functions**

In [12]:
def open_dataset(file_name):
    
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    
    return data

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

**To read the file into my notebook and explore:**

In [16]:
hn = open_dataset("/content/HN_posts_year_to_Sep_26_2016.csv")
hn_header = hn[0]
hn = hn[1:]

print(hn_header)
explore_data(hn,0,5,True)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']


Number of ro

**Segmenting posts with comments from those with none:**

In [18]:
hn_no_comments = []
hn_comments = []
for post in hn:
  comment = post[4]
  if comment == "0":
    hn_no_comments.append(post)
  else:
    hn_comments.append(post)
explore_data(hn_no_comments, 0, 4, True)
explore_data(hn_comments, 0, 4, True)

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows: 212718
Number of columns: 7
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']


['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/

*This leaves us with 80401 rows out of 293119 for posts with comments.*

**Segmenting the posts with comments into three buckets: Ask HN (posts with Ask HN at the start of the post), Show HN (posts with Ask HN at the start of the post) and other posts.**

In [20]:
ask_posts = []
show_posts = []
other_posts = []

for post in hn_comments:
  title = post[1]
  if title.lower().startswith('ask hn'):
    ask_posts.append(post)
  elif title.lower().startswith('show hn'):
    show_posts.append(post)
  else:
    other_posts.append(post)

explore_data(ask_posts, 0, 3, True)
explore_data(show_posts, 0, 3, True)
explore_data(other_posts, 0, 3, True)

['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']


['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']


['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']


Number of rows: 6911
Number of columns: 7
['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06']


['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06']


['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']


Number of rows: 5059
Number of columns: 7
['12578975', 'Saving the Hassle of Shopping', 'https://blog.m

We have **6,911 posts** asking the Hacker News community a specific question.

We have **5,059 posts** showing the Hacker News community a project, product, or just generally something interesting.

We have **68,431 posts** on others.

**Average number of Comments for Ask HN and Show HN**

In [27]:
total_ask_comments = 0
for post in ask_posts:
  num_comments = int(post[4])
  total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(total_ask_comments)
print(len(ask_posts))
print(avg_ask_comments)

94986
6911
13.744175951381855


In [28]:
total_show_comments = 0
for post in show_posts:
  num_comments = int(post[4])
  total_show_comments += num_comments
avg_show_comments = total_show_comments / len(show_posts)
print(total_show_comments)
print(len(show_posts))
print(avg_show_comments)

49633
5059
9.810832180272781


*The average number of comments under the Ask HN posts is approximately 14 comments, which is higher than the the average number of comments under the Show HN posts (approximately 10 comments)*

**Volume of Ask HN posts and comments by the hour they were created**

In [29]:
result_list = []
for post in ask_posts:
  time_created = post[6]
  num_comments = int(post[4])
  result_list.append([time_created, num_comments])
print(result_list[:4])

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2]]


In [55]:
counts_by_hour = {}
comments_by_hour = {}
for item in result_list:
  date_time = item[0]
  comment_num = item[1]
  # print(date_time)
  stripped_date_time = datetime.strptime(date_time, "%m/%d/%Y %H:%M")
  # print(stripped_date_time)
  hour = datetime.strftime(stripped_date_time, "%H")
  # print(hour)
  if hour not in counts_by_hour:
    counts_by_hour[hour] = 1
    comments_by_hour[hour] = comment_num
  else:
    counts_by_hour[hour] += 1
    comments_by_hour[hour] += comment_num
print(counts_by_hour)
print(comments_by_hour)

{'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


**Average Number of Comments per Ask HN Posts by Hour**

In [62]:
avg_num_comments_by_hour = []
for count in comments_by_hour:
  avg_count = round(comments_by_hour[count] / counts_by_hour[count])
  # print(comments_by_hour[count], counts_by_hour[count])
  # print(avg_count)
  avg_num_comments_by_hour.append([count, avg_count])
print(avg_num_comments_by_hour)

[['02', 13], ['01', 9], ['22', 12], ['21', 11], ['19', 9], ['17', 14], ['15', 40], ['14', 13], ['13', 22], ['11', 11], ['10', 14], ['09', 8], ['07', 10], ['03', 10], ['16', 11], ['08', 12], ['00', 10], ['23', 8], ['20', 11], ['18', 11], ['12', 15], ['04', 13], ['06', 9], ['05', 11]]


*This means an average of **13 comments** per post posted at **2am**, average of **40 comments** per post posted at **3pm**, etc.*

In [63]:
avg_hr_swap = []
for item in avg_num_comments_by_hour:
  new_hr_swap = [item[1], item[0]]
  avg_hr_swap.append(new_hr_swap)
print(avg_hr_swap)

[[13, '02'], [9, '01'], [12, '22'], [11, '21'], [9, '19'], [14, '17'], [40, '15'], [13, '14'], [22, '13'], [11, '11'], [14, '10'], [8, '09'], [10, '07'], [10, '03'], [11, '16'], [12, '08'], [10, '00'], [8, '23'], [11, '20'], [11, '18'], [15, '12'], [13, '04'], [9, '06'], [11, '05']]


In [66]:
sorted_hr_swap = sorted(avg_hr_swap, reverse=True)
print(sorted_hr_swap)


[[40, '15'], [22, '13'], [15, '12'], [14, '17'], [14, '10'], [13, '14'], [13, '04'], [13, '02'], [12, '22'], [12, '08'], [11, '21'], [11, '20'], [11, '18'], [11, '16'], [11, '11'], [11, '05'], [10, '07'], [10, '03'], [10, '00'], [9, '19'], [9, '06'], [9, '01'], [8, '23'], [8, '09']]


**Conclusion**

In conclusion, it can be deduced that posts asking the Hacker News community about a specific question are likely to have high number of comments when posted during the following periods:

1. 3pm
2. 1pm
3. 12pm
4. 5pm
5. 10am