# Exploring 'Hacker News' Posts - “What time should you post to receive the highest average of comments?”
In this project, we'll work a dataset of submissions to popular technology site Hacker News.

For convenience of this practice, the data has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that didn't receive any comments and then randomly sampling the remaining submissions.

The full data set is available right here.

In our analysis we are specifically interested in posts that begin with 'Ask HN'(To ask the hacker news community about specific topics) and 'Show HN'(to show the hacker news community their projects, product or something interesting).

So we will compare these two type Ask HN and Show HN to determine which one gets more reach or comments and whether the most comments are posted during a specific time of day.

In [9]:
from csv import reader
import datetime as dt

In [10]:
file = open("hacker_news.csv","r",encoding="utf-8")
a = reader(file)
hn = list(a)
headers = hn[:1]
hn = hn[1:]

In [11]:
file = open("hacker_news.csv","r",encoding='utf-8')


In [12]:
a = reader(file)
a

<_csv.reader at 0x2ab3ec4ed40>

In [13]:
hn[:5]

[['12579008',
  'You have two days to comment if you want stem cells to be classified as your own',
  'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018',
  '1',
  '0',
  'altstar',
  '9/26/2016 3:26'],
 ['12579005',
  'SQLAR  the SQLite Archiver',
  'https://www.sqlite.org/sqlar/doc/trunk/README.md',
  '1',
  '0',
  'blacksqr',
  '9/26/2016 3:24'],
 ['12578997',
  'What if we just printed a flatscreen television on the side of our boxes?',
  'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43',
  '1',
  '0',
  'pavel_lishin',
  '9/26/2016 3:19'],
 ['12578989',
  'algorithmic music',
  'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext',
  '1',
  '0',
  'poindontcare',
  '9/26/2016 3:16'],
 ['12578979',
  'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake',
  'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94',
  '1',
  '0',
  'markgainor1',
  '9/26/2016 3:14']]

In [14]:
len(hn)

293119

In [15]:
headers

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

In [None]:
#Displaying header and first 5 rows of the dataset

In [16]:
print(headers,"\n")
for i in hn[:5]:
    print(i)
    print("\n")

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']] 

['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']




# Extracting Ask HN and Show HN posts
To find the posts that begin with either Ask HN or Show HN, we'll use the string method startswith. Since, startswith is case sensitive, here in this project we are using lower method, to have a control on case.

In [17]:
ask_posts=[]
show_posts=[]
other_posts = []

for row in hn:
    title = row[1]
    
    if title.lower().startswith("ask hn"):
         ask_posts.append(row)
    
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    
    else:
        other_posts.append(row)
    
print("number of all posts",len(hn))
print("number of ask_posts",len(ask_posts))
print("number of show_posts",len(show_posts))
print("number of other_posts",len(other_posts))

number of all posts 293119
number of ask_posts 9139
number of show_posts 10158
number of other_posts 273822


In [38]:
ask_posts

[['12578908',
  'Ask HN: What TLD do you use for local development?',
  '',
  '4',
  '7',
  'Sevrene',
  '9/26/2016 2:53'],
 ['12578522',
  'Ask HN: How do you pass on your work when you die?',
  '',
  '6',
  '3',
  'PascLeRasc',
  '9/26/2016 1:17'],
 ['12577908',
  'Ask HN: How a DNS problem can be limited to a geographic region?',
  '',
  '1',
  '0',
  'kuon',
  '9/25/2016 22:57'],
 ['12577870',
  'Ask HN: Why join a fund when you can be an angel?',
  '',
  '1',
  '3',
  'anthony_james',
  '9/25/2016 22:48'],
 ['12577647',
  'Ask HN: Someone uses stock trading as passive income?',
  '',
  '5',
  '2',
  '00taffe',
  '9/25/2016 21:50'],
 ['12576946',
  'Ask HN: How hard would it be to make a cheap, hackable phone?',
  '',
  '2',
  '1',
  'hkt',
  '9/25/2016 19:30'],
 ['12576899',
  'Ask HN: What is that one deciding factor that makes a website successful?',
  '',
  '22',
  '22',
  'ziggystardust',
  '9/25/2016 19:22'],
 ['12576398',
  'Ask HN: Is the world really short of software deve

In [18]:
show_posts

[['12578335',
  'Show HN: Finding puns computationally',
  'http://puns.samueltaylor.org/',
  '2',
  '0',
  'saamm',
  '9/26/2016 0:36'],
 ['12578182',
  'Show HN: A simple library for complicated animations',
  'https://christinecha.github.io/choreographer-js/',
  '1',
  '0',
  'christinecha',
  '9/26/2016 0:01'],
 ['12578098',
  'Show HN: WebGL visualization of DNA sequences',
  'http://grondilu.github.io/dna.html',
  '1',
  '0',
  'grondilu',
  '9/25/2016 23:44'],
 ['12577991',
  'Show HN: Pomodoro-centric, heirarchical project management with ES6 modules',
  'https://github.com/jakebian/zeal',
  '2',
  '0',
  'dbranes',
  '9/25/2016 23:17'],
 ['12577142',
  'Show HN: Jumble  Essays on the go #PaulInYourPocket',
  'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8',
  '1',
  '1',
  'ryderj',
  '9/25/2016 20:06'],
 ['12576813',
  'Show HN: Learn Japanese Vocab via multiple choice questions',
  'http://japanese.vul.io/',
  '1',
  '1',
  'soulchild37',
  

In [19]:
show_posts[-3:]

[['10177511',
  'Show HN: MockTheClock  A tiny JavaScript library for spoofing time in browser',
  'https://github.com/zb3/MockTheClock',
  '18',
  '6',
  'zb3',
  '9/6/2015 13:02'],
 ['10177459',
  'Show HN: AppyPaper  Gift wrap with app icons printed on it',
  'http://www.appypaper.com/',
  '6',
  '4',
  'submitstartup',
  '9/6/2015 12:38'],
 ['10177421',
  'Show HN: Popularity scoring for arXiv publications',
  'https://gist.github.com/nebw/5504697c118744677c2d',
  '17',
  '1',
  'nebw',
  '9/6/2015 12:16']]

# To verify that the list works fine, we have to take a look at the first and last two rows of each list.

In [20]:
print(ask_posts[0][1])
print(ask_posts[1][1])
print(ask_posts[-2][1])
print(ask_posts[-1][1])
print("\n")
print(show_posts[0][1])
print(show_posts[1][1])
print(show_posts[-2][1])
print(show_posts[-1][1])


Ask HN: What TLD do you use for local development?
Ask HN: How do you pass on your work when you die?
Ask HN: Where do you look for work if you need experience?
Ask HN: What is/are your favorite quote(s)?


Show HN: Finding puns computationally
Show HN: A simple library for complicated animations
Show HN: AppyPaper  Gift wrap with app icons printed on it
Show HN: Popularity scoring for arXiv publications


# Calculating the average number of comments for Ask HN and Show HN posts.

In [None]:
#ask hn allways recive comments

In [None]:
# we find out total_ask_comment
avg_ask_hn = total_ask_comments / len(ask_posts)

In [22]:
total_ask_comments = 0

for posts in ask_posts:
    comments = int(posts[4])
    total_ask_comments +=comments

In [24]:
avg_ask_hn = total_ask_comments / len (ask_posts)

In [25]:
print(avg_ask_hn),"{:.2f}".format(avg_ask_hn)

10.393478498741656


(None, '10.39')

In [None]:
# show hn allways recives points

In [36]:
# we find out total_show_comment
avg_show_hn = total_ask_comments / len(ask_posts)

In [26]:
total_show_comments = 0

for posts in show_posts:
    comments = int(posts[4])
    total_show_comments += comments
    

In [27]:
avg_show_hn = total_show_comments / len(show_posts)

In [28]:
print(avg_show_hn),"{:.2f}".format(avg_show_hn)

4.886099625910612


(None, '4.89')

In [29]:
a = avg_ask_hn = 10.39
b = avg_show_hn = 4.89
a -= b
a
"{:.2f}".format(a)


'5.50'

# Finding the number of Ask posts and comments by hour created.

In [84]:
result_list = []
# we iterate trough each post in ask_post, then asign
# the timestamp to created_at and the number of comments
# to num_comments. At the end we append a list with that
# data to the result_list

for posts in ask_posts:
    dates = posts[-1]
    n_comm = int(posts[4])
    result_list.append([dates, n_comm])

counts_by_hour = {} # contains the number of ask posts created during each hour of the day
comments_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received

for row in result_list:
    date_str = row[0]
    num_comments = row[1]
    dt_object = dt.datetime.strptime(date_str, '%m/%d/%Y %H:%M') # we convert the date string to a datetime object
    hour = dt_object.strftime('%H') # we ectract the hour (%H) from the datetime object
    
    # now we create a frequent table and count the hour to get
    # the number of posts for each hour and set the comments by hour
    # equal to the comment number to get the comments in each hour.
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = num_comments
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += num_comments
        

In [85]:
result_list[:]

[['9/26/2016 2:53', 7],
 ['9/26/2016 1:17', 3],
 ['9/25/2016 22:57', 0],
 ['9/25/2016 22:48', 3],
 ['9/25/2016 21:50', 2],
 ['9/25/2016 19:30', 1],
 ['9/25/2016 19:22', 22],
 ['9/25/2016 17:55', 3],
 ['9/25/2016 15:48', 0],
 ['9/25/2016 15:35', 13],
 ['9/25/2016 15:28', 0],
 ['9/25/2016 14:43', 0],
 ['9/25/2016 14:17', 3],
 ['9/25/2016 13:08', 2],
 ['9/25/2016 11:27', 2],
 ['9/25/2016 10:51', 0],
 ['9/25/2016 10:47', 6],
 ['9/25/2016 9:04', 97],
 ['9/25/2016 7:09', 4],
 ['9/25/2016 3:00', 1],
 ['9/24/2016 23:04', 0],
 ['9/24/2016 22:02', 7],
 ['9/24/2016 21:18', 2],
 ['9/24/2016 20:58', 0],
 ['9/24/2016 19:57', 1],
 ['9/24/2016 19:02', 0],
 ['9/24/2016 17:55', 0],
 ['9/24/2016 17:27', 1],
 ['9/24/2016 16:50', 0],
 ['9/24/2016 16:03', 5],
 ['9/24/2016 15:29', 66],
 ['9/24/2016 14:03', 1],
 ['9/24/2016 10:10', 11],
 ['9/24/2016 8:46', 7],
 ['9/24/2016 8:39', 1],
 ['9/24/2016 8:38', 1],
 ['9/24/2016 8:28', 1],
 ['9/24/2016 3:36', 3],
 ['9/24/2016 0:21', 2],
 ['9/23/2016 23:38', 6],
 ['9/2

In [86]:
len(result_list)

9139

In [87]:
result_list[:3]

[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:57', 0]]

In [88]:
comments_by_hour

{'02': 2996,
 '01': 2089,
 '22': 3372,
 '21': 4500,
 '19': 3954,
 '17': 5547,
 '15': 18525,
 '14': 4972,
 '13': 7245,
 '11': 2797,
 '10': 3013,
 '09': 1477,
 '07': 1585,
 '03': 2154,
 '23': 2297,
 '20': 4462,
 '16': 4466,
 '08': 2362,
 '00': 2277,
 '18': 4877,
 '12': 4234,
 '04': 2360,
 '06': 1587,
 '05': 1838}

In [89]:
counts_by_hour

{'02': 269,
 '01': 282,
 '22': 383,
 '21': 518,
 '19': 552,
 '17': 587,
 '15': 646,
 '14': 513,
 '13': 444,
 '11': 312,
 '10': 282,
 '09': 222,
 '07': 226,
 '03': 271,
 '23': 343,
 '20': 510,
 '16': 579,
 '08': 257,
 '00': 301,
 '18': 614,
 '12': 342,
 '04': 243,
 '06': 234,
 '05': 209}

In [None]:
# max value of comments_by_hour

In [53]:
max_by_hour = 0
max_hour = []

for row in counts_by_hour:
    if counts_by_hour[row]>max_by_hour:
        max_by_our = counts_by_hour[row]
        max_hour = [row , counts_by_hour[row]]
        
print('by',max_hour[1], 'posts', max_hour[0],'\'o clock has the highest number of posts.')



by 209 posts 05 'o clock has the highest number of posts.


In [None]:
# calculating average number of comments for posts during each hour

In [64]:
avg_by_hour = []

for count in comments_by_hour:
    avg_by_hour.append([count , comments_by_hour[count] / counts_by_hour[count]])



In [72]:
for row in sorted(avg_by_hour):
    print(row)

['00', 7.5647840531561465]
['01', 7.407801418439717]
['02', 11.137546468401487]
['03', 7.948339483394834]
['04', 9.7119341563786]
['05', 8.794258373205741]
['06', 6.782051282051282]
['07', 7.013274336283186]
['08', 9.190661478599221]
['09', 6.653153153153153]
['10', 10.684397163120567]
['11', 8.96474358974359]
['12', 12.380116959064328]
['13', 16.31756756756757]
['14', 9.692007797270955]
['15', 28.676470588235293]
['16', 7.713298791018998]
['17', 9.449744463373083]
['18', 7.94299674267101]
['19', 7.163043478260869]
['20', 8.749019607843136]
['21', 8.687258687258687]
['22', 8.804177545691905]
['23', 6.696793002915452]


In [73]:
#Let's swap the places of avg_by_hour
#to sort the data at more convenience
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print('Most number of comments where obsevered during:')
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
    
for average, hr in sorted_swap[:5]: 
    hour = dt.datetime.strptime(hr,'%H')#convert the string to a datetime object
    time = hour.strftime('%H:%M')#format the datetime object
    print('{time}: {average:.2f} comments per post'.format(time = time, average = average))
    

Most number of comments where obsevered during:
15:00: 28.68 comments per post
13:00: 16.32 comments per post
12:00: 12.38 comments per post
02:00: 11.14 comments per post
10:00: 10.68 comments per post


In [74]:
print('Top 5 hours in IST for Ask Posts comments:')

for average, hr in sorted_swap[:5]: 
    hour = dt.datetime.strptime(hr,'%H')
    ist = hour + dt.timedelta(hours = 10.5)
    time = ist.strftime('%H:%M')
    print('{time}: {average:.2f} comments per post'.format(time = time, average = average))
    

Top 5 hours in IST for Ask Posts comments:
01:30: 28.68 comments per post
23:30: 16.32 comments per post
22:30: 12.38 comments per post
12:30: 11.14 comments per post
20:30: 10.68 comments per post


In [78]:
result_list = []
# we iterate trough each post in show_post, then asign
# the timestamp to created_at and the number of comments
# to num_comments. At the end we append a list with that
# data to the result_list

for row in show_posts:
    dates = row[-1]
    n_points = int(row[4])
    result_list.append([dates, n_points])

posts_by_hour = {} # contains the number of ask posts created during each hour of the day
points_by_hour = {} # contains the corresponding number of comments ask posts created at each hour received

for row in result_list:
    hour = row[0]
    num_points = row[1]
    time = dt.datetime.strptime(hour, '%m/%d/%Y %H:%M') # we convert the date string to a datetime object
    hour = time.strftime('%H') # we ectract the hour (%H) from the datetime object
    
    # now we create a frequent table and count the hour to get
    # the number of posts for each hour and set the comments by hour
    # equal to the comment number to get the comments in each hour.
    
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        points_by_hour[hour] = 1
    else:
        posts_by_hour[hour] += 1
        points_by_hour[hour] += num_points
        

In [79]:
result_list

[['9/26/2016 0:36', 0],
 ['9/26/2016 0:01', 0],
 ['9/25/2016 23:44', 0],
 ['9/25/2016 23:17', 0],
 ['9/25/2016 20:06', 1],
 ['9/25/2016 19:06', 1],
 ['9/25/2016 18:32', 0],
 ['9/25/2016 16:50', 1],
 ['9/25/2016 16:43', 0],
 ['9/25/2016 14:30', 1],
 ['9/25/2016 10:50', 3],
 ['9/25/2016 10:00', 0],
 ['9/25/2016 9:19', 1],
 ['9/25/2016 9:19', 0],
 ['9/25/2016 8:49', 0],
 ['9/25/2016 6:48', 1],
 ['9/25/2016 3:06', 1],
 ['9/24/2016 23:18', 0],
 ['9/24/2016 21:36', 0],
 ['9/24/2016 20:07', 26],
 ['9/24/2016 19:35', 0],
 ['9/24/2016 18:42', 2],
 ['9/24/2016 18:40', 0],
 ['9/24/2016 18:36', 0],
 ['9/24/2016 18:35', 1],
 ['9/24/2016 17:34', 1],
 ['9/24/2016 15:20', 0],
 ['9/24/2016 15:06', 102],
 ['9/24/2016 15:03', 0],
 ['9/24/2016 11:24', 0],
 ['9/24/2016 11:23', 1],
 ['9/24/2016 11:23', 5],
 ['9/24/2016 7:24', 0],
 ['9/24/2016 4:16', 0],
 ['9/24/2016 0:41', 0],
 ['9/23/2016 21:18', 0],
 ['9/23/2016 19:38', 1],
 ['9/23/2016 18:55', 1],
 ['9/23/2016 18:31', 8],
 ['9/23/2016 17:52', 0],
 ['9/23

In [80]:
len(result_list)

10158

In [81]:
result_list[:3]

[['9/26/2016 0:36', 0], ['9/26/2016 0:01', 0], ['9/25/2016 23:44', 0]]