# Analyzing Hacker News Posts

For this project, we use the hacker_news.csv dataset, a modified dataset, which has been reduced from almost 300,000 rows to approximately 20,000 rows by removing:<br>
1] All submissions that did not receive any comments.<br>
2] After that randomly sampling from the remaining submissions.<br>

**Descriptions of the columns:**<br>
id: The unique identifier from Hacker News for the post<br>
title: The title of the post
url: The URL that the posts links to, if the post has a URL<br>
num_points: The number of points the post acquired, calculated as the <br>total number of upvotes minus the total number of downvotes<br>
num_comments: The number of comments that were made on the post<br>
author: The username of the person who submitted the post<br>
created_at: The date and time at which the post was submitted<br>

## Import,Opening and Reading Data

In [12]:
from csv import reader
opened_file=open("hacker_news.csv")
read_file=reader(opened_file)

hacker_news=list(read_file) #creating a list of list

In [13]:
print(hacker_news[:3])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']]


### Preparing Data

Removing column header for analyze the data and store it in a seperate list for further reference

In [14]:
column_headers=hacker_news[:1] 
hacker_news=hacker_news[1:]

In [15]:
print(column_headers)
print(hacker_news[:4])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


Now separate posts beginning with Ask HN and Show HN into two different lists using the startswith() method

In [20]:
#creating seperate list to store data for Ask Hn,Show Hn and other post
ask_posts=[]
show_posts=[]
other_posts=[]

for row in hacker_news:
    title=row[1]
    if(title.lower().startswith('ask hn')):
       ask_posts.append(row)
       
    elif(title.lower().startswith('show hn')):
       show_posts.append(row)     
    else:
        other_posts.append(row)


In [21]:
print('Number of ask posts:',len(ask_posts))
print('Number of show posts:',len(show_posts))
print('Number of other posts:',len(other_posts))

Number of ask posts: 1744
Number of show posts: 1162
Number of other posts: 17194


In [7]:
print(ask_posts[:5])
print(show_posts[:5])

[['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]
[['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/2015 22:46'], ['11590768', 'Show HN: Shanhu.io, a programming playground powered by e8vm', 'https://shanhu.io', '1', '1', 'h

## Analyzing Comments - Ask HN vs Show HN Posts

In [29]:
total_ask_comments=0
for val in ask_posts:
    total_ask_comments+=int(val[4])
    
avg_ask_comments=total_ask_comments/len(ask_posts)

print('Total number of ask comments:' ,total_ask_comments)

Total number of ask comments: 24483


In [30]:
total_show_comments=0
for val in show_posts:
    total_show_comments+=int(val[4])
    
avg_show_comments=total_show_comments/len(show_posts)

print('Total number of Show  comments:' ,total_show_comments)

Total number of Show  comments: 11988


In [31]:
print('Average number of ask comments:' ,avg_ask_comments)
print('Average number of Show comments:' ,avg_show_comments)

Average number of ask comments: 14.038417431192661
Average number of Show comments: 10.31669535283993


**After analysis from the above result we determind that on average ask posts recieve more comments than show posts.**

### Analyzing how time of Ask post creation  affects amount of comments

Now we Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.<br>

In [34]:
import datetime as dt
result_list=[] # Creating a  empty list to store time of posts and comments per hour
for val in ask_posts:
    result_list.append([val[6],val[4]])

print(result_list[:5])

[['8/16/2016 9:55', '6'], ['11/22/2015 13:43', '29'], ['5/2/2016 10:14', '1'], ['8/2/2016 14:20', '3'], ['10/15/2015 16:38', '17']]


In [36]:
counts_by_hour={}    #number of ask_posts created every hour
comments_by_hour={}  #number of comments obtained by the ask_posts

for row in result_list:
    date_time = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    time=date_time.hour
    comment=int(row[1])
    if time not in counts_by_hour:
        counts_by_hour[time]=1
        comments_by_hour[time]=comment
    else:
       counts_by_hour[time]+=1
       comments_by_hour[time]+=comment

In [37]:
print(counts_by_hour)
print("\n")
print(comments_by_hour)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}


{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


**We will calculate Average Number of comment per hour** 

In [38]:
avg_comments_by_hour=[]
for val in comments_by_hour:
    avg_comments_by_hour.append([val,comments_by_hour[val]/counts_by_hour[val]])

print(avg_comments_by_hour)

[[9, 5.5777777777777775], [13, 14.741176470588234], [10, 13.440677966101696], [14, 13.233644859813085], [16, 16.796296296296298], [23, 7.985294117647059], [12, 9.41095890410959], [17, 11.46], [15, 38.5948275862069], [21, 16.009174311926607], [20, 21.525], [2, 23.810344827586206], [18, 13.20183486238532], [3, 7.796296296296297], [5, 10.08695652173913], [19, 10.8], [1, 11.383333333333333], [22, 6.746478873239437], [8, 10.25], [4, 7.170212765957447], [0, 8.127272727272727], [6, 9.022727272727273], [7, 7.852941176470588], [11, 11.051724137931034]]


**It makes difficult to identify the hours with the most comments.So next we will display the data in a clear way to see the hours with the most comments.**

In [40]:
swap_avg_by_hour=[]
for row in avg_comments_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]


In [41]:
sorted_swap=sorted(swap_avg_by_hour,reverse= True)#Sorting the above list
print(sorted_swap)

[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [13.20183486238532, 18], [11.46, 17], [11.383333333333333, 1], [11.051724137931034, 11], [10.8, 19], [10.25, 8], [10.08695652173913, 5], [9.41095890410959, 12], [9.022727272727273, 6], [8.127272727272727, 0], [7.985294117647059, 23], [7.852941176470588, 7], [7.796296296296297, 3], [7.170212765957447, 4], [6.746478873239437, 22], [5.5777777777777775, 9]]


In [58]:
print("Top 5 Hours for Ask Posts Comments")
for val in sorted_swap[0:5]:
    print(f'{val[1]}:00 {val[0]:.2f} average comments per post')# US/Eastern timezone (EST) - UTC-06

Top 5 Hours for Ask Posts Comments
15:00 38.59 average comments per post
2:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


In [59]:
# Our timezone (IST) : 10 hours,30 minute ahead of EST
print("Top 5 Hours for Ask Posts Comments")
for val in sorted_swap[0:5]:
    ist_dt = dt.datetime.strptime(str(val[1]), '%H') + dt.timedelta(hours=10,minutes=30)
    ist_str = ist_dt.strftime('%H:%M')
    print(f'{ist_str}:00 {val[0]:.2f} average comments per post')

Top 5 Hours for Ask Posts Comments
01:30:00 38.59 average comments per post
12:30:00 23.81 average comments per post
06:30:00 21.52 average comments per post
02:30:00 16.80 average comments per post
07:30:00 16.01 average comments per post


**Above results show that creating a post at 15:00 EST has the highest chance of receiving comments.<br>
For this reason, the best time for us to submit a post at Indian time zone is 01:30**

## Conclusion
After Analyzing Hacker News:<br>
First we can see that on average ask posts recieve more comments than show posts & <br>
Second creating a post at 15:00 EST has the highest chance of receiving comments.<br>
It shows that  users in both North America and Europe are active in this time.

