# Analysing Hacker News post popularities

In this project, we are analysing [Hacker News](https://news.ycombinator.com/) posts to analyse which type of posts receive more attention and points.

Specifically, we are looking for posts that begin with Ask HN ( Type of post where users ask a question) and Show HN ( Type of post where users post a project or generally something interesting).

We will also check if posting at a certain time receives more comments and  points on average.

The dataset we are working with has been reduced, in which, posts that had no comments have been removed and the remaining submissions have then been randomly sampled.

Dataset used : [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts/home)

In [3]:
opf = open("hacker_news.csv")
from csv import reader
#Reading the loaded file
rf = reader(opf)
#Assigning it in a list format
hn = list(rf)
#Displaying first five rows , using for loop for a cleaner output
limit = 5
for l in range(0,limit):
     print (hn[l],'\n')
    
    
    

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 



# Cleaning the Data

We see that the first row seems to be the column header. As such, we will separate this row from our list.

In [4]:
hn_header = hn[0] #Setting the header in another list
print ("Header is \n \n",hn_header)
hn=hn[1:]
print("\n Table is: \n")
for l in range(0,limit):
     print (hn[l],'\n')

Header is 
 
 ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

 Table is: 

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'] 

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'] 

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'] 

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'] 

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/3

# Separating the Required Posts

Since we are only interested in analysing posts that begin with Show HN or Ask Hn, we will separate these posts accordingly.

In [95]:
#Posts that start with AskHN
aposts = []
#Posts that start with ShowHN
sposts = []
#All other posts
oposts = []
for row in hn:
    title = row[1]
    title = title.lower() #Setting capitalization to lower case
    if(title.startswith("ask hn")):
        #Add to ask post lists if title starts with ask hn
        aposts.append(row) 
    elif(title.startswith("show hn")):
         #Add to ask post lists if title starts with show hn
        sposts.append(row)
    else:
        #Add all other posts to this list
        oposts.append(row)
print("\n The number of ask hn posts are",len(aposts),'\n')
print("\n Few of them are:\n",aposts[:5])
print("\n The number of show hn posts are",len(sposts),'\n')
print("\n Few of them are: \n",sposts[:5])
print("\n The number of remaining posts are",len(oposts),'\n')
print("\n Few of them are: \n",oposts[:5])


 The number of ask hn posts are 1744 


 Few of them are:
 [['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55'], ['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43'], ['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14'], ['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20'], ['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']]

 The number of show hn posts are 1162 


 Few of them are: 
 [['10627194', 'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform', 'https://iot.seeed.cc', '26', '22', 'kfihihc', '11/25/2015 14:03'], ['10646440', 'Show HN: Something pointless I made', 'http://dn.ht/picklecat/', '747', '102', 'dhotson', '11/29/20

# Determining the Average amount of comments for ShowHN and AskHN Posts

Now that we have separated the lists, we can determine which type of posts receive more comments on average.

In [111]:
totcom_show = 0 #Total comments for ShowHN type post
totcom_ask = 0 #Total comments for AskHN type post
for row in aposts:
    comments = int(row[4])
    totcom_ask += comments
avgaskcom = int(totcom_ask/len(aposts)) #Average comments for ask type post
for row in sposts:
    comments = int(row[4])
    totcom_show += comments
avgshowcom = int(totcom_show/len(sposts)) #Average comments for show type post

print("The average number of comments for an ask post is :",avgaskcom)
print("The average number of comments for an show post is :",avgshowcom)    

The average number of comments for an ask post is : 14
The average number of comments for an show post is : 10


We see that the average number of comments for an ask post is higher, which could potentially mean that user engagement is higher when they can contribute something to the conversation or provide some help.

In the case of a show post, one reason for a lower amount of comments could be that users simply view the project , give it a point and move on.

# Determining the average amount of AskHN posts and comments each hour

Since we have deduced that AskHN posts receive more comments, we will continue our study only pertaining to the AskHN list.

We will create frequency tables to store the posts and comments per hour.

In [121]:
import datetime as dt
countposts_hr = {} #store posts per hour
countcom_hr = {} #store comments per hour
date_format = "%m/%d/%Y %H:%M"
for row in aposts:
    dtime = row[6]
#    comments are stored as strings, so we convert it to int
    comhr = int(row[4])
    totcom+= comhr
#    parsing time according to the format stored and then extracting only hour
    time = dt.datetime.strptime(dtime,date_format).strftime("%H")
    if time in countposts_hr:
        countposts_hr[time] += 1
    else:
        countposts_hr[time] = 1
        
    if time in countcom_hr:
        countcom_hr[time] += comhr
    else:
        countcom_hr[time] = comhr

print ("The number of comments per hour for AskHN posts: \n",countcom_hr,'\n')
print ("The number of AskHN posts per hour : \n",countposts_hr,'\n')

The number of comments per hour for AskHN posts: 
 {'21': 1745, '00': 447, '23': 543, '13': 1253, '14': 1416, '04': 337, '11': 641, '10': 793, '12': 687, '02': 1381, '22': 479, '03': 421, '06': 397, '18': 1439, '09': 251, '07': 267, '08': 492, '15': 4477, '05': 464, '16': 1814, '01': 683, '19': 1188, '20': 1722, '17': 1146} 

The number of AskHN posts per hour : 
 {'21': 109, '00': 55, '23': 68, '13': 85, '14': 107, '04': 47, '11': 58, '10': 59, '12': 73, '02': 58, '22': 71, '03': 54, '06': 44, '18': 109, '09': 45, '07': 34, '08': 48, '15': 116, '05': 46, '16': 108, '01': 60, '19': 110, '20': 80, '17': 100} 



Now that we have the comments and posts per hour, we can calculate the average comments per hour for AskHN posts.

In [114]:
avg= []
for hr in countcom_hr:
    # Append the hour and average comments to a list avg
    avg.append([hr,int(countcom_hr[hr]/countposts_hr[hr])])
print(avg)

[['21', 16], ['00', 8], ['23', 7], ['13', 14], ['14', 13], ['04', 7], ['11', 11], ['10', 13], ['12', 9], ['02', 23], ['22', 6], ['03', 7], ['06', 9], ['18', 13], ['09', 5], ['07', 7], ['08', 10], ['15', 38], ['05', 10], ['16', 16], ['01', 11], ['19', 10], ['20', 21], ['17', 11]]


# Correcting and Sorting the List by Swapping the Hours and Average Comments

In [73]:
#Swapping hours and average comments
swapped = []
for num in avg:
    swapped.append([num[1],num[0]])
#Sorting the list in descending order
swapped=sorted(swapped,reverse = True)
swapped

[[38, '15'],
 [23, '02'],
 [21, '20'],
 [16, '21'],
 [16, '16'],
 [14, '13'],
 [13, '18'],
 [13, '14'],
 [13, '10'],
 [11, '17'],
 [11, '11'],
 [11, '01'],
 [10, '19'],
 [10, '08'],
 [10, '05'],
 [9, '12'],
 [9, '06'],
 [8, '00'],
 [7, '23'],
 [7, '07'],
 [7, '04'],
 [7, '03'],
 [6, '22'],
 [5, '09']]

We now calculate the top 5 hours with the highest comments on average.

In [115]:
print("Top 5 Hours with Highest comments on average for AskHN posts:")
for avg,hr in swapped[:5]:
    print(dt.datetime.strptime(hr,"%H").strftime("%H:%M"),"has",avg,"comments on average")

Top 5 Hours with Highest comments on average for AskHN posts:
15:00 has 38 comments on average
02:00 has 23 comments on average
20:00 has 21 comments on average
21:00 has 16 comments on average
16:00 has 16 comments on average


The hour that receives the most comments per post on average is 15:00 ES with 38 comments on average.
According to the documentation, we see that this time is Eastern time which when converted to British Standard Time would be 5 hours ahead. 
Example: 15:00 would be 20:00 in BST

# Determining the Average amount of points for ShowHN and AskHN Posts

In [123]:
totp_show = 0 #Total points for ShowHN posts
totp_ask = 0 #Total points for AskHN posts
for row in aposts:
    points = int(row[3])
    totp_ask += points
avgaskp = int(totp_ask/len(aposts))
for row in sposts:
    points = int(row[3])
    totp_show += points
avgshowp = int(totp_show/len(sposts))

print("The average number of points for an AskHN post is :",avgaskp)
print("The average number of points for an ShowHN post is :",avgshowp)    

The average number of points for an AskHN post is : 15
The average number of points for an ShowHN post is : 27


As we had assumed earlier, we see that in the case of points, a ShowHN post has 27 more points on average compared to an AskHN post.

Hence we will continue with ShowHN posts in this case.

# Determining the average amount of ShowHN posts and points each hour

Same as earlier, we now create frequency tables for ShowHN posts and points per hour

In [122]:
import datetime as dt
countposts_hr1 = {} #store posts per hour
countp_hr = {} #store points per hour
date_format = "%m/%d/%Y %H:%M"
for row in sposts:
    dtime = row[6]
#    points are stored as strings, so we convert it to int
    phr = int (row[3])
#    parsing time according to the format and then extracting only hour
    time1 = dt.datetime.strptime(dtime,date_format).strftime("%H")
    if time1 in countposts_hr1:
        countposts_hr1[time1] += 1
    else:
        countposts_hr1[time1] = 1
        
    if time1 in countp_hr:
        countp_hr[time1] += phr
    else:
        countp_hr[time1] = phr

print ("The number of points per hour for ShowHN posts: \n",countp_hr,'\n')
print ("The number of ShowHN posts per hour : \n",countposts_hr1,'\n')

The number of points per hour for ShowHN posts: 
 {'21': 866, '00': 1173, '23': 1526, '13': 2438, '14': 2187, '04': 386, '15': 2228, '10': 681, '12': 2543, '02': 340, '22': 1856, '03': 679, '06': 375, '18': 2215, '09': 553, '07': 494, '08': 519, '11': 1480, '05': 104, '16': 2634, '01': 700, '19': 1702, '20': 1819, '17': 2521} 

The number of ShowHN posts per hour : 
 {'21': 47, '00': 31, '23': 36, '13': 99, '14': 86, '04': 26, '15': 78, '10': 36, '12': 61, '02': 30, '22': 46, '03': 27, '06': 16, '18': 61, '09': 30, '07': 26, '08': 34, '11': 44, '05': 19, '16': 93, '01': 28, '19': 55, '20': 60, '17': 93} 



Now that we have the points and posts per hour, we can calculate the average points per hour for ShowHN posts.

In [124]:
avg1= []
for hr in countp_hr:
    #Calculating average points per hour and appending the same to a list avg1
    avg1.append([hr,int(countp_hr[hr]/countposts_hr1[hr])])
print(avg1)

[['21', 18], ['00', 37], ['23', 42], ['13', 24], ['14', 25], ['04', 14], ['15', 28], ['10', 18], ['12', 41], ['02', 11], ['22', 40], ['03', 25], ['06', 23], ['18', 36], ['09', 18], ['07', 19], ['08', 15], ['11', 33], ['05', 5], ['16', 28], ['01', 25], ['19', 30], ['20', 30], ['17', 27]]


# Correcting and Sorting the List by Swapping the Hours and Average Points

In [125]:
#Swapping hours and average points
swapped1 = []
for num in avg1:
    swapped1.append([num[1],num[0]])
#Sorting in descending order of points
swapped1=sorted(swapped1,reverse = True)
swapped1

[[42, '23'],
 [41, '12'],
 [40, '22'],
 [37, '00'],
 [36, '18'],
 [33, '11'],
 [30, '20'],
 [30, '19'],
 [28, '16'],
 [28, '15'],
 [27, '17'],
 [25, '14'],
 [25, '03'],
 [25, '01'],
 [24, '13'],
 [23, '06'],
 [19, '07'],
 [18, '21'],
 [18, '10'],
 [18, '09'],
 [15, '08'],
 [14, '04'],
 [11, '02'],
 [5, '05']]

We now calculate the top 5 hours with the highest points on average.

In [109]:
print("Top 5 Hours with Highest points on average for ShowHN posts:")
for avg1,hr in swapped1[:5]:
    print(dt.datetime.strptime(hr,"%H").strftime("%H:%M"),"has",avg1,"points on average")

Top 5 Hours with Highest points on average for ShowHN posts:
23:00 has 42 points on average
12:00 has 41 points on average
22:00 has 40 points on average
00:00 has 37 points on average
18:00 has 36 points on average


From this we can conclude that the 23:00 EST ( 4:00 AM Next Day BST) is the best time to receive the highest points on average for a ShowHN type of post.

# Conclusion

From our analysis,we can conclude the following:

1. AskHN posts receive more comments than ShowHN posts.

2. ShowHN posts receive more points than AskHN posts.

3. In order to maximize the comments received, we should categorize the post as an AskHN type and post it at 15:00 (EST) or 20:00 (BST). However, it should be considered that our analysis did not consider random posts as well as posts without comments.

4. In order to maximize the points received, we should categorize the post as a ShowHN type and post it at 23:00 (EST) or 4:00 next day BST. 