# Hacker News Analysis.

The data used here deals with posts in the popular website, Hacker News. It has the following attributes.
<ul>
    <li>id: The unique identifier from Hacker News for the post</li>
    <li>title: The title of the post</li>
    <li>url: The URL that the posts links to, if it the post has a URL</li>
    <li>num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes</li>
    <li>num_comments: The number of comments that were made on the post</li>
    <li>author: The username of the person who submitted the post</li>
    <li>created_at: The date and time at which the post was submitted</li>
</ul>
<br>
We compare two types of posts, Ask HN and Show HN which deal with questions and showcases, respectively.

### Reading Data

In [1]:
import csv

In [5]:
# Reading the .csv file
f = open("hacker_news.csv", "r")
hn = csv.reader(f)

# Converting the data into a list of lists
hn = list(hn)

# Displaying the first 5 rows
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

We can see that the first row is actually the column headers for the table. We need to separate the header row from the data rows.

### Separating Header Row

In [6]:
# Storing the header row
headers = hn[0]

# Removing the header row from the dataset
hn = hn[1:]
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

### Displaying the First 5 rows

In [7]:
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

The data doesn't contain the couln headers now

### Splitting into separate lists of lists

Looking at the dataset, we can see that there are three types of posts:
<ul>
    <li>Ask Posts</li>
    <li>Show Posts</li>
    <li>Other Posts</li>
</ul>
We split the rows into different lists for each of these caategories.

In [22]:
# Initializing lists to store the different kinds of posts

ask_posts = []
show_posts = []
other_posts = []

In [23]:
# Iterating over each now in the dataset
for row in hn:
    
    # Extracting the post content
    title = row[1].lower()
    
    # Posts are added to their respective lists
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Checking the number of posts in each type.

In [24]:
len(ask_posts), len(show_posts), len(other_posts)

(1744, 1162, 17194)

### Calculating average comments for Ask HN posts

In [26]:
# Finding the total number of comments

total_ask_comments = 0
for post in ask_posts:
    total_ask_comments += int(post[4])
total_ask_comments

24483

In [27]:
# Dividing the total coments by the length of the list

avg_ask_comments = total_ask_comments/len(ask_posts)
print(avg_ask_comments)

14.038417431192661


### Calculating average comments for Show HN posts

In [28]:
# Finding the total number of comments

total_show_comments = 0
for post in show_posts:
    total_show_comments += int(post[4])
total_show_comments

11988

In [29]:
# Dividing the total coments by the length of the list

avg_show_comments = total_show_comments/len(show_posts)
print(avg_show_comments)

10.31669535283993


It seems like Ask HN posts recieve more comments on average than Show HN posts.

### Finding the number of Ask HN posts and comments by hour

Next, we try to determine if ask posts created at a certain time are more likely to attract comments.

In [30]:
import datetime as dt

In [36]:
# Extracting the "created at" and "number of comments" information from each row

result_list = []
for post in ask_posts:
    result_list.append([post[6], int(post[4])])
result_list[:5]

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In [37]:
counts_by_hour = {}
comments_by_hour = {}

In [39]:
# Aggregating data based on hour

for result in result_list:
    date = dt.datetime.strptime(result[0], "%m/%d/%Y %H:%M")
    hour = date.strftime("%H")
    if hour not in counts_by_hour.keys():
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = result[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += result[1]

In [41]:
# Calculating the average comments by hour

avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour]/counts_by_hour[hour]])
avg_by_hour

[['10', 13.440677966101696],
 ['02', 23.810344827586206],
 ['05', 10.08695652173913],
 ['17', 11.46],
 ['01', 11.383333333333333],
 ['03', 7.796296296296297],
 ['07', 7.852941176470588],
 ['12', 9.41095890410959],
 ['11', 11.051724137931034],
 ['19', 10.8],
 ['15', 38.5948275862069],
 ['04', 7.170212765957447],
 ['20', 21.525],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['23', 7.985294117647059],
 ['21', 16.009174311926607],
 ['18', 13.20183486238532],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['16', 16.796296296296298],
 ['14', 13.233644859813085],
 ['13', 14.741176470588234],
 ['09', 5.5777777777777775]]

The above list consists of the right information. However, it cannot be sorted. Python list of lists are sorted based on the first element of each list. Therefore, we must swap the order of elements in each list.

In [45]:
# Swapping the columns in the previously calculated list

swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
swap_avg_by_hour

[[13.440677966101696, '10'],
 [23.810344827586206, '02'],
 [10.08695652173913, '05'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [7.796296296296297, '03'],
 [7.852941176470588, '07'],
 [9.41095890410959, '12'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [38.5948275862069, '15'],
 [7.170212765957447, '04'],
 [21.525, '20'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.985294117647059, '23'],
 [16.009174311926607, '21'],
 [13.20183486238532, '18'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [16.796296296296298, '16'],
 [13.233644859813085, '14'],
 [14.741176470588234, '13'],
 [5.5777777777777775, '09']]

Now, we can sort the list.

In [46]:
# Sort the new list

sorted_swap = sorted(swap_avg_by_hour, reverse=True)

Finally, we have the most popular hours to post an Ask HN post. Let's display the top five times.

In [50]:
# Displaying the best time to post a question

print("Top 5 Hours for Ask Posts Comments")
for row in sorted_swap[:5]:
    hour = dt.datetime.strptime(row[1], "%H")
    hour = hour.strftime("%H:%M")
    print("{}: {:.2f} average comments per post".format(hour, row[0]))

Top 5 Hours for Ask Posts Comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The best time to post something on Hacker News is at 3PM Eastern Standard Time as it has the highest average comments per post in our data.