# Project Hacker News

Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just generally something interesting.

We'll compare these two types of posts to determine the following:

Do Ask HN or Show HN receive more comments on average?
Do posts created at a certain time receive more comments on average?

In [1]:
from csv import reader
import pandas as pd

In [2]:
!wget http://bioinf-mw.bihz.upwr.edu.pl/students-data/Data-Cleaning-Advanced/hacker_news.csv

--2023-04-16 13:11:01--  http://bioinf-mw.bihz.upwr.edu.pl/students-data/Data-Cleaning-Advanced/hacker_news.csv
Resolving bioinf-mw.bihz.upwr.edu.pl (bioinf-mw.bihz.upwr.edu.pl)... 156.17.187.238
Connecting to bioinf-mw.bihz.upwr.edu.pl (bioinf-mw.bihz.upwr.edu.pl)|156.17.187.238|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3104774 (3.0M) [text/csv]
Saving to: ‘hacker_news.csv’


2023-04-16 13:11:06 (718 KB/s) - ‘hacker_news.csv’ saved [3104774/3104774]



Opening data as list of list and as dataframe

In [3]:
open_file = open('hacker_news.csv')
read_file = reader(open_file)
hn = list(read_file)
hn_main = hn[1:]
hn_header = hn[:1]
df = pd.DataFrame(hn_main)
df_header = pd.DataFrame(hn_header)

In [4]:
hn_header

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']]

In [5]:
df[:5]

Unnamed: 0,0,1,2,3,4,5,6
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


Function that creates dataframe from given input

In [6]:
def dt(aaa):
  dt = pd.DataFrame(aaa)
  return dt

Creating an empty lists for separating specific posts

In [7]:
ask_posts = []
show_posts = []
other_posts = []

Checking if the title of the post starts with 'ask  hn', 'show hn' or other title and appendig this post to correct list

In [8]:
for row in hn_main:
  title = row[1]
  title_lower = title.lower()

  if (title_lower.startswith('ask hn')):
    ask_posts.append(row)
  elif (title_lower.startswith('show hn')):
    show_posts.append(row)
  else:
    other_posts.append(row)


In [9]:
dt(other_posts[:5])

Unnamed: 0,0,1,2,3,4,5,6
0,12224879,Interactive Dynamic Video,http://www.interactivedynamicvideo.com/,386,52,ne0phyte,8/4/2016 11:52
1,11964716,Florida DJs May Face Felony for April Fools' W...,http://www.thewire.com/entertainment/2013/04/f...,2,1,vezycash,6/23/2016 22:20
2,11919867,Technology ventures: From Idea to Enterprise,https://www.amazon.com/Technology-Ventures-Ent...,3,1,hswarna,6/17/2016 0:01
3,10301696,Note by Note: The Making of Steinway L1037 (2007),http://www.nytimes.com/2007/11/07/movies/07ste...,8,2,walterbell,9/30/2015 4:12
4,10482257,Title II kills investment? Comcast and other I...,http://arstechnica.com/business/2015/10/comcas...,53,22,Deinos,10/31/2015 9:48


Checking the number of each post type

In [10]:
print('Number of asking posts: ' + str(len(ask_posts)))
print('Number of showing posts: ' + str(len(show_posts)))
print('Number  of other posts: ' + str(len(other_posts)))

Number of asking posts: 1744
Number of showing posts: 1162
Number  of other posts: 17193


Counting total number of comments in ask type posts.

In [11]:
total_ask_comm = 0

for row in ask_posts:
  comms = int(row[4])
  total_ask_comm += comms

print('Total number of ask type posts comments: ' + str(total_ask_comm))

Total number of ask type posts comments: 24483


Computing average number of comments of ask type posts comments

In [12]:
avg_ask_comm =round(total_ask_comm/len(ask_posts))
print('Avergae number of ask type posts comments: ' + str(avg_ask_comm))

Avergae number of ask type posts comments: 14


Counting total number of comments in show type posts.

In [13]:
total_show_comms = 0

for row in show_posts:
  comms = int(row[4])
  total_show_comms += comms

print('Total number of show type posts comments: ' + str(total_show_comms))

Total number of show type posts comments: 11988


Computing average number of comments of show type posts comments

In [14]:
avg_show_comms = round(total_show_comms/(len(show_posts)))
print('Avergae number of show type posts comments: ' + str(avg_show_comms))

Avergae number of show type posts comments: 10


Counting total number of comments in other type posts.

In [15]:
total_other_comm = 0

for row in other_posts:
    comms = int(row[4])
    total_other_comm = total_other_comm + comms
    
print('Total number of other type posts comments: ' + str(total_other_comm))

Total number of other type posts comments: 462045


Computing average number of comments of other type posts comments

In [16]:
avg_other_comm = round(total_other_comm/(len(other_posts)))
print('Avergae number of other type posts comments: ' + str(avg_other_comm))

Avergae number of other type posts comments: 27


In [17]:
dt(ask_posts[:5])

Unnamed: 0,0,1,2,3,4,5,6
0,12296411,Ask HN: How to improve my personal website?,,2,6,ahmedbaracat,8/16/2016 9:55
1,10610020,Ask HN: Am I the only one outraged by Twitter ...,,28,29,tkfx,11/22/2015 13:43
2,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,5/2/2016 10:14
3,12210105,Ask HN: Looking for Employee #3 How do I do it?,,1,3,sph130,8/2/2016 14:20
4,10394168,Ask HN: Someone offered to buy my browser exte...,,28,17,roykolak,10/15/2015 16:38


In [18]:
import datetime as dtime

Creating lists that shows time when post was added with total numbers of comments

In [19]:
result_ask_list = []
for row in ask_posts:
  created_at = row[6]
  comms = int(row[4])
  result_ask_list.append([created_at, comms])

Creating dictionaries that contains the number of ask posts created during each hour of the day and corresponding number of comments ask posts created at each hour received.


In [20]:
ask_counts_by_hour = {}
ask_comms_by_hour = {}

for row in result_ask_list:
  date = row[0]
  comms = row[1]
  hour = dtime.datetime.strptime(date, '%m/%d/%Y %H:%M').strftime('%H')

  if hour not in ask_counts_by_hour:
    ask_comms_by_hour[hour] = comms
    ask_counts_by_hour[hour] = 1
  else:
    ask_comms_by_hour[hour] += comms
    ask_counts_by_hour[hour] += 1

Creating a list that contains averge number of comments for each hour

In [21]:
ask_avg_by_hour = []
for row in ask_comms_by_hour:
  ask_avg_by_hour.append([row, ask_comms_by_hour[row]/ask_counts_by_hour[row]])

Swaping columns, so we can sort the list by averge comments in decending order

In [22]:
swap_ask_avg_by_hour = []
for row in ask_avg_by_hour:
  swap_ask_avg_by_hour.append([row[1], row[0]])
  
sorted_swap_ask_avg_by_hour = sorted(swap_ask_avg_by_hour, reverse = True)

Printing hours with the averge numbers of comments using str.format method

In [23]:
for row in sorted_swap_ask_avg_by_hour[:5]:
    hour = dtime.datetime.strptime(row[1], "%H").strftime("%H:%M")
    print(f"{hour}: {row[0]:.2f} average comments per post.")

15:00: 38.59 average comments per post.
02:00: 23.81 average comments per post.
20:00: 21.52 average comments per post.
16:00: 16.80 average comments per post.
21:00: 16.01 average comments per post.


A chance to recive a higher number of comments occurs if you post the question at 15.00

Creating lists that shows time when post was added with total numbers of comments

In [45]:
result_show_list = []
for row in show_posts:
    created_at = row[6]
    comms = int(row[4])
    result_show_list.append([created_at, comms])

Creating dictionaries that contains the number of show posts created during each hour of the day and corresponding number of comments show posts created at each hour received.

In [46]:
show_counts_by_hour = {}
show_comms_by_hour = {}

for row in result_show_list:
    date = row[0]
    comms = row[1]
    hour = dtime.datetime.strptime(date, '%m/%d/%Y %H:%M').strftime('%H')
    
    if hour not in show_counts_by_hour:
        show_comms_by_hour[hour] = comms
        show_counts_by_hour[hour] = 1
    else:
        show_comms_by_hour[hour] += comms
        show_counts_by_hour[hour] += 1

Creating a list that contains averge number of comments for each hour 

In [47]:
show_avg_by_hour = []
for row in show_comms_by_hour:
    show_avg_by_hour.append([row, show_comms_by_hour[row]/ show_counts_by_hour[row]])

Swaping columns, so we can sort the list by averge comments in decending order

In [48]:
swap_show_avg_by_hour = []
for row in show_avg_by_hour:
    swap_show_avg_by_hour.append([row[1],row[0]])

sorted_swap_show_avg_by_hour = sorted(swap_show_avg_by_hour, reverse = True)

Printing hours with the averge numbers of comments using str.format method

In [49]:
for row in sorted_swap_show_avg_by_hour[:5]:
    hour = dtime.datetime.strptime(row[1], "%H").strftime("%H:%M")
    print(f"{hour}: {row[0]:.2f} average comments per post.")

18:00: 15.77 average comments per post.
00:00: 15.71 average comments per post.
14:00: 13.44 average comments per post.
23:00: 12.42 average comments per post.
22:00: 12.39 average comments per post.


A chance to recive a higher number of comments occurs if you post your thing at 18.00

Creating lists that shows time when post was added with total numbers of comments

In [52]:
result_other_list = []
for row in other_posts:
    created_at = row[6]
    comms = int(row[4])
    result_other_list.append([created_at, comms])

Creating dictionaries that contains the number of other posts created during each hour of the day and corresponding number of comments other posts created at each hour received.

In [55]:
other_counts_by_hour = {}
other_comms_by_hour = {}

for row in result_other_list:
    date = row[0]
    comms = row[1]
    hour = dtime.datetime.strptime(date, '%m/%d/%Y %H:%M').strftime('%H')
    
    if hour not in other_counts_by_hour:
        other_comms_by_hour[hour] = comms
        other_counts_by_hour[hour] = 1
    else:
        other_comms_by_hour[hour] += comms
        other_counts_by_hour[hour] += 1

Creating a list that contains averge number of comments for each hour  

In [56]:
other_avg_by_hour = []

for row in other_comms_by_hour:
  other_avg_by_hour.append([row, other_comms_by_hour[row]/other_counts_by_hour[row]])

Swaping columns, so we can sort the list by averge comments in decending order

In [58]:
swap_other_avg_by_hour = []

for row in other_avg_by_hour:
  swap_other_avg_by_hour.append([row[1],row[0]])

sorted_swap_other_avg_by_hour = sorted(swap_other_avg_by_hour, reverse = True)

Printing hours with the averge numbers of comments using str.format method

In [60]:
for row in sorted_swap_other_avg_by_hour[:5]:
  hour = dtime.datetime.strptime(row[1], '%H').strftime('%H:%M')
  print(f"{hour}: {row[0]:.2f} average comments per post.")

14:00: 32.33 average comments per post.
13:00: 30.90 average comments per post.
12:00: 30.35 average comments per post.
11:00: 29.59 average comments per post.
15:00: 29.52 average comments per post.


A chance to recive a higher number of comments occurs if you post other type posts at 14.00

**Summary**

The data contains:
* 1744 asking posts
* 1162 showig posts
* 17193 other posts

 - Total number of ask type posts comments is 24483 with the
average number of comments = 14.
A chance to recive a higher number of comments occurs if you post the question at 15.00

 - Total number of show type posts comments is 11988 with the
average number of comments = 10. A chance to recive a higher number of comments occurs if you post your thing at 18.00

 - Total number of other type posts comments is 462045 with the
average number of comments = 27. A chance to recive a higher number of comments occurs if you post other type posts at 14.00