# Hacker News Posts Analysis

Hacker News is blogging site which is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result. 

We will perform data cleaning and analysis of text and date using python.

**Below are descriptions of the columns:**

1. id: The unique identifier from Hacker News for the post
2. title: The title of the post
3. url: The URL that the posts links to, if it the post has a URL
4. num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes

5. num_comments: The number of comments that were made on the post
6. author: The username of the person who submitted the post
7. created_at: The date and time at which the post was submitted

In [1]:
from csv import reader
hn = list(reader(open('hacker_news.csv')))

In [2]:
hn[1:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [3]:
headers = hn[0]

In [4]:
hn = hn[1:]

In [5]:
headers

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

In [6]:
hn[5]

['10482257',
 'Title II kills investment? Comcast and other ISPs are now spending more',
 'http://arstechnica.com/business/2015/10/comcast-and-other-isps-boost-network-investment-despite-net-neutrality/',
 '53',
 '22',
 'Deinos',
 '10/31/2015 9:48']

### Data Analysis

As we are concerned with post beginning with 'ask hn' and 'show hn' we will refine our dataset. We will analyse 'ask' and 'show' posts for comments and points to find out which posts dominates the site.

In [33]:
ask_posts = []
show_posts = []
other_posts = []

In [34]:
for app in hn:
    title = app[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(app)
    elif title.lower().startswith('show hn'):
        show_posts.append(app)
    else:
        other_posts.append(app)

In [35]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


##### 1.  Calculate Average no of comments and points for 'ask'  and  'show ' posts

In [10]:
total_ask_comments = 0
total_ask_points = 0

In [11]:
for post in ask_posts:
    total_ask_comments += int(post[4])
    total_ask_points += int(post[3])

In [12]:
avg_ask_comments = total_ask_comments/len(ask_posts)
avg_ask_points = total_ask_points/len(ask_posts)

In [13]:
avg_ask_comments

14.038417431192661

In [14]:
avg_ask_points

15.061926605504587

In [15]:
total_show_comments = 0
total_show_points = 0

In [16]:
for post in show_posts:
    total_show_comments += int(post[4])
    total_show_points += int(post[3])

In [17]:
avg_show_comments = total_show_comments/len(show_posts)
avg_show_points = total_show_points/len(show_posts)

In [18]:
avg_show_comments

10.31669535283993

In [19]:
avg_show_points

27.555077452667813

##### Finding:- We can see at an average 'show' posts receive more points than 'ask' posts while average no. of comments on 'ask' posts are more than 'show' posts. Which is the intended behaviour.**

##### 2. Now we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

    a. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
    b. Calculate the average number of comments ask posts receive by hour created.


In [36]:
import datetime as dt

In [37]:
result_list = []

In [38]:
for app in ask_posts:
    l = [app[6],int(app[4])]
    result_list.append(l)

In [39]:
counts_by_hour = {}
comments_by_hour = {}

In [24]:
for l in result_list:
    date = dt.datetime.strptime(l[0],"%m/%d/%Y %H:%M")
    hour = date.strftime(date.strftime("%H"))
    if hour in counts_by_hour:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += l[1]
    else:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = l[1]

In [25]:
avg_by_hour = []
for k in counts_by_hour:
    avg_by_hour.append([k,comments_by_hour[k]/counts_by_hour[k]])

In [26]:
avg_by_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

In [27]:
swap_avg_by_hour = []
for r in avg_by_hour:
    swap_avg_by_hour.append([r[1],r[0]])

In [28]:
sorted_swap = sorted(swap_avg_by_hour,reverse=True)

In [29]:
for x in sorted_swap[:5]:
    hour = dt.datetime.strptime(str(x[1]),"%H")
    hour = hour.strftime("%H:%M")
    fmt = "{}: {:.2f} average comments per post"
    print(fmt.format(hour,x[0]))

15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


##### Findings:- It seems that 1500 hrs is the most active time on Hacker News as most comments are made at this time.