# **GUIDED PROJECT 2:  Exploring Hacker News Posts**

**Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.**


## **Opening and cleaning the data**

Let's start by importing the libraries we need and reading the dataset into a list of lists.

In [1]:
from csv import reader
opened_file = open("hacker_news.csv")
read_file = reader(opened_file)
hn = list(read_file)
print("First five rows: \n\n", hn[:5])

First five rows: 

 [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


We need to remove the first row(the names of `columns`) from our dataset:

In [2]:
header = hn[0]
hn = hn[1:]
print(header)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


## **Extracting Ask HN and Show HN Posts**

We're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles.

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if((title.lower()).startswith("ask hn")):
        ask_posts.append(row)
    if((title.lower()).startswith("show hn")):
        show_posts.append(row)
    else:
        other_posts.append(row)
print("The number of ask posts is: ", len(ask_posts))
print("The number of show posts is: ", len(show_posts))
print("The number of other posts is: ", len(other_posts))

The number of ask posts is:  1744
The number of show posts is:  1162
The number of other posts is:  18938


## **Calculating the Average Number of Comments for Ask HN and Show HN**

Next, let's determine if ask posts or show posts receive more comments on average:

Firstly, we find the total number of comments in ask posts:

In [4]:
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
avg_ask_comments = total_ask_comments / len(ask_posts)

print("The average number of comments in ask_posts is: ", avg_ask_comments)

The average number of comments in ask_posts is:  14.038417431192661


Secondly, we find the total number of comments in show posts:

In [5]:
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments


avg_show_comments = total_show_comments / len(show_posts)

print("The average number of comments in show_posts is: ", avg_show_comments)

The average number of comments in show_posts is:  10.31669535283993


*We can see, that more comments in average have ask_posts, because of the fact, that many people write the answer to the post and try to help the person. But the difference is biggest, so we can make a conclusion, that more persons comment ask posts.*

## **Finding the Number of Ask Posts and Comments by Hour Created**

**We want to determine if ask posts created at a certain time are more likely to attract comments.**

Let's calculate the number of ask posts created in each hour of the day, along with the number of comments received:

In [6]:
import datetime as dt
result_list = []
for row in ask_posts:
    list_2_elements = []
    list_2_elements.append(row[6])
    number_comments = int(row[4])
    list_2_elements.append(number_comments)
    result_list.append(list_2_elements)

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_data = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    hour_data = date_data.strftime("%H")
    if hour_data not in counts_by_hour:
        counts_by_hour[hour_data] = 1;
        comments_by_hour[hour_data] = row[1]
    elif hour_data in counts_by_hour:
        counts_by_hour[hour_data] += 1
        comments_by_hour[hour_data] += row[1]


Let's show our result dictionaries:

In [7]:
print("The numbers of posting per certain hour: \n")
for key in counts_by_hour:
    print(key, "hour have", counts_by_hour[key], "posts")

The numbers of posting per certain hour: 

09 hour have 45 posts
13 hour have 85 posts
10 hour have 59 posts
14 hour have 107 posts
16 hour have 108 posts
23 hour have 68 posts
12 hour have 73 posts
17 hour have 100 posts
15 hour have 116 posts
21 hour have 109 posts
20 hour have 80 posts
02 hour have 58 posts
18 hour have 109 posts
03 hour have 54 posts
05 hour have 46 posts
19 hour have 110 posts
01 hour have 60 posts
22 hour have 71 posts
08 hour have 48 posts
04 hour have 47 posts
00 hour have 55 posts
06 hour have 44 posts
07 hour have 34 posts
11 hour have 58 posts


In [8]:
print("The numbers of comments in posts in certain hour: \n")
for key in counts_by_hour:
    print(key, "hour have", comments_by_hour[key], "comments")

The numbers of comments in posts in certain hour: 

09 hour have 251 comments
13 hour have 1253 comments
10 hour have 793 comments
14 hour have 1416 comments
16 hour have 1814 comments
23 hour have 543 comments
12 hour have 687 comments
17 hour have 1146 comments
15 hour have 4477 comments
21 hour have 1745 comments
20 hour have 1722 comments
02 hour have 1381 comments
18 hour have 1439 comments
03 hour have 421 comments
05 hour have 464 comments
19 hour have 1188 comments
01 hour have 683 comments
22 hour have 479 comments
08 hour have 492 comments
04 hour have 337 comments
00 hour have 447 comments
06 hour have 397 comments
07 hour have 267 comments
11 hour have 641 comments


**We can see that the most popular hour for posting posts is 15 o'clock, and this posts are also get a biggest number of comments.**

## **Calculating the Average Number of Comments for Ask HN Posts by Hour**

Let's calculate the average number of comments per post for posts created during each hour of the day:

In [9]:
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

for row in avg_by_hour:
    hour = row[0]
    average_comments = row[1]
    print(hour, "hour have in average", average_comments, "comments")

09 hour have in average 5.5777777777777775 comments
13 hour have in average 14.741176470588234 comments
10 hour have in average 13.440677966101696 comments
14 hour have in average 13.233644859813085 comments
16 hour have in average 16.796296296296298 comments
23 hour have in average 7.985294117647059 comments
12 hour have in average 9.41095890410959 comments
17 hour have in average 11.46 comments
15 hour have in average 38.5948275862069 comments
21 hour have in average 16.009174311926607 comments
20 hour have in average 21.525 comments
02 hour have in average 23.810344827586206 comments
18 hour have in average 13.20183486238532 comments
03 hour have in average 7.796296296296297 comments
05 hour have in average 10.08695652173913 comments
19 hour have in average 10.8 comments
01 hour have in average 11.383333333333333 comments
22 hour have in average 6.746478873239437 comments
08 hour have in average 10.25 comments
04 hour have in average 7.170212765957447 comments
00 hour have in averag

## **Sorting and Printing Values from a List of Lists**

Although we now have the results we need, this format makes it difficult to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read:

In [10]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


Let's sorted the swap list to see the results:

In [11]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)

**Let's print our final results:**

In [12]:
print("Top 5 Hours for Ask Posts comments")
for row in sorted_swap[:5]:
    hour_datetime = dt.datetime.strptime(str(row[1]), "%H")
    print("{hour}: {average:.2f} average comments per post".format(hour = hour_datetime.strftime("%H:%M"), average = row[0]))
          

Top 5 Hours for Ask Posts comments
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


**We can see that if we create a post at 3 p.m., we will get the greatest effect from comments, because there are almost twice as many comments at this time as at the hour that ranks second in our table.Also during 2 a.m.,16.p.m, 20 p.m., 21 p.m. hours we should create a post to have a higher chance of receiving comments.**

**But we are living in +2GMT time zone, so in our case 3 p.m. is 9.p.m so we need to keep this statement in our mind. And also other hours will be moved forward by 6 hours.**

# Determine if show or ask posts receive more points on average

In [13]:
show_posts_points = []
ask_posts_points = []

for row in ask_posts:
    points = int(row[3])
    ask_posts_points.append(points)

for row in show_posts:
    points = int(row[3])
    show_posts_points.append(points)

show_posts_average_points = sum(show_posts_points) / len(show_posts_points)
ask_posts_average_points = sum(ask_posts_points) / len(ask_posts_points)

print("The average points of ask posts is ", ask_posts_average_points)
print("The average points of show posts is ", show_posts_average_points)


The average points of ask posts is  15.061926605504587
The average points of show posts is  27.555077452667813


**We can see that show posts receive more points on average, than ask posts, despite the fact, that show posts get less comments on average, than ask posts.**

# **Conclusions**

**So today we are exploring the dataset "Hacker News Posts" and we can determine the top 5 Hours for publishing your work:**

|Place in the rating| Hour |Number of average comments|
|-------------------|----|--------------------------|
| 1 |  15 p.m. | 38.59 |
| 2 |  2 a.m.| 23.81 |
| 3 |  8 p.m. | 21.52 |
| 4 |  4 p.m.| 16.80 |
| 5 |  9 p.m.| 16.01 |

**We also determine, that show posts receive more points on average, despite the  bigger collecting of comments in ask posts.**