# About project and dataset

This project is to practice working with datetime data using data set of Hacker News. Hacker News is a site on which users can submit posts and those posts receiving high vote or comments will attract a high volume of visitors. You can download the dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but keep in mind that our analysis used a shorter version of the dataset by removing all posts without any comments. 

Description of data set:

> * id: The unique identifier for the post
> * title: The title of the post
> * url: The URL that the posts links to
> * num_points: The number of points that the post acquired
> * num_comments: The number of comments on the post
> * author: The username of the author
> * created_at: The date and time of submitting

Our analysis will focus only on posts whose titles begin with either **Ask HN** (ask Hacker news a question) or **Show HN** (show new project, product, etc.). Our aim is to answer two basic questions:

> 1) Which one received more attention between **Ask HN** and **Show HN**?

> 2) Does the submitting time has an impact on the number of comments?

### Import libraries and open data set

In [1]:
from csv import reader
import datetime as dt


In [2]:
opened_file = open(r"Desktop\New folder\project2\HN_posts_year_to_Sep_26_2016.csv", encoding = "utf8")
read_file = reader(opened_file)
hn = list(read_file)
hn_header = hn[0]
hn = hn[1:]
print(hn[:5])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


In [3]:
# Our analyse only focuses on posts starting with "Ask HN" or "Show HN"
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    title_low = title.lower()
    if title_low.startswith("ask hn") == True:
        ask_posts.append(row)
    elif title_low.startswith("show hn") == True:
        show_posts.append(row)
    else:
        other_posts.append(row)
print("Number of ask posts: ", len(ask_posts))
print("Number of show posts: ", len(show_posts))

Number of ask posts:  9139
Number of show posts:  10158


In [4]:
print(hn_header)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [5]:
def get_avg_comment(dataset, index):
    total_comments = 0
    for row in dataset:
        num_comments = int(row[index])
        total_comments += num_comments
    avg_comments = total_comments / len(dataset)
    return avg_comments
    

In [6]:
print("Average comments of ask posts: ",get_avg_comment(ask_posts, 4))
print('\n')
print("Average comments of show_posts: ", get_avg_comment(show_posts, 4))

Average comments of ask posts:  10.393478498741656


Average comments of show_posts:  4.886099625910612


We notice that **users pay more attention to ask posts** with the average number of comments double that of show posts. It seems reasonable given that with ask post, people are willing to give their answers while show posts will only appeal people who are interested in the information. 

We will now move on to investigate whether posts created at a certain time can attract more comments. To answer that, we use the "created_at" column and calcul the average comments by hours created.

In [7]:
def avg_by_hour(dataset, index1, index2):    
    result_list = []
    for row in dataset:
        time = row[index1]
        comment_number = int(row[index2])
        result_list.append([time, comment_number])
    # print(result_list[1])
    counts_by_hour = {}
    comments_by_hour = {}
    for row in result_list:
        date = row[0]
        date = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
        hour = date.strftime("%H")
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = row[1]
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += row[1]
    avg_by_hour = []
    for hour in comments_by_hour:
        average_comments = comments_by_hour[hour] / counts_by_hour[hour]
        avg_by_hour.append([average_comments, hour])
    sorted_result = sorted(avg_by_hour, reverse = True)
    for entry in sorted_result:
        print("{0}: {1:.2f} average comments per post".format(entry[1], entry[0]))
 

In [8]:
avg_by_hour(ask_posts, 6, 4)

15: 28.68 average comments per post
13: 16.32 average comments per post
12: 12.38 average comments per post
02: 11.14 average comments per post
10: 10.68 average comments per post
04: 9.71 average comments per post
14: 9.69 average comments per post
17: 9.45 average comments per post
08: 9.19 average comments per post
11: 8.96 average comments per post
22: 8.80 average comments per post
05: 8.79 average comments per post
20: 8.75 average comments per post
21: 8.69 average comments per post
03: 7.95 average comments per post
18: 7.94 average comments per post
16: 7.71 average comments per post
00: 7.56 average comments per post
01: 7.41 average comments per post
19: 7.16 average comments per post
07: 7.01 average comments per post
06: 6.78 average comments per post
23: 6.70 average comments per post
09: 6.65 average comments per post


The result shows that posts created at around 3PM attracted the highest number of comments, while ones created too early in the morning or too late in the evening were less noticed. 

In [9]:
avg_by_hour(show_posts, 6, 4)

12: 6.99 average comments per post
07: 6.68 average comments per post
11: 6.00 average comments per post
08: 5.60 average comments per post
14: 5.52 average comments per post
13: 5.43 average comments per post
02: 5.15 average comments per post
04: 5.04 average comments per post
19: 5.02 average comments per post
18: 4.94 average comments per post
06: 4.71 average comments per post
16: 4.71 average comments per post
09: 4.67 average comments per post
00: 4.65 average comments per post
15: 4.57 average comments per post
03: 4.53 average comments per post
23: 4.53 average comments per post
17: 4.25 average comments per post
20: 4.16 average comments per post
21: 4.09 average comments per post
01: 4.07 average comments per post
22: 3.85 average comments per post
10: 3.80 average comments per post
05: 3.44 average comments per post


With show posts, the most common times were around 7AM and 12AM. However, the difference in the number of comments by hour is not as large as in ask posts.