# Finding Perfect Time for Posting to Get More Comments.
In this project, we'll work with a dataset of submissions to popular technology site [Hacker News](https://news.ycombinator.com/).
Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com/), where user-submitted stories (known as "posts") receive votes and comments, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of the Hacker News listings can get hundreds of thousands of visitors as a result.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), it contains almost 300,000 rows. Below are descriptions of the columns:

- **id**: the unique identifier from Hacker News for the post
- **title**: the title of the post
- **url**: the URL that the posts links to, if the post has a URL
- **num_points**: the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
- **num_comments**: the number of comments on the post
- **author**: the username of the person who submitted the post
- **created_at**: the date and time of the post's submission

We're specifically interested in posts with titles that begin with either **Ask HN** or **Show HN**. Users submit Ask HN posts to ask the Hacker News community a specific question. Likewise, users submit Show HN posts to show the Hacker News community a project, product, or just something interesting. 
We'll compare these two types of posts to determine the following:

- Do **Ask HN** or **Show HN** receive more comments on average?
- Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the dataset into a list of lists `hn_data` and demonstrate first 5 rows.

## Cleaning data

In [2]:
from csv import reader
# reading .csv file and transforming data into list of lists
opened_file = open('hacker_news.csv', encoding="utf8")
read_file = reader(opened_file)
hn_data = list(read_file)
# demonstrating first 5 rows
for row in hn_data[:5]:
    print(row)
    print('\n')
print('Number of rows in dataset:', len(hn_data))

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']


['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24']


['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19']


['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16']


Number of rows in dataset: 293120


We can see that demonstrated above posts have 0 (zero) comments. As our goal to examine posts that get more comments, we will clean our dataset from posts that don't have comments.

In [3]:
# collecting rows with comments in separate list 'hn'
hn = []
for row in hn_data:
    if row[4] != '0':
        hn.append(row)

# checking if there are rows with '0' points
number_points_0 = 0
for row in hn:
    if row[3] == '0':
        number_points_0 += 1
print("Number of rows with '0' points:", number_points_0)        

print('Number of rows in dataset:', len(hn))  

print('First 5 rows:')

for row in hn[:5]:
    print(row)

Number of rows with '0' points: 0
Number of rows in dataset: 80402
First 5 rows:
['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3:13']
['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53']
['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26']
['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54']


We reduced our dataset to 80,402 rows.

Let's extract header row and assign it to variable *headers*. Next we remove the header row from *hn* and demonstrate 5 first rows to check, that the header row was removed.

In [5]:
headers = hn[0]
hn = hn[1:]
print(hn[:5], '\n')
print('Title values:', headers)

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-dont-find-you-the-best-deals/', '1', '1', 'yarapavan', '9/26/2016 2:26'], ['12578694', 'Emergency dose of epinephrine that does not cost an arm and a leg', 'http://m.imgur.com/gallery/th6Ua', '2', '1', 'dredmorbius', '9/26/2016 1:54'], ['12578624', 'Phone Makers Could Cut Off Drivers. So Why Dont They?', 'http://www.nytimes.com/2016/09/25/technology/phone-makers-could-cut-off-drivers-so-why-dont-they.html', '4', '1', 'danso', '9/26/2016 1:37'], ['12578556', 'OpenMW, Open Source Elderscrolls III: Morrowind Reimplementation', 'https://openmw.org/en/', '32', '3', 'rocky1138', '9/26/2016 1:24']] 

Title values: ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-with-your-style-feed/', '1', '1', 'bdoux', '9/26/2016 3

## Distributing posts by titles (topics)

Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists containing just the data for those titles: `ask_posts` to collect rows starting with Ask HN, `show_posts` to collect rows starting with Show HN and `other_posts` for the rest of rows. 

In order to make this distribution we are using `startswith()` method. And to make sure that the destribution of the rows is done correctly we are using `lower()` method.

In [6]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
     # checking if the title starts with 'ask hn' in lower case
    if title.lower().startswith('ask hn'): # checking if the title starts with 'ask hn' in lower case
        ask_posts.append(row)
    # if previous condition wasn't fulfilled it will be checked if the title starts with 'show hn' in lower case
    elif title.lower().startswith('show hn'): 
        show_posts.append(row)
    # if previous condition wasn't fulfilled the row will be appended to 'other_posts' list of lists.
    else:
        other_posts.append(row)
print(len(ask_posts)) 
print(ask_posts[:3])
print('\n')
print(len(show_posts))
print(show_posts[:3])
print('\n')
print(len(other_posts)) 
print(other_posts[:3])
print('\n')

6911
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48']]


5059
[['12577142', 'Show HN: Jumble  Essays on the go #PaulInYourPocket', 'https://itunes.apple.com/us/app/jumble-find-startup-essay/id1150939197?ls=1&mt=8', '1', '1', 'ryderj', '9/25/2016 20:06'], ['12576813', 'Show HN: Learn Japanese Vocab via multiple choice questions', 'http://japanese.vul.io/', '1', '1', 'soulchild37', '9/25/2016 19:06'], ['12576090', 'Show HN: Markov chain Twitter bot. Trained on comments left on Pornhub', 'https://twitter.com/botsonasty', '3', '1', 'keepingscore', '9/25/2016 16:50']]


68430
[['12578822', 'Amazons Algorithms Dont Find You the Best Deals', 'https://www.technologyreview.com/s/602442/amazons-algorithms-do

## Calculating average number of comments

 Now let's determine if ask posts or show posts receive more comments on average.

In [7]:
# creatting variable total_ask_comments to count total amount of comments for ask posts
total_ask_comments = 0
for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments
# calculating average number of comments for ask_post
avg_ask_comments = total_ask_comments / len(ask_posts)
print(round(avg_ask_comments,3))

13.744


In [8]:
# creatting variable total_show_comments to count total amount of comments for show posts
total_show_comments = 0
for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments
# calculating average number of comments for show_post    
avg_show_comments = total_show_comments / len(show_posts)
print(round(avg_show_comments,3))

9.811


Let's check what is going on in the category of *other posts*: how many comments on average do people leave?

In [10]:
# creatting variable total_show_comments to count total amount of comments for other posts
total_other_comments = 0
for row in other_posts:
    num_comments = int(row[4])
    total_other_comments += num_comments
# calculating average number of comments for other_post
avg_other_comments = total_other_comments / len(other_posts)
print(round(avg_other_comments,3))

25.839


We can see that on average ask_posts get more response than show_posts. May be this is because people prefer to give advice than to give some kind of feedback on something. 

Also we can see that post with other titles have the biggest average number of comments. This can happen due to the fact that there a lot of different topics. Some of the topics can be very popular or controversial, thats why people discuss them a lot.

# Analysis of Ask posts

## Distributing number of posts and comments by hour created

Since **Ask posts** are more likely to receive comments, we'll focus our remaining analysis just on these posts.

Next, we'll determine if ask posts created at a certain time are more likely to attract comments. We'll use the following steps to perform this analysis:

- Calculate the number of ask posts created in each hour of the day, along with the number of comments received.
- Calculate the average number of comments ask posts receive by hour created.
First, we'll work on calculating the number of ask posts and comments by hour created. We'll use the datetime module to work with the data in the created_at column.

Now let's create an empty list `result_list`, we will iterate over `ask_posts` list of list and append a list of 2 elements (the column 'created_at', the number of comments of the post) to the `result_list`.

In [14]:
result_list = []
for row in ask_posts:
    result_list.append([row[6], int(row[4])])
print(result_list[:5])


[['9/26/2016 2:53', 7], ['9/26/2016 1:17', 3], ['9/25/2016 22:48', 3], ['9/25/2016 21:50', 2], ['9/25/2016 19:30', 1]]


Next, we are creating 2 empty dictionaries `counts_by_hour` to collect there information about created post in each hour and `comments_by_hour` to collect there information about number of comments left in each hour. To do that we need to create a datetime object using datetime.strptime().

In [15]:
# importing datetime module using alias 'dt' 
import datetime as dt

counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    comments = row[1]
    date_str = row[0]
    # creating datetime object from the string 'date_str'
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M") 
    # extracting hour from the datetime object and assigning to variable hour_created
    hour_created = date_dt.strftime('%H')
     
    if hour_created not in counts_by_hour:
        counts_by_hour[hour_created] = 1
        comments_by_hour[hour_created] = comments
    else:
        counts_by_hour[hour_created] += 1
        comments_by_hour[hour_created] += comments
print('Posts created by hour:', counts_by_hour) 
print('Comments left by hour:', comments_by_hour) 

Posts created by hour: {'02': 227, '01': 223, '22': 287, '21': 407, '19': 420, '17': 404, '15': 467, '14': 378, '13': 326, '11': 251, '10': 219, '09': 176, '07': 157, '03': 212, '16': 415, '08': 190, '00': 231, '23': 276, '20': 392, '18': 452, '12': 274, '04': 186, '06': 176, '05': 165}
Comments left by hour: {'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '16': 4466, '08': 2362, '00': 2277, '23': 2297, '20': 4462, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


## Calculating average number of comments for each hour

Now we will create a list of lists `avg_by_hour` containing the hours during which posts were created and the average number of comments those posts received.

In [16]:
avg_by_hour = []
for key in comments_by_hour:
    # calculating the average number of comments for each hour
    # for better readability we round the avg value up to 3 symbols
    avg_comments = round(comments_by_hour[key] / counts_by_hour[key], 3)
    avg_by_hour.append([key, avg_comments]) 
for row in avg_by_hour:
    print(row)

['02', 13.198]
['01', 9.368]
['22', 11.749]
['21', 11.057]
['19', 9.414]
['17', 13.73]
['15', 39.668]
['14', 13.153]
['13', 22.224]
['11', 11.143]
['10', 13.758]
['09', 8.392]
['07', 10.096]
['03', 10.16]
['16', 10.761]
['08', 12.432]
['00', 9.857]
['23', 8.322]
['20', 11.383]
['18', 10.79]
['12', 15.453]
['04', 12.688]
['06', 9.017]
['05', 11.139]


## Formatting the output in more readable way

In order to make it easier to sort our data, let's swap the columns.

In [19]:
# creating empty list of lists to place there swapped columns
swap_avg_by_hour = []
for row in avg_by_hour:
    x = row[0]
    y = row[1]
    swap_avg_by_hour.append([y, x])

# sorting our data in descending order
sorted_swap = sorted(swap_avg_by_hour, reverse=True)
print(sorted_swap)


[[39.668, '15'], [22.224, '13'], [15.453, '12'], [13.758, '10'], [13.73, '17'], [13.198, '02'], [13.153, '14'], [12.688, '04'], [12.432, '08'], [11.749, '22'], [11.383, '20'], [11.143, '11'], [11.139, '05'], [11.057, '21'], [10.79, '18'], [10.761, '16'], [10.16, '03'], [10.096, '07'], [9.857, '00'], [9.414, '19'], [9.368, '01'], [9.017, '06'], [8.392, '09'], [8.322, '23']]


In [20]:
# demonstrating top 5 commented hours
print("Top 5 Hours for Ask Posts Comments:")
for hour in sorted_swap[:5]:
    print(hour)

Top 5 Hours for Ask Posts Comments:
[39.668, '15']
[22.224, '13']
[15.453, '12']
[13.758, '10']
[13.73, '17']


Let's demonstrate our findings in a more readable way: using string formating.

In [21]:
for hour in sorted_swap[:5]:
    time_str = hour[1]
    # creating datetime object from the string
    time_dt = dt.datetime.strptime(time_str, '%H')
    # setting the format of the string - transforming from 'hour' format to 'hour:minute' format
    post_time = time_dt.strftime('%H:%M')
    average = hour[0]
    print(f'{post_time}: {average:.2f} average comments per post')

15:00: 39.67 average comments per post
13:00: 22.22 average comments per post
12:00: 15.45 average comments per post
10:00: 13.76 average comments per post
17:00: 13.73 average comments per post


# Analysis of Other posts
## Distributing number of posts and comments by hour created

Our main goal goal was to check  Ask HN and Show HN posts. 

But as we've got a big value for average number of comments in the category 'other posts', it will be interesting to analyse this data too. And check if there is the same commenting pattern as for Ask posts.

Let's do the same analysis for other posts as we have made for Ask posts.

In [23]:
other_posts_result = []
for row in other_posts:
    other_posts_result.append([int(row[4]), row[6]])
print(other_posts_result[:5])    

[[1, '9/26/2016 2:26'], [1, '9/26/2016 1:54'], [1, '9/26/2016 1:37'], [3, '9/26/2016 1:24'], [1, '9/26/2016 0:31']]


Creating 2 empty dictionaries counts_by_hour to collect there information about created post in each hour and comments_by_hour to collect there information about number of comments left in each hour. To do that we need to create a datetime object using datetime.strptime().

In [33]:
other_posts_byhour = {}
other_comments_byhour = {}
for row in other_posts_result:
    comments = row[0]
    date_str = row[1]
     # creating datetime object from the string 'date_str'
    date_dt = dt.datetime.strptime(date_str, "%m/%d/%Y %H:%M")
    # extracting hour from the datetime object and assigning to variable hour_created
    hour_created = date_dt.strftime("%H")
    if hour_created not in other_posts_byhour:
        other_posts_byhour[hour_created] = 1
        other_comments_byhour[hour_created] = comments
    else:
        other_posts_byhour[hour_created] += 1
        other_comments_byhour[hour_created] += comments
print('Posts created by hour:', other_posts_byhour) 
print('Comments left by hour:', other_comments_byhour)        

Posts created by hour: {'02': 1870, '01': 2031, '00': 2271, '23': 2556, '22': 2995, '21': 3470, '20': 3730, '19': 3986, '18': 4314, '17': 4392, '16': 4335, '15': 4122, '14': 3854, '13': 3619, '12': 3085, '11': 2620, '10': 2298, '09': 2149, '08': 1919, '07': 1826, '06': 1789, '05': 1598, '04': 1861, '03': 1740}
Comments left by hour: {'02': 50100, '01': 47756, '00': 55491, '23': 58378, '22': 68059, '21': 79996, '20': 88320, '19': 101127, '18': 112502, '17': 118217, '16': 116322, '15': 115286, '14': 108277, '13': 106302, '12': 90082, '11': 71072, '10': 59147, '09': 56141, '08': 49804, '07': 44424, '06': 43050, '05': 41773, '04': 43753, '03': 42762}


## Calculating average number of comments for each hour

In [34]:
#creating a list of lists containing the hours during which posts were created 
#and the average number of comments those posts received.
other_avg_byhour = []
for key in other_comments_byhour:
    # calculating the average number of comments for each hour
    # for better readability we round the avg value up to 2 symbols
    average_comment = round(other_comments_byhour[key] / other_posts_byhour[key], 2)
    other_avg_byhour.append([average_comment, key])
for row in other_avg_byhour:
    print(row)

[26.79, '02']
[23.51, '01']
[24.43, '00']
[22.84, '23']
[22.72, '22']
[23.05, '21']
[23.68, '20']
[25.37, '19']
[26.08, '18']
[26.92, '17']
[26.83, '16']
[27.97, '15']
[28.09, '14']
[29.37, '13']
[29.2, '12']
[27.13, '11']
[25.74, '10']
[26.12, '09']
[25.95, '08']
[24.33, '07']
[24.06, '06']
[26.14, '05']
[23.51, '04']
[24.58, '03']


## Formatting the output in more readable way

In [35]:
# sorting our list of lists in descending order
sorted_other_avg = sorted(other_avg_byhour, reverse=True)
# demonstrating our findings in a more readable way: using string formating.
for hour in sorted_other_avg[:5]:
    date_str = hour[1]
    date_dt = dt.datetime.strptime(date_str, "%H")
    hour_str = date_dt.strftime("%H:%M")
    average_com = hour[0]
    print(f'{hour_str}: {average_com} average comments per post')

13:00: 29.37 average comments per post
12:00: 29.2 average comments per post
14:00: 28.09 average comments per post
15:00: 27.97 average comments per post
11:00: 27.13 average comments per post


# Conclusion

Our main goalwas to compare  two types of posts to determine the following:

- Do **Ask HN** or **Show HN** receive more comments on average?
- Do posts created at a certain time receive more comments on average?

We found out that on average Ask_posts receive more comments than Show posts (13.744 versus 9.81).
We can assume that this is because people prefer to give advice than to give some kind of feedback on something.

Also we checked the situation in the rest of posts (othe_posts) and found out that they the biggest average number of comments. This can happen due to the fact that there a lot of different topics. Some of the topics can be very popular or controversial, thats why people discuss them a lot.

Regarding the second question, the analysis of Ask posts showed that the most commented hours are day time hours:

- 15:00
- 13:00
- 12:00
- 10:00
- 17:00

Analysis of Other posts showed that on average all hours don't differ to much. The average numbers of comments are pretty similar for each hour of the day. The first 5 leaders are:
- 13:00
- 12:00
- 14:00
- 15:00
- 11:00

So if you are deciding what time to post, in order to receive the most possible amount of comments or feedback, the answer is: do it between 11:00  and 15:00.