# Hacker News Analysis

In this project we will analyze posts from Hacker News, a popular site where technology related stories are voted and commented on. Much like Reddit. 

We want to analyze two types of posts on the site, "Ask HN" and "Show HN" to see what type of posts generates more interest and engagement from users. 

We will specifically compare the teo types of posts to determine the following:
    Do Ask HN or Show HN receive more comments on average?
    Do posts created at a certain time receive more comments?
    
Note: The data set that we are using has been filtered down from 300,000 rows to approximately 20,000 rows. This was done by removing all posts that received no comments, and then randomly sampling from the remaining submissions.

The dataset contains the following columns:
    id: the unique identifier from Hacker News for the post
    title: the title of the post
    url: the URL that the posts links to, if the post has a URL
    num_points: the number of points the post acquired, calculated as the total number of upvotes minus     the total number of downvotes
    num_comments: the number of comments on the post
    author: the username of the person who submitted the post
    created_at: the date and time of the post's submission


Getting the Data

In [4]:
import datetime as dt
import random
from csv import reader

opened_file = open(r"C:\Users\MalikSami\Desktop\Data_Analytics\HackerNews_Analysis\hacker_news.csv",
                   encoding="utf8")
read_file = reader(opened_file)

hn = list(read_file)

header = hn[0]
print(header)
hn = hn[1:] ## This will remove the header from the list


['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [2]:
print(hn[1])

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


## Explore Data Function

This function will allow us to view data slices with ease. 

In [5]:
def explore_data(dataSet, start, end, rows_and_columns = False):
    dataSet_slice = dataSet[start:end]
    for row in dataSet_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataSet))
        print('Number of columns: ', len(dataSet[0]))

In [31]:
explore_data(ask_posts,0,5,True)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


['10394168', 'Ask HN: Someone offered to buy my browser extension from me. What now?', '', '28', '17', 'roykolak', '10/15/2015 16:38']


Number of rows:  1744
Number of columns:  7


## Removing the data with 0 comments

Here we create a new list and populate it with the rows that have at least one comment. We can do this by running a for loop to iterate through the data and append into the new list as long as the attribute of the 4th column is not "0"(note this is a string, not an integer).

In [6]:
hn_with_comments = []
for row in hn:
    comments = row[4]
    if comments != "0":
        hn_with_comments.append(row)

In [7]:
print(hn_with_comments[1])
print(len(hn_with_comments))

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
20100


## Further Filtering

We want to split the data into seperate lists of:
    ask_posts[]
    show_posts[]
    other_posts[]
    
This will make the analysis easier.
We should iterate through the filtered list "hn_with_comments" and assign the second column "title" to a variable named title.
Then we can make use of the lower() to make all the data inside this column lowercase. This will help in not making any errors with our code.
Next we can write a simple if statement and use the startswith() to append into the proper list. 


In [8]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn_with_comments:
    title = row[1]
    title = title.lower()
    if title.startswith("ask hn"):
        ask_posts.append(row)
    elif title.startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [9]:
explore_data(ask_posts, 0, 4, True)

['12296411', 'Ask HN: How to improve my personal website?', '', '2', '6', 'ahmedbaracat', '8/16/2016 9:55']


['10610020', 'Ask HN: Am I the only one outraged by Twitter shutting down share counts?', '', '28', '29', 'tkfx', '11/22/2015 13:43']


['11610310', 'Ask HN: Aby recent changes to CSS that broke mobile?', '', '1', '1', 'polskibus', '5/2/2016 10:14']


['12210105', 'Ask HN: Looking for Employee #3 How do I do it?', '', '1', '3', 'sph130', '8/2/2016 14:20']


Number of rows:  1744
Number of columns:  7


In [7]:
print("The number of ask posts are: ", len(ask_posts))
print("The number of show posts are: ", len(show_posts))
print("The number of other posts are: ", len(other_posts))

The number of ask posts are:  1744
The number of show posts are:  1162
The number of other posts are:  17194


## Which posts recieve more comments? Ask or Show?

Since we filtered the data into two seperate lists of "ask_posts" and "show_posts", we can simply run a for loop on each list to get the total ask comments and then do calculations to find the number of average comments on each type of post. 

We can see from the output that "ask_posts" recieve on average 14 comments per post where as "show_posts" only receive 10 comments per post on average. 

We can focus the rest of our analysis on ask posts, since they are the ones that recieve more comments.


In [11]:
total_ask_comments = 0

for posts in ask_posts:
    total_ask_comments += int(posts[4])
    
    
print("Total number of ask comments is: ",  total_ask_comments)
average_ask_comments = total_ask_comments / len(ask_posts)
print("The average number of comments on ask posts are: ", average_ask_comments)
    


Total number of ask comments is:  24483
The average number of comments on ask posts are:  14.038417431192661


In [14]:
total_show_comments = 0

for posts in show_posts:

    total_show_comments += int(posts[4])
    
    
print("Total number of show comments is: ",  total_show_comments)
average_show_comments = total_show_comments / len(show_posts)
print("The average number of comments on show posts are: ", average_show_comments)

Total number of show comments is:  11988
1162
The average number of comments on show posts are:  10.31669535283993


## Finding Comments by Hour Created

We can determine if we can maximize the number of comments a ask post receives by creating it at a certain hour. We can do this by finding the number of ask posts created during each hour of the day, along with the number of comments those posts received. 

Then we can create a dictionary and insert the data. We can then create a new list of lists "avg_by_hour" to store the average number of comments per hour. We can do this by running a for loop on "comments_by_hour" and appending the calculated average to "avg_by_hour".

In [30]:
result_list = []

for post in ask_posts:
    created_at = post[6]
    total_comments = int(post[4])
    result_list.append([created_at,total_comments])
    
counts_by_hour = {}
comments_by_hour = {}
date_format = "%m/%d/%Y %H:%M"

for row in result_list:
    date = row[0]
    comment = row[1]
    time = dt.datetime.strptime(date, date_format).strftime("%H")
    if time in counts_by_hour:
        comments_by_hour[time] += comment
        counts_by_hour[time] += 1
    else:
        comments_by_hour[time] = comment
        counts_by_hour[time] = 1
        
counts_by_hour


{'09': 45,
 '13': 85,
 '10': 59,
 '14': 107,
 '16': 108,
 '23': 68,
 '12': 73,
 '17': 100,
 '15': 116,
 '21': 109,
 '20': 80,
 '02': 58,
 '18': 109,
 '03': 54,
 '05': 46,
 '19': 110,
 '01': 60,
 '22': 71,
 '08': 48,
 '04': 47,
 '00': 55,
 '06': 44,
 '07': 34,
 '11': 58}

In [28]:
avg_by_hour = []

for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
    
avg_by_hour
                        


[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

## Sorting and Displaying Values 

Now that we have the "avg_by_hour" list we can swap the average and hour as well as sort the list in decending order to increase readability. We can do this by creating a new list and swapping the values and appending to the list "swap_avg_by_hour". Then we can use the sorted() to sort. 

Then we simply display the results.


In [34]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print(swap_avg_by_hour)

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

sorted_swap

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]


[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [35]:
print("Top 5 hours for Ask Posts Comments.")

for avg, hr in sorted_swap[:5]:
    print(
    "{}: {:.2f} average comments per post".format(dt.datetime.strptime(hr, "%H").strftime("%H:%M"), avg)
    )
    

Top 5 hours for Ask Posts Comments.
15:00: 38.59 average comments per post
02:00: 23.81 average comments per post
20:00: 21.52 average comments per post
16:00: 16.80 average comments per post
21:00: 16.01 average comments per post


The hour that receives the most comments in 15:00, with an average of 38.59 comments per post. There's 60% increase between the hours with the highest and second highest average number of comments.

# Conclusion

In this project we took the Hacker News data set and cleaned and filtered the data to determine the hours that produce the most comments on average. Based on this analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as a 'Ask HN' post and be created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).