*This notebook is a guided project from datquest course (data analyst path)*

# Exploring Hacker News Posts
[Hacker News](https://news.ycombinator.com/) is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.





You can find the data set [here](./hacker_news.csv), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

    id: The unique identifier from Hacker News for the post
    title: The title of the post
    url: The URL that the posts links to, if it the post has a URL
    num_points: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    num_comments: The number of comments that were made on the post
    author: The username of the person who submitted the post
    created_at: The date and time at which the post was submitted


>We're specifically interested in posts whose titles begin with either **Ask HN**(submited when user want to ask the Hacker News community a specific question) or **Show HN**(Submited when user want to show the Hacker News Community a project)
We'll compare these two types of posts to determine the following:

    Do Ask HN or Show HN receive more comments on average?
    Do posts created at a certain time receive more comments on average?

Let's start by importing the libraries we need and reading the data set into a list of lists.

In [2]:
from csv import reader 
file=open("hacker_news.csv")
file_read=reader(file)
hn=list(file_read)
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


You clearly can notice that the first row of what we just printed above is different : it's the header row. 
    This row doen't take part in the analysis process and therefore should be seperated.
    **--> We ll save the first row in a list named header.**

In [3]:
header=hn[0]
hn=hn[1:]
print(header)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


Now that we've removed the headers apart, we're ready to filter our data. Since we're only concerned with post titles beginning with Ask HN or Show HN, we'll create new lists of lists (2 lists, one for each type) containing just the data for those titles to work with

In [5]:
ask_posts=[]
show_posts=[]
other_posts=[]
for row in hn:
    title=row[1]
    title=title.lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
#Let's print the number of posts for each type        
print("The number of ask post :", len(ask_posts))
print("The number of show post :", len(show_posts))
print("The number of other post :", len(other_posts))

The number of ask post : 1744
The number of show post : 1162
The number of other post : 17194


Now we are going to determine if ask posts or show posts receive more comments on average.

In [6]:
total_ask_comments=0
for row in ask_posts:
    num_comments=int(row[4])
    total_ask_comments+=num_comments
avg_ask_comments=total_ask_comments/len(ask_posts)
print('The average of the ask post comments : ' ,avg_ask_comments)

The average of the ask post comments :  14.038417431192661


In [7]:
total_show_comments=0
for row in show_posts:
    num_comments=int(row[4])
    total_show_comments+=num_comments
avg_show_comments=total_show_comments/len(show_posts)
print('The average of the show post comments : ' ,avg_show_comments)

The average of the show post comments :  10.31669535283993


From the two cells above where we calculated the average of each type comments per posts, we found that ask posts receive more comments on average than the show posts.
which is logical because the users tends to responds to question in comments.

Since ask posts are more likely to receive comments, **we'll focus our remaining analysis just on these posts.** 

Next, we'll determine if ask posts created at a **certain time** are **more likely** to attract comments. We'll use the following steps to perform this analysis:

    Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
    Calculate the average number of comments ask posts receive by hour created.


For the first step (Calculate the amount of ask posts created in each hour of the day with their number of comments) we are going to use 2 dictionary:
* The first will countain the number of post per hour

* The second one the number of comments per hour


In [8]:
import datetime as dt
result_list=[]
for row in ask_posts:
    created_at=row[6]
    nb_comments=int(row[4])
    result_list.append([created_at,nb_comments])
counts_by_hour={}
comments_by_hour={}
for row in result_list:
    date_1=row[0]
    nb_commments=row[1]
    #the created_at column of our dataset contain the date in a string format.
    #we need to use the strptime and strftime method from datetime in order to 
    #parse date from the string format
    date_2=dt.datetime.strptime(date_1,"%m/%d/%Y %H:%M")
    hour=date_2.strftime("%H")
    if hour not in counts_by_hour:
        counts_by_hour[hour]=1
        comments_by_hour[hour]=nb_commments
    else:
        counts_by_hour[hour]+=1
        comments_by_hour[hour]+=nb_commments
print('Number of post per hour : \n' ,counts_by_hour ,'\n')  
print('Number of comments per hour : \n' ,comments_by_hour )

Number of post per hour : 
 {'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58} 

Number of comments per hour : 
 {'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}


In the last screen, we created two dictionaries:

    counts_by_hour: contains the number of ask posts created during each hour of the day.
    comments_by_hour: contains the corresponding number of comments ask posts created at each hour received.

Next, we'll use these two dictionaries to calculate the average number of comments for posts created during each hour of the day. 

In [9]:
# we'll use a list of list to store the average amount of comments `Ask HN` posts 
#created at each hour of the day (each sublist should countain the nb of comments and of post in a specific hour).
#the main list is indexed with the hour
avg_by_hour = []

for hr in comments_by_hour:
    avg_by_hour.append([hr, comments_by_hour[hr] / counts_by_hour[hr]])

print(' Averege of comments and post per hour : \n' ,avg_by_hour)

 Averege of comments and post per hour : 
 [['09', 5.5777777777777775], ['13', 14.741176470588234], ['10', 13.440677966101696], ['14', 13.233644859813085], ['16', 16.796296296296298], ['23', 7.985294117647059], ['12', 9.41095890410959], ['17', 11.46], ['15', 38.5948275862069], ['21', 16.009174311926607], ['20', 21.525], ['02', 23.810344827586206], ['18', 13.20183486238532], ['03', 7.796296296296297], ['05', 10.08695652173913], ['19', 10.8], ['01', 11.383333333333333], ['22', 6.746478873239437], ['08', 10.25], ['04', 7.170212765957447], ['00', 8.127272727272727], ['06', 9.022727272727273], ['07', 7.852941176470588], ['11', 11.051724137931034]]


In the last screen, we calculated the average number of comments for posts created during each hour of the day, and stored the results in a list of lists named avg_by_hour.

Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [10]:
#we need to swap our avg_by_hour list of lists because the sorted function in python sort a list of lists
#acording to the values in the first columns wich should be the average in our case
swap_avg_by_hour=[]
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
print(swap_avg_by_hour)  
sorted_swap=sorted(swap_avg_by_hour,reverse=True)
print(" \nTop 5 Hours for Ask Posts Comments \n" ,sorted_swap[:5])

[[5.5777777777777775, '09'], [14.741176470588234, '13'], [13.440677966101696, '10'], [13.233644859813085, '14'], [16.796296296296298, '16'], [7.985294117647059, '23'], [9.41095890410959, '12'], [11.46, '17'], [38.5948275862069, '15'], [16.009174311926607, '21'], [21.525, '20'], [23.810344827586206, '02'], [13.20183486238532, '18'], [7.796296296296297, '03'], [10.08695652173913, '05'], [10.8, '19'], [11.383333333333333, '01'], [6.746478873239437, '22'], [10.25, '08'], [7.170212765957447, '04'], [8.127272727272727, '00'], [9.022727272727273, '06'], [7.852941176470588, '07'], [11.051724137931034, '11']]
 
Top 5 Hours for Ask Posts Comments 
 [[38.5948275862069, '15'], [23.810344827586206, '02'], [21.525, '20'], [16.796296296296298, '16'], [16.009174311926607, '21']]


In [11]:
#printing the top 5 average comments per post per hour
template="{hour} {avg:.2f} average comments per post"
print('Top 5 hour for the average of comments per post : \n')
for row in sorted_swap[:5]:
    h=dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    print(template.format(hour=h, avg=row[0]))

Top 5 hour for the average of comments per post : 

15:00 38.59 average comments per post
02:00 23.81 average comments per post
20:00 21.52 average comments per post
16:00 16.80 average comments per post
21:00 16.01 average comments per post


# The time users have a higher chance of receiving answers on their ask posts
According to the result above we found that the hour user should create a post (an ask post specially) during, in order to have a higher chance of receiving comments is at **15:00** acording to the the time zone is Eastern Time in the US where the data where collected, wich means at **23:00** according to time in Algiers, Algeria.

** All the next parts isn't included in the guided project **

# Part 2:
Next we are going to define the type of posts that receive more points on average.


In [13]:
#the ask posts
total_ask_point=0
for row in ask_posts:
    num_point=int(row[3])
    total_ask_point+=num_point
avg_ask_point=total_ask_point/len(ask_posts)
print('The average of the ask post point : ' ,avg_ask_point)


The average of the ask post point :  15.061926605504587


In [16]:
#the show posts
total_show_point=0
for row in show_posts:
    num_point=int(row[3])
    total_show_point+=num_point
avg_show_point=total_show_point/len(show_posts)
print('The average of the show post point : ' ,avg_show_point)


The average of the show post point :  27.555077452667813


From the result above we conclude that a show post is more likely to receive points than an ask post.

# Part 3:
Now let's determine if posts created at a certain time are more likely to receive more points.

Since show posts are more likely to receive comments,**we'll focus our remaining analysis just on these posts.**

First we ll create a dictionary containing the points per hour for the show posts 

In [18]:
import datetime as dt
result_list2=[]
for row in show_posts:
    created_at=row[6]
    point=int(row[3])
    result_list2.append([created_at,point])
point_per_hour={}
for row in result_list2:
    date_1=row[0]
    nb_points=row[1]
    #the created_at column of our dataset contain the date in a string format.
    #we need to use the strptime and strftime method from datetime in order to 
    #parse date from the string format
    date_2=dt.datetime.strptime(date_1,"%m/%d/%Y %H:%M")
    hour=date_2.strftime("%H")
    if hour not in point_per_hour:
        point_per_hour[hour]=1
        
    else:
        point_per_hour[hour]+=1
print('The number of point show posts receive per hour:' ,point_per_hour)        

The number of point show posts receive per hour: {'14': 86, '22': 46, '18': 61, '07': 26, '20': 60, '05': 19, '16': 93, '19': 55, '15': 78, '03': 27, '17': 93, '06': 16, '02': 30, '13': 99, '08': 34, '21': 47, '04': 26, '11': 44, '12': 61, '23': 36, '09': 30, '01': 28, '10': 36, '00': 31}


Next, we are going to sorte the previous value in an ascending way using a list of lists and print the firsts value 

In [22]:
point_per_hour_list=[]
for hour in point_per_hour:
    #we put the points value in the first column because it s the one
    #used to sorte a list by sorted function
    point_per_hour_list.append([point_per_hour[hour], hour])
print(point_per_hour_list)    

[[86, '14'], [46, '22'], [61, '18'], [26, '07'], [60, '20'], [19, '05'], [93, '16'], [55, '19'], [78, '15'], [27, '03'], [93, '17'], [16, '06'], [30, '02'], [99, '13'], [34, '08'], [47, '21'], [26, '04'], [44, '11'], [61, '12'], [36, '23'], [30, '09'], [28, '01'], [36, '10'], [31, '00']]


Sorting and printing the higher point 

In [24]:
sorted_point_per_hour=sorted(point_per_hour_list,reverse=True)
print(" \nTop 5 Hours for show Posts points \n" ,sorted_point_per_hour[:5])

 
Top 5 Hours for show Posts points 
 [[99, '13'], [93, '17'], [93, '16'], [86, '14'], [78, '15']]


In [28]:
template2="{hour} : {nb:.2f} points per post"
print('Top 5 hour for receiving  comments per post : \n')
for row in sorted_point_per_hour[:5]:
    h=dt.datetime.strptime(row[1], "%H").strftime("%H:%M")
    print(template2.format(hour=h, nb=row[0]))

Top 5 hour for receiving  comments per post : 

13:00 : 99.00 points per post
17:00 : 93.00 points per post
16:00 : 93.00 points per post
14:00 : 86.00 points per post
15:00 : 78.00 points per post


# The time users have a higher chance of receiving point on their  posts

According to the result above we found that the hour user should create a post (a show post specially) during, in order to have a higher chance of receiving comments is at 13:00 acording to the the time zone is Eastern Time in the US where the data where collected, wich means at 21:00 according to time in Algiers, Algeria.



## Conclusion :
