# Data set of submissions to popular technology site Hacker News:
Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

For the analysis, the dataset from: [Hacker News Posts](https://www.kaggle.com/hacker-news/hacker-news-posts)

**Descriptions of the columns:**

>id:* The unique identifier from Hacker News for the post*

>title:* The title of the post*

>url:* The URL that the posts links to, if it the post has a URL*

>num_points:* The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes*

>num_comments:* The number of comments that were made on the post*

>author:* The username of the person who submitted the post*

>created_at:* The date and time at which the post was submitted*

# What interest us from dataset?
We're specifically interested in posts whose **titles** begin with either **Ask HN** or **Show HN**. Users submit **Ask HN** posts to ask the Hacker News community a **specific question**. Likewise, users submit **Show HN** posts to show the Hacker News community a project, product, or **just generally something interesting**.

# Study target:
1) Do Ask HN or Show HN receive more comments on average?

2) Do posts created at a certain time receive more comments on average?


**It open and read the file "hacker_news.csv":**

In [None]:
from csv import reader
import datetime as dt

hn=list(reader(open("hacker_news.csv")))
hn_head=hn[0]
hn=hn[1:]

print(hn[0:5])
print("\nTotal rows without head row: ",len(hn))

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]

Total rows without head row:  20100


**Methods to separate posts beginning with Ask HN and Show HN (and case variations) into two different lists:**

In [None]:
ask_post=[]
show_post=[]
other_post=[]

for row in hn:
    title=row[1]
    title=title.capitalize()
    row[1]=title
    if title.startswith("Ask hn"):
        ask_post.append(row)
    elif title.startswith("Show hn"):
        show_post.append(row)
    else:
        other_post.append(row)
        
print("Number of ask post: ",len(ask_post)," this corresponds to ",round(100*(len(ask_post)/len(hn)),2),"% of the total rows"),
print("Number of show post: ",len(show_post)," this corresponds to ",round(100*(len(show_post)/len(hn)),2),"% of the total rows")
print("Number of other post: ",len(other_post)," this corresponds to ",round(100*(len(other_post)/len(hn)),2),"% of the total rows")


Number of ask post:  1744  this corresponds to  8.68 % of the total rows
Number of show post:  1162  this corresponds to  5.78 % of the total rows
Number of other post:  17194  this corresponds to  85.54 % of the total rows


I decided use the method str.capitalize() for each title has the same format, Thus avoiding don't leave out rows that has not comply with the conditionals.

# Total number of comments and average comments in each ask posts and show post

In [None]:
#num_comments corresponding to index 4
total_ask_comments=0
total_show_comments=0

for row in ask_post:
    num_comments=int(row[4])
    total_ask_comments+=num_comments
    
for row in show_post:
    num_comments=int(row[4])
    total_show_comments+=num_comments

avg_ask_comments=round(total_ask_comments/len(ask_post),2)
avg_show_comments=round(total_show_comments/len(ask_post),2)

print("Average coment in ask post: ",avg_ask_comments)
print("Average coment in show post: ",avg_show_comments)



Average coment in ask post:  14.04
Average coment in show post:  6.87


Of the total post (20,100) the 8.68 % correspond to ask post and  the 5.78% correspond to show post, rest correspond to other post.

Ask post  receive aproximatly **14 average coments** and Show post receive **7 average coments** respectively. 

The **ask post** are more recurrent than **show post** and receive more average comments.

# Ask posts and Show post created at a certain time are more likely to attract comments.
**Format time m/d/y Hour:min (24 hours)**

## Ask_post
**counts_by_hour:** contains the number of ask posts created during each hour of the day.

**comments_by_hour:** contains the corresponding number of comments ask posts created at each hour received.

In [None]:
# created_at index -1 and num_comments index 4 
result_list=[]

counts_by_hour={}
comments_by_hour={}

for row in ask_post:
    created_num=[row[-1],int(row[4])]
    result_list.append(created_num)

for row in result_list:
    obj_datetime=dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    row[0]=obj_datetime
    obj_time=obj_datetime.time()
    
    if obj_time not in counts_by_hour:
        counts_by_hour[obj_time]=1
        comments_by_hour[obj_time]=row[1]
    elif obj_time in counts_by_hour:
        counts_by_hour[obj_time]+=1
        comments_by_hour[obj_time]+=row[1]   
        
avg_by_hour=[]

for row in counts_by_hour:
    avg_by_hour.append([row,round(comments_by_hour[row]/counts_by_hour[row],2)])

print(avg_by_hour[0:2]) 
print("\nTotal Ask_post classified by hour: ",len(avg_by_hour))

[[datetime.time(4, 42), 2.33], [datetime.time(5, 31), 1.0]]

Total Ask_post classified by hour:  967


## Show_post
**scounts_by_hour:** contains the number of ask posts created during each hour of the day.

**scomments_by_hour:** contains the corresponding number of comments ask posts created at each hour received.

In [None]:
# created_at index -1 and num_comments index 4 
sresult_list=[]

scounts_by_hour={}
scomments_by_hour={}

for row in show_post:
    created_num=[row[-1],int(row[4])]
    sresult_list.append(created_num)

for row in sresult_list:
    obj_datetime=dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
    row[0]=obj_datetime
    obj_time=obj_datetime.time()
    
    if obj_time not in scounts_by_hour:
        scounts_by_hour[obj_time]=1
        scomments_by_hour[obj_time]=row[1]
    elif obj_time in scounts_by_hour:
        scounts_by_hour[obj_time]+=1
        scomments_by_hour[obj_time]+=row[1]   
        
savg_by_hour=[]

for row in scounts_by_hour:
    savg_by_hour.append([row,round(scomments_by_hour[row]/scounts_by_hour[row],2)])

print(savg_by_hour[0:2])  
print("\nTotal show_post classified by hour: ",len(savg_by_hour))

[[datetime.time(14, 12), 117.0], [datetime.time(22, 12), 3.0]]

Total show_post classified by hour:  745


**Swapped columns & sort in descending order for ask_post**

In [None]:
swap_avg_by_hour=[]

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1],row[0]])

sorted_swap=sorted(swap_avg_by_hour,reverse=True)

for row in sorted_swap[0:5]:
    format_str="{} {}  average {} per post"
    if row[0]>1:
        print(format_str.format(row[1].strftime("%H:%M"),row[0],"comments"))
    else:
        print(format_str.format(row[1].strftime("%H:%M"),row[0],"comment"))

02:07 434.5  average comments per post
16:43 350.0  average comments per post
11:52 266.0  average comments per post
20:57 258.0  average comments per post
20:18 239.5  average comments per post


**Swapped columns & sort in descending order for show_post**

In [None]:
sswap_avg_by_hour=[]

for row in savg_by_hour:
    sswap_avg_by_hour.append([row[1],row[0]])

ssorted_swap=sorted(sswap_avg_by_hour,reverse=True)

for row in ssorted_swap[0:5]:
    format_str="{} {}  average {} per post"
    if row[0]>1:
        print(format_str.format(row[1].strftime("%H:%M"),row[0],"comments"))
    else:
        print(format_str.format(row[1].strftime("%H:%M"),row[0],"comment"))

18:55 206.0  average comments per post
12:02 197.0  average comments per post
14:39 160.5  average comments per post
14:12 117.0  average comments per post
00:28 92.0  average comments per post


The **time zone** of the times of each ask post and show post is **Eastern Time in the US** and you can verified this [here](https://www.kaggle.com/hacker-news/hacker-news-posts).

For **Ask post** the hour whith higher chance of receiving comments is at **02:07** with **434 average comments per post**. For me this would be at 1:07 (Colombia time UTC-5).

For **Show post** the hour whith higher chance of receiving comments is at **18:55** with **206 average comments per post**. For me this would be at 17:55 (Colombia time UTC-5)


In [None]:
aTen=0
a_Ten=[]
a_Hundred=[]

sTen=0
s_Ten=[]
s_Hundred=[]

for row in sorted_swap:
    if row[0] >= 10:
        aTen+=1

    if row[0] >= 100:
        a_Hundred.append(row[1])
        
        
for row in ssorted_swap:
    if row[0] >= 10:
        sTen+=1
    
    if row[0] >= 100:
        s_Hundred.append(row[1])

print("Greater than or equal to 10 comments in Ask post: ",aTen," Schedules")
print("Greater than or equal to 10 comments in Show post: ",sTen," Schedules")
print("Greater than or equal to 100 comments in Ask post: ",len(a_Hundred)," Schedules")
print("Greater than or equal to 100 comments in Show post: ",len(s_Hundred)," Schedules")

Greater than or equal to 10 comments in Ask post:  235  Schedules
Greater than or equal to 10 comments in Show post:  185  Schedules
Greater than or equal to 100 comments in Ask post:  18  Schedules
Greater than or equal to 100 comments in Show post:  4  Schedules


# Conclusion
>The **Ask Post** has greater participation than **Show post**  by community from Hacker News with aproximatly 14 average coments per post.

>The **Show Post** represent the second part of the participation of  **Ask Post** with aproximatly 7 average coments per post.

>It is expected that of 20,100 post 8.68 % correspond to **Ask Post** and 5.78% correspond to **Show Post**. This does not mean that this kind of posts be the least popular from Hacker News for this we would need study the composition of the rest percentage (Other post 85.54 %)

>For **Ask Post** receive more than 9 average comments per post there is 235 schedules and for **Show Post** there is 185 schedules.

>For **Ask Post** receive more than 99 average comments per post there is 18 schedules and for **Show Post** there is 4 schedules.




**Times to publish a post and get at least 100 average comment per post:**

```
Ask post Show post
-------- --------
02:07:00 18:55:00
16:43:00 12:02:00
11:52:00 14:39:00
20:57:00 14:12:00
20:18:00
13:04:00
15:00:00
15:01:00
18:35:00
17:02:00
21:03:00
15:52:00
15:40:00
19:37:00
16:13:00
15:07:00
10:25:00
02:22:00
```