## Hacker News Dataset Analysis

> Hacker News works as Reddit, users submit posts related to startups, science, and technology. The posts may then receive votes and comments. This analysis focuses on post titles, with the goal of:

#### Project Goals
> 1. Determining which receive more comments: those that start with *Ask HN* or *Show HN*
> 2. Examining if posts created at certain time periods attarct more response


This analysis does not entail the use of numpy or pandas.

In [44]:
from csv import reader
hn_file=reader(open('hacker_news.csv'))
hn=list(hn_file)
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [45]:
#Removing the header row from the dataset
header=hn[0]
hn=hn[1:]

In [46]:
for row in hn[0:3]:
    title=hn[1]
    print(title)

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']
['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']


In [47]:
#separating the dataset on the condition of whether the post startswith 'Ask HN', 
#'Show Hn', or 'Others'

ask_posts=[]
show_posts=[]
other_posts=[]

for row in hn:
    title=row[1].lower()
    if title.startswith('ask hn'):
        ask_posts.append(row)
    elif title.startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)
        
ttl_ask_posts=len(ask_posts)
ttl_show_posts=len(show_posts)
ttl_other_posts=len(other_posts)

print(f"the total number of 'Ask HN' posts are {ttl_ask_posts}")
print(f"The total number of 'Show HN' posts are {ttl_show_posts}")
print(f"All other posts' count is {ttl_other_posts}")

the total number of 'Ask HN' posts are 1744
The total number of 'Show HN' posts are 1162
All other posts' count is 17194


In [48]:
#Determining number of comments for either of the categories above
#Function for determining number of comments

def comment_count(dataset,index,ttl_post_count):
    total_num_comments=0
    for row in dataset:
        comments=float(row[index])
        total_num_comments+=comments
    average_comm_per_post=total_num_comments/ttl_post_count
    return total_num_comments, average_comm_per_post


In [49]:
#Determining number of comments for the ask posts and 
#the respective average number of posts per comment

total_comments, average_num_posts = comment_count(ask_posts,4,ttl_ask_posts)
print(total_comments)
print('\n')
print(average_num_posts)

24483.0


14.038417431192661


In [50]:
#Determining number of comments for the show posts and 
#the respective average number of posts per comment

total_comments, average_num_posts = comment_count(show_posts,4,ttl_show_posts)
print(total_comments)
print('\n')
print(average_num_posts)

11988.0


10.31669535283993


In [51]:
#Determining number of comments for the other post category and 
#the respective average number of posts per comment

total_comments, average_num_posts = comment_count(other_posts,4,ttl_other_posts)
print(total_comments)
print('\n')
print(average_num_posts)

462055.0


26.8730371059672


> The above analysis indicates that there are more *ask HN* posts, which is a total of 24,483. Contrarily, there are 11,988 *Show HN* posts. 

> On average, *Ask HN* posts receive more comments (about 14 comments per post) compared to *Show HN* which attracts about 10 comments per post. Arguably, this occurence is due to the fact that Ask HN posts inherently call for a response. 

In [53]:
ask_posts[0][-1]

'8/16/2016 9:55'

In [54]:
#Convert the time column into datetime function

def date_conv(dataset,index):
    import datetime as dt
    for row in dataset:
        date=row[index]
        dated=dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
        row[index]=dated
    return dataset
        

In [55]:
#Convert date column from string to datetime
ask_posts=date_conv(ask_posts,-1)

In [56]:
show_posts=date_conv(show_posts,-1)

In [57]:
#Function for extracting number of comments by hour of day based on month

def comm_hour(dataset,col_index,num_comments_index):
    import datetime as dt
    monthly_hour_freq={}
    for row in dataset:
        comm_num=int(row[num_comments_index])
        time=row[col_index]
        time_day=dt.datetime.strftime(time, "%m %H")
        month,hour=time_day.split(' ')
        if month in monthly_hour_freq:
            if hour in monthly_hour_freq[month]:
                monthly_hour_freq[month][hour]+=comm_num
            else:
                monthly_hour_freq[month][hour]=comm_num
        else:
            monthly_hour_freq[month]={}
            monthly_hour_freq[month][hour]=comm_num
    return monthly_hour_freq

In [58]:
#Create a nested dictionary highligting comments by month and time
#Graphs of these can be plotted to determine how 
#commenting varies by time for a given month
#This is oputside the scope of my current objective though
month_and_hour_ask=comm_hour(ask_posts,-1,-3)
month_and_hour_show=comm_hour(show_posts,-1,-3)
print(month_and_hour_ask['08'])
print('\n')
print(month_and_hour_show['08'])

{'09': 48, '14': 216, '18': 77, '20': 556, '17': 188, '23': 16, '01': 98, '00': 11, '13': 140, '19': 40, '11': 38, '21': 38, '05': 173, '10': 128, '08': 163, '12': 116, '15': 1093, '22': 31, '02': 16, '07': 37, '04': 4, '16': 113, '06': 30, '03': 44}


{'16': 149, '07': 6, '15': 33, '02': 1, '18': 12, '06': 2, '14': 29, '13': 329, '23': 64, '19': 37, '21': 4, '17': 184, '05': 7, '11': 84, '12': 93, '03': 35, '08': 29, '20': 34, '10': 22, '00': 1, '01': 3, '22': 5}


In [59]:
#Function for determining total num of comments by month

def tot_monthly_com(data_dict):
    monthly_freq={}
    for month in data_dict:
            l_dict=data_dict[month]
            tot=sum(l_dict.values())
            monthly_freq[month]=tot
    return monthly_freq
            

In [60]:
#Applying the function on the earlier monthly and hourly 
#nested dictionary on the ask posts

tot_monthly_com(month_and_hour_ask)

{'08': 3414,
 '11': 1657,
 '05': 2560,
 '10': 1089,
 '09': 3466,
 '04': 1596,
 '02': 940,
 '06': 2144,
 '01': 2178,
 '03': 2655,
 '07': 1662,
 '12': 1122}

In [61]:
#Applying the function on the earlier monthly and hourly 
#nested dictionary on the show posts

tot_monthly_com(month_and_hour_show)

{'11': 805,
 '04': 876,
 '07': 687,
 '01': 1092,
 '03': 1280,
 '09': 1559,
 '08': 1163,
 '06': 868,
 '02': 1041,
 '10': 947,
 '12': 533,
 '05': 1137}

> The above analysis indicates that August and September are the months with the highest comments. 

> A more indepth analysis can be done in a similar approach to the one done (month and time above) by creating frequency tables of year and monthly rate of commenting to ascertain if this monthly trend is reflected on a yearly basis.

> In the Ask HN and Show HN comparative analysis, a clearer picture is attainable with the plotting of the data. However, while August and September have the highest rate of commenting for the Ask HN posts, March and September have the highest numbers for Show HN

In [63]:
#Overall examination of time vs commenting rate
#Function for generating a frequency table of
#time and total number of comments

def comm_time(dataset,time_index,comm_index):
    import datetime as dt
    time_comm_agg={}
    for row in dataset:
        comm_num=int(row[comm_index])
        time=row[time_index]
        time_day=dt.datetime.strftime(time, "%H")
        if time_day in time_comm_agg:
            time_comm_agg[time_day]+=comm_num                                       
        else:
            time_comm_agg[time_day]=comm_num
    return time_comm_agg


In [64]:
#Determine percentages  function based on time of the day

def timewise_com(data):
    time_freq={}
    for item in data:
        time_freq[item]=round((data[item]/sum(data.values())*100),2)
    return time_freq


In [65]:
#Application of the functions on the Ask HN posts
timewise_com_ask=comm_time(ask_posts,-1,-3)
timewise_com(timewise_com_ask)

{'09': 1.03,
 '13': 5.12,
 '10': 3.24,
 '14': 5.78,
 '16': 7.41,
 '23': 2.22,
 '12': 2.81,
 '17': 4.68,
 '15': 18.29,
 '21': 7.13,
 '20': 7.03,
 '02': 5.64,
 '18': 5.88,
 '03': 1.72,
 '05': 1.9,
 '19': 4.85,
 '01': 2.79,
 '22': 1.96,
 '08': 2.01,
 '04': 1.38,
 '00': 1.83,
 '06': 1.62,
 '07': 1.09,
 '11': 2.62}

In [66]:
#Application of the functions on the Show HN posts
timewise_com_show=comm_time(show_posts,-1,-3)
timewise_com(timewise_com_show)

{'14': 9.64,
 '22': 4.75,
 '18': 8.02,
 '07': 2.49,
 '20': 5.11,
 '05': 0.48,
 '16': 9.04,
 '19': 4.5,
 '15': 5.27,
 '03': 2.39,
 '17': 7.6,
 '06': 1.18,
 '02': 1.06,
 '13': 7.89,
 '08': 1.38,
 '21': 2.27,
 '04': 2.06,
 '11': 4.1,
 '12': 6.01,
 '23': 3.73,
 '09': 2.43,
 '01': 2.05,
 '10': 2.48,
 '00': 4.06}

> The analysis above indicates that most Ask HN posts receive more comments around 3pm and 4pm.
> Contrarily, rate of commenting for the Show HN posts is at 2pm and 4pm

> This is a relatively superficial analysis. A more informative analysis can be done by examining the rate of response based on posts per hour. This will help determine whether the above rates are due to more people making posts which attarct responses from their followers as opposed to a more natural rate of response among users.


### Conclusion

The analysis notes that:
> 1. There are more Ask HN posts (24,483) compared to Show HN posts (11,988)
> 2. Ask HN posts receive more comments per post (14) compared to Show HN posts (10)
> 3. For Ask HN posts, those made at 3pm and 4 pm receive more comments 