# Exploring Hacker News Posts


### Analysing Hacker News data in determining the likelihood of a post having solid engagements with audiences. 

I'm doing a series of things here:
* Open `HNpost` file. 
* Read and convert the read file into a list of lists
* Remove the header row 
* Assign the rest of the list into a variable named `hndata`

In [1]:
from csv import reader
opened_file = open("/storage/emulated/0/Python Worksheet/HNposts.csv") 
read_file = reader(opened_file)
hndata = list(read_file)
header = hndata[0]
hndata = hndata[1:]

Now that I have the file opened and converted to a list of lists, I can begin by performing some basic operations on it. How about exploring the data for now 💁🏽‍♂️?

This is what I'm going to do👌🏽:
* I will write a function that can explore, not just the `HNposts` dataset, but any dataset within defined limits (provided the data has been opened and converted into a list of list by the function above).
* I will provide in the function a system that will—in addition to exploring the data between any defined limits—allow us to know how many rows and columns we have in the dataset. This is important as it makes us aware of exactly how big the data we are dealing with is👌🏽. 

Get it🤷🏽‍♂️? Cute then👊🏽. Now let's go💪🏽! 

In [2]:
def explore_data(dataset, start, end, rows_and_columns = False) :
    display = dataset[start:end]
    for row in display:
        print (row)
        print ('\n') 
    
    if rows_and_columns:
        print ('There are {0} number of rows'.format(len(dataset))) 
        print ('There are {0} number of columns'.format(len(dataset[0]))) 

Let's try out our function to see if it works. We have to make sure there ain't no bugs and glitches here people 😉

In [3]:
#explore_data(hndata, 0, 3, True)

See? It works😄. Now let's focus. Notice that in arguments supplied in the function, we have a particular one: `rows_and_columns = False?` That is what I mentioned earlier at play. You see, the default argument there is `False` which means it is not necessary we display the number of rows and columns if we don't want to. 

However, if we have performed some cleaning on our data, and we want to know the number of rows that we are left with, that particular argument will come in handy. And all we have to do then is simply set it as `True` whenever we are giving our arguments alongside the updated dataset. 

I'm specifically interested in posts whose titles begin with either `Ask HN` or `Show HN`. Users submit Ask HN posts to ask the Hacker News community a specific question.


Then we are going to compare these two types of posts to determine the following:
* Do Ask HN or Show HN receive more comments on average?
* Do posts created at a certain time receive more comments on average?

Now I'm going to a particular method to separate posts beginning with Ask HN and Show HN into two different lists next.

Follow along 😉

In [4]:
ask_posts = [] 
show_posts = [] 
other_posts = [] 
for row in hndata:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

Let's now check if we have some values in our created lists.

In [5]:
print (len(ask_posts)) 
print (len(show_posts)) 


9139
10158


From the output we have, it is clear we already have some values in our lists. We can further crosscheck this by printing out some few rows from our lists. You could do that if you have the time, but as pertinent to what I want to achieve here, it means you will sleep there doing it alone 😄. 

So, we locomote 💪🏽. 

In [6]:
print (header) 

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


In [7]:
ask_comments = []
for row in ask_posts:
    comments_num = int(row[4])
    ask_comments.append(comments_num)

The body of codes in the cell above basically does one job:
* Collect all the comments from the `Ask HN` posts and store them into a separate list. 

Don't forget that our aim is to calculate the average of these comments. Yet, we locomote 🚙. 


Now let's find the average for these figures. 

In [8]:
average_ask_commments = sum(ask_comments)/len(ask_posts)
print (average_ask_commments) 

10.393478498741656


Let's use the same procedure to find the average of the `Show HN` posts. 


In [9]:
show_comments = []
for row in show_posts:
    comments_num = int(row[4])
    show_comments.append(comments_num)

average_show_commments = sum(show_comments)/len(show_posts)
print (average_show_commments) 

4.886099625910612


In our last two code cells, we have outputs `10.39` and `4.89` for average number of `Ask HN` and `Show HN` posts respectively.

Which means that `Ask HN` posts on Hacker News posts received more engagements than `Show HN` posts. 

Now we want to determine if `Ask HN` posts created at a certain time are nor likely to attract comments. 

We can do this by generating a frequency table for ask posts by hour and find the total comments on all the posts at that hour. 


Let's attempt this. 

In [10]:
print (ask_posts[:2]) 

[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17']]


In [11]:
posts_by_hour = {}
comments_by_hour = {}
import datetime as dt
for row in ask_posts:
    datetime = row[6]
    comments = int(row[4]) 
    datetime = dt.datetime.strptime(datetime,"%m/%d/%Y %H:%M")
    hour = datetime.strftime("%H") 
    if hour not in posts_by_hour:
        posts_by_hour[hour] = 1
        comments_by_hour[hour] = comments
    else:
        posts_by_hour[hour] += 1
        comments_by_hour[hour] += comments

In [12]:
print (posts_by_hour) 

{'02': 269, '01': 282, '22': 383, '21': 518, '19': 552, '17': 587, '15': 646, '14': 513, '13': 444, '11': 312, '10': 282, '09': 222, '07': 226, '03': 271, '23': 343, '20': 510, '16': 579, '08': 257, '00': 301, '18': 614, '12': 342, '04': 243, '06': 234, '05': 209}


In [13]:
print (comments_by_hour) 

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


With the two outputs we have above, we can further calculate the average number of comments on ask posts made at a particular hour. 

In [14]:
avg_comments_by_hour = []
for value in posts_by_hour:
    if value in comments_by_hour:
        average = comments_by_hour[value]/posts_by_hour[value]
    avg_comments_by_hour.append([value, average]) 

In [15]:
print (avg_comments_by_hour) 

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.

In [16]:
swap_avg_comments_by_hour = []
for row in avg_comments_by_hour:
    swap_avg_comments_by_hour.append([row[1], row[0]])
print (swap_avg_comments_by_hour) 

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


In [18]:
sorted_swap = sorted(swap_avg_comments_by_hour, reverse = True)
print (sorted_swap)

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10'], [9.7119341563786, '04'], [9.692007797270955, '14'], [9.449744463373083, '17'], [9.190661478599221, '08'], [8.96474358974359, '11'], [8.804177545691905, '22'], [8.794258373205741, '05'], [8.749019607843136, '20'], [8.687258687258687, '21'], [7.948339483394834, '03'], [7.94299674267101, '18'], [7.713298791018998, '16'], [7.5647840531561465, '00'], [7.407801418439717, '01'], [7.163043478260869, '19'], [7.013274336283186, '07'], [6.782051282051282, '06'], [6.696793002915452, '23'], [6.653153153153153, '09']]


In [21]:
top_5 = (sorted_swap[:5])
print (top_5)

[[28.676470588235293, '15'], [16.31756756756757, '13'], [12.380116959064328, '12'], [11.137546468401487, '02'], [10.684397163120567, '10']]


In [25]:
hour = "15"
hour = dt.datetime.strptime(hour, "%H")
hour = hour.strftime("%H:%M")
print (hour) 

15:00


In [32]:
for row in top_5:
    hour = row[1]
    hour = dt.datetime.strptime(hour, "%H") 
    hour = hour.strftime("%H:%M") 
    print ("{0}: {1:.2f} comments per posts".format(hour, row[0]))
    print ("\n") 

15:00: 28.68 comments per posts


13:00: 16.32 comments per posts


12:00: 12.38 comments per posts


02:00: 11.14 comments per posts


10:00: 10.68 comments per posts




From the output we have, we can infer that to have the best chance of getting more comments on posts, it should be created at `15:00` or `3 p.m`