# Hacker News Posts With Most Comments

###                                                       INTRODUCTION

In this project, we'll work with a data set of submissions to popular technology site [Hacker News](https://news.ycombinator.com). 
Hacker News is a site started by the startup incubator [Y Combinator](https://www.ycombinator.com), where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.

This data set is Hacker News posts from the last 12 months (up to September 26 2016) and can be found [here](https://news.ycombinator.com). The data contains almost 300,000 rows. Below are descriptions of the columns:

* **id**: The unique identifier from Hacker News for the post
* **title**: The title of the post
* **url** : The URL that the posts links to, if it the post has a URL
* **num_points**: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* **num_comments**: The number of comments that were made on the post
* **author**: The username of the person who submitted the post
* **created_at**: The date and time at which the post was submitted

At Hacker News, Users can ask questions mostly related to tech and startups and also provide insights to projects, products or anything around tech generally. The norm at Hacker News is to use the tag "Ask HN" to ask questions and "Show HN" to show the community something interesting.

The focus of this analysis is to compare the two types of posts (Ask HN and Show HN) in order to determine:
* which amongst these posts have the highest number of comments on the average
* The particular hours in which posts are likely to receive most comments on the average

We'll Begin our analysis by opening the dataset (csv file) and removing the header row

In [1]:
opened_file = open("../Dataset/hacker_news.csv", encoding = 'utf8')

In [2]:
# Importing the reader method from the csv module
from csv import reader
read_file = reader(opened_file)
hn = list(read_file) # Parsing the file as a list of lists

In [3]:
# Displaying the first five rows of the data set
hn[0:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

In [4]:
# Removing the header row
headers = hn[0]
hn = hn[1:]
hn[0] # The first row now excludes the header row

['12224879',
 'Interactive Dynamic Video',
 'http://www.interactivedynamicvideo.com/',
 '386',
 '52',
 'ne0phyte',
 '8/4/2016 11:52']

### Analyzing "Ask HN & Show HN" Posts

#### Isolating All "Ask HN", "Show HN" and "Other" Posts in separate lists

The cell below Takes the following steps:

* Creates three empty lists namely `ask_posts, show_posts` and `other_posts`, that will hold a list of all posts that starts with "Ask HN", "Show HN" and `other posts` respectively.

* Loop through the `hn` list  of lists and for iteration:
- assign the first column to a variable called `title` (the frist column of `hn` holds the title of the posts)
- checks if each `title` `startswith` `ask hn` and appends all such rows to the `ask_posts` list

_Note: The `lower` method that is called, returns the lowercase of a string. At inspection, the titles that begin with `Ask HN`_
_Have some letters in uppercase and code will return an error if we don't first, convert them to lowercase_

_Also the string method `startswith` checks if a string starts with a particular string_

- if the above condition doesn't hold true, we check if `title` starts with `show hn` and append all such rows to the `show_posts`list

- if the condition doesn't hold true, we append all other rows to the `other_posts` list.

* Print the length of `ask_posts` list
* Print the length of `show_posts` list
* Print the length of `other_posts` list

In [5]:
# Analyzing Data whose rows starts with just 'Ask hn' and 'Show hn'
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]
    
    if (title.lower().startswith('ask hn')):
        ask_posts.append(row)
        
    elif (title.lower().startswith('show hn')):
        show_posts.append(row)
        
    else:
        other_posts.append(row)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17193


The above output shows that majority of the posts, are posts that neither used `Ask HN` nor `Show HN` syntax to ask questions or show projects. 
Other Posts have a total of 273,822 rows, while posts that begins with "Ask HN" have 9,139 rows and posts that starts with "Show HN" have a total of 10,158 rows.
## Note:

##### This work is focused on analyzing `Ask Posts` and `Show Posts`, so we won't be needing to analyze `Other posts`

In [6]:
ask_posts[0:5] # Showing the first five rows of Ask Posts

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

In [7]:
show_posts[0:5] # Showing the first five rows of Show Posts

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

# Determining Which Posts Received More Comments Between `Ask Posts` &                                        `Show Posts`.

The code cell below shows the following steps:

* creates a variable called `total_ask_comments` which holds a zero value
* Loops through the `ask_posts` list and for each iteration:

* _converts the number of comments for each question asked (column 4) to an integer and assigns it to variable `ask_comments`_
* _Adds up the number of comments in each row and assigns it to the variable `total_ask_comments`
* Calculates the average number of comments in all `ask_posts` by dividing `total_ask_comments` by the total number of `ask_posts`
* Print the total number of comments for all posts in `ask_posts`
* Print the average number of comments for all posts in `ask_posts`

In [8]:
# computing the total and average number of comments in 'Ask Posts' 
total_ask_comments = 0

for row in ask_posts:
    ask_comments = int(row[4])
    total_ask_comments += ask_comments

avr_ask_comments = total_ask_comments / len(ask_posts)      
print(total_ask_comments)
print(avr_ask_comments)

24483
14.038417431192661


The above outputs shows that the total number of comments for all posts in `ask_posts` is 94,986 comments, while the average number of comments for all posts in `ask_posts` is 10.39 comments.

This can otherwise be interpreted as saying that on the average, posts that starts with `Ask HN` will receive 10 comments per post.

The code cell below shows the following steps:

* creates a variable called `total_show_comments` which holds a zero value
* Loops through the `show_posts` list and for each iteration:

* _converts the number of comments for each question asked (column 4) to an integer and assigns it to variable `show_comments`_
* _Adds up the number of comments in each row and assigns it to the variable `total_show_comments`
* Calculates the average number of comments in all `show_posts` by dividing `total_show_comments` by the total number of `show_posts`
* Print the total number of comments for all posts in `show_posts`
* Print the average number of comments for all posts in `show_posts`

In [9]:
# computing the total and average number of comments in 'Show Posts' 

total_show_comments = 0

for row in show_posts:
    show_comments = int(row[4])
    total_show_comments += show_comments
    avr_show_comments = total_show_comments / len(show_posts)
    
print(total_show_comments)
print(avr_show_comments)

11988
10.31669535283993


The above outputs shows that the total number of comments for all posts in `show_posts` list is 49,633 comments, while the average number of comments for all posts in `show_posts` is 4.89 comments.

This can otherwise be interpreted as saying that on the average, posts that starts with `Show HN` will receive 5 comments per post.

#### Going Deeper in analyzing `Ask Posts`

Next, we'll determine if ask posts created at a certain time are more likely to attract more comments. We'll use the following steps to perform this analysis:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive per hour created.

The next few code cells below implements the first step above. Firstly, we import the `datetime` module as `dt`
_created an empty list that will contain two elements (the first element will be the different hours in which `ask_posts` were created while the second element will be the number of comments that were received during that hour)_

Loop through the `ask_posts` list and for each iteration;
_isolated the datetime column and assigned it to a variable called `created_at`
_isolated the number of comments posts received, converted it to an integer and assigned it to a variable called `num_comments`
_append to result list, a tuple that contains the hour in which posts were created and the number of comments in that hour

Printed the first ten rows of result list

In [10]:
# The focus will be on the ask_posts because they have more comments on the average

import datetime as dt

result_list = []

for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append((created_at, num_comments)) # append takes one argument, using two will return an error, but this worked because we parsed the two arguments with one parenthesis
    
    
result_list[0:10]

[('8/16/2016 9:55', 6),
 ('11/22/2015 13:43', 29),
 ('5/2/2016 10:14', 1),
 ('8/2/2016 14:20', 3),
 ('10/15/2015 16:38', 17),
 ('9/26/2015 23:23', 1),
 ('4/22/2016 12:24', 4),
 ('11/16/2015 9:22', 1),
 ('2/24/2016 17:57', 1),
 ('6/4/2016 17:17', 2)]

In [11]:
# printing the last ten rows of result list

result_list[9129:]

[]

Below, we created two empty dictionaries namely, `counts_by_hour` and `comments_by_hour`

Loop through the `result_list` (which now contains datetime and number of comments for each), and for each iteration;

* isolate the date and time located at the first column and assigned it to a variable called `date_and_hour`
* created an object of the `datetime` module (now `dt`), called one of its methods also called `datetime` and called another of its methods called `strptime`, which helps to parse the date and time in the format specified.
* Isolated the hour of the parsed `date` object in a variable called `hour`
* Created a condition that executes `if` the `hour` variable does not exist as a key in the `count_by_hour` dictionary. The condition creates a new 'key-value' pair in the `counts_by_hour` empty dictionary, where the `hour` is stored as a key and its number of counts is stored as the `value`. And also Creates a new 'key-value' pair in the `comments_by_hour` empty dictionary, where the `hour` is the key and the value is the number of comments.
* `if` the `hour` variable already exists as key in the `counts_by_hour` dictionary, increment the `hour` in the `counts_by_hour` by one. And also increment the `hour` in the `comments_by_hour` by the number of comments. In other words, for every time a particular hour is encountered, count the number of comments for that hour and store the sum as the value of the `comments_by_hour` dictionary.

In [12]:
counts_by_hour = {}
comments_by_hour = {}

for row in result_list:
    date_and_hour = row[0]
    date = dt.datetime.strptime(date_and_hour, "%m/%d/%Y %H:%M")
    hour = date.strftime('%H')
    
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
        
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

print(comments_by_hour)
print(counts_by_hour)

{'09': 251, '13': 1253, '10': 793, '14': 1416, '16': 1814, '23': 543, '12': 687, '17': 1146, '15': 4477, '21': 1745, '20': 1722, '02': 1381, '18': 1439, '03': 421, '05': 464, '19': 1188, '01': 683, '22': 479, '08': 492, '04': 337, '00': 447, '06': 397, '07': 267, '11': 641}
{'09': 45, '13': 85, '10': 59, '14': 107, '16': 108, '23': 68, '12': 73, '17': 100, '15': 116, '21': 109, '20': 80, '02': 58, '18': 109, '03': 54, '05': 46, '19': 110, '01': 60, '22': 71, '08': 48, '04': 47, '00': 55, '06': 44, '07': 34, '11': 58}


### Finfing the Average Number of Comments in each Hour

* The code cell below creates an empty list and assigned it to a variable called `avr_comments_per_hour`
* Loops through the `comments_by_hour` dictionary, and for each iteration:
* appends to the empty `avr_comments_per_hour` list, another list with two elements.
* The first element being the key of the dictionary and...
* The second element being the value of each key in the `avr_comments_by_hour` divided by the value of each key in the `counts_by_hour` dictionary 
* This ultimately computes the second element of the `avr_comments_per_hour` list as the average comments received for each hour

In [13]:
# Average number of comments per hour
avr_comments_per_hour = []

for row in comments_by_hour:
    avr_comments_per_hour.append([row, comments_by_hour[row] / counts_by_hour[row]])
    
avr_comments_per_hour

[['09', 5.5777777777777775],
 ['13', 14.741176470588234],
 ['10', 13.440677966101696],
 ['14', 13.233644859813085],
 ['16', 16.796296296296298],
 ['23', 7.985294117647059],
 ['12', 9.41095890410959],
 ['17', 11.46],
 ['15', 38.5948275862069],
 ['21', 16.009174311926607],
 ['20', 21.525],
 ['02', 23.810344827586206],
 ['18', 13.20183486238532],
 ['03', 7.796296296296297],
 ['05', 10.08695652173913],
 ['19', 10.8],
 ['01', 11.383333333333333],
 ['22', 6.746478873239437],
 ['08', 10.25],
 ['04', 7.170212765957447],
 ['00', 8.127272727272727],
 ['06', 9.022727272727273],
 ['07', 7.852941176470588],
 ['11', 11.051724137931034]]

### Swapping the Columns 

The code cell below puts the average comment for each hour as the first column and the hour itself as the second column. This is executed by undertaking the following steps:

* Create an empty list and assign it to a variable called `swap_avg_by_hour`
* Lopp through the `avg_comments_per_hour` list, and for each iteration;
* Appends the average comments as the first column and the hour as the second column in the list.

In [14]:
swap_avg_by_hour = []

for row in avr_comments_per_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
swap_avg_by_hour

[[5.5777777777777775, '09'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [16.796296296296298, '16'],
 [7.985294117647059, '23'],
 [9.41095890410959, '12'],
 [11.46, '17'],
 [38.5948275862069, '15'],
 [16.009174311926607, '21'],
 [21.525, '20'],
 [23.810344827586206, '02'],
 [13.20183486238532, '18'],
 [7.796296296296297, '03'],
 [10.08695652173913, '05'],
 [10.8, '19'],
 [11.383333333333333, '01'],
 [6.746478873239437, '22'],
 [10.25, '08'],
 [7.170212765957447, '04'],
 [8.127272727272727, '00'],
 [9.022727272727273, '06'],
 [7.852941176470588, '07'],
 [11.051724137931034, '11']]

### Sorting the List in Descending order

The `swap_avg_by_hour` list is sorted in descending order using the `sorted` function. The function takes the list to be manipulated as its first argument and take the reverse argument which evaluates to either `True` or `False` as its second argument. When `reverse` evaluates to `True` the values are displayed in descending order.

In [15]:
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
sorted_swap

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21'],
 [14.741176470588234, '13'],
 [13.440677966101696, '10'],
 [13.233644859813085, '14'],
 [13.20183486238532, '18'],
 [11.46, '17'],
 [11.383333333333333, '01'],
 [11.051724137931034, '11'],
 [10.8, '19'],
 [10.25, '08'],
 [10.08695652173913, '05'],
 [9.41095890410959, '12'],
 [9.022727272727273, '06'],
 [8.127272727272727, '00'],
 [7.985294117647059, '23'],
 [7.852941176470588, '07'],
 [7.796296296296297, '03'],
 [7.170212765957447, '04'],
 [6.746478873239437, '22'],
 [5.5777777777777775, '09']]

In [16]:
# Printing the top five hours with highest comments on the average
top_five = sorted_swap[0:5]
top_five

[[38.5948275862069, '15'],
 [23.810344827586206, '02'],
 [21.525, '20'],
 [16.796296296296298, '16'],
 [16.009174311926607, '21']]

The results above shows that on the average, `ask_posts` posted around 3pm, 1pm, 12noon, 2am and 10am, receive the highest number of comments.
It is therefore advised that users who want to make `ask_posts`, should do so around these times if they want to get higher reviews and comments on their posts.

### Displaying the top-five ask_posts with the highest average comments in a more readable format

The code cell below loops through the two columns in the `top_five` list of lists and for each iteration:
* parses the string that contains the hours in a hourly format using the `strptime` method and assigns it to the variable `hour`
* formats the hour to diplay both the hour and minutes
* creates a string that will display the output in a readable format and assigns the string to a variable called `output`
* inserts the hour and average comments into the `output` string using the `format` function
* Prints `output`

In [17]:
for avr, hour in top_five:
    hour = dt.datetime.strptime(hour,"%H")
    hour = hour.strftime("%H:%M")
    output = "At {}, there's an average of {:.2f} comments on ask_posts"
    output = output.format(hour, avr)
    
    print(output)

At 15:00, there's an average of 38.59 comments on ask_posts
At 02:00, there's an average of 23.81 comments on ask_posts
At 20:00, there's an average of 21.52 comments on ask_posts
At 16:00, there's an average of 16.80 comments on ask_posts
At 21:00, there's an average of 16.01 comments on ask_posts
