# Exploring Hacker News Posts

## Table of Contents
### &ensp;&ensp; 1. [Opening and Exploring the Dataset](#1.)
### &ensp;&ensp; 2. [Extracting Ask HN & Show HN Posts](#2.)
### &ensp;&ensp; 3. [Analysing the Dataset](#3.)
### &ensp;&ensp; 4. [Conclusion](#4.)

## Abstract
This project aims to analyse Hacker News posts using a dataset derived from [kaggle](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts) with a goal to answer the below 2 questions:
- Do`Ask HN posts`or`Show HN posts`receive more comments on average?
  - An`Ask HN`post is a type of post users submit to ask a specific question on the Hacker News community, such as "How to improve my personal website?" 
  - A`Show HN`post is a type of post users submit to show a project, product, or just generally something interesting on the Hacker News community.
- Do posts created at a certain time receive more comments on average?

  
  
Note that this project provides opportunities for me to understand and practise the below basic skills:
1. Use the datetime module and its time class.
2. Use the sorted(iterable_object, key, reverse=False) function to sort the order in a list.
3. Use dictionary to make frequency tables, and calcuate the average values of the data.

## 1. Opening and Exploring the Dataset
In this section,a dataset`hacker_news.csv`will be opened and read into a list of lists for preliminary exploration.

In [1]:
# Read the CSV file into a list of lists
from csv import reader
opened_file = open('hacker_news.csv', encoding='utf8')
hn = list(reader(opened_file))

# Check the first 5 rows
hn[:5]

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'],
 ['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01']]

In [2]:
# Remove the header
headers = hn[0]
hn = hn[1:]
print('Headers:\n')
print(headers)
print()
print('First 5 rows:\n')
for i in range(5):
    print(hn[i])
    print()

Headers:

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']

First 5 rows:

['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52']

['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30']

['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20']

['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']

['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015

The below table shows each column's name and its desciption in the dataset.

| Column      | Description |
| :-----------: | :----------- |
| id | The unique identifier from Hacker News for the post |
| title | The title of the post |
| url | The URL that the posts links to, the post might not contain a URL |
| num_points | The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes |
| num_comments | The number of comments that were made on the post |
| author | The username of the person who submitted the post |
| created_at | The date and time at which the post was submitted (Eastern Time in the US, GMT -5 hours)|


## 2. Extracting Ask HN & Show HN Posts
Since we are only concerned with the posts that begin with either`Ask HN`or`Show HN`, the following step will use the`string.lower().startswith()`method to find these posts and store them into their corresponding list.

In [3]:
# Create 3 empty lists
ask_posts = []
show_posts = []
other_posts = []

# Loop through each row in `hn`
for row in hn:
    title = row[1]
    if title.lower().startswith('ask hn'):
        ask_posts.append(row)
    elif title.lower().startswith('show hn'):
        show_posts.append(row)
    else:
        other_posts.append(row)

print('The number of posts that begins with "Ask HN" is:', len(ask_posts))
print()
print('The number of posts that begins with "Show HN" is:', len(show_posts))
print()
print('The number of posts that does not begin with "Ask Hn" nor "Ask HN" is:', len(other_posts)) 

The number of posts that begins with "Ask HN" is: 1744

The number of posts that begins with "Show HN" is: 1162

The number of posts that does not begin with "Ask Hn" nor "Ask HN" is: 17194


### 2.1. Ask HN Posts

In [4]:
# Check the first 5 rows in ask_posts
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

### 2.2. Show HN Posts

In [5]:
# Check the first 5 rows in show_posts
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

## 3. Analysing the Dataset

### 3.1. Calculating Average Number of Comments for Ask HN & Show HN Posts

In [6]:
# Create a variable that will be adding up the number of comments for Ask HN during a loop
total_ask_comments = 0

for row in ask_posts:
    num_comments = int(row[4])
    total_ask_comments += num_comments

avg_ask_comments = total_ask_comments / len(ask_posts)    

print('The average number of comments for each Ask HN post is:', avg_ask_comments)

The average number of comments for each Ask HN post is: 14.038417431192661


In [7]:
# Create a variable that will be adding up the number of comments for Ask HN during a loop
total_show_comments = 0

for row in show_posts:
    num_comments = int(row[4])
    total_show_comments += num_comments

avg_show_comments = total_show_comments / len(show_posts)    

print('The average number of comments for each Show HN post is:', avg_show_comments)

The average number of comments for each Show HN post is: 10.31669535283993


## Based on the above result:
- On average, the Ask HN posts (14.04) received more comments than the Show HN posts (10.32) in the dataset.
- Accordingly, we will be looking into the Ask HN posts for the following analysis.

### 3.2. Calculating Number of Ask HN Posts and Comments Created by Hour
In this section, we aim to find whether Ask HN posts created at a certain time were more likely to receive comments. The analysis will be conducted following the below steps:
1. Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
2. Calculate the average number of comments ask posts receive by hour created.

### 3.2.1. Obtain the number of Ask HN Posts created by Hour and the number of its corresponding comments

In [8]:
import datetime as dt
result_list = []

# Loop through ask_posts and append the created time for each post and the number of comments it received to result_list.
for row in ask_posts:
    created_at = row[6]
    num_comments = int(row[4])
    result_list.append([created_at, num_comments])

result_list[:5] 

[['8/16/2016 9:55', 6],
 ['11/22/2015 13:43', 29],
 ['5/2/2016 10:14', 1],
 ['8/2/2016 14:20', 3],
 ['10/15/2015 16:38', 17]]

In the following cell, two empty dictionaries will be created with:
- `counts_by_hour`, which contains the number of ask posts created during each hour of the day.
- `comments_by_hour`, which contains the corresponding number of comments ask posts created at each hour received.

In [9]:
# Create two empty dictionaries
counts_hour = {}
counts_comment = {}

# Loop through result_list
for row in result_list:
    date_str = row[0]
    comment_count = row[1]
    # Use the strptime() method to parse the datetime, which returns a datetime object
    date_dt = dt.datetime.strptime(date_str, '%m/%d/%Y %H:%M')
    # Use the strftime() method to format the time, which returns a string object
    time_object = dt.datetime.strftime(date_dt, '%-H')
    time_object = int(time_object) # Convert str into int for the sorting later in dictionary
    if time_object in counts_hour:
        counts_hour[time_object] += 1
        counts_comment[time_object] += comment_count
    else:
        counts_hour[time_object] = 1
        counts_comment[time_object] = comment_count

print(counts_hour)
print()
print(counts_comment)

{9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


In [10]:
# Sort the dictionary by hour
sorted_counts_hour = sorted(counts_hour.items(), key=lambda x:x[0])
sorted_counts_comment = sorted(counts_comment.items(), key=lambda x:x[0])

# The number of posts created by hour
sorted_counts_hour

[(0, 55),
 (1, 60),
 (2, 58),
 (3, 54),
 (4, 47),
 (5, 46),
 (6, 44),
 (7, 34),
 (8, 48),
 (9, 45),
 (10, 59),
 (11, 58),
 (12, 73),
 (13, 85),
 (14, 107),
 (15, 116),
 (16, 108),
 (17, 100),
 (18, 109),
 (19, 110),
 (20, 80),
 (21, 109),
 (22, 71),
 (23, 68)]

In [11]:
# The number of comments recevied by posts created by hour
sorted_counts_comment

[(0, 447),
 (1, 683),
 (2, 1381),
 (3, 421),
 (4, 337),
 (5, 464),
 (6, 397),
 (7, 267),
 (8, 492),
 (9, 251),
 (10, 793),
 (11, 641),
 (12, 687),
 (13, 1253),
 (14, 1416),
 (15, 4477),
 (16, 1814),
 (17, 1146),
 (18, 1439),
 (19, 1188),
 (20, 1722),
 (21, 1745),
 (22, 479),
 (23, 543)]

### 3.2.2. Calculate the Average number of Comments for Ask HN Post by Hour

In [12]:
avg_by_hour = []
for hour in counts_hour:
    avg_by_hour.append([hour, counts_comment[hour]/counts_hour[hour]])
# Sort the list in descending order based on the average number of comments    
avg_by_hour = sorted(avg_by_hour, key=lambda x:x[1], reverse=True)
avg_by_hour

[[15, 38.5948275862069],
 [2, 23.810344827586206],
 [20, 21.525],
 [16, 16.796296296296298],
 [21, 16.009174311926607],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [18, 13.20183486238532],
 [17, 11.46],
 [1, 11.383333333333333],
 [11, 11.051724137931034],
 [19, 10.8],
 [8, 10.25],
 [5, 10.08695652173913],
 [12, 9.41095890410959],
 [6, 9.022727272727273],
 [0, 8.127272727272727],
 [23, 7.985294117647059],
 [7, 7.852941176470588],
 [3, 7.796296296296297],
 [4, 7.170212765957447],
 [22, 6.746478873239437],
 [9, 5.5777777777777775]]

Since it is difficult to understand the above information, the following cell will use the methods below to format our result and increase its readability:
- The`datetime.strptime()`constructor to create a datetime object with the hour
- The`datetime.strftime()`method to format the hour
- The`str.format()`method with {:.2f} to format the average number to two decimal places

In [13]:
print('The Top 5 Hours for Ask HN Comments:')
for hr, avg in avg_by_hour[:5]:
    hour = dt.time(hr)
    hour_str = hour.strftime('%H%P')
    message = 'At {}, {:.2f} comments per post'
    print(message.format(hour_str, avg))

The Top 5 Hours for Ask HN Comments:
At 15pm, 38.59 comments per post
At 02am, 23.81 comments per post
At 20pm, 21.52 comments per post
At 16pm, 16.80 comments per post
At 21pm, 16.01 comments per post


## Based on the above result:
- On average, the hour that received the highest number of comments per post was`15pm (the time zone is Eastern Time in the US, GMT -5 hours)`with an average`38.59 comments per post`.
- It was about a 60% increase in the number of received comments compared to posts created at 2am that received the 2nd highest average number of comments with an average 23.81 comments per post.

## 4. Conclusion
In this project, we analysed the`Ask post`and`Show post`of Hacker News to find out which type of the post received more comments on average, and whether the post created at a certain time period received more comments than others. 

The results suggested that creating an`Ask post`at`15am (Eastern Time in the US, GMT -5 hours)`would be more likely to receve comments.