# Project: Meaningful Insights from Engaging Posts
People visit several online communities throughout the week. During the course of the week, posts within these communities receive lots of engagement from other users. Usually the posts with lots of engagement have lots of comments and shares throughout social media. The goal of this project is to examine which posts receive the most comments due to factors within our control. We'll be using data from Hackernews.com 

## Project Outline
#### 1. Introduction:
Introduction to the project and our why.
#### 2. Project Goals
The goals that help us answer our why.
#### 3. Data Analysis
Working with the data to uncover insights and analyzing patterns.
#### 4. Conclusion
What conclusions our project has revealed.

## 1. Introduction
##### 1.1 What is Hacker News, Who Visits Hacker News, and why evaluate Hacker News?
#### 1.1.1 What is Hacker News?
<a href="https://news.ycombinator.com/">Hacker News</a> is a social news site focused on entrepreneurship and technology and is owned by <a href="https://www.ycombinator.com/Ycombinator">Ycombinator</a>. Users can comment, share, and upvote/downvote posts in a similar fashion to <a href="https://reddit.com">reddit.</a>

#### 1.1.2 Who Visits Hacker News?
People who are interested in technology and entrepreneurship are the main users of HackerNews. The site receives over half a million visitors monthly. <a href="https://app.neilpatel.com/en/traffic_analyzer/overview?lang=en&locId=2840&domain=news.ycombinator.com"> Source</a>.

#### 1.1.3 Why Evaluate Hacker News?
Since Hacker News is an established site, with an easy to use voting system, we can examine what kind of posts would create the most visibility through comments or upvotes. We can also examine which day(s) and time(s) would be the best to post in order to receive more shares and comments on the platform. 

## 2. Project Goals 

Hacker News is a popular news outlet similar in scope to reddit for tech and tech related news. Users often create threads using hashtags #AskHN or #ShowHN. #AskHN posts ask users within the community a specific question, such as "What is the best tech stack for an online store?". Similarly #ShowHN posts show the community a project, product, or something interesting and eye catching. 

Because of the popularity of these posts, we want to analyze which posts create more engagement on average. We'd also like to analyze the other factors that may lead to engagement such as: time of post, framing of header, etc

We'll compare these two types of posts to determine the following:
- <b>Do #AskHN or #ShowHN posts receive more comments on average?
- <b>Does posting at a certain time generate more comments on average?

## 3. Data Analysis Section
### 3.1 Introduction to the Data
The data set we're working with was reduced from close to 300,000 rows down to around 20,000 rows. This was done by removing posts that did not receive any comments, and then randomly sampling from the remaining posts. 

In [16]:
from csv import reader

# Reading in hacker news data set as a list of lists
opened_file = open("hacker_news.csv", encoding="utf-8")
read_file = reader(opened_file)
hn_raw = list(read_file)
headers = hn_raw[0] # Headers of the dataset

# Removing headers from dataset
hn = hn_raw[1:]

In [2]:
#Exploring some data points from dataset with headers
print(headers)
print("\n")
hn[0:5]

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']




[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

### 3.2 Summary of Data
The data set contains the title of the posts, the number comments on the post, and the date and time the post was created. Now let's explore the number of comments for each type of post from AskHN and ShowHN.

### 3.3 Separating AskHN posts from ShowHN posts
The first thing that needs to be done is to identify posts that begin with AskHN or ShowHN. After identifying them, we can then place each datapoint into a separate list for analysis. We also need to account for other posts that are not either. We should have three separate lists:
- AskHN posts
- ShowHN posts
- Other posts

In [6]:
# Create three empty lists
ask_posts =[]
show_posts = []
other_posts = []

# Categorizing posts by titles 
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

In [7]:
print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

1744
1162
17194


### 3.4 Which posts receive higher avg amount of comments?

To find out which posts receive the highest number of comments, we can compare the avg number of comments for each sub-category "ask hn" and "show hn".
#### 3.4.1 Finding Avg Number of Comments for AskHN and ShowHN Posts
All the posts have been separated into different lists. Now we'll calculate the avg number of comments received per post based on their category (AskHN, ShowHN, Other). 

In [11]:
# Finding avg number of comments for AskHN posts
total_ask_comments = 0

for row in ask_posts:
    comment = int(row[4])
    total_ask_comments += comment

# Computing avg number of comments
avg_ask_comments = total_ask_comments / len(ask_posts)
print(avg_ask_comments)

14.038417431192661


In [15]:
# Finding ttl comments in show_posts list
total_show_comments = 0

# Finding Avg posts for ShowHN Posts
for row in show_posts:
    scomment = int(row[4])
    total_show_comments += scomment
    
# Finding the avg 
avg_show_comments = total_show_comments / len(show_posts)
print(avg_show_comments)
    

10.31669535283993


#### 3.4.2 Average Number of comments based on Category
AskHN posts from our sample receive on average 14 comments. ShowHN posts receive around 10 comments per post. Because AskHN posts receive a higher average amount of comments, we'll focus only on these now.

### From here on we will shift our analysis and focus only on "Ask Hn" post data

### 3.5 Determining the best time to post for engagement
We want to check and see if there's any correlation between the time a post was made and the number of comments it receives. If we find that posting at a certain time elicits more comments, we'll have an advantage of knowing when to post. 
#### We will perform this analysis using the following steps:
<ol>
<li> Calculate the amount of ask posts created in each hour of the day, along with comments received.</li>
<li> Calculate the average number of comments ask posts received by hour created </li>
</ol>

In [19]:
# Calculating amount of ask posts created each hour and comments on the post. 
import datetime as dt

 
result_list = [] # will be a list of lists
counts_by_hour = {}
comments_by_hour = {}

for row in ask_posts:
    created_at = row[6]
    pcomment = int(row[4])
    #list for the two variables above
    ask_list = [created_at, pcomment]
    result_list.append(ask_list)
    
for row in result_list:
    date = row[0]
    dt_object = dt.datetime.strptime(date, "%m/%d/%Y %H:%M")
    hour = dt.datetime.strftime(dt_object, "%H")
    
    #if the hour isn't a key in counts_by_hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1] #the comment column created for pcomment var
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]

comments_by_hour

{'09': 251,
 '13': 1253,
 '10': 793,
 '14': 1416,
 '16': 1814,
 '23': 543,
 '12': 687,
 '17': 1146,
 '15': 4477,
 '21': 1745,
 '20': 1722,
 '02': 1381,
 '18': 1439,
 '03': 421,
 '05': 464,
 '19': 1188,
 '01': 683,
 '22': 479,
 '08': 492,
 '04': 337,
 '00': 447,
 '06': 397,
 '07': 267,
 '11': 641}

In [22]:
# Calculating avg number of comments per post for each hour of the day
avg_by_hour = []

for hour in counts_by_hour:
    avg = (comments_by_hour[hour] / counts_by_hour[hour])
    avg_by_hour.append([hour, round((avg),2)])

avg_by_hour

[['09', 5.58],
 ['13', 14.74],
 ['10', 13.44],
 ['14', 13.23],
 ['16', 16.8],
 ['23', 7.99],
 ['12', 9.41],
 ['17', 11.46],
 ['15', 38.59],
 ['21', 16.01],
 ['20', 21.52],
 ['02', 23.81],
 ['18', 13.2],
 ['03', 7.8],
 ['05', 10.09],
 ['19', 10.8],
 ['01', 11.38],
 ['22', 6.75],
 ['08', 10.25],
 ['04', 7.17],
 ['00', 8.13],
 ['06', 9.02],
 ['07', 7.85],
 ['11', 11.05]]

### 3.6 Arranging and Printing Average Number of Comments Per Hour
Now that we have a list of lists with our average amount of comments per hour, we want to arrange it in a way to output the number of comments in descending order. We only want to focus on the times where posts received the most comments.

In [23]:
swap_avg_by_hour = []

for row in avg_by_hour:
    hour = row[0]
    avg_comment = row[1]
    swap_avg_by_hour.append([avg_comment, hour])
# Swapping position of time with avg comments per post
sorted_swap = sorted(swap_avg_by_hour, reverse = True)
print(swap_avg_by_hour)

[[5.58, '09'], [14.74, '13'], [13.44, '10'], [13.23, '14'], [16.8, '16'], [7.99, '23'], [9.41, '12'], [11.46, '17'], [38.59, '15'], [16.01, '21'], [21.52, '20'], [23.81, '02'], [13.2, '18'], [7.8, '03'], [10.09, '05'], [10.8, '19'], [11.38, '01'], [6.75, '22'], [10.25, '08'], [7.17, '04'], [8.13, '00'], [9.02, '06'], [7.85, '07'], [11.05, '11']]


#### 3.6.1 Top 5 Hours for Ask Posts Comments
After finding the average number of comments per post based on the time of posting, we now want to focus on the top five times to post. 

In [25]:
# Checking list sorted_swap
print(sorted_swap[0:4], "\n")

[[38.59, '15'], [23.81, '02'], [21.52, '20'], [16.8, '16']] 



In [28]:
# Sorting the values and printing the top 5 hours with the greatest number of comments. 
print("The top 5 Hours for AskHN Comments", "\n")
for row in sorted_swap[0:5]:
    hour = dt.datetime.strptime(row[1], '%H')
    hour_str = hour.strftime('%H:%M')
    avg = row[0]
    output = ("{0} {1:.2f} - average comments per post.".format(hour_str, avg))
    print(output)
    

The top 5 Hours for AskHN Comments 

15:00 38.59 - average comments per post.
02:00 23.81 - average comments per post.
20:00 21.52 - average comments per post.
16:00 16.80 - average comments per post.
21:00 16.01 - average comments per post.


A quick glimpse of our list shows that the best time to post is at 15:00(3:00 pm) as it receives on average 38 comments per post. Suprisingly, posting at 15:00 is such a great time as there is about a sixty percent increase in comments received as compared to the second time of 02:00. 

One more important thing that needs to be looked at is what time zone the times are in as this can greatly affect our posts. From the documentation on the dataset, the times are in EST(Eastern Time Zone). With this taken into account we can now plan and develop content that would be posted at 15:00 EST. 

## 4. Conclusion
The purpose of this project was to find what type of post would receive the most engagement on HackerNews. After some quick comparisons between the AskHN, ShowHN, and other posts, we found that AskHN posts receive more comments on average. Our focus shifted to analyzing only AskHN posts and we found that by posting at 15:00 EST, there would be a higher probability to receive comments compared to posting at other times.

<b>Therefore if one wanted to make a post on HackerNews and wanted that post to have a great probablity of receiving engagement (comments, shares) they should post at 15:00 EST. </b>