# **INTRODUCTION**


- The learning objectives of this lesson have been added in the README file attached to this repository. However, to understand this project clearly, you should check out my lesson on [how to work with complex dates and times in Python](https://github.com/Tess-hacker/WORKING-WITH-COMPLEX-DATE-TIME-DATASET-IN-PYTHON). You'll find it interesting, I promise!


### **A Little Background on Hacker News**

- Hacker News is a site started by the startup incubator Y Combinator, where user-submitted stories (known as "posts") are voted and commented upon, similar to reddit. Hacker News is extremely popular in technology and startup circles, and posts that make it to the top of Hacker News' listings can get hundreds of thousands of visitors as a result.


- As stated in the README, we would be making use of the dataset [here](https://www.kaggle.com/hacker-news/hacker-news-posts/home). Below are the description of the columns: 

    - `id`: The unique identifier from Hacker News for the post
    
    - `title`: The title of the post
    
    - `url`: The URL that the posts links to, if it the post has a URL
    
    - `num_points`: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
    
    - `num_comments`: The number of comments that were made on the post
    
    - `author`: The username of the person who submitted the post
    
    - `created_at`: The date and time at which the post was submitted
    
    
- In this project, we are specifically interested in the posts whose titles begin with `Ask HN` or `Show HN`. The former are posts used to ask the Hacker News Community a specific question while the latter are posts used to show the community a project, product or something generally interesting. We will be **comparing these two posts to answer the following questions:**

    - Do `Ask HN` or `Show HN` receive more comments on average?
    
    - Do posts created at a certain time receive more comments on average?


- Let us begin by importing the required libraries we need to execute these tasks and reading the dataset into the list of lists.


In [1]:
from csv import reader
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import unicodecsv as csv
opened_file = open(r'C:\Users\USER\Documents\ONLINE COURSES\DATAQUEST\DATASETS\HN_posts_year_to_Sep_26_2016.csv', 'r', encoding='utf-8')
read_file = reader(opened_file)
hn =list(read_file)
hn= hn[1:] #we have eliminated the header
print (hn[:10])

[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle of Shopping', 'https://blog.menswr.com/2016/09/07/whats-new-w

- Let us assign the first row to a variable named *headers*. This will enable us to have our dataset rid of the headers and we can use both parties individually.

In [2]:
headers = hn[:1]
print ('The dataset headers are:')
print (headers)
print ('\n')
hn = hn[1:]
print ('The newly extracted dataset are:')
print (hn[:6])

The dataset headers are:
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26']]


The newly extracted dataset are:
[['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14'], ['12578975', 'Saving the Hassle 

- Now that we have our dataset separated from the headers, we are ready to filter the data. Remember the titles of posts that are the area of focus for this project. Now, this is a large dataset and we would need a Python function to filter the dataset for the posts which we want.


- The function `startswith` helps us to check through a dataset and return the variables that contain the assigned values which we have passed into the argument. In return, we get boolean values of True or False. Using this method, we would filter through our dataset for the posts that fall into the two categories which are the targets of this lesson.

In [3]:
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)
print ('The total number of Ask Posts are:')
print (len(ask_posts))
print ('The total number of Show Posts are:')
print (len(show_posts))
print ('The total number of Other Posts are:')
print (len(other_posts))

The total number of Ask Posts are:
9139
The total number of Show Posts are:
10158
The total number of Other Posts are:
273821


In [4]:
print ('The First Five Rows of Ask Posts are:')
print (ask_posts[:5])
print ('\n')
print ('The First Five Rows of Show Posts are:')
print (show_posts[:5])
print ('\n')
print ('The First Five Rows of Other Posts are:')
print (other_posts[:5])

The First Five Rows of Ask Posts are:
[['12578908', 'Ask HN: What TLD do you use for local development?', '', '4', '7', 'Sevrene', '9/26/2016 2:53'], ['12578522', 'Ask HN: How do you pass on your work when you die?', '', '6', '3', 'PascLeRasc', '9/26/2016 1:17'], ['12577908', 'Ask HN: How a DNS problem can be limited to a geographic region?', '', '1', '0', 'kuon', '9/25/2016 22:57'], ['12577870', 'Ask HN: Why join a fund when you can be an angel?', '', '1', '3', 'anthony_james', '9/25/2016 22:48'], ['12577647', 'Ask HN: Someone uses stock trading as passive income?', '', '5', '2', '00taffe', '9/25/2016 21:50']]


The First Five Rows of Show Posts are:
[['12578335', 'Show HN: Finding puns computationally', 'http://puns.samueltaylor.org/', '2', '0', 'saamm', '9/26/2016 0:36'], ['12578182', 'Show HN: A simple library for complicated animations', 'https://christinecha.github.io/choreographer-js/', '1', '0', 'christinecha', '9/26/2016 0:01'], ['12578098', 'Show HN: WebGL visualization of DN

- Now we would like to  determine whether the ask posts or the show posts receive more comments on the average.


- The following are the steps that would be taken:

    - **Determine the total number of comments for each category**
    
    - **Compute the average number of comments for each category**
    

- Afterwards, we would conclude on whether the show or ask posts receive the highest number of average comments.
    

In [5]:
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    average_ask_comments = total_ask_comments/len(ask_posts)
print ('The average comments for the ask posts are:')
print (average_ask_comments)

total_show_comments = 0
for comments in show_posts:
    total_show_comments += int(comments[4])
    average_show_comments = total_show_comments/len(show_posts)
print ('The average comments for the show posts are:')
print (average_show_comments)    

The average comments for the ask posts are:
10.393478498741656
The average comments for the show posts are:
4.886099625910612


## **CALCULATING THE NUMBER OF ASK POSTS AND COMMENTS BASED ON THE HOUR CREATED**


- Here, we would like to fully determine whether the ask posts and comments vary by the timeframe within which they are created. Since we have ascertained from the code above that the ask posts has the highest average comment, then, the time period needs to be taken into consideration too. 


- To calculate this successfully, we would need to follow the following steps:

    - First, we would calculate the amount of ask posts created per hour of the day along with the comments received.
    
    - Then, we would calculate the average number of comments that the ask posts receive per hour created. 
    
    
- Of course, you should know that the `datetime` module would come in handy now in order to deal with and execute these tasks. Recall that we ca use the `datetime.strptimr` to parse the dates stored as strings and to return `datetime` objects.

In [6]:
import datetime as dt
result_list = []
for row in ask_posts:
    created_at = row[6]
    comments = row[4]
    result_list.append([created_at, int(comments)])
    
counts_by_hour = {}
comments_by_hour = {}
ask_date_format = "%m/%d/%Y %H:%M"
for row in result_list:
    date = row[0]
    comment = row[1]
    askpost_time = dt.datetime.strptime(date, ask_date_format).strftime("%H")
    if askpost_time in counts_by_hour:
        comments_by_hour[askpost_time] += comment
        counts_by_hour[askpost_time] += 1
    else:
        comments_by_hour[askpost_time] =comment
        counts_by_hour[askpost_time] = 1
print (comments_by_hour)

{'02': 2996, '01': 2089, '22': 3372, '21': 4500, '19': 3954, '17': 5547, '15': 18525, '14': 4972, '13': 7245, '11': 2797, '10': 3013, '09': 1477, '07': 1585, '03': 2154, '23': 2297, '20': 4462, '16': 4466, '08': 2362, '00': 2277, '18': 4877, '12': 4234, '04': 2360, '06': 1587, '05': 1838}


In [7]:
# Now, we would execute the second step of our task by calculating the average number of comments that the ask posts receive per hour created.
# The two dictionaries created earlier would be used to perform this task.
avg_by_hour = []
for hour in comments_by_hour:
    avg_by_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])
print (avg_by_hour)

[['02', 11.137546468401487], ['01', 7.407801418439717], ['22', 8.804177545691905], ['21', 8.687258687258687], ['19', 7.163043478260869], ['17', 9.449744463373083], ['15', 28.676470588235293], ['14', 9.692007797270955], ['13', 16.31756756756757], ['11', 8.96474358974359], ['10', 10.684397163120567], ['09', 6.653153153153153], ['07', 7.013274336283186], ['03', 7.948339483394834], ['23', 6.696793002915452], ['20', 8.749019607843136], ['16', 7.713298791018998], ['08', 9.190661478599221], ['00', 7.5647840531561465], ['18', 7.94299674267101], ['12', 12.380116959064328], ['04', 9.7119341563786], ['06', 6.782051282051282], ['05', 8.794258373205741]]


- We have now completed the two steps but the results really don't stand as clear enough. We need to sort through this list and print the five highest values in a format that is readable.

In [8]:
swap_avg_by_hour = []

for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])
    
print (swap_avg_by_hour)
print ('\n')

sorted_swap = sorted(swap_avg_by_hour, reverse = True)

print ("Top 5 Hours for Ask Post Comments")
for average, hour in sorted_swap[:5]:
    print(
        "{}: {:.2f} average comments per post".format(
            dt.datetime.strptime(hour, "%H").strftime("%H:%M"),average
        )
    )

[[11.137546468401487, '02'], [7.407801418439717, '01'], [8.804177545691905, '22'], [8.687258687258687, '21'], [7.163043478260869, '19'], [9.449744463373083, '17'], [28.676470588235293, '15'], [9.692007797270955, '14'], [16.31756756756757, '13'], [8.96474358974359, '11'], [10.684397163120567, '10'], [6.653153153153153, '09'], [7.013274336283186, '07'], [7.948339483394834, '03'], [6.696793002915452, '23'], [8.749019607843136, '20'], [7.713298791018998, '16'], [9.190661478599221, '08'], [7.5647840531561465, '00'], [7.94299674267101, '18'], [12.380116959064328, '12'], [9.7119341563786, '04'], [6.782051282051282, '06'], [8.794258373205741, '05']]


Top 5 Hours for Ask Post Comments
15:00: 28.68 average comments per post
13:00: 16.32 average comments per post
12:00: 12.38 average comments per post
02:00: 11.14 average comments per post
10:00: 10.68 average comments per post


## **CONCLUSION**

- From the above analysis we can now deduce that:

    - The best hour of the day to make ask post comments is at 15:00 which has an average of 28.68 comments per post.


- In this lesson, we have learnt how to analyse through social media posts with focus on **Hacker News** and we conducted an analysis on which of the posts on the platforom  gains the maximum average amount of comments. We also further analysed the posts with the highest average based on the `timedate` class which we learnt earlier. This allowed us to be able to determine which hour of the day attracts the highest number of comments. These results will help existing and prospective users of the platform to have a good idea of when to make Ask Posts in order to get a subatantial number of contributions.


- Think you can do this on your own? I CHALLENGE YOU! GO FOR IT!!!!


- And remember, YOU CAN DO IT!