<a href="https://colab.research.google.com/github/KacperKaszuba0608/Projects/blob/main/Exploring_Hacker_News_Posts_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring Hacker News Posts

In this project I worked with Hacker News Posts dataset. Dataset is about post which had been written on Hacker News site and I focused on post withe titles begin with "Ask HN" or "Show HN", because this posts are to Hacker News community. If you need read more about dataset follow this [link](https://www.kaggle.com/datasets/hacker-news/hacker-news-posts). Below are desriptions of the columns:
* `id` -  the unique identifier from Hacker News for the post
* `title` - the title of the post
* `url` - the URL that the posts links to, if the post has a URL
* `num_points` - the number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotes
* `num_coments` - the number of comments on the post
* `author` - the username of the person who submitted the post
* `creared_at` - the date and time of the post's submisson

## Importing the libraries and reading the dataset

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = "kacperkaszuba"
os.environ['KAGGLE_KEY'] = "b5523d8cc418c800dccfc85aaa76c0e1"

from kaggle.api.kaggle_api_extended import KaggleApi
api = KaggleApi()
api.authenticate()

api.dataset_download_files('hacker-news/hacker-news-posts', path=".", unzip = True)

In [None]:
from csv import reader
import datetime as dt

file = open('HN_posts_year_to_Sep_26_2016.csv')
hn = reader(file) #hn = hacker news
hn = list(hn)

headers = hn[0]
hn = hn[1:]

print(headers)
print(hn[:5])

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']
[['12579008', 'You have two days to comment if you want stem cells to be classified as your own', 'http://www.regulations.gov/document?D=FDA-2015-D-3719-0018', '1', '0', 'altstar', '9/26/2016 3:26'], ['12579005', 'SQLAR  the SQLite Archiver', 'https://www.sqlite.org/sqlar/doc/trunk/README.md', '1', '0', 'blacksqr', '9/26/2016 3:24'], ['12578997', 'What if we just printed a flatscreen television on the side of our boxes?', 'https://medium.com/vanmoof/our-secrets-out-f21c1f03fdc8#.ietxmez43', '1', '0', 'pavel_lishin', '9/26/2016 3:19'], ['12578989', 'algorithmic music', 'http://cacm.acm.org/magazines/2011/7/109891-algorithmic-composition/fulltext', '1', '0', 'poindontcare', '9/26/2016 3:16'], ['12578979', 'How the Data Vault Enables the Next-Gen Data Warehouse and Data Lake', 'https://www.talend.com/blog/2016/05/12/talend-and-Â\x93the-data-vaultÂ\x94', '1', '0', 'markgainor1', '9/26/2016 3:14']]


## Extracting Ask HN and Show HN Posts

Like I said in introduction I want to extract post which start with "Ask HN" or "Show HN". Code below helping me with this.

In [None]:
def extract_posts(dataset, index_of_title):
    ask_posts = []
    show_posts = []
    other_posts = []
    
    for row in dataset:
        title = row[index_of_title]
        title = title.lower()
        
        if title.startswith("ask hn"):
            ask_posts.append(row)
        elif title.startswith("show hn"):
            show_posts.append(row)
        else:
            other_posts.append(row)
            
    return ask_posts, show_posts, other_posts

In [None]:
ask_posts, show_posts, other_posts = extract_posts(hn, 1)

print(len(ask_posts))
print(len(show_posts))
print(len(other_posts))

9139
10158
273822


As we see above in dataset is more others post (17194) than about ask or show something (appropriately 1744 and 1162).

## Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [None]:
def average_of_comments(dataset, index_of_column):
    total_comments = 0
    
    for row in dataset:
        number_of_comments = int(row[index_of_column])
        total_comments += number_of_comments
    
    avg_comments = round(total_comments / len(dataset), 4)
    
    return avg_comments

In [None]:
avg_ask_comments = average_of_comments(ask_posts, 4)
print(f' Average of ask posts comments: {avg_ask_comments}')

avg_show_comments = average_of_comments(show_posts, 4)
print(f' Average of show posts comments: {avg_show_comments}')

 Average of ask posts comments: 10.3935
 Average of show posts comments: 4.8861


The average of comments tells us that ask posts are commented more often than show posts. It is on average 4 more comments.

## Ask Posts

### Finding the Number of Ask Posts and Comments by Hour Created

In [None]:
def number_of_posts_and_comments_by_hour(dataset, 
                                         index_of_date_col,
                                         index_of_comment_col):
    result_list = []
    
    for row in dataset:
        date = row[index_of_date_col]
        comments = int(row[index_of_comment_col])
        result_list.append([date, comments])
        
    counts_by_hour = {}
    comments_by_hour = {}
    
    for row in result_list:
        date = dt.datetime.strptime(row[0], "%m/%d/%Y %H:%M")
        hour = date.strftime("%H")
        if hour not in counts_by_hour:
            counts_by_hour[hour] = 1
            comments_by_hour[hour] = row[1]
        else:
            counts_by_hour[hour] += 1
            comments_by_hour[hour] += row[1]
            
    return comments_by_hour, counts_by_hour

In [None]:
comments_by_hour, counts_by_hour = number_of_posts_and_comments_by_hour(ask_posts, 6, 4)

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [None]:
def average_in_lists(comment_dict, count_dict):
    result = []
    
    for hour in count_dict:
        result.append([round(comment_dict[hour] / count_dict[hour], 4), hour])
        
    return result

In [None]:
avg_by_hour = average_in_lists(comments_by_hour, counts_by_hour)

### Sorting and Printing Values from a List of Lists

In [None]:
def top5(list_of_avg):
    
    sorted_results = sorted(list_of_avg, reverse = True)

    print("Top 5 Hours for Ask Posts Comments:")

    counter = 1

    for hour in sorted_results[:5]:
        date = dt.datetime.strptime(hour[1], "%H").strftime("%H:%M")
        avg = hour[0]
        print(f'{counter}. {date} - {avg} average comments per post')
        counter += 1

In [None]:
top5(avg_by_hour)

Top 5 Hours for Ask Posts Comments:
1. 15:00 - 28.6765 average comments per post
2. 13:00 - 16.3176 average comments per post
3. 12:00 - 12.3801 average comments per post
4. 02:00 - 11.1375 average comments per post
5. 10:00 - 10.6844 average comments per post


As we see above 15:00 (3:00 pm) hour is the best commented hour in whole time (38.5948). Second is 2:00 (2:00 am) and very near on the third place is 20:00 (8:00 pm). 4th and 5th place is almost the same and it is 16:00 (4:00 pm) and 21:00 (9:00 pm).

## Show posts

### Finding the Number of Ask Posts and Comments by Hour Created

In [None]:
comments_by_hour, counts_by_hour = number_of_posts_and_comments_by_hour(show_posts, 6, 4)

### Calculating the Average Number of Comments for Ask HN Posts by Hour

In [None]:
avg_by_hour = average_in_lists(comments_by_hour, counts_by_hour)

### Sorting and Printing Values from a List of Lists

In [None]:
top5(avg_by_hour)

Top 5 Hours for Ask Posts Comments:
1. 12:00 - 6.9942 average comments per post
2. 07:00 - 6.6822 average comments per post
3. 11:00 - 6.0025 average comments per post
4. 08:00 - 5.6044 average comments per post
5. 14:00 - 5.5158 average comments per post


We can't unequivocally stated which hour are the best to write post about show something but the best time to wrtite about this is definitely evening.

## Conclusion

To conclusion, the most commented posts are ask posts. They have average 4 more comments than show posts. If we look deeper, we can state that the best time to write post about asking something is afternoon (15:00 and 16:00) and evening (20:00 and 21:00). If you want ask about something after midnight you should choose 2:00 hour.