# Hacker News Projects 

* In this project, we will analyze a subset of Hacker News data using string manipulation, object-oriented programming, and date handling techniques. 

* We will compare "Ask HN" and "Show HN" posts to determine which type receives more comments on average and explore trends in post timing. 

* The goal is to practice data analysis in a real-world context. 📊

In [2]:
from csv import reader

openedFile = open("Dataset/hacker_news.csv")
readFile = reader(openedFile)
hn = list(readFile)

In [3]:
print(hn[:5])

[['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']]


In [4]:
header = hn[1:]

In [5]:
print(header[:5])

[['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]


### Extracting Ask HN and Show HN Posts 

In [6]:
ask_posts = []
show_posts = []
other_posts = []

In [7]:
for row in hn:
    title = row[1]
    lowerTitle = title.lower()
    if lowerTitle.startswith('ask hn'):
        ask_posts.append(row)
    elif lowerTitle.startswith('show hn'):
        show_posts.append(row)
    else :
        other_posts.append(row)
        
     
        
    

In [90]:
print ("Ask posts contain ", len(ask_posts)), " articles"
print ("show_posts contain ", len(show_posts)), " articles"
print ("other_posts contain ", len(other_posts)), " articles"



Ask posts contain  1744
show_posts contain  1162
other_posts contain  17195


(None, ' articles')

## Let's determine if ask posts or show posts receive more comments on average.

### For ask post

In [9]:
total_ask_comments = 0
for post in ask_posts:
    numComments = post[4]
    numComments = int(numComments)
    total_ask_comments += numComments
avg_ask_comments = total_ask_comments / len(ask_posts)

In [10]:
print("the average number of comments on ask posts is", int(avg_ask_comments))

the average number of comments on ask posts is 14


### For show posts

In [11]:
total_show_comments = 0

for post in show_posts:
    numComments = post[4]
    numComments = int(numComments)
    total_show_comments += numComments
avg_show_comments = total_show_comments / len(show_posts)


In [12]:
print("the average number of comments on show posts is", int(avg_show_comments))

the average number of comments on show posts is 10


* You should've determined that, on average, ask posts receive more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts.


### we'll determine if ask posts created at a certain time are more likely to attract comments

In [14]:
import datetime as dt

result_list = []
for element in ask_posts:
    created_time = element[6]
    num_comment = int(element[4])
    result_list.append((created_time, num_comment))


In [17]:
print(result_list[:5])

[('8/16/2016 9:55', 6), ('11/22/2015 13:43', 29), ('5/2/2016 10:14', 1), ('8/2/2016 14:20', 3), ('10/15/2015 16:38', 17)]


In [73]:
counts_by_hour = {}
comments_by_hour = {}

for rdv in result_list:
    rdv_hour = rdv[0]
    datetime = dt.datetime.strptime(rdv_hour, "%m/%d/%Y %H:%M")
    h_rdv = datetime.hour
    if h_rdv not in counts_by_hour:
        counts_by_hour[h_rdv] = 1
        comments_by_hour[h_rdv] = int(rdv[1])
    else:
        counts_by_hour[h_rdv] += 1
        comments_by_hour[h_rdv] += int(rdv[1])
        

In [74]:
print("Nombre de Ask Post par h ➡️", counts_by_hour)
print("Nombre de Commentaire par h ➡️",comments_by_hour)

Nombre de Ask Post par h ➡️ {9: 45, 13: 85, 10: 59, 14: 107, 16: 108, 23: 68, 12: 73, 17: 100, 15: 116, 21: 109, 20: 80, 2: 58, 18: 109, 3: 54, 5: 46, 19: 110, 1: 60, 22: 71, 8: 48, 4: 47, 0: 55, 6: 44, 7: 34, 11: 58}
Nombre de Commentaire par h ➡️ {9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


###  Calculating the Average Number of Comments for Ask HN Posts by Hour

In [77]:

avg_by_hour = []

for hour in counts_by_hour:
    if hour in comments_by_hour:
        avg_by_hour.append([hour, round(comments_by_hour[hour] / counts_by_hour[hour], 2)])


print(avg_by_hour)

[[9, 5.58], [13, 14.74], [10, 13.44], [14, 13.23], [16, 16.8], [23, 7.99], [12, 9.41], [17, 11.46], [15, 38.59], [21, 16.01], [20, 21.52], [2, 23.81], [18, 13.2], [3, 7.8], [5, 10.09], [19, 10.8], [1, 11.38], [22, 6.75], [8, 10.25], [4, 7.17], [0, 8.13], [6, 9.02], [7, 7.85], [11, 11.05]]


### Sorting and Printing Values

In [82]:
swap_avg_by_hour = []
for row in avg_by_hour:
    swap_avg_by_hour.append([row[1], row[0]])

In [83]:
print(swap_avg_by_hour)

[[5.58, 9], [14.74, 13], [13.44, 10], [13.23, 14], [16.8, 16], [7.99, 23], [9.41, 12], [11.46, 17], [38.59, 15], [16.01, 21], [21.52, 20], [23.81, 2], [13.2, 18], [7.8, 3], [10.09, 5], [10.8, 19], [11.38, 1], [6.75, 22], [10.25, 8], [7.17, 4], [8.13, 0], [9.02, 6], [7.85, 7], [11.05, 11]]


In [84]:
sorted_swap = sorted(swap_avg_by_hour, reverse=True)

### Top 5 Hours for Ask Posts Comments

In [85]:
print(sorted_swap)
    

[[38.59, 15], [23.81, 2], [21.52, 20], [16.8, 16], [16.01, 21], [14.74, 13], [13.44, 10], [13.23, 14], [13.2, 18], [11.46, 17], [11.38, 1], [11.05, 11], [10.8, 19], [10.25, 8], [10.09, 5], [9.41, 12], [9.02, 6], [8.13, 0], [7.99, 23], [7.85, 7], [7.8, 3], [7.17, 4], [6.75, 22], [5.58, 9]]


In [89]:
for data in sorted_swap[:5]:
    print(data[1], "H:", data[0], "average comments per post")

15 H: 38.59 average comments per post
2 H: 23.81 average comments per post
20 H: 21.52 average comments per post
16 H: 16.8 average comments per post
21 H: 16.01 average comments per post


### In conclusion, "ask" posts tend to generate the highest number of comments. Among the different times of day, 3 PM is the best time to post, with an average of 38.59 comments per post.