# Hacker News project

In this project we'll work on the hacker news posts, here is the table of values o for this dataset:

|column|description|
|:--------:|:-----------:|
|id| The unique identifier from Hacker News for the post|
|title| The title of the post|
|url| The URL that the posts links to, if the post has a URL|
|num_points| The number of points the post acquired, calculated as |
the total| number of upvotes minus the total number of downvotes|
|num_comments| The number of comments that were made on the post|
|author| The username of the person who submitted the post|
|created_at| The date and time at which the post was submitted|

We might be interseted in the total number of posts, authors with the biggest amount of posts and number of comments for any particular post. We might observe the reason why some post are better rated than other posts.
We're specifically interested in posts whose titles begin with either Ask HN or Show HN. Users submit Ask HN posts to ask the Hacker News community a specific question.

## Loading the data

so we gonna start with importing modules and open the file with python built-in open function

In [1]:
from csv import reader

In [2]:
file = open("hacker_news.csv")

Read the file with the reader function

In [3]:
file_read = reader(file)

Assign the opened file to the variable hn and make a list, then separate the header from rest of the data

In [4]:
hn = list(file_read)

In [5]:
hn = hn[1:]

Displaying first five rows to ensure that we make it good

In [6]:
hn[:5]

[['12224879',
  'Interactive Dynamic Video',
  'http://www.interactivedynamicvideo.com/',
  '386',
  '52',
  'ne0phyte',
  '8/4/2016 11:52'],
 ['10975351',
  'How to Use Open Source and Shut the Fuck Up at the Same Time',
  'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/',
  '39',
  '10',
  'josep2',
  '1/26/2016 19:30'],
 ['11964716',
  "Florida DJs May Face Felony for April Fools' Water Joke",
  'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/',
  '2',
  '1',
  'vezycash',
  '6/23/2016 22:20'],
 ['11919867',
  'Technology ventures: From Idea to Enterprise',
  'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429',
  '3',
  '1',
  'hswarna',
  '6/17/2016 0:01'],
 ['10301696',
  'Note by Note: The Making of Steinway L1037 (2007)',
  'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0',
  '8',
  '2',
  'walterbell',
  '9/30/2015 4:12']]

## Filtering the data

To find the posts that begin with either Ask Hn or Show Hn we'll use the string method startwith. Here's how it works

In [7]:
print("Missisipi".startswith("miss"))
print()
print("missisipi".startswith("miss"))

False

True


we may use also lower method and in this case we re gonna get two times True:

In [8]:
print("Missisipi".lower().startswith("miss"))
print()
print("missisipi".startswith("miss"))

True

True


Now we re gonna make three empty list to separate our data:

In [9]:
ask_posts = []
show_posts = []
other_posts = []

for row in hn:
    title = row[1]  #  here we assign the title to the variable
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)

Now we gonna check the lenght of each list

In [10]:
print(f"lenght of ask posts is: {len(ask_posts)}")
print()
print(f"lenght of show posts is {len(show_posts)}")
print()
print(f"lenght of other posts is: {len(other_posts)}")

lenght of ask posts is: 1744

lenght of show posts is 1162

lenght of other posts is: 17194


Fast glance at t first five rows of ask posts:

In [11]:
ask_posts[:5]

[['12296411',
  'Ask HN: How to improve my personal website?',
  '',
  '2',
  '6',
  'ahmedbaracat',
  '8/16/2016 9:55'],
 ['10610020',
  'Ask HN: Am I the only one outraged by Twitter shutting down share counts?',
  '',
  '28',
  '29',
  'tkfx',
  '11/22/2015 13:43'],
 ['11610310',
  'Ask HN: Aby recent changes to CSS that broke mobile?',
  '',
  '1',
  '1',
  'polskibus',
  '5/2/2016 10:14'],
 ['12210105',
  'Ask HN: Looking for Employee #3 How do I do it?',
  '',
  '1',
  '3',
  'sph130',
  '8/2/2016 14:20'],
 ['10394168',
  'Ask HN: Someone offered to buy my browser extension from me. What now?',
  '',
  '28',
  '17',
  'roykolak',
  '10/15/2015 16:38']]

and another fast glance on show posts:

In [12]:
show_posts[:5]

[['10627194',
  'Show HN: Wio Link  ESP8266 Based Web of Things Hardware Development Platform',
  'https://iot.seeed.cc',
  '26',
  '22',
  'kfihihc',
  '11/25/2015 14:03'],
 ['10646440',
  'Show HN: Something pointless I made',
  'http://dn.ht/picklecat/',
  '747',
  '102',
  'dhotson',
  '11/29/2015 22:46'],
 ['11590768',
  'Show HN: Shanhu.io, a programming playground powered by e8vm',
  'https://shanhu.io',
  '1',
  '1',
  'h8liu',
  '4/28/2016 18:05'],
 ['12178806',
  'Show HN: Webscope  Easy way for web developers to communicate with Clients',
  'http://webscopeapp.com',
  '3',
  '3',
  'fastbrick',
  '7/28/2016 7:11'],
 ['10872799',
  'Show HN: GeoScreenshot  Easily test Geo-IP based web pages',
  'https://www.geoscreenshot.com/',
  '1',
  '9',
  'kpsychwave',
  '1/9/2016 20:45']]

Now we re gonna find the total number of comments in the ask posts:

In [13]:
total_ask_comments = 0 

for row in ask_posts:
    comments = int(row[4])
    total_ask_comments += comments 

# check number of comments
print('Total number of ask comments:', total_ask_comments)

# calculate average number of ask comments 
avg_ask_comments = total_ask_comments / len(ask_posts)

print('Average of ask comments:', avg_ask_comments)

Total number of ask comments: 24483
Average of ask comments: 14.038417431192661


Average number of comments for show posts:

In [23]:
# show comments count
total_show_comments = 0 

for row in show_posts:
    comments = int(row[4])
    total_show_comments += comments 

# showing total number of comments
print('Total number of show comments:', total_show_comments)

# calculate average number of ask comments 
avg_show_comments = total_show_comments / len(show_posts)

print('Average of show comments:', avg_show_comments)

Total number of show comments: 11988
Average of show comments: 10.31669535283993


On average, ask posts tend to get 4 more comments than show posts. Since ask posts are more likely to receive comments, we'll focus our remaining analysis on this type of post.

## Posts timies

Now we gonna dettermine if ask posts created at a certain time are more likely to attract comments.

First we gonna:

* Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.
* Calculate the average number of comments ask posts receive by hour created.

To this purspose we re gonna use python datetime module, a quick recall from **datetime.strptime()** constructor to parse dates stored as string and return datetime objects

In [38]:
import datetime

my_birthday_str = 'January 9, 1999'  #  my birthday date in string format
my_birthday_dt = datetime.datetime.strptime(my_birthday_str, "%B %d, %Y")  # formating string to datetime object
print(type(my_birthday_dt))

<class 'datetime.datetime'>


Now we gonna do the same but with the ask_posts times data and time

In [44]:
result_list = []



for post in ask_posts:
    created_at = post[6]
    comments = int(post[4])
    info = (created_at, comments) #  using a tuple to store the date and comments coun data
    result_list.append(info)

print(result_list[:5])

[('8/16/2016 9:55', 6), ('11/22/2015 13:43', 29), ('5/2/2016 10:14', 1), ('8/2/2016 14:20', 3), ('10/15/2015 16:38', 17)]


We have an list with tuples contains the data of creating post and amount of comments to that post, now we gonna do the frequency tables:

In [60]:
counts_by_hour = {} # here we gonna store ask post count per each hour of the day

comments_by_hour = {} #  here we gonna store comments count per each our of the day

for info in result_list:
    date = info[0]
    date = datetime.datetime.strptime(date,'%m/%d/%Y %H:%M') #  formating the string date to datetime object
    hour = date.hour
    comment_n = info[1]

    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = comment_n
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += comment_n

In [59]:
print(comments_by_hour)

{9: 251, 13: 1253, 10: 793, 14: 1416, 16: 1814, 23: 543, 12: 687, 17: 1146, 15: 4477, 21: 1745, 20: 1722, 2: 1381, 18: 1439, 3: 421, 5: 464, 19: 1188, 1: 683, 22: 479, 8: 492, 4: 337, 0: 447, 6: 397, 7: 267, 11: 641}


Next we re gonna use these two dictionaries to calculate average comments for posts created durning each hour of the day, to ilustrate the technique how we re gonna go that let's work with the following ditionary:

In [68]:
sample_dict =  {
        "mango": 4,
        "lemon": 1,
        "apple": 7
                }

fruits = []

# we gonna multiply each amount of the fruits of the ten and return the result as a list of the lists
for fruit in sample_dict:
    fruits.append((fruit ,sample_dict[fruit] * 10))

print(fruits)

[('mango', 40), ('lemon', 10), ('apple', 70)]


Now we gonna make something similliar with the comments by hour:

In [79]:
avg_comments_hour = [] 

for hour in counts_by_hour:
    avg_comments_hour.append([hour, comments_by_hour[hour] / counts_by_hour[hour]])

avg_comments_hour

[[9, 5.5777777777777775],
 [13, 14.741176470588234],
 [10, 13.440677966101696],
 [14, 13.233644859813085],
 [16, 16.796296296296298],
 [23, 7.985294117647059],
 [12, 9.41095890410959],
 [17, 11.46],
 [15, 38.5948275862069],
 [21, 16.009174311926607],
 [20, 21.525],
 [2, 23.810344827586206],
 [18, 13.20183486238532],
 [3, 7.796296296296297],
 [5, 10.08695652173913],
 [19, 10.8],
 [1, 11.383333333333333],
 [22, 6.746478873239437],
 [8, 10.25],
 [4, 7.170212765957447],
 [0, 8.127272727272727],
 [6, 9.022727272727273],
 [7, 7.852941176470588],
 [11, 11.051724137931034]]

## Sorting the values

The result above makes it hard to identify the hours with the highest values.Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.



In [85]:
swap_avg_by_hour = []

for hour,avg in avg_comments_hour:
    swap_avg_by_hour.append([avg,hour])


In [86]:
print(swap_avg_by_hour)

[[5.5777777777777775, 9], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [16.796296296296298, 16], [7.985294117647059, 23], [9.41095890410959, 12], [11.46, 17], [38.5948275862069, 15], [16.009174311926607, 21], [21.525, 20], [23.810344827586206, 2], [13.20183486238532, 18], [7.796296296296297, 3], [10.08695652173913, 5], [10.8, 19], [11.383333333333333, 1], [6.746478873239437, 22], [10.25, 8], [7.170212765957447, 4], [8.127272727272727, 0], [9.022727272727273, 6], [7.852941176470588, 7], [11.051724137931034, 11]]


In [89]:
sorted_avg_comments_hour = sorted(swap_avg_by_hour, reverse=True)
print(sorted_avg_comments_hour)

[[38.5948275862069, 15], [23.810344827586206, 2], [21.525, 20], [16.796296296296298, 16], [16.009174311926607, 21], [14.741176470588234, 13], [13.440677966101696, 10], [13.233644859813085, 14], [13.20183486238532, 18], [11.46, 17], [11.383333333333333, 1], [11.051724137931034, 11], [10.8, 19], [10.25, 8], [10.08695652173913, 5], [9.41095890410959, 12], [9.022727272727273, 6], [8.127272727272727, 0], [7.985294117647059, 23], [7.852941176470588, 7], [7.796296296296297, 3], [7.170212765957447, 4], [6.746478873239437, 22], [5.5777777777777775, 9]]


## identify top 5 most popular hours to receive comments

In [93]:
# display top 5 hours which got the highest average of number of comments
print("Top 5 Hours for Ask Posts Comments")
for avg, hr in sorted_avg_comments_hour[0:5]:
   print('{}: {avg:,.2f} average comments per post'. format(datetime.datetime.strptime(str(hr),'%H').strftime("%H"), avg = avg))

Top 5 Hours for Ask Posts Comments
15: 38.59 average comments per post
02: 23.81 average comments per post
20: 21.52 average comments per post
16: 16.80 average comments per post
21: 16.01 average comments per post


## Conclusion

This project's purpose was to analyse post types on Hacker News and determine if there is a type of post which is more popular and attracts more comments.

Considering the fact we used only posts with comments, the results showed that on average, ask posts tend to get 4 more comments than show posts. Also, the results concluded that 3pm EST time is the hour when users are more likely to leave comments.