# intro to the project 

In this project, we'll work with a data set of submissions to popular technology site Hacker News.

You can find the data set [here](https://www.kaggle.com/hacker-news/hacker-news-posts), but note that it has been reduced from almost 300,000 rows to approximately 20,000 rows by removing all submissions that did not receive any comments, and then randomly sampling from the remaining submissions. Below are descriptions of the columns:

- title: title of the post (self explanatory)

- url: the url of the item being linked to

- num_points: the number of upvotes the post received

- num_comments: the number of comments the post received

- author: the name of the account that made the post

- created_at: the date and time the post was made (the time zone is Eastern Time in the US)

## reading data and removing headers

In [1]:
# read your data first of all

import csv
read_file = csv.reader(open("HN_posts.csv" , encoding = "utf-8"))
hn = list(read_file)
# removing the header
header = hn[0]
print(header)

# print first 3 rows of your data
hn = hn[1:]
hn[:3]
len(hn)

['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at']


293119

we are interseted in the Ask HN posts and Show HN

so we will focus on them
we will make 3 lists of lists


1-ask_posts

2-show_posts

3-others

In [2]:
# new tip with str.startswith method

s = "fatma"
print(s.startswith("fa"))

print(s.startswith("Fa"))

#use the lower() method returns small case of a string
"Eslam".lower().startswith("e")


True
False


True

In [3]:
# make 3 lempty lists
ask_posts = []
show_posts = []
other_posts = []

#loop in hn and append each list with its elments
for row in hn:
    title = row[1]
    if title.lower().startswith("ask hn"):
        ask_posts.append(row)
    elif title.lower().startswith("show hn"):
        show_posts.append(row)
    else:
        other_posts.append(row)


In [4]:
print([len(ask_posts) , len(show_posts) , len(other_posts)])

[9139, 10158, 273822]


Calculating the Average Number of Comments for Ask HN and Show HN Posts

In [5]:
# calculate ask_posts avg comments
total_ask_comments = 0
for row in ask_posts:
    total_ask_comments += int(row[4])
    
ask_avg = total_ask_comments / len(ask_posts)
print(ask_avg)

10.393478498741656


In [6]:
# calculate the avg comments for show posts
total_show_comments = 0
for row in show_posts:
    total_show_comments += int(row[4])
    
show_avg = total_show_comments / len(show_posts)
print(show_avg)

4.886099625910612


from the  2 above cells ask_posts recieves more comments than show_posts 

we'll focus our remaining analysis just on these posts for ask_posts list

##  Finding the Amount of Ask Posts and Comments by Hour Created

In [7]:
# create a list of [[time , comments]]
results_list = []

for row in ask_posts:
    created_at = row[6]
    comment = int(row[4])
    results_list.append([created_at , comment])

print(len(results_list)) # the same length as ask_posts 
#sresults_list

9139


#### Calculate the amount of ask posts created in each hour of the day, along with the number of comments received.


In [8]:
import datetime as dt       # to handle the date

counts_by_hour = {}
comments_by_hour = {}

for row in results_list:
    hour_dt = dt.datetime.strptime(row[0] , "%m/%d/%Y %H:%M")
    hour = hour_dt.hour
    if hour not in counts_by_hour:
        counts_by_hour[hour] = 1
        comments_by_hour[hour] = row[1]
    else:
        counts_by_hour[hour] += 1
        comments_by_hour[hour] += row[1]
                

In [9]:
counts_by_hour 

{2: 269,
 1: 282,
 22: 383,
 21: 518,
 19: 552,
 17: 587,
 15: 646,
 14: 513,
 13: 444,
 11: 312,
 10: 282,
 9: 222,
 7: 226,
 3: 271,
 23: 343,
 20: 510,
 16: 579,
 8: 257,
 0: 301,
 18: 614,
 12: 342,
 4: 243,
 6: 234,
 5: 209}

In [10]:
comments_by_hour

{2: 2996,
 1: 2089,
 22: 3372,
 21: 4500,
 19: 3954,
 17: 5547,
 15: 18525,
 14: 4972,
 13: 7245,
 11: 2797,
 10: 3013,
 9: 1477,
 7: 1585,
 3: 2154,
 23: 2297,
 20: 4462,
 16: 4466,
 8: 2362,
 0: 2277,
 18: 4877,
 12: 4234,
 4: 2360,
 6: 1587,
 5: 1838}

In [11]:
# c_hour = {}
# p_hour = {}

# for row in results_list:
#     date = dt.datetime.strptime(row[0] , "%m/%d/%Y %H:%M")
#     hour = date.hour
#     if hour not in p_hour:
#         p_hour[hour] = 1
#         c_hour[hour] = 1
#     else:
#         p_hour[hour] += 1 
#         c_hour[hour] += 1
        


calculate the average comments for posts created during each hour at the day

In [12]:
# avg number of comments for posts per hour
avg_by_hour = []

for hr in counts_by_hour:
    avg_by_hour.append([hr , comments_by_hour[hr] / counts_by_hour[hr]])
    
avg_by_hour

[[2, 11.137546468401487],
 [1, 7.407801418439717],
 [22, 8.804177545691905],
 [21, 8.687258687258687],
 [19, 7.163043478260869],
 [17, 9.449744463373083],
 [15, 28.676470588235293],
 [14, 9.692007797270955],
 [13, 16.31756756756757],
 [11, 8.96474358974359],
 [10, 10.684397163120567],
 [9, 6.653153153153153],
 [7, 7.013274336283186],
 [3, 7.948339483394834],
 [23, 6.696793002915452],
 [20, 8.749019607843136],
 [16, 7.713298791018998],
 [8, 9.190661478599221],
 [0, 7.5647840531561465],
 [18, 7.94299674267101],
 [12, 12.380116959064328],
 [4, 9.7119341563786],
 [6, 6.782051282051282],
 [5, 8.794258373205741]]

### Sorting and Printing Values from a List of Lists

In [13]:
# create list equal to avg_by_hour with swaped column
swaped_list = []
for row in avg_by_hour:
    swaped_list.append([row[1] , row[0]])
    
print(swaped_list)

[[11.137546468401487, 2], [7.407801418439717, 1], [8.804177545691905, 22], [8.687258687258687, 21], [7.163043478260869, 19], [9.449744463373083, 17], [28.676470588235293, 15], [9.692007797270955, 14], [16.31756756756757, 13], [8.96474358974359, 11], [10.684397163120567, 10], [6.653153153153153, 9], [7.013274336283186, 7], [7.948339483394834, 3], [6.696793002915452, 23], [8.749019607843136, 20], [7.713298791018998, 16], [9.190661478599221, 8], [7.5647840531561465, 0], [7.94299674267101, 18], [12.380116959064328, 12], [9.7119341563786, 4], [6.782051282051282, 6], [8.794258373205741, 5]]


In [14]:
# sort the swaped_list
sorted_swap = sorted(swaped_list , reverse=True)
sorted_swap

[[28.676470588235293, 15],
 [16.31756756756757, 13],
 [12.380116959064328, 12],
 [11.137546468401487, 2],
 [10.684397163120567, 10],
 [9.7119341563786, 4],
 [9.692007797270955, 14],
 [9.449744463373083, 17],
 [9.190661478599221, 8],
 [8.96474358974359, 11],
 [8.804177545691905, 22],
 [8.794258373205741, 5],
 [8.749019607843136, 20],
 [8.687258687258687, 21],
 [7.948339483394834, 3],
 [7.94299674267101, 18],
 [7.713298791018998, 16],
 [7.5647840531561465, 0],
 [7.407801418439717, 1],
 [7.163043478260869, 19],
 [7.013274336283186, 7],
 [6.782051282051282, 6],
 [6.696793002915452, 23],
 [6.653153153153153, 9]]

In [15]:
print("Top 5 Hours for Ask Posts Comments")

# here iam not using the datetime.strptime method because i have converted the hours to int type at the begining
for row in sorted_swap[:6]:
    print("{}: {:.2f} average comments per post".format(row[1] , row[0]))

Top 5 Hours for Ask Posts Comments
15: 28.68 average comments per post
13: 16.32 average comments per post
12: 12.38 average comments per post
2: 11.14 average comments per post
10: 10.68 average comments per post
4: 9.71 average comments per post



from results above the the higher chance to get more comments is at 15:00  and the timezone is 
(Eastern Time in the US) so we can say at 3 pm est


In [16]:
print("lowest 5 hours for ask posts comments")

for avg , hr in sorted_swap[-5:]:
    hr = str(hr)
    hr = dt.datetime.strptime(hr , "%H").strftime("%H:%M")
    print("{} :  {:.2f} average comments per post".format(hr , avg) )

lowest 5 hours for ask posts comments
19:00 :  7.16 average comments per post
07:00 :  7.01 average comments per post
06:00 :  6.78 average comments per post
23:00 :  6.70 average comments per post
09:00 :  6.65 average comments per post


from the result above at 09:00 am your post for Asking HN will get least comments


### Conclusion
In this project, we analyzed ask posts and show posts to determine which type of post and time receive the most comments on average. Based on our analysis, to maximize the amount of comments a post receives, we'd recommend the post be categorized as ask post and created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est).

However, it should be noted that the data set we analyzed excluded posts without any comments. Given that, it's more accurate to say that of the posts that received comments, ask posts received more comments on average and ask posts created between 15:00 and 16:00 (3:00 pm est - 4:00 pm est) received the most comments on average