<h1>Bucketing time</h1>

<h4>The file "sample_data.csv" contains start times and processing times for all complaints registered with New York City's 311 complaint hotline on 01/01/2016. Our goal is to compute the average processing time for each hourly bucket.


<h4>Let's take a quick look at the data

In [35]:
#Unfortunatel, this won't work on Windows.
!head sample_data.csv

2016-01-01 00:00:09,0.0815162037037037
2016-01-01 00:00:40,0.1334837962962963
2016-01-01 00:01:09,20.388726851851853
2016-01-01 00:02:59,0.9811458333333334
2016-01-01 00:03:03,7.048576388888889
2016-01-01 00:03:03,0.1400810185185185
2016-01-01 00:03:29,0.11086805555555555
2016-01-01 00:04:06,0.016967592592592593
2016-01-01 00:04:37,0.1597222222222222
2016-01-01 00:04:56,2.996585648148148


<h3>Step 1: Read the data</h3>

In [36]:
data_tuples = list()
with open('sample_data.csv','r') as f:
    for line in f:
        data_tuples.append(line.strip().split(','))

<h4>Let's look at the first 10 lines</h4>

In [37]:
data_tuples[0:10]

[['2016-01-01 00:00:09', '0.0815162037037037'],
 ['2016-01-01 00:00:40', '0.1334837962962963'],
 ['2016-01-01 00:01:09', '20.388726851851853'],
 ['2016-01-01 00:02:59', '0.9811458333333334'],
 ['2016-01-01 00:03:03', '7.048576388888889'],
 ['2016-01-01 00:03:03', '0.1400810185185185'],
 ['2016-01-01 00:03:29', '0.11086805555555555'],
 ['2016-01-01 00:04:06', '0.016967592592592593'],
 ['2016-01-01 00:04:37', '0.1597222222222222'],
 ['2016-01-01 00:04:56', '2.996585648148148']]

<li><b>Element 1 of the tuple is a date inside a string
<li>Element 2 is double inside a string
<li>Let's convert them

In [38]:
#Figure out the format string
# http://pubs.opengroup.org/onlinepubs/009695399/functions/strptime.html 
import datetime
x='2016-01-01 00:00:09'
format_str = "%Y-%m-%d %H:%M:%S"
datetime.datetime.strptime(x,format_str)

datetime.datetime(2016, 1, 1, 0, 0, 9)

In [39]:
data_tuples = list()
with open('sample_data.csv','r') as f:
    for line in f:
        data_tuples.append(line.strip().split(','))
import datetime
for i in range(0,len(data_tuples)):
    data_tuples[i][0] = datetime.datetime.strptime(data_tuples[i][0],format_str)
    data_tuples[i][1] = float(data_tuples[i][1])

In [40]:
#Let's see if this worked
data_tuples[0:10]

[[datetime.datetime(2016, 1, 1, 0, 0, 9), 0.0815162037037037],
 [datetime.datetime(2016, 1, 1, 0, 0, 40), 0.1334837962962963],
 [datetime.datetime(2016, 1, 1, 0, 1, 9), 20.388726851851853],
 [datetime.datetime(2016, 1, 1, 0, 2, 59), 0.9811458333333334],
 [datetime.datetime(2016, 1, 1, 0, 3, 3), 7.048576388888889],
 [datetime.datetime(2016, 1, 1, 0, 3, 3), 0.1400810185185185],
 [datetime.datetime(2016, 1, 1, 0, 3, 29), 0.11086805555555555],
 [datetime.datetime(2016, 1, 1, 0, 4, 6), 0.016967592592592593],
 [datetime.datetime(2016, 1, 1, 0, 4, 37), 0.1597222222222222],
 [datetime.datetime(2016, 1, 1, 0, 4, 56), 2.996585648148148]]

<h4>We can replace the datetime by hourly buckets</h4>

In [41]:
#Extract the hour from a datetime object
x=data_tuples[0][0]
x.hour

0

<h4>Use list comprehension to bucket the data</h4>

In [42]:
data_tuples = [(x[0].hour,x[1]) for x in data_tuples]

In [43]:
data_tuples[0:10]

[(0, 0.0815162037037037),
 (0, 0.1334837962962963),
 (0, 20.388726851851853),
 (0, 0.9811458333333334),
 (0, 7.048576388888889),
 (0, 0.1400810185185185),
 (0, 0.11086805555555555),
 (0, 0.016967592592592593),
 (0, 0.1597222222222222),
 (0, 2.996585648148148)]

In [44]:
data_tuples = list()
with open('sample_data.csv','r') as f:
    for line in f:
        data_tuples.append(line.strip().split(','))
import datetime
for i in range(0,len(data_tuples)):
    data_tuples[i][0] = datetime.datetime.strptime(data_tuples[i][0],format_str)
    data_tuples[i][1] = float(data_tuples[i][1])


<h3>Create a function that returns the data</h3>

In [22]:
def get_data(filename):
    data_tuples = list()
    with open(filename,'r') as f:
        for line in f:
            data_tuples.append(line.strip().split(','))
    import datetime
    format_str = "%Y-%m-%d %H:%M:%S"
    data_tuples = [(datetime.datetime.strptime(x[0],format_str).hour,float(x[1])) for x in data_tuples]
    return data_tuples    

In [23]:
get_data('sample_data.csv')

[(0, 0.0815162037037037),
 (0, 0.1334837962962963),
 (0, 20.388726851851853),
 (0, 0.9811458333333334),
 (0, 7.048576388888889),
 (0, 0.1400810185185185),
 (0, 0.11086805555555555),
 (0, 0.016967592592592593),
 (0, 0.1597222222222222),
 (0, 2.996585648148148),
 (0, 0.06299768518518518),
 (0, 0.059479166666666666),
 (0, 0.003460648148148148),
 (0, 0.22096064814814814),
 (0, 0.10398148148148148),
 (0, 0.2975231481481482),
 (0, 0.09293981481481481),
 (0, 0.016446759259259258),
 (0, 0.06824074074074074),
 (0, 0.04800925925925926),
 (0, 0.26761574074074074),
 (0, 1.4127662037037036),
 (0, 0.4363078703703704),
 (0, 0.869375),
 (0, 0.05337962962962963),
 (0, 0.6558333333333334),
 (0, 0.3119560185185185),
 (0, 3.580636574074074),
 (0, 0.1267939814814815),
 (0, 5.040613425925926),
 (0, 0.022662037037037036),
 (0, 0.31908564814814816),
 (0, 1.001412037037037),
 (0, 0.3957407407407407),
 (0, 0.01945601851851852),
 (0, 0.1460300925925926),
 (0, 0.6539351851851852),
 (0, 0.027731481481481482),
 (0,

<h3>Step 2: Accumulate counts and sums for each bucket

In [24]:
buckets = dict()
for item in get_data('sample_data.csv'):
    if item[0] in buckets:
        buckets[item[0]][0] += 1
        buckets[item[0]][1] += item[1]
    else:
        buckets[item[0]] = [1,item[1]]


In [25]:
buckets

{0: [241, 158.34932870370375],
 1: [340, 1006.8582291666668],
 2: [199, 464.6581249999997],
 3: [221, 681.5493865740739],
 4: [157, 732.1197337962964],
 5: [112, 285.60615740740764],
 6: [80, 427.54798611111124],
 7: [71, 183.4966435185184],
 8: [99, 601.1727546296297],
 9: [132, 1130.5627199074067],
 10: [137, 1735.9673726851845],
 11: [182, 1074.1009490740735],
 12: [168, 2295.5562731481473],
 13: [195, 1675.7310300925922],
 14: [185, 1498.5249999999999],
 15: [193, 2465.890451388889],
 16: [204, 2232.515092592593],
 17: [211, 1399.851180555556],
 18: [182, 1333.1421180555558],
 19: [165, 1501.3013541666667],
 20: [158, 821.5105439814813],
 21: [161, 763.653865740741],
 22: [218, 1841.9319444444443],
 23: [210, 1088.8371064814814]}

<h3>Let's print them to see what sort of pattern is there in the data</h3>
<h4>Bear in mind that this is just one day's data!

In [26]:
for key,value in buckets.items():
    print("Hour:",key,"\tAverage:",value[1]/value[0])

Hour: 0 	Average: 0.6570511564469035
Hour: 1 	Average: 2.9613477328431377
Hour: 2 	Average: 2.334965452261305
Hour: 3 	Average: 3.0839338759007866
Hour: 4 	Average: 4.663183017810805
Hour: 5 	Average: 2.550054976851854
Hour: 6 	Average: 5.344349826388891
Hour: 7 	Average: 2.5844597678664565
Hour: 8 	Average: 6.0724520669659565
Hour: 9 	Average: 8.564869090207626
Hour: 10 	Average: 12.671294691132733
Hour: 11 	Average: 5.901653566341063
Hour: 12 	Average: 13.66402543540564
Hour: 13 	Average: 8.593492462013293
Hour: 14 	Average: 8.100135135135135
Hour: 15 	Average: 12.776634463154863
Hour: 16 	Average: 10.943701434277418
Hour: 17 	Average: 6.634365784623489
Hour: 18 	Average: 7.324956692612944
Hour: 19 	Average: 9.098796085858586
Hour: 20 	Average: 5.199433822667603
Hour: 21 	Average: 4.74319171267541
Hour: 22 	Average: 8.449229102956167
Hour: 23 	Average: 5.184938602292768


<h3>Put everything into a function</h3>
<h4>This way, we can easily test other similar datasets

In [27]:
def get_hour_bucket_averages(filename):
    def get_data(filename):
        data_tuples = list()
        with open(filename,'r') as f:
            for line in f:
                data_tuples.append(line.strip().split(','))
        import datetime
        format_str = "%Y-%m-%d %H:%M:%S"
        data_tuples = [(datetime.datetime.strptime(x[0],format_str).hour,float(x[1])) for x in data_tuples]
        return data_tuples        
    buckets = dict()
    for item in get_data(filename):
        if item[0] in buckets:
            buckets[item[0]][0] += 1
            buckets[item[0]][1] += item[1]
        else:
            buckets[item[0]] = [1,item[1]]  
    return [(key,value[1]/value[0]) for key,value in buckets.items()]


In [28]:
get_hour_bucket_averages('sample_data.csv')

[(0, 0.6570511564469035),
 (1, 2.9613477328431377),
 (2, 2.334965452261305),
 (3, 3.0839338759007866),
 (4, 4.663183017810805),
 (5, 2.550054976851854),
 (6, 5.344349826388891),
 (7, 2.5844597678664565),
 (8, 6.0724520669659565),
 (9, 8.564869090207626),
 (10, 12.671294691132733),
 (11, 5.901653566341063),
 (12, 13.66402543540564),
 (13, 8.593492462013293),
 (14, 8.100135135135135),
 (15, 12.776634463154863),
 (16, 10.943701434277418),
 (17, 6.634365784623489),
 (18, 7.324956692612944),
 (19, 9.098796085858586),
 (20, 5.199433822667603),
 (21, 4.74319171267541),
 (22, 8.449229102956167),
 (23, 5.184938602292768)]

<h3>The file all_data.csv contains data from January to September 2016</h3>
<h4>We can test whether our one day result is generally true or not</h4>

In [29]:
get_hour_bucket_averages('all_data.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'all_data.csv'

In [46]:
def remove_punctuation(word):
    punctuations = ['.', '!', '?', ',', '(', ')']
    for punctuation in punctuations:
        if punctuation in word:
            print(punctuation)
            word.replace(punctuation, '')
    return word
remove_punctuation("sis!")

!


'sis!'