# Python MapReduce TAXI rides

In the notebook, you should create a map-reduce program that count the number of occurrence of each word.

In this exercise, hadoop runs in standalone mode and reads data from the local filesystem.


### Download the dataset 

In [118]:
!wget -O Taxi_small.csv https://www.dropbox.com/s/mi1el58o88hd5u8/Taxi_Trips_151MB.csv?dl=0

--2020-11-01 17:13:21--  https://www.dropbox.com/s/mi1el58o88hd5u8/Taxi_Trips_151MB.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.68.1, 162.125.68.1, 2620:100:6024:1::a27d:4401, ...
Connecting to www.dropbox.com (www.dropbox.com)|162.125.68.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/mi1el58o88hd5u8/Taxi_Trips_151MB.csv [following]
--2020-11-01 17:13:22--  https://www.dropbox.com/s/raw/mi1el58o88hd5u8/Taxi_Trips_151MB.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc3359623093a79923bd0b7d59cb.dl.dropboxusercontent.com/cd/0/inline/BCaWEvFF-fE3YePphA4AoRZ9yJx6sMC9ffzjyCkcUpEcvKrAwR3f9Y5ZKFy_vbeqKfnNVwuPP_mm7AjEDqVT9HZDDCdJrE91T2FNA-p_hhJnzk7jVZOkcA6mQyfNks2dlYw/file# [following]
--2020-11-01 17:13:22--  https://uc3359623093a79923bd0b7d59cb.dl.dropboxusercontent.com/cd/0/inline/BCaWEvFF-fE3YePphA4AoRZ9yJx6sMC9ffzjyCkcUpEcvKrAwR3f9Y5ZKFy_vbe

## 1 - How many trips were started in each year present in the data set?


### Mapper
Complete with the code for the mapper.

In [119]:

%%file mapper_taxi1.py
#!/usr/bin/env python

# import sys
import sys
import string

for line in sys.stdin:
    # split the line into features
    
    line_len=len(line)
    print(line_len)
    
    
    data = line.split(";")
    time = data[2]
    
    if "AM" in time or "PM" in time: 
        year = time.split(" ")[0]
        year = year.split("/")[2]
    else:
        date_obj = dt.strptime(time, '%m/%d/%Y""%H:%M:%S')  #change to different time format     
        time = dt.strftime(date_obj, '%m/%d/%Y""%I:%M:%S""%p')
        year = time.split(" ")[0]
        year = year.split("/")[2]
    print('%s\t%s' % (year,"1"))

Overwriting mapper_taxi1.py


### Reducer

In [124]:

%%file reducer_taxi1.py
#!/usr/bin/python

import sys

last_year = ''
count_trips = 0
counter = 0
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    counter = counter + 1

    # parse the input we got from mapper.py
    year, count = line.split('\t', 1)

    # convert count (currently a string) to int
    count = int(count)
    
    if last_year != year:
        if last_year != '':
            print ("%s\t%s" % (last_year, count_trips))
        last_year = year
        
        count_trips = count
    else:
        count_trips += count

        
print ("%s\t%s" % (last_year, count_trips))

print("my all rows are" +str(counter))

Overwriting reducer_taxi1.py


### Hadoop standalone mode execution


The output directory needs to be cleared...

In [125]:
rm -rf results_1

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [126]:
%%time
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_taxi1.py,reducer_taxi1.py -mapper mapper_taxi1.py -reducer reducer_taxi1.py -input Taxi_small.csv -output results_1

2020-11-01 17:16:01,090 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-11-01 17:16:01,203 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-11-01 17:16:01,203 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-11-01 17:16:01,223 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-11-01 17:16:01,424 INFO mapred.FileInputFormat: Total input files to process : 1
2020-11-01 17:16:01,453 INFO mapreduce.JobSubmitter: number of splits:5
2020-11-01 17:16:01,659 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1247332117_0001
2020-11-01 17:16:01,659 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-11-01 17:16:02,002 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_taxi1.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1247332117_0001_597ed689-a9f7-4a0d-a7fe-ab54863d0714/mapper_taxi1.py
2020-11-01 17:16:02,024 INFO mapred.

#### Checking the results
The result is stored in directory results.

In [127]:
!cat results_1/part-*

cat: 'results_1/part-*': No such file or directory


## 2 - For each of the 24 hours of the day, how many taxi trips there were, what was their average trip miles and trip total cost?


### Mapper
Complete with the code for the mapper.

In [252]:
%%file mapper_taxi2.py
#!/usr/bin/env python

# import sys
import sys
# import string library function  
import string  

from datetime import datetime as dt

for line in sys.stdin:
    data = line.split(";")
    time = data[2]
    #print("current " + str(line))
    if "AM" in time or "PM" in time: 
        time = time[11:]
        time = time [9:]+ time[:2]
    else:
        date_obj = dt.strptime(time, '%m/%d/%Y %H:%M:%S')  #change to different time format     
        time = dt.strftime(date_obj, '%m/%d/%Y %I:%M:%S %p')
        time = time[11:]
        time = time [9:]+ time[:2]
        
    miles = data[5]
    cost = data[14]
    
    print('%s\t%s\t%s\t%s' % (time,miles,cost,"1"))


Overwriting mapper_taxi2.py


### Reducer

In [253]:
%%file reducer_taxi2.py
#!/usr/bin/python

import sys

last_time = ''
count_trips = 0
avg_trip_cost = 0
sum_words = 0
sum_miles = 0
avg_miles = 0

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    
    #hh = len(line.split('\t')) #sanity check
    #print(line)
    #if hh < 4:
        #print(line)
        
    time, miles, cost, count = line.split('\t')


    #convert count (currently a string) to int

    count = int(count)
    
    #print("current" + str(cost))
    
    if cost !="":
        cost = float(cost.replace(",","")) #converting to float and changing comma to no comma
        
    else:
        cost = 0
        
    if miles !="":    
        miles = float(miles.replace(",",""))
       
    else:
        miles = 0
        
                      
    if last_time != time:
        if last_time != '':
            print ("%s\t%s\t%s\t%s" % (last_time, count_trips, format(avg_trip_cost,".2f"), format(avg_miles,".2f")))
        
        last_time = time
        
        count_trips = count #count trips
        
        sum_words = cost #count cost 
        avg_trip_cost = sum_words / count_trips #count cost average
        
        
        sum_miles = miles  #count miles 
        avg_miles = sum_miles/count_trips       
        
        
    else:
        
        count_trips =count_trips + count
        
        sum_words = sum_words + cost #count cost 
        avg_trip_cost = sum_words / count_trips #count cost average
        
        sum_miles = sum_miles + miles
        avg_miles = sum_miles/count_trips       
                  
        
print ("%s\t%s\t%s\t%s" % (last_time, count_trips, format(avg_trip_cost,".2f"), format(avg_miles,".2f")))

Overwriting reducer_taxi2.py


### Hadoop standalone mode execution


The output directory needs to be cleared...

In [254]:
rm -rf results_taxi2

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [255]:
%%time
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_taxi2.py,reducer_taxi2.py -mapper mapper_taxi2.py -reducer reducer_taxi2.py -input Taxi_small.csv -output results_taxi2

2020-11-02 09:20:51,909 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-11-02 09:20:52,029 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-11-02 09:20:52,029 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-11-02 09:20:52,059 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-11-02 09:20:52,282 INFO mapred.FileInputFormat: Total input files to process : 1
2020-11-02 09:20:52,314 INFO mapreduce.JobSubmitter: number of splits:5
2020-11-02 09:20:52,521 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local440046538_0001
2020-11-02 09:20:52,522 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-11-02 09:20:52,926 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_taxi2.py as file:/tmp/hadoop-jovyan/mapred/local/job_local440046538_0001_4e6ea575-ad56-4577-8560-f6815dcb021f/mapper_taxi2.py
2020-11-02 09:20:52,955 INFO mapred.Lo

#### Checking the results
The result is stored in directory results.

In [256]:
!cat results_taxi2/part-*

AM01	11166	13.73	2.46
AM02	8832	12.71	2.35
AM03	6594	12.96	2.48
AM04	4604	15.19	3.80
AM05	4087	20.94	5.94
AM06	5629	19.93	5.57
AM07	10145	17.09	3.98
AM08	15695	13.96	2.98
AM09	18248	14.47	3.23
AM10	17777	15.34	3.30
AM11	18622	16.48	3.56
AM12	13544	14.45	2.83
PM01	20181	15.79	3.38
PM02	20039	16.10	3.56
PM03	20708	17.38	3.59
PM04	21714	16.87	3.34
PM05	23639	15.20	3.09
PM06	25446	14.62	2.94
PM07	25402	15.26	2.94
PM08	22222	15.99	3.18
PM09	19786	15.95	3.40
PM10	18492	15.29	3.11
PM11	16303	15.00	3.07
PM12	19875	16.04	3.38


## 3 For each of the 24 hours of the day, which are the (up to) 5 most popular routes (pairs pickup/dropoff regions) according to the the total number of taxi trips? Also reportand the average fare (total trip cost).



### Mapper 

In [371]:
%%file mapper_taxi3.py
#!/usr/bin/env python

# import sys
import sys

# import string library function  
import string  

from datetime import datetime as dt

for line in sys.stdin:
    
    timestamp = line.split(";")[2]
    
    if "AM" in timestamp or "PM" in timestamp: 
        tz = timestamp.split(" ")[2]
        hour = timestamp.split(" ")[1].split(":")[0] 
    else:
        date_obj = dt.strptime(timestamp, '%m/%d/%Y %H:%M:%S')  #change to different time format     
        timestamp = dt.strftime(date_obj, '%m/%d/%Y %I:%M:%S %p')
        tz = timestamp.split(" ")[2]
        hour = timestamp.split(" ")[1].split(":")[0]
        
    trip = line.split(";")
    loc_in = trip[6]
    loc_fi = trip [7]
    miles = trip[5]
    cost = trip[14]
    
    if miles != "" and cost != "" and loc_in != "" and loc_fi != "": #Filter out values that are missing
        
        print('%s\t%s\t%s\t%s\t%s\t%s' % (loc_in+":"+loc_fi+"-"+hour+tz,hour,tz,miles,cost,"1"))
       


Overwriting mapper_taxi3.py


### Reducer

In [372]:
%%file reducer_taxi3.py
#!/usr/bin/python

import sys

last_location = ''
last_time = ''
count_trips = 0
avg_trip_cost = 0
sum_words = 0
sum_miles = 0
avg_miles = 0
last_location_key = ''

line_count = 0
# input comes from STDIN
for line in sys.stdin:
    
    line_count = line_count + 1
    
    # remove leading and trailing whitespace
    line = line.strip()
    
    #hh = len(line.split('\t')) #sanity check
   # print(line)
    #if hh < 4:
        #print(line)
        
    location_in, time,tz, miles, cost, count = line.split('\t')


    #convert count (currently a string) to int

    count = int(count)
    
    #print("current" + str(cost))
    
    if cost !="":
        cost = float(cost.replace(",","")) #converting to float and changing comma to no comma
        
    else:
        cost = 0
        
    if miles !="":    
        miles = float(miles.replace(",",""))
    else:
        miles = 0
    
     
   # if(line_count == 177744 or line_count == 177745 or line_count == 177746): #sanity check
   #     print(str(line) +" " + str(line_count))
        
   # if(location_in == "17031831100:1703183110003PM"): #sanity check if sorted properly
   #     print(line)
   #     print(loc)
   #     print(line_count)
   
    if last_location_key != location_in:
        if last_location_key != '':            
            print ("%s\t%s\t%s\t%s\t%s" % (last_location, last_time, count_trips, format(avg_trip_cost,".2f"),format(avg_miles,".2f")))
          #print(last_location_key)
           # a = 1
        
       # if(location_in == "17031831100:1703183110003PM"):
        #    print(count_trips)
        location, time = location_in.split('-')
        last_location = location
        last_location_key = location_in
        last_time = time
        count_trips = count #count trips

        sum_words = cost #count cost 
        avg_trip_cost = sum_words / count_trips #count cost average


        sum_miles = miles  #count miles 
        avg_miles = sum_miles/count_trips       

        
    else:
        #if(location_in == "17031831100:1703183110003PM"):
        #    print(count_trips)
        count_trips =count_trips + count
        sum_words = sum_words + cost #count cost 
        avg_trip_cost = sum_words / count_trips #count cost average
        sum_miles = sum_miles + miles
        avg_miles = sum_miles/count_trips       

print ("%s\t%s\t%s\t%s\t%s" % (last_location, last_time, count_trips, format(avg_trip_cost,".2f"),format(avg_miles,".2f")))

Overwriting reducer_taxi3.py


In [373]:
rm -rf results_taxi3

In [374]:
%%time
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_taxi3.py,reducer_taxi3.py -mapper mapper_taxi3.py -reducer reducer_taxi3.py -input Taxi_small.csv -output results_taxi3

2020-11-02 12:01:54,590 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-11-02 12:01:54,686 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-11-02 12:01:54,686 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-11-02 12:01:54,717 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-11-02 12:01:55,056 INFO mapred.FileInputFormat: Total input files to process : 1
2020-11-02 12:01:55,085 INFO mapreduce.JobSubmitter: number of splits:5
2020-11-02 12:01:55,284 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1591750888_0001
2020-11-02 12:01:55,284 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-11-02 12:01:55,578 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_taxi3.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1591750888_0001_09122ff7-7ee0-4697-a0ec-0f9f783609d3/mapper_taxi3.py
2020-11-02 12:01:55,596 INFO mapred.

In [365]:
!cat results_taxi3/part-*

17031010100:17031010100	11AM	2	6.62	0.18
17031010100:17031980100	05AM	1	65.10	22.10
17031010201:17031010201	09AM	1	3.25	0.00
17031010300:17031010202	10AM	1	5.50	0.59
17031010300:17031010300	11AM	1	9.05	2.40
17031010300:17031010600	12AM	1	6.45	1.20
17031010300:17031281900	03PM	1	41.11	11.80
17031010400:17031010400	03PM	1	3.25	0.00
17031010400:17031010400	06PM	1	3.25	0.00
17031010501:17031010501	02PM	1	3.25	0.00
17031010501:17031809900	10AM	1	7.80	2.10
17031010501:17031839100	08PM	1	25.25	0.60
17031010502:17031839000	07AM	1	30.00	11.00
17031010503:17031030200	03AM	1	6.45	0.70
17031010503:17031061100	01AM	1	13.45	0.00
17031010503:17031063100	09PM	1	15.25	5.40
17031010503:17031070101	08PM	1	22.81	6.09
17031010503:17031071200	07PM	1	19.05	6.90
17031010503:17031838100	06PM	1	28.45	11.60
17031010503:17031980000	03PM	2	38.55	14.35
17031010600:17031030800	12AM	1	8.85	2.40
17031010600:17031060800	08PM	1	14.65	5.20
17031010600:17031980000	08AM	1	33.65	0.00
17031010702:17031

### Mapper again
Here I just switch key with a value so I can use time as a key in reducer 

In [366]:
%%file mapper_taxi4.py
#!/usr/bin/python

import sys

for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
        
    location, time, count_trips,avg_trip_cost, avg_miles = line.split('\t')
    
    
    
    print ("%s\t%s\t%s\t%s\t%s" % (time,count_trips, location,avg_trip_cost, avg_miles))

    

Overwriting mapper_taxi4.py


### Reducer again

In [367]:
%%file reducer_taxi4.py
#!/usr/bin/python

import sys
    
tops = []

for line in sys.stdin:
    
    line = line.strip()
    
    time,count,location, avg_cost, avg_miles= line.split('\t')  
    
    count = int(count)
    
    avg_miles =float(avg_miles)
    
    avg_cost = float(avg_cost) 
    
    matching = [s for s in tops if time in s]
    
    if(len(matching) < 5): #just checking for up to 5 locations based on popularity (count number)
        
        tops.append(line)
        
    else:        
        for top in matching:
            
           #print(top)
        
            top_time, top_count, top_loc, top_cost, top_miles = top.split('\t')
            
            top_count = int(top_count)
            
            if(count > top_count):
                
                tops.remove(top)
                
                tops.append(line)
                
                break

#print(sorted(tops))

for top in sorted(tops): 
    
    print(top)


Overwriting reducer_taxi4.py


In [368]:
rm -rf results_taxi4

In [369]:
%%time
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_taxi4.py,reducer_taxi4.py -mapper mapper_taxi4.py -reducer reducer_taxi4.py -input results_taxi3/part-* -output results_taxi4

2020-11-02 11:57:13,845 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-11-02 11:57:13,978 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-11-02 11:57:13,979 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-11-02 11:57:14,005 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-11-02 11:57:14,257 INFO mapred.FileInputFormat: Total input files to process : 1
2020-11-02 11:57:14,279 INFO mapreduce.JobSubmitter: number of splits:1
2020-11-02 11:57:14,511 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local764631263_0001
2020-11-02 11:57:14,511 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-11-02 11:57:14,873 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_taxi4.py as file:/tmp/hadoop-jovyan/mapred/local/job_local764631263_0001_20a7a733-4d8c-4710-a7ce-a10b04b00ae6/mapper_taxi4.py
2020-11-02 11:57:14,891 INFO mapred.Lo

In [370]:
!cat results_taxi4/part-*

01AM	37	17031081700:17031081403	7.57	0.58
01AM	41	17031081700:17031320100	7.55	0.78
01AM	41	17031081700:17031839100	8.09	0.79
01AM	73	17031081700:17031081800	6.96	0.54
01AM	96	17031081700:17031081700	6.91	0.46
01PM	146	17031839100:17031081500	7.87	0.85
01PM	173	17031839100:17031081700	7.07	0.59
01PM	264	17031320100:17031839100	7.54	0.76
01PM	271	17031839100:17031320100	7.52	0.78
01PM	433	17031839100:17031839100	6.68	0.54
02AM	34	17031081800:17031081700	7.53	0.44
02AM	43	17031081700:17031320100	7.67	0.78
02AM	43	17031081800:17031081800	7.51	0.72
02AM	75	17031081700:17031081700	6.41	0.43
02AM	75	17031081700:17031081800	7.10	0.44
02PM	170	17031839100:17031980000	48.62	14.64
02PM	185	17031081700:17031839100	7.09	0.63
02PM	198	17031081500:17031839100	8.36	1.05
02PM	239	17031839100:17031320100	7.30	0.72
02PM	441	17031839100:17031839100	7.03	0.66
03AM	19	17031081700:17031280100	7.34	0.66
03AM	21	17031081700:17031320100	7.83	0.54
03AM	25	17031081700:17031839100	7.13	0.64

## 4 Calculate average and maximum tip for each year

In [375]:
%%file mapper_taxi5.py
#!/usr/bin/env python

# import sys
import sys
# import string library function  
import string  

from datetime import datetime as dt

for line in sys.stdin:
    data = line.split(";")
    time = data[2]
    year = time.split(" ")[0]
    year = year.split("/")[2]
    cost = data[11]
    
    if year != "":
        year = year
    else:
        continue
        
    if cost != "":
        
        cost = data[11]
    else:
        continue
    
    print('%s\t%s\t%s' % (year,cost, "1"))


Overwriting mapper_taxi5.py


In [72]:
%%file reducer_taxi5.py
#!/usr/bin/python

import sys
import string


last_year = ''
count_trips = 0
avg_trip_cost = 0
sum_years = 0
sum_miles = 0
avg_miles = 0

max = None

# input comes from STDIN
for line in sys.stdin:
    
    # remove leading and trailing whitespace
    line = line.strip()
    
    # parse the input we got from mapper.py
    year, cost, count = line.split('\t')
    
    # convert time (currently a string) to float
    
    cost=float(cost)
    
    count = int(count)
      
    if last_year != year:
       
        if last_year != '':
            
            print ("%s\t%s\t%s" % (last_year,format(avg,".2f"), max))
        
        max = None #resetting our value
        
        count_trips = count
        
        last_year = year
        
        sum_years = cost
        
        avg = sum_years / count_trips
        
        
        
        if max is None or cost > max:
            max = cost
        
    else:
        
        count_trips = count_trips + count
        
       
        sum_years = sum_years + cost
        
        avg = sum_years / count_trips
        
       
        if max is None or cost > max:
            max = cost
              
print ("%s\t%s\t%s" % (last_year, format(avg,".2f"), max))


Overwriting reducer_taxi5.py


In [376]:
rm -rf results_taxi5

In [377]:
%%time
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper_taxi5.py,reducer_taxi5.py -mapper mapper_taxi5.py -reducer reducer_taxi5.py -input Taxi_small.csv -output results_taxi5

2020-11-02 12:03:55,629 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2020-11-02 12:03:55,717 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2020-11-02 12:03:55,717 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2020-11-02 12:03:55,737 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2020-11-02 12:03:55,915 INFO mapred.FileInputFormat: Total input files to process : 1
2020-11-02 12:03:55,935 INFO mapreduce.JobSubmitter: number of splits:5
2020-11-02 12:03:56,134 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1109934559_0001
2020-11-02 12:03:56,134 INFO mapreduce.JobSubmitter: Executing with tokens: []
2020-11-02 12:03:56,469 INFO mapred.LocalDistributedCacheManager: Localized file:/home/jovyan/work/mapper_taxi5.py as file:/tmp/hadoop-jovyan/mapred/local/job_local1109934559_0001_c9310dd2-4467-4337-9553-9c8932221872/mapper_taxi5.py
2020-11-02 12:03:56,499 INFO mapred.

In [75]:
!cat results_taxi5/part-*

2013	0.96	99.0
2014	1.12	120.0
2015	1.37	100.0
2016	1.49	60.0
2017	1.55	261.0
2018	1.72	90.0
2019	1.86	59.0
2020	1.50	30.0
