<a href="https://colab.research.google.com/github/NicoPatalagua/Taxis/blob/master/Taxis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TLC

## Nicolás Patalagua
### Infraestructura para Big Data - Universidad Sergio Arboleda

*The New York City Taxi and Limousine Commission (TLC), created in 1971, is the agency responsible for licensing and regulating New York City's Medallion (Yellow) taxi cabs, for-hire vehicles (community-based liveries, black cars and luxury limousines), commuter vans, and paratransit vehicles. The Commission's Board consists of nine members, eight of whom are unsalaried Commissioners. The salaried Chair/ Commissioner presides over regularly scheduled public commission meetings and is the head of the agency, which maintains a staff of approximately 600 TLC employees.*

*Over 200,000 TLC licensees complete approximately 1,000,000 trips each day. To operate for hire, drivers must first undergo a background check, have a safe driving record, and complete 24 hours of driver training. TLC-licensed vehicles are inspected for safety and emissions at TLC's Woodside Inspection Facility.

More info: https://www1.nyc.gov/site/tlc/about/about-tlc.page

Descargar el dataset de taxis de NYC para el mes de junio: https://nyc-tlc.s3.amazonaws.com/trip+data/yellow_tripdata_2019-06.csv

Descargar archivo de zonas: 
https://nyc-tlc.s3.amazonaws.com/misc/taxi+_zone_lookup.csv

In [86]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://apache.osuosl.org/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz
!tar xf spark-2.4.5-bin-hadoop2.7.tgz
!pip install -q findspark
!pip install pyspark
import os
import time 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.5-bin-hadoop2.7"

from google.colab import drive
drive.mount('/content/gdrive')
from pyspark import SparkConf, SparkContext
conf = SparkConf().setAppName("app")
sc = SparkContext.getOrCreate();

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.master("local").getOrCreate()

In [0]:
ObjTaxi=spark.read.csv("Taxis.csv",header=True)

In [0]:
ObjZone=spark.read.csv("Zone.csv",header=True)

### **Distancia Promedio de un recorrido en Taxi NYC**

In [94]:
ObjTaxi.select(F.avg("trip_distance")).show()

+------------------+
|avg(trip_distance)|
+------------------+
|3.0785054986122415|
+------------------+



#### **Formas de Pago**

In [96]:
ObjTaxi20= ObjTaxi.select('payment_type').distinct().show()

+------------+
|payment_type|
+------------+
|           3|
|           1|
|           4|
|           2|
+------------+



#### **Taxi con mayor número de viajes**

In [97]:
ObjTaxi30=ObjTaxi.groupBy("VendorID").agg(F.count("VendorID").alias("Max_trips"))
ObjTaxi31=ObjTaxi30.select("VendorID","Max_trips").agg(F.max("VendorID").alias("VendorID"), F.max("Max_trips"))
ObjTaxi31.show()

+--------+--------------+
|VendorID|max(Max_trips)|
+--------+--------------+
|       4|       4382892|
+--------+--------------+



#### **Número de viajes por dia en el mes de junio de 2019**

In [98]:
ObjTaxi40 = ObjTaxi.groupBy("tpep_pickup_datetime").agg(F.count("tpep_pickup_datetime").alias("Max (pickup)"))
ObjTaxi41 = ObjTaxi.groupBy("tpep_dropoff_datetime").agg(F.count("tpep_dropoff_datetime").alias("Max (dropoff)"))
ObjTaxi40.show(5)
ObjTaxi41.show(5)

+--------------------+------------+
|tpep_pickup_datetime|Max (pickup)|
+--------------------+------------+
| 2019-06-01 00:29:17|           3|
| 2019-06-01 00:07:12|           1|
| 2019-06-01 00:52:54|           5|
| 2019-06-01 00:08:46|           3|
| 2019-06-01 00:40:46|           1|
+--------------------+------------+
only showing top 5 rows

+---------------------+-------------+
|tpep_dropoff_datetime|Max (dropoff)|
+---------------------+-------------+
|  2019-06-01 00:22:34|            2|
|  2019-06-01 00:57:29|            4|
|  2019-06-01 01:03:00|            5|
|  2019-06-01 00:05:36|            1|
|  2019-06-01 00:29:17|            4|
+---------------------+-------------+
only showing top 5 rows



#### **Área donde se recoge mayor número de pasajeros**

In [100]:
ObjZone50 = ObjZone.groupBy("Zone").agg(F.count("Zone").alias("Pass"))
ObjZone51 = ObjZone50.select("Zone","Pass").agg(F.max("Zone").alias("Zone"), F.max("Pass"))
ObjZone51.show()

+--------------+---------+
|          Zone|max(Pass)|
+--------------+---------+
|Yorkville West|        3|
+--------------+---------+



#### **Número de viajes que se dirigieron al “Bronx”**

In [101]:
ObjZone60 = ObjZone.where("`Borough` like 'Bronx%'").select("Borough", "LocationID")
ObjZone61 = ObjZone60.groupBy("Borough").agg(F.count("Borough").alias("Trips"))
ObjZone61.show()

+-------+-----+
|Borough|Trips|
+-------+-----+
|  Bronx|   43|
+-------+-----+



#### **Número promedio de personas por viaje que se dirigen al aeropuerto JFK**

In [102]:
ObjZone70=ObjZone.where("`Zone` like 'JFK_Airport%'").select("service_zone", "LocationID","Borough","Zone")
ObjTaxiZone70=ObjTaxi.join(ObjZone70, ObjTaxi.PULocationID == ObjZone70.LocationID)
ObjTaxiZone71=ObjTaxiZone70.groupby("Zone").agg(F.avg("VendorID").alias("Avg_Pass"))
ObjTaxiZone71.show()

+-----------+------------------+
|       Zone|          Avg_Pass|
+-----------+------------------+
|JFK Airport|1.6908959629637494|
+-----------+------------------+



#### **Distancia y Costo promedio de tomar un taxi del Aeropuerto JFK a Manhattan Valley**

In [103]:
ObjZone80=ObjZone.where("`Zone` like 'JFK_Airport%'").select("service_zone", "LocationID","Borough","Zone")
ObjZone81=ObjZone.where("`Zone` like 'Manhattan_Valley%'").select("service_zone", "LocationID","Borough","Zone")
ObjTaxiZone80=ObjTaxi.join(ObjZone80, ObjTaxi.PULocationID == ObjZone80.LocationID)
ObjTaxiZone81=ObjTaxiZone80.where("`DOLocationID` like '151%'").select("Zone", "trip_distance","fare_amount","PULocationID","DOLocationID")
ObjTaxiZone82=ObjTaxiZone81.groupBy("Zone").agg(F.avg("Trip_distance"),F.avg("Fare_amount"))
ObjTaxiZone82.show()

+-----------+------------------+-----------------+
|       Zone|avg(Trip_distance)| avg(Fare_amount)|
+-----------+------------------+-----------------+
|JFK Airport| 20.18786912751678|52.09825503355705|
+-----------+------------------+-----------------+



#### **Recorrido más frencuente (entre qué zona y qué zona)**

In [104]:
ObjTaxiZone90= ObjTaxi.join(ObjZone, ObjTaxi.PULocationID == ObjZone.LocationID)
ObjTaxiZone91= ObjTaxiZone90.groupBy("Zone","PULocationID","DOLocationID").agg(F.count("trip_distance").alias("num_trips"))
ObjTaxiZone92= ObjTaxiZone91.groupBy("Zone","PULocationID","DOLocationID").agg(F.max("num_trips").alias("max_trips"))
ObjTaxiZone93= ObjTaxiZone92.select('Zone','max_trips').agg(F.max("max_trips").alias("Trips"), F.max("zone").alias("Zone"))
ObjTaxiZone93.show()

+-----+--------------+
|Trips|          Zone|
+-----+--------------+
|47368|Yorkville West|
+-----+--------------+



### **Tiempo Promedio de un viaje**

In [112]:
ObjTime=ObjTaxi.select((F.unix_timestamp('tpep_dropoff_datetime')-F.unix_timestamp('tpep_pickup_datetime')).alias('Time'))
ObjTime1 =ObjTime.select(F.avg('Time').alias('avg')).first()['avg']
ObjTime2=time.strftime('%H:%M:%S', time.gmtime(int(ObjTime1))) 
print('Avr_Time_Trip: '+ObjTime2)

Avr_Time_Trip: 00:18:42
