<a href="https://colab.research.google.com/github/AnnLivio/Automatidata/blob/main/Automatidata_02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the Automatidata project
La comisión de Taxi y Limusina de New York busca una manera de utilizar los datos recolectados en el área de la ciudad para predecir el monto de la tarifa de los viajes en taxi.
Esta es una etapa inicial del proyecto. Para obtener información clara, se deben analizar los datos de la Comisión, identificar las variables clave y asegurarse de que el conjunto de datos esté listo para el análisis.

*The New York City Taxi and Limousine Commission seeks a way to utilize the data collected from the New York City area to predict the fare amount for taxi cab rides. This is an early stages of the project. To get clear insights, New York TLC's data must be analyzed, key variables identified, and the dataset ensured it is ready for analysis.*


### Data Dictionary

|Column name |Description |
|---|---|
| ID | Trip identification number |
| VendorID | 1= Creative Mobile Technologies, LLC; </br>2= VeriFone Inc.|
| tpep_pickup_datetime |The date and time when the meter was engaged. |
|tpep_dropoff_datetime | The date and time when the meter was disengaged. |
|Passenger_count |The number of passengers in the vehicle. </br> This is a driver-entered value.|
|Trip_distance |The elapsed trip distance in miles reported by the taximeter.|
|PULocationID |TLC Taxi Zone in which the taximeter was engaged|
|DOLocationID |TLC Taxi Zone in which the taximeter was disengaged|
|RateCodeID |The final rate code in effect at the end of the trip.<br><br>1 = Standard rate <br> 2 = JFK <br>3 = Newark <br>4 = Nassau or Westchester <br>5 = Negotiated fare <br>6 = Group ride|
|Store_and_fwd_flag | Y = store and forward trip <br>N = not a store and forward trip|
|Payment_type | 1 = Credit card <br>2 = Cash <br>3 = No charge <br>4 = Dispute <br>5 = Unknown <br>6 = Voided trip|
|Fare_amount |The time-and-distance fare calculated by the meter.|
|Extra | Extras and surcharges, this only includes the 0.50 and 1 rush hour and overnight charges.|
|MTA_tax| 0.50 MTA tax that is automatically triggered based on the metered rate in use.|
|Improvement_surcharge |0.30 improvement surcharge assessed trips at the flag drop.|
|Tip_amount | This field is automatically populated for credit card tips. Cash tips are not included.|
|Tolls_amount |Total amount of all tolls paid in trip. |
|Total_amount |The total amount charged to passengers. Does not include cash tips.|

In [1]:
# Import libraries
import pandas as pd
from scipy import stats


In [3]:
# Load the dataset
data = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Automatidata/automatidata_clean.csv")


In [4]:
data.head()

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
0,2,2017-03-25 08:55:43,2017-03-25 09:09:47,6,3.34,1,N,100,231,1,13.0,0.0,0.5,2.76,0.0,0.3,16.56
1,1,2017-04-11 14:53:28,2017-04-11 15:19:58,1,1.8,1,N,186,43,1,16.0,0.0,0.5,4.0,0.0,0.3,20.8
2,1,2017-12-15 07:26:56,2017-12-15 07:34:08,1,1.0,1,N,262,236,1,6.5,0.0,0.5,1.45,0.0,0.3,8.75
3,2,2017-05-07 13:17:59,2017-05-07 13:48:14,1,3.7,1,N,188,97,1,20.5,0.0,0.5,6.39,0.0,0.3,27.69
4,2,2017-04-15 23:32:20,2017-04-15 23:49:03,1,4.37,1,N,4,112,2,16.5,0.5,0.5,0.0,0.0,0.3,17.8


In [9]:
data.describe(include='all')

Unnamed: 0,VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount
count,22698.0,22698,22698,22698.0,22698.0,22698.0,22698,22698.0,22698.0,22698.0,22698.0,22698.0,22698.0,22698.0,22698.0,22698.0,22698.0
unique,,,,,,,2,,,,,,,,,,
top,,,,,,,N,,,,,,,,,,
freq,,,,,,,22599,,,,,,,,,,
mean,1.55626,2017-06-29 07:37:21.415234816,2017-06-29 07:54:22.286809344,1.642391,2.913441,1.039078,,162.407877,161.523482,1.336902,13.023802,0.333289,0.497445,1.835862,0.312555,0.299551,16.307784
min,1.0,2017-01-01 00:08:25,2017-01-01 00:17:20,0.0,0.0,1.0,,1.0,1.0,1.0,-120.0,-1.0,-0.5,0.0,0.0,-0.3,-120.3
25%,1.0,2017-03-30 03:07:14.750000128,2017-03-30 03:09:46.750000128,1.0,0.99,1.0,,114.0,112.0,1.0,6.5,0.0,0.5,0.0,0.0,0.3,8.75
50%,2.0,2017-06-23 12:51:06,2017-06-23 13:04:29,1.0,1.61,1.0,,162.0,162.0,1.0,9.5,0.0,0.5,1.35,0.0,0.3,11.8
75%,2.0,2017-10-02 10:43:06,2017-10-02 11:07:42.500000,2.0,3.06,1.0,,233.0,233.0,2.0,14.5,0.5,0.5,2.45,0.0,0.3,17.8
max,2.0,2017-12-31 23:45:30,2017-12-31 23:49:24,6.0,33.96,5.0,,265.0,265.0,4.0,999.99,4.5,0.5,200.0,19.1,0.3,1200.29


## **Step 2:** Data exploration

Pondremos especial atención en la relación entre `payment_type` y `fare_amount`, una opción es observar el promedio de `fare_amount` por cada tipo de `payment_type`.

*We are interested in the relationship between `payment_type` and the `fare_amount` the customer pays. One approach is to look at the average `fare_amount` for each `payment_type`.*

In [10]:
data.groupby('payment_type')['fare_amount'].mean()

Unnamed: 0_level_0,fare_amount
payment_type,Unnamed: 1_level_1
1,13.42557
2,12.213546
3,12.186116
4,9.913043


**Notes:**

Based on the averages shown, it appears that customers who pay in credit card tend to pay a larger fare amount than customers who pay in cash. However, this difference might arise from random sampling, rather than being a true difference in fare amount. To assess whether the difference is statistically significant, you conduct a hypothesis test.

## **Step 3:** Hypothesis testing / Test de Hipótesis

$H_0$: No hay diferencia en el promedio de `fare_amount` entre usuarios que pagan con credit cards y usuarios que pagan con cash.

$H_A$: Hay diferencia en el promedio de `fare_amount` entre usuarios que pagan con credit cards y usuarios que pagan con cash.


---



$H_0$: There is no difference in the average fare amount between customers who use credit cards and customers who use cash.

$H_A$: There is a difference in the average fare amount between customers who use credit cards and customers who use cash.

In [14]:
#hypothesis test, A/B test
# significance level 5%
clients_cc = data.query('payment_type == 1')['fare_amount']
clients_cash = data.query('payment_type == 2')['fare_amount']
statistics, pvalue = stats.ttest_ind(a=clients_cc, b=clients_cash, equal_var=False)

if pvalue < 0.05:
  print("Reject the null hypothesis")
else:
  print("Fail to reject the null hypothesis")

print("P-value: ", pvalue)

Reject the null hypothesis
P-value:  7.917577989336359e-12


**Conclusions**

There is a statistically significant difference in the average `fare_amount` between customers who use credit cards and customers who use cash.

### Questions

1. La clave sería que alentar el pago con tarjetas de crédito podría generar mayores ingresos para los taxistas. *The key business insight is that encouraging customers to pay with credit cards can generate more revenue for taxi cab drivers.*

2.  Este dataset no contempla otras posibles explicaciones. Por ejemplo, los usuarios no contaban con suficiente efectivo para pagar un trayecto más largo y era más fácil pagar con credit card. Esto significa que es probable que `fare_amount` determine el tipo de pago que a la inversa. *This dataset does not account for other likely explanations. For example, riders might not carry lots of cash, so it's easier to pay for longer/farther trips with a credit card. In other words, it's far more likely that fare amount determines payment type, rather than vice versa.*

El projecto continua en Automatidata_03.ipynb