

## Taxi Accident Probability Analysis

The goal of this project is to review key concepts of probability using Python, which I have applied throughout the course, and to analyze real data in order to draw informed conclusions.

In particular, this project is divided into two parts. The first part focuses on introducing the case study, while the second part involves analysis of a real dataset related to the MIP taxi accident.


![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 1

**Study Case**

In a city with one hundred taxis, 1 is `blue` and 99 are `green`.

A witness observes a hit-and-run by a taxi at night and recalls that the taxi was blue, so the police arrest the blue taxi driver who was on duty that night. 

The driver proclaims his innocence and hires you to defend him in court. You hire a scientist to test the witness' ability to distinguish blue and green taxis under conditions similar to the night of accident.

The data suggests that the witness sees blue cars as blue 99% of the time and green cars as blue 2% of the time.

---
**A)** What type of probability will you calculate to defend him?

> **Quick reminder**: You know that the witness recall the taxi was blue. You will try to make the case that this is likely to be a case of random mis-identification.

### **P(the car is blue|the witness sees blue).**

---
**B)** Using the following events:

$W_b$ = "witness sees a blue taxi" 

$W_g$ = "witness sees a green car"

$T_b$ = "taxi is blue" 

$T_g$ = "taxi is green"

How could you use that evetns to calculate the probability you mentioned?



### **P(the car is blue|the witness sees blue)=$P(T_b|W_b)=\dfrac{P(W_b|T_b)P(T_b)}{P(W_b)}$**

---
**C)** Calculate $P(W_b|T_b)$ and $P(T_b)$

$P(T_b) = 0.01$

$P(T_g) = 0.99$.

$P(W_b|T_b) = 0.99$ 

$P(W_b|T_g) = 0.02$.

---
**D)** Estimate $P(W_b)$ using your previous answer.

> Remember that we compute $P(W_b)$ using the _Law of Total Probability_.

### **$P(W_b) = P(W_b|T_b)P(T_b) + P(W_b|T_g)P(T_g) = 0.99 \times 0.01 + 0.02 \times 0.99 = 0.99 \times 0.03.$**


---
**E)** Now, what is the result of $P(T_b|W_b)$?

### $P(T_b|W_b)$= $\dfrac{1}{3}$

---
**F)** Verdict time. Is the car more likely to be blue or green?

**The probability that the blue taxi seen by the witness was actually blue is only 1/3, which indicates reasonable doubt about the accuracy of the witness's identification. This means that there is a 2/3 chance that the taxi involved in the accident was actually green.**
 



![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)

## Part 2

In this section you'll use taxi trips data.

The trip records include data about pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts.

In [1]:
import pandas as pd

In [2]:
taxi_df = pd.read_csv('taxi.csv')
taxi_df.head()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,5240.0,2016-12-15 23:45:00,2016-12-16 00:00:00,900.0,2.5,,307.0,8.0,32.0,10.75,2.45,0.0,1.0,14.7,Credit Card,,754.0,410.0,64.0,231.0
1,1215.0,2016-12-12 07:15:00,2016-12-12 07:15:00,240.0,0.4,,40.0,28.0,28.0,5.0,3.0,0.0,1.0,9.5,Credit Card,,395.0,408.0,395.0,408.0
2,3673.0,2016-12-16 16:30:00,2016-12-16 17:00:00,2400.0,10.7,,,32.0,,31.0,0.0,0.0,0.0,31.0,Cash,,385.0,478.0,,
3,5400.0,2016-12-16 08:45:00,2016-12-16 09:00:00,300.0,0.0,,787.0,8.0,8.0,5.25,2.0,0.0,0.0,7.25,Credit Card,101.0,688.0,206.0,161.0,649.0
4,1257.0,2016-12-3 18:45:00,2016-12-3 18:45:00,360.0,0.3,,534.0,7.0,8.0,5.0,0.0,0.0,0.0,5.0,Cash,3.0,618.0,407.0,454.0,453.0


In [3]:
taxi_df.shape

(1245712, 20)

---
**A)** Drop the following columns: 

```
['pickup_census_tract','dropoff_census_tract','pickup_community_area','dropoff_community_area','tolls','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude']
```


In [4]:
data = taxi_df.drop(columns=['pickup_census_tract','dropoff_census_tract','pickup_community_area','dropoff_community_area','tolls','pickup_latitude','pickup_longitude','dropoff_latitude','dropoff_longitude'])
data.head()

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,fare,tips,extras,trip_total,payment_type,company
0,5240.0,2016-12-15 23:45:00,2016-12-16 00:00:00,900.0,2.5,10.75,2.45,1.0,14.7,Credit Card,
1,1215.0,2016-12-12 07:15:00,2016-12-12 07:15:00,240.0,0.4,5.0,3.0,1.0,9.5,Credit Card,
2,3673.0,2016-12-16 16:30:00,2016-12-16 17:00:00,2400.0,10.7,31.0,0.0,0.0,31.0,Cash,
3,5400.0,2016-12-16 08:45:00,2016-12-16 09:00:00,300.0,0.0,5.25,2.0,0.0,7.25,Credit Card,101.0
4,1257.0,2016-12-3 18:45:00,2016-12-3 18:45:00,360.0,0.3,5.0,0.0,0.0,5.0,Cash,3.0


---
**B)** Keep just the first 100 rows.

In [5]:
data = taxi_df.iloc[:100,:]
print(data.shape)

data.head()

(100, 20)


Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
0,5240.0,2016-12-15 23:45:00,2016-12-16 00:00:00,900.0,2.5,,307.0,8.0,32.0,10.75,2.45,0.0,1.0,14.7,Credit Card,,754.0,410.0,64.0,231.0
1,1215.0,2016-12-12 07:15:00,2016-12-12 07:15:00,240.0,0.4,,40.0,28.0,28.0,5.0,3.0,0.0,1.0,9.5,Credit Card,,395.0,408.0,395.0,408.0
2,3673.0,2016-12-16 16:30:00,2016-12-16 17:00:00,2400.0,10.7,,,32.0,,31.0,0.0,0.0,0.0,31.0,Cash,,385.0,478.0,,
3,5400.0,2016-12-16 08:45:00,2016-12-16 09:00:00,300.0,0.0,,787.0,8.0,8.0,5.25,2.0,0.0,0.0,7.25,Credit Card,101.0,688.0,206.0,161.0,649.0
4,1257.0,2016-12-3 18:45:00,2016-12-3 18:45:00,360.0,0.3,,534.0,7.0,8.0,5.0,0.0,0.0,0.0,5.0,Cash,3.0,618.0,407.0,454.0,453.0


---
**C)** We don't have information about taxi colors, but we know green taxis don't accept **cash** if the fare is higher than 40.

We know a passenger spent more than 40$ and paid by cash, so that trip belongs to the blue taxi, **can you get that trip record?**

In [6]:
blue_taxi = data[(data["payment_type"] == "Cash") & (data["fare"]>40)]
blue_taxi

Unnamed: 0,taxi_id,trip_start_timestamp,trip_end_timestamp,trip_seconds,trip_miles,pickup_census_tract,dropoff_census_tract,pickup_community_area,dropoff_community_area,fare,tips,tolls,extras,trip_total,payment_type,company,pickup_latitude,pickup_longitude,dropoff_latitude,dropoff_longitude
28,1369.0,2016-12-20 15:15:00,2016-12-20 15:45:00,2160.0,17.4,,313.0,32.0,76.0,43.5,0.0,0.0,0.0,43.5,Cash,10.0,18.0,610.0,225.0,6.0


---
**D)** What is the probability to have taken the blue taxi if the trip lasted more than 500 seconds?

**We can apply the conditional probability formula:**

$P(T_b|S)=\dfrac{P(S\cap T_b)}{P(S)}$.

In [8]:
# Filter the data to get trips that lasted more than 500 seconds and are related to the blue taxi (taxi 1369)
blue_taxi_trips_over_500 = data[(data["trip_seconds"]>500) & (data["taxi_id"] == 1369.0)]

# Filter the data to get all trips that lasted more than 500 seconds
trips_over_500_seconds = data[(data["trip_seconds"]>500)]

# Calculate the probability
prob_blue_taxi_over_500 = blue_taxi_trips_over_500.shape[0]/trips_over_500_seconds.shape[0]
prob_blue_taxi_over_500

0.01818181818181818

---
**E)** What is the probability to have taken the green taxi if the trip lasted more than 500 seconds?

 **We can apply the conditional probability formula:**

$P(T_g|S)=\dfrac{P(S\cap T_g)}{P(S)}$.

In [10]:
# Filter the data to get trips that lasted more than 500 seconds and are related to green taxis
green_taxi_trips_over_500 = data[(data["trip_seconds"]>500) & (data["taxi_id"] != 1369.0)]

# Filter the data to get all trips that lasted more than 500 seconds
trips_over_500_seconds = data[(data["trip_seconds"]>500)]

# Calculate the probability
prob_green_taxi_over_500 = green_taxi_trips_over_500.shape[0]/trips_over_500_seconds.shape[0]
prob_green_taxi_over_500

0.9818181818181818

**Another way to calculate the probability of having taken the green taxi given that the trip lasted more than 500 seconds is by using the complement of the probability of taking the blue taxi:**

$P(T_g|S)=1-P(T_b|S)$.


In [11]:
1 - prob_blue_taxi_over_500

0.9818181818181818

---
**F)** What is the probability to have paid in cash given that you took the green taxi?

**We can apply the conditional probability formula:**

$P(C|T_g)=\dfrac{P(C\cap T_g)}{P(T_g)}$.

In [24]:
# Filter data for green taxi trips that were paid in cash
green_taxi_cash_trips = data[(data["payment_type"]=='Cash') & (data["taxi_id"] != 1369.0)]

# Filter data to get all trips taken in green taxis
green_taxi_trips = data[(data["taxi_id"] != 1369.0)]

# Calculate the probability
prob_cash_green_taxi = green_taxi_cash_trips.shape[0]/green_taxi_trips.shape[0]
prob_cash_green_taxi

0.5353535353535354

---
**G)** For last task, you will need to pick a taxi from this dataset considering the following events:

- A={The TAXI chosen have been paying by cash}
- B={The TAXI chosen have the trip lasted more than 500 seconds}
- C={The TAXI chosen have paid a total trip equal to 40$}

Now answer:

**G.1)** A and B are mutually exclusive events?

**G.2)** A and C are mutually exclusive events?

In [26]:
data_AB = data[(data["payment_type"] == "Cash") & (data["trip_seconds"]>500)]
data_AB.shape
# not mutually exclusive events

(25, 20)

In [27]:
data_AC = data[(data["payment_type"] == "Cash") & (data["trip_total"]==40)]
data_AC.shape
# yes, mutually exclusive events

(0, 20)

![orange-divider](https://user-images.githubusercontent.com/7065401/92672455-187a5f80-f2ef-11ea-890c-40be9474f7b7.png)