# Analysis on Chicago Ride-hailing
# (please refer to the `case_chicago.ipynb` for most of the figures for visualization)

The dataset has 23 columns. They can be categorised into 4 groups:

1. Ancillary featire: Trip ID, Taxi ID, Company

2. Payment feature: Fare, Tips, Tolls, Extras, Trip Total, Payment Type

3. Time feature: 'Trip Start Timestamp', 'Trip End Timestamp', 'Trip Seconds', 

4. Geo-distance feature: Trip Miles, Pickup Census Tract, Dropoff Census Tract, Pickup Community Area, Dropoff Community Area, Pickup Centroid Latitude, Pickup Centroid Longitude, Pickup Centroid Location, Dropoff Centroid Latitude, Dropoff Centroid Longitude, Dropoff Centroid  Location


## Ancillary - competitiors 

| Company | Market Share |
| :-- | :-- | 
| Flash Cab | 0.329 |
| Taxi Affiliation Services | 0.322 |
| Medallion Leasin            |             0.060
| Taxicab Insurance Agency, LLC |           0.051
| Sun Taxi                        |         0.041
| City Service                      |       0.036
| Blue Ribbon Taxi Association Inc.   |     0.027
| Top Cab Affiliation                   |   0.026
| Star North Management LLC               | 0.026
| Globe Taxi                              | 0.020


Flash Cab and Taxi Affiliation are the two main service providers in the region (65% by trips). 

To break into the market, we need to pay attention to their features and strategies.

## Payment

### Payment Type

| Payment Type | Market Share |
| :-- | :-- | 
| Cash | 0.428 |
| Credit Card | 0.245 |
| Prcard | 0.155
| Unknown | 0.130 


Cash, Credit Card and Prcard are the three main modes of transactions (more than 80% by trips).

We need to provide/enable at least these three types of payment for our services.

### Fare

Mean: 18.2 USD

Median: 12.5 USD

Standard Deviation: 37.8 USD

As shown by the plot, majority (99%) of the trips cost less than 50 USD. 

The first peak appears at 10 USD. There is a seconds peak at around 30 USD


## Time

### Trip Seconds 

Mean: 18.1 Mins

Median: 13.0 Mins

Standard Deviation: 30.7 Mins

As shown by the plot, majority (99%) of the trips are within a hour. 


### Day of the week
The unit price are higher after middle night for weekends than weekdays



### Hour of the day
The price fluctuates for different hours of the day


## Geo-distance

### Trip Miles

Mean: 4.91 Mils

Median: 2.16 Mils

Standard Deviation: 7.01 Mils

As shown by the plot, majority (99%) of the trips are within 20 miles. 

Half of the trips are within 2 Miles. 


### Geolocations
![title](visual_1.png)
It is observed that the pickup and dropoff geolocations are the centers of Census Tract instead of the exact locations. This can be seen from the visual above (please check the interative kepler map `map_chicago_visual.html`). It makes direct interpolation for an arbitrary geolocation less interesting. Therefore, the Census Tracts features are introduced for better understanding of the pricing pattern 



### Fly Distance
Since Miles are considered as the driving distance and the driving distance depends on a route engine to predict, we introduce the flying distance as a handy feature for predicting fare. The underlying assumption is the flying distance is highly correlated to the driving miles. 


### Pickup and Dropoff Census
Chicago Census polygon data obtained from `https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Census-Tracts-2010/5jrd-6zik`. There are a total of 801 census tracks in Chigago.

From the Pickup and Dropoff geolocations, we can reverse geocoding the census name from the polygons. It is observed in `map_chicago_cencus_pickup.html` and `map_chicago_cencus_dropoff.html` that the unit price is higher in the downtown area - the East coast area of the city and generally lower at the outskirts.  


## Price Pattern
In summary, we captures the following features that are most relavant to the fare of a hailing trip:

flying_distance, pick_up_census, drop_off_census, hour of the day and day of the week

We give up Trip Seconds and Trip Miles because we cannot estimate them accurately at the point of hailing. 

The impact of the factors are rougly illustrated by the correlation.
*The non-linear impact is underestimated here

| factor | Correlation |
| :-- | :-- | 
|fly_distance|	0.314080|
|Trip Miles|	0.267008|
|Trip Seconds	|0.151681|
|pick_up_census	|0.139031|
|drop_off_census|	0.047961|
|hour	|0.002554|
|Pickup Centroid Latitude	|-0.005354|
|is_weekday|	-0.019408|
|Dropoff Centroid Latitude	|-0.040550|
|Dropoff Centroid Longitude|	-0.061133|
|Pickup Centroid Longitude	|-0.171556|


## Fare Prediction

In the proposed method, 5 features are used to predict the fare

fly_distance (numerical)

pickup_census_tract (categorical)

dropoff_census_tract (categorical)

hour (categorical)

day (boolean)

where the top 3 are calculated by origin and destination geolocations.
A lightGBM model is trained for prediction.
The best Mean Absolute Error can be achieved is 3.27 USD 
Please refer to the code for model details

In [7]:
%%html
<style>
table {align:left;display:block}
</style>