# Ride Share Dataset

**Ride Share(Boston) Dataset Glossary**

|Variable|Description|
|---|---|
|Distance|Distance Between Source and Destination|
|cab_type|Uber or Lyft|
|time_stamp|epoch time when data was queried|
|destination|destination of the ride|
|source|the starting point of the ride|
|price|price estimated for the ride in USD|
|surge_multiplier|the multiplier by which price was increased, default 1|
|id|unique identifier|
|product_id|uber/lyft identifier for cab-type|
|name|visible type of the cab eg: Uber Pool, UberXL|


# **Import Libraries and Dataset**

## Import Libraries

In [1]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Import Dataset

In [2]:
#import data as a pandas dataframe

# Access the .csv file in Google Drive folder. The file path must be correct
data = pd.read_csv('cab_rides.csv')

# **Data Check**

### View the first five and last five rows of the dataframe

### Determine the number of entries in the dataframe

### Check the data types for each entry

In [3]:
data

Unnamed: 0,distance,cab_type,time_stamp,destination,source,price,surge_multiplier,id,product_id,name
0,0.440,Lyft,1544952607890,North Station,Haymarket Square,5.000,1.000,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared
1,0.440,Lyft,1543284023677,North Station,Haymarket Square,11.000,1.000,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux
2,0.440,Lyft,1543366822198,North Station,Haymarket Square,7.000,1.000,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft
3,0.440,Lyft,1543553582749,North Station,Haymarket Square,26.000,1.000,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL
4,0.440,Lyft,1543463360223,North Station,Haymarket Square,9.000,1.000,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL
...,...,...,...,...,...,...,...,...,...,...
693066,1.000,Uber,1543708385534,North End,West End,13.000,1.000,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL
693067,1.000,Uber,1543708385534,North End,West End,9.500,1.000,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX
693068,1.000,Uber,1543708385534,North End,West End,,1.000,64d451d0-639f-47a4-9b7c-6fd92fbd264f,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi
693069,1.000,Uber,1543708385534,North End,West End,27.000,1.000,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV


In [4]:
data.shape

(693071, 10)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 10 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   distance          693071 non-null  float64
 1   cab_type          693071 non-null  object 
 2   time_stamp        693071 non-null  int64  
 3   destination       693071 non-null  object 
 4   source            693071 non-null  object 
 5   price             637976 non-null  float64
 6   surge_multiplier  693071 non-null  float64
 7   id                693071 non-null  object 
 8   product_id        693071 non-null  object 
 9   name              693071 non-null  object 
dtypes: float64(3), int64(1), object(6)
memory usage: 52.9+ MB


Convert "timestamp" to a datetime datatype

In [6]:
#Convert "timestamp" to a datetime datatype
data['date'] = pd.to_datetime(data['time_stamp'],unit='ms')
data = data.drop(['time_stamp'], axis=1)
data

Unnamed: 0,distance,cab_type,destination,source,price,surge_multiplier,id,product_id,name,date
0,0.440,Lyft,North Station,Haymarket Square,5.000,1.000,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared,2018-12-16 09:30:07.890
1,0.440,Lyft,North Station,Haymarket Square,11.000,1.000,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux,2018-11-27 02:00:23.677
2,0.440,Lyft,North Station,Haymarket Square,7.000,1.000,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft,2018-11-28 01:00:22.198
3,0.440,Lyft,North Station,Haymarket Square,26.000,1.000,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL,2018-11-30 04:53:02.749
4,0.440,Lyft,North Station,Haymarket Square,9.000,1.000,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL,2018-11-29 03:49:20.223
...,...,...,...,...,...,...,...,...,...,...
693066,1.000,Uber,North End,West End,13.000,1.000,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,2018-12-01 23:53:05.534
693067,1.000,Uber,North End,West End,9.500,1.000,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-12-01 23:53:05.534
693068,1.000,Uber,North End,West End,,1.000,64d451d0-639f-47a4-9b7c-6fd92fbd264f,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,2018-12-01 23:53:05.534
693069,1.000,Uber,North End,West End,27.000,1.000,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-12-01 23:53:05.534


In [7]:
# Bin datetime to hours
data['hourly_bins'] = pd.cut(data['date'], bins=pd.date_range(start=data['date'].min(), end=data['date'].max(), freq='H'))
data['hours'] = data['hourly_bins'].apply(lambda x: x.left.hour)
data = data.drop('hourly_bins', axis=1)
map = {0.000:"12:00AM",1.000:"1:00AM",2.000:"2:00AM",3.000:"3:00AM",4.000:"4:00AM",5.000:"5:00AM",6.000:"6:00AM",7.000:"7:00AM",8.000:"8:00AM",9.000:"9:00AM",10.000:"10:00AM",11.000:"11:00AM",12.000:"12:00PM",13.000:"1:00PM",14.000:"2:00PM",15.000:"3:00PM",16.000:"4:00PM",17.000:"5:00PM",18.000:"6:00PM",19.000:"7:00PM",20.000:"8:00PM",21.000:"9:00PM",22.000:"10:00PM",23.000:"11:00PM",24.000:"12:00PM"}
data["hours"].replace(map, inplace=True)
data

Unnamed: 0,distance,cab_type,destination,source,price,surge_multiplier,id,product_id,name,date,hours
0,0.440,Lyft,North Station,Haymarket Square,5.000,1.000,424553bb-7174-41ea-aeb4-fe06d4f4b9d7,lyft_line,Shared,2018-12-16 09:30:07.890,8:00AM
1,0.440,Lyft,North Station,Haymarket Square,11.000,1.000,4bd23055-6827-41c6-b23b-3c491f24e74d,lyft_premier,Lux,2018-11-27 02:00:23.677,1:00AM
2,0.440,Lyft,North Station,Haymarket Square,7.000,1.000,981a3613-77af-4620-a42a-0c0866077d1e,lyft,Lyft,2018-11-28 01:00:22.198,12:00AM
3,0.440,Lyft,North Station,Haymarket Square,26.000,1.000,c2d88af2-d278-4bfd-a8d0-29ca77cc5512,lyft_luxsuv,Lux Black XL,2018-11-30 04:53:02.749,4:00AM
4,0.440,Lyft,North Station,Haymarket Square,9.000,1.000,e0126e1f-8ca9-4f2e-82b3-50505a09db9a,lyft_plus,Lyft XL,2018-11-29 03:49:20.223,3:00AM
...,...,...,...,...,...,...,...,...,...,...,...
693066,1.000,Uber,North End,West End,13.000,1.000,616d3611-1820-450a-9845-a9ff304a4842,6f72dfc5-27f1-42e8-84db-ccc7a75f6969,UberXL,2018-12-01 23:53:05.534,11:00PM
693067,1.000,Uber,North End,West End,9.500,1.000,633a3fc3-1f86-4b9e-9d48-2b7132112341,55c66225-fbe7-4fd5-9072-eab1ece5e23e,UberX,2018-12-01 23:53:05.534,11:00PM
693068,1.000,Uber,North End,West End,,1.000,64d451d0-639f-47a4-9b7c-6fd92fbd264f,8cf7e821-f0d3-49c6-8eba-e679c0ebcf6a,Taxi,2018-12-01 23:53:05.534,11:00PM
693069,1.000,Uber,North End,West End,27.000,1.000,727e5f07-a96b-4ad1-a2c7-9abc3ad55b4e,6d318bcc-22a3-4af6-bddd-b409bfce1546,Black SUV,2018-12-01 23:53:05.534,11:00PM


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 693071 entries, 0 to 693070
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   distance          693071 non-null  float64       
 1   cab_type          693071 non-null  object        
 2   destination       693071 non-null  object        
 3   source            693071 non-null  object        
 4   price             637976 non-null  float64       
 5   surge_multiplier  693071 non-null  float64       
 6   id                693071 non-null  object        
 7   product_id        693071 non-null  object        
 8   name              693071 non-null  object        
 9   date              693071 non-null  datetime64[ns]
 10  hours             691978 non-null  object        
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 58.2+ MB


### Check for missing values

In [9]:
data.isnull().sum()

distance                0
cab_type                0
destination             0
source                  0
price               55095
surge_multiplier        0
id                      0
product_id              0
name                    0
date                    0
hours                1093
dtype: int64

### Check for duplicate values and remove duplicate values if any exist

In [10]:
data.duplicated().sum()

0

### Create a statistical summary of the numerical data

In [11]:
data.describe().T

Unnamed: 0,count,mean,min,25%,50%,75%,max,std
distance,693071.0,2.189,0.020,1.280,2.160,2.920,7.860,1.139
price,637976.0,16.545,2.500,9.000,13.500,22.500,97.500,9.324
surge_multiplier,693071.0,1.014,1.000,1.000,1.000,1.000,3.000,0.092
date,693071.0,2018-12-05 21:35:09.764357120,2018-11-26 03:40:46.318000,2018-11-28 22:26:08.356499968,2018-12-02 07:57:57.528999936,2018-12-14 22:45:08.976499968,2018-12-18 19:15:10.943000,


# **Initial Analysis**

### Question 1: During which hours of the day do the majority of individuals utilize ride-sharing services?

In [21]:
data["hours"].value_counts(normalize = True)*100

hours
11:00PM   4.665
10:00PM   4.587
9:00AM    4.504
10:00AM   4.391
11:00AM   4.391
12:00PM   4.391
1:00PM    4.391
3:00PM    4.391
4:00PM    4.391
5:00PM    4.391
2:00PM    4.391
12:00AM   4.371
3:00AM    4.250
1:00AM    4.171
9:00PM    4.096
6:00PM    3.985
8:00PM    3.940
2:00AM    3.915
5:00AM    3.881
7:00PM    3.845
8:00AM    3.764
6:00AM    3.695
4:00AM    3.685
7:00AM    3.520
Name: proportion, dtype: float64

### Question 2: Which ride-sharing service generated the highest profit within the 2018 timeframe? Additionally, what was the average ride price for each of the ride-sharing services?

In [13]:
data[data['date'].dt.year == 2018].groupby(['cab_type'], as_index=False)['price'].agg(['sum','mean'])

Unnamed: 0,cab_type,sum,mean
0,Lyft,5333957.98,17.351
1,Uber,5221435.0,15.795


### Question 3: What is the average price per mile for rides with Uber and those with Lyft in Boston?

In [14]:
data["price/mile"] = data["price"] / data["distance"]
data.groupby(['cab_type'], as_index=False)['price/mile'].agg(['mean'])

Unnamed: 0,cab_type,mean
0,Lyft,9.683
1,Uber,9.692


### Question 4: How many more miles did Uber drivers drive than Lyft riders? What were the average distances travelled by both services?

In [15]:
diff = data[data['cab_type']== 'Uber']['distance'].sum() - data[data['cab_type']== 'Lyft']['distance'].sum() 
data.groupby(['cab_type'], as_index=False)['distance'].agg(['sum','mean'])


Unnamed: 0,cab_type,sum,mean
0,Lyft,672293.79,2.187
1,Uber,845136.48,2.191


In [16]:
print('Uber drivers drove',int(diff),"more miles than Lyft drivers in total.")

Uber drivers drove 172842 more miles than Lyft drivers in total.


### Question 5: Which departure -> destination combo is the most popular?

In [17]:
data.groupby(['source'], as_index=False)['destination'].value_counts(normalize=True)

Unnamed: 0,source,destination,proportion
0,Back Bay,North End,0.177
1,Back Bay,Haymarket Square,0.166
2,Back Bay,Northeastern University,0.166
3,Back Bay,South Station,0.164
4,Back Bay,Fenway,0.164
...,...,...,...
67,West End,Boston University,0.173
68,West End,South Station,0.166
69,West End,Northeastern University,0.166
70,West End,North End,0.160
