# Ride Share Dataset

**Ride Share(Boston) Dataset Glossary**

|Variable|Description|
|---|---|
|Distance|Distance Between Source and Destination|
|cab_type|Uber or Lyft|
|time_stamp|epoch time when data was queried|
|destination|destination of the ride|
|source|the starting point of the ride|
|price|price estimated for the ride in USD|
|surge_multiplier|the multiplier by which price was increased, default 1|
|id|unique identifier|
|product_id|uber/lyft identifier for cab-type|
|name|visible type of the cab eg: Uber Pool, UberXL|


# **Import Libraries and Dataset**

## Import Libraries

In [71]:
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to restrict the float value to 3 decimal places
pd.set_option('display.float_format', lambda x: '%.3f' % x)

## Import Dataset

In [72]:
#import data as a pandas dataframe

# Access the .csv file in Google Drive folder. The file path must be correct
data = pd.read_csv('cab_rides.csv')

# **Data Check**

### View the first five and last five rows of the dataframe

### Determine the number of entries in the dataframe

### Check the data types for each entry

Convert "timestamp" to a datetime datatype

In [None]:
#Convert "timestamp" to a datetime datatype
data['date'] = pd.to_datetime(data['time_stamp'],unit='ms')
data = data.drop(['time_stamp'], axis=1)
data

In [None]:
# Bin datetime to hours
data['hourly_bins'] = pd.cut(data['date'], bins=pd.date_range(start=data['date'].min(), end=data['date'].max(), freq='H'))
data['hours'] = data['hourly_bins'].apply(lambda x: x.left.hour)
data = data.drop('hourly_bins', axis=1)
map = {0.000:"12:00AM",1.000:"1:00AM",2.000:"2:00AM",3.000:"3:00AM",4.000:"4:00AM",5.000:"5:00AM",6.000:"6:00AM",7.000:"7:00AM",8.000:"8:00AM",9.000:"9:00AM",10.000:"10:00AM",11.000:"11:00AM",12.000:"12:00PM",13.000:"1:00PM",14.000:"2:00PM",15.000:"3:00PM",16.000:"4:00PM",17.000:"5:00PM",18.000:"6:00PM",19.000:"7:00PM",20.000:"8:00PM",21.000:"9:00PM",22.000:"10:00PM",23.000:"11:00PM",24.000:"12:00PM"}
data["hours"].replace(map, inplace=True)
data

### Check for missing values

### Check for duplicate values and remove duplicate values if any exist

### Create a statistical summary of the numerical data

# **Initial Analysis**

### Question 1: During which hours of the day do the majority of individuals utilize ride-sharing services?

### Question 2: Which ride-sharing service generated the highest profit within the 2018 timeframe? Additionally, what was the average ride price for each of the ride-sharing services?

### Question 3: What is the average price per mile for rides with Uber and those with Lyft in Boston?

### Question 4: How many more miles did Uber drivers drive than Lyft riders? What were the average distances travelled by both services?

### Question 5: Which departure -> destination combo is the most popular?

# **Exploratory Data Analysis**

## **Univariate Analysis**

### Question 6: Plot Histograms and Box Plots For Each Of the Values With Numerical Values

#### Distance

#### Price

#### Surge Multiplier

### Question 7: Plot CountPlots For Columns With Catergorical Values

#### Cab Type

#### Source

#### Destination

#### Name

#### Trip (Source to Destination)


- Create a new column called "trip"
- Assign the new column a string composed of the following concatenation:    **"source"** value + **" --> "** + **"destination"** value
- Create a countplot for the new column

Create new column and view table.

Create countplot of the "trip" column

#### Hours



Separate the hour in which the ride took place from the date time format
- Google how to extract the hour from the date time format
- Create a new column,"hour", and assign it the hour for each trip.

View the updated dataframe

Create a countplot, histogram, and boxplot for the new "hour" column

## **Bivariate Analysis**

### Question 8: Create A Heatmap For Columns Numerical Values

## **Treating Missing/Erroneous Values**

### Remove Erroneous Values

### Impute Missing Values