Business Understanding

You may have some experience of travelling to and from the airport. Have you ever used Uber or any other cab service for this travel? Did you at any time face the problem of cancellation by the driver or non-availability of cars?
Well, if these are the problems faced by customers, these very issues also impact the business of Uber. If drivers cancel the request of riders or if cars are unavailable, Uber loses out on its revenue. Let’s hear more about such problems that Uber faces during its operations.
As an analyst, you decide to address the problem Uber is facing - driver cancellation and non-availability of cars leading to loss of potential revenue.

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

# Import the numpy, pandas, matplotlib, seaborn packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

**EXPLORATORY DATA ANALYSIS**

**STEPS:**  
1) Data Cleaning
2) Understanding the Data
3) Univariate Analysis
4) Bivariate Analysis
5) Deriving New Metrics
6) Graphical Analysis

**STEP-1: DATA CLEANING**

In [None]:
#Importing & Reading the Data
df=pd.read_csv("../input/uber-request-data/Uber Request Data.csv")
df

In [None]:
#Correcting the data types
df['Request timestamp'] = pd.to_datetime(df['Request timestamp'])
df['Drop timestamp'] = pd.to_datetime(df['Drop timestamp'])
df.head()

In [None]:
# Removing unnecessary columns
df = df.drop(['Driver id'], axis = 1)

In [None]:
df.tail()

**STEP-2:** **Understand the Dataset**
1. How many unique pickup points are present in uberReq?
2. How many observations are present in uberReq?
3. Number of null values?
4. Inspecting the null values

In [None]:
# How many unique pickup points are present in uberReq?
print(df['Pickup point'].unique())

In [None]:
# How many observations are present in uberReq?
df.shape

In [None]:
df.info()

In [None]:
# Inspecting the Null values , column-wise
df.isnull().sum(axis=0)

In [None]:
df[(df['Drop timestamp'].isnull())].groupby('Status').size()

NOTE:
The cell above goes on to show that the Drop timestamp rows are empty when the Status is No Cars Available or Cancelled. Since the trips did not happen in those cases, the Drop timestamp can not be available, hence the null values here are valid.

In [None]:
print(len(df['Request id'].unique()))
print(len(df['Pickup point'].unique()))
print(len(df['Status'].unique()))

In [None]:
# Checking if there are any duplicate values
len(df[df.duplicated()].index)

**STEP-3: UNIVARIATE ANALYSIS**

In [None]:
# Univariate analysis on Status column 
status = pd.crosstab(index = df["Status"], columns="count")     
status.plot.bar()

**INSIGHTS:** Univariate Analysis conclusion of Status column:
No cars available is more than the number of trips cancelled.

In [None]:
#Univariate analysis on Pickup Point column 
pick_point = pd.crosstab(index = df["Pickup point"], columns="count")     
pick_point.plot.bar()

**INSIGHTS:** Univariate Analysis conclusion of Pickup point column:
The pickup points Airport and City are almost equal times present in the dataset.

**STEP-4: BIVARIATE ANALYSIS**

In [None]:
# grouping by Status and Pickup point.
df.groupby(['Status', 'Pickup point']).size()

In [None]:
# Visualizing the count of Status and Pickup point bivariate analysis
sns.countplot(x=df['Pickup point'],hue =df['Status'] ,data = df)

**INSIGHTS: Bivariate Analysis conclusion of Status and Pickup point columns:**
*     There are more No cars available from Airport to City.
*     There are more cars Cancelled from City to Airport.

**STEP-5: DERIVING NEW METRICS**

In [None]:
# Request and Drop hours
df['Request Hour'] = df['Request timestamp'].dt.hour

In [None]:
# Time Slots
df['Request Time Slot'] = 'Early Morning'
df.loc[df['Request Hour'].between(5,8, inclusive=True),'Request Time Slot'] = 'Morning'
df.loc[df['Request Hour'].between(9,12, inclusive=True),'Request Time Slot'] = 'Late Morning'
df.loc[df['Request Hour'].between(13,16, inclusive=True),'Request Time Slot'] = 'Noon'
df.loc[df['Request Hour'].between(17,21, inclusive=True),'Request Time Slot'] = 'Evening'
df.loc[df['Request Hour'].between(21,24, inclusive=True),'Request Time Slot'] = 'Night'

In [None]:
# As Demand can include trips completed, cancelled or no cars available, we will create a column with 1 as a value
df['Demand'] = 1

In [None]:
# As Supply can only be the trips completed, rest all are excluded, so we will create a column with 1 as a supply value trips completed and 0 otherwise.
df['Supply'] = 0
df.loc[(df['Status'] == 'Trip Completed'),'Supply'] = 1

In [None]:
# Demand Supply Gap can be defined as a difference between Demand and Supply
df['Gap'] = df['Demand'] - df['Supply']
df.loc[df['Gap']==0,'Gap'] = 'Trip Completed'
df.loc[df['Gap']==1,'Gap'] = 'Trip Not Completed'

In [None]:
# Removing unnecessary columns
df = df.drop(['Request Hour', 'Demand', 'Supply'], axis=1)

In [None]:
df.head()

**STEP-6: GRAPHICAL ANALYSIS**

In [None]:
# Plot to find the count of the three requests, according to the defined time slots
sns.countplot(x=df['Request Time Slot'],hue =df['Status'] ,data = df)

**INSIGHTS:**
* Most No Cars Available are in the Evening.
* Most Cancelled trips are in the Morning.

In [None]:
# Plot to find the count of the status, according to both pickup point and the time slot
pickup_df = pd.DataFrame(df.groupby(['Pickup point','Request Time Slot', 'Status'])['Request id'].count().unstack(fill_value=0))
pickup_df.plot.bar()

**INSIGHTS:**
* Most No Cars Available are in the Evening from Airport to City.
* Most Cancelled trips are in the Morning from City to Airport.

In [None]:
# Plot to count the number of requests that was completed and which was not
sns.countplot(x=df['Gap'], data = df)

**INSIGHTS:**
* More Trip not completed than Trip Completed.

In [None]:
## Plot to count the number of requests that was completed and which was not, against the time slot
gap_timeslot_df = pd.DataFrame(df.groupby(['Request Time Slot','Gap'])['Request id'].count().unstack(fill_value=0))
gap_timeslot_df.plot.bar()

In [None]:
# Plot to count the number of requests that was completed and which was not, against pickup point
gap_pickup_df = pd.DataFrame(df.groupby(['Pickup point','Gap'])['Request id'].count().unstack(fill_value=0))
gap_pickup_df.plot.bar()

In [None]:
# Plot to count the number of requests that was completed and which was not, for the final analysis
gap_main_df = pd.DataFrame(df.groupby(['Request Time Slot','Pickup point','Gap'])['Request id'].count().unstack(fill_value=0))
gap_main_df.plot.bar()

**Hypothesis:**

**Pickup Point - City:**
As per the analysis, the morning time slot is most problematic where the requests are being cancelled. Most probably the requests are being cancelled by the drivers due to the morning rush as it being the office hours and seeing the destination as airport which would be too far, the driver would think to earn more for the shorter trips within the city.

**Pickup Point - Airport:**
Upon analysis, the evening time slot seems to be most problematic for pickup points as airport where the requests being No Cars Available. The reason seems to be that not enough cars are available to service the requests as cars might not be available at the airport due to the cars serving inside the city.

**Conclusions:**
Based on the data analysis performed, following recommendation can be used by Uber to bridge the gap between supply and demand: -

For bridging the demand supply gap from airport to city, making a permanent stand in the airport itself where the cabs will be available at all times and the incomplete requests can come down significantly.
Uber can provide some incentives to the driver who complete the trip from city to airport in the morning part. This might result the driver to not cancel the request from city to airport trips.
Last but sure solution to bring down the gap is to increase the numbers of cab in its fleet.