# Context
Ridesharing is a service that arranges transportation on short notice. It is a very volatile market and its demand fluctuates wildly with time, place, weather, local events, etc. The key to being successful in this business is to be able to detect patterns in these fluctuations and cater to the demand at any given time.



# Objective
Uber Technologies, Inc. is an American multinational transportation network company based in San Francisco and has operations in over 785 metropolitan areas with over 110 million users worldwide. As a newly hired Data Scientist in Uber's New York Office, you have been given the task of extracting actionable insights from data that will help in the growth of the business.



# Key Questions
1 - What are the different variables that influence the number of pickups?

2 - Which factor affects the number of pickups the most?

3 - What could be the possible reasons for that?

4 - What are your recommendations to Uber management to capitalize on fluctuating demand?


# Guidelines
Perform univariate analysis on the data to better understand the variables at your disposal
Perform bivariate analysis to better understand the correlation between different variables
Create visualizations to explore data and extract the insights
Create a presentation for Uber Management, detailing all the insights along with supporting data.


# Data
The data contains the details for the Uber rides across various boroughs (subdivisions) of New York City at an hourly level and attributes associated with weather conditions at that time.

- pickup_dt: Date and time of the pick-up.
- borough: NYC's borough.
- pickups: Number of pickups for the period (hourly).
- spd: Wind speed in miles/hour.
- vsb: Visibility in miles to the nearest tenth.
- temp: Temperature in Fahrenheit.
- dewp: Dew point in Fahrenheit.
- slp: Sea level pressure.
- pcp01: 1-hour liquid precipitation.
- pcp06: 6-hour liquid precipitation.
- pcp24: 24-hour liquid precipitation.
- sd: Snow depth in inches.
- hday: Being a holiday (Y) or not (N).

In [1]:
# Mount Drive from G drive
from google.colab import files
uploaded = files.upload()

Saving Uber_Data.csv to Uber_Data.csv


In [2]:
#Import necessary libraries
import pandas as pd
import numpy as np
import io

# Libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Commands to help with graph displays
%matplotlib inline

#To display values upto 2 decimal places
pd.set_option("display.float_format", lambda x: "%.2f" % x)

In [3]:
# Get the dataset
df = pd.read_csv("Uber_Data.csv")

#Data Overview Process
The initial steps to get an overview of any dataset is to:

- Observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not get information about the number of rows and columns in the dataset

- Find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.

- Check the statistical summary of the dataset to get an overview of the numerical columns of the data

In [4]:
#First view of data
df.head(10)

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
0,01-01-2015 01:00,Bronx,152,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
1,01-01-2015 01:00,Brooklyn,1519,5.0,10.0,,7.0,1023.5,0.0,0.0,0.0,0.0,Y
2,01-01-2015 01:00,EWR,0,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
3,01-01-2015 01:00,Manhattan,5258,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
4,01-01-2015 01:00,Queens,405,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
5,01-01-2015 01:00,Staten Island,6,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
6,01-01-2015 01:00,,4,5.0,10.0,30.0,7.0,1023.5,0.0,0.0,0.0,0.0,Y
7,01-01-2015 02:00,Bronx,120,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y
8,01-01-2015 02:00,Brooklyn,1229,3.0,10.0,,6.0,1023.0,0.0,0.0,0.0,0.0,Y
9,01-01-2015 02:00,EWR,0,3.0,10.0,30.0,6.0,1023.0,0.0,0.0,0.0,0.0,Y


In [5]:
df.shape

(29101, 13)

#Insight
- Dataset has 13 columns with 29101 rows of data. Small- mid sized dataset

In [6]:
df.describe()

Unnamed: 0,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd
count,29101.0,29101.0,29101.0,28742.0,29101.0,29101.0,29101.0,29101.0,29101.0,29101.0
mean,490.22,5.98,8.82,47.9,30.82,1017.82,0.0,0.03,0.09,2.53
std,995.65,3.7,2.44,19.8,21.28,7.77,0.02,0.09,0.22,4.52
min,0.0,0.0,0.0,2.0,-16.0,991.4,0.0,0.0,0.0,0.0
25%,1.0,3.0,9.1,32.0,14.0,1012.5,0.0,0.0,0.0,0.0
50%,54.0,6.0,10.0,46.5,30.0,1018.2,0.0,0.0,0.0,0.0
75%,449.0,8.0,10.0,65.0,50.0,1022.9,0.0,0.0,0.05,2.96
max,7883.0,21.0,10.0,89.0,73.0,1043.4,0.28,1.24,2.1,19.0


In [7]:
df.describe(include = "all").T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
pickup_dt,29101.0,4343.0,01-01-2015 01:00,7.0,,,,,,,
borough,26058.0,6.0,Bronx,4343.0,,,,,,,
pickups,29101.0,,,,490.22,995.65,0.0,1.0,54.0,449.0,7883.0
spd,29101.0,,,,5.98,3.7,0.0,3.0,6.0,8.0,21.0
vsb,29101.0,,,,8.82,2.44,0.0,9.1,10.0,10.0,10.0
temp,28742.0,,,,47.9,19.8,2.0,32.0,46.5,65.0,89.0
dewp,29101.0,,,,30.82,21.28,-16.0,14.0,30.0,50.0,73.0
slp,29101.0,,,,1017.82,7.77,991.4,1012.5,1018.2,1022.9,1043.4
pcp01,29101.0,,,,0.0,0.02,0.0,0.0,0.0,0.0,0.28
pcp06,29101.0,,,,0.03,0.09,0.0,0.0,0.0,0.0,1.24


#Insight

- 6 unique boroughs - Bronx in the most common
- Pickups, mean is larger than the median. distribution is skewed to the right  
- Remaining numeric columns have similiar means and medians.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29101 entries, 0 to 29100
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pickup_dt  29101 non-null  object 
 1   borough    26058 non-null  object 
 2   pickups    29101 non-null  int64  
 3   spd        29101 non-null  float64
 4   vsb        29101 non-null  float64
 5   temp       28742 non-null  float64
 6   dewp       29101 non-null  float64
 7   slp        29101 non-null  float64
 8   pcp01      29101 non-null  float64
 9   pcp06      29101 non-null  float64
 10  pcp24      29101 non-null  float64
 11  sd         29101 non-null  float64
 12  hday       29101 non-null  object 
dtypes: float64(9), int64(1), object(3)
memory usage: 2.9+ MB


#Insight

- pickup_dt should be a datetime data types not an object
- Borough & Temp columns are missing data

In [9]:
df.isnull().sum()

pickup_dt       0
borough      3043
pickups         0
spd             0
vsb             0
temp          359
dewp            0
slp             0
pcp01           0
pcp06           0
pcp24           0
sd              0
hday            0
dtype: int64

In [10]:
df.duplicated().sum()

0

#Insight

- Borough missing 3043 rows of data

- Temp missing 359 rows


Further analysis of the data and the domain knowledge will help understand how to handle the missing data.  

- No duplicates found

-----------------------------------------------------------

# Missing value treatment
One of the commonly used method to deal with the missing values is to impute them with the central tendencies - mean, median, and mode of a column.

- Replacing with mean: In this method the missing values are imputed with the mean of the column. Mean gets impacted by the presence of outliers, and in such cases where the column has outliers using this method may lead to erroneous imputations.

- Replacing with median: In this method the missing values are imputed with the median of the column. In cases where the column has outliers, median is an appropriate measure of central tendency to deal with the missing values over mean.

- Replacing with mode: In this method the missing values are imputed with the mode of the column. This method is generally preferred with categorical data.

In [16]:
#Check the missing values for borough in %
df.borough.value_counts(normalize=True, dropna=False)

borough
Bronx           0.15
Brooklyn        0.15
EWR             0.15
Manhattan       0.15
Queens          0.15
Staten Island   0.15
Unkown          0.10
Name: proportion, dtype: float64

#Insight

- All 6 categories have the same percentage missing of 15%
- There is no mode / multiple modes for the 6 categories
- NaN is 10%, which is close to the percentage of observations from the other boroughs
- We can treat the missing values as a separate category

In [13]:
#Convert missing values to unknown category
df['borough'].fillna('Unkown', inplace = True)

In [14]:
#Check boroughs
df['borough'].unique()

array(['Bronx', 'Brooklyn', 'EWR', 'Manhattan', 'Queens', 'Staten Island',
       'Unkown'], dtype=object)

In [15]:
#Check missing values for boroughs
df.isnull().sum()

pickup_dt      0
borough        0
pickups        0
spd            0
vsb            0
temp         359
dewp           0
slp            0
pcp01          0
pcp06          0
pcp24          0
sd             0
hday           0
dtype: int64

#Insight

The missing values in the borough column have been treated. Let us now move on to temp variable and see how to deal with the missing values present there

In [17]:
#Check temp missing values
df.loc[df['temp'].isnull()==True]

Unnamed: 0,pickup_dt,borough,pickups,spd,vsb,temp,dewp,slp,pcp01,pcp06,pcp24,sd,hday
1,01-01-2015 01:00,Brooklyn,1519,5.00,10.00,,7.00,1023.50,0.00,0.00,0.00,0.00,Y
8,01-01-2015 02:00,Brooklyn,1229,3.00,10.00,,6.00,1023.00,0.00,0.00,0.00,0.00,Y
15,01-01-2015 03:00,Brooklyn,1601,5.00,10.00,,8.00,1022.30,0.00,0.00,0.00,0.00,Y
22,01-01-2015 04:00,Brooklyn,1390,5.00,10.00,,9.00,1022.00,0.00,0.00,0.00,0.00,Y
29,01-01-2015 05:00,Brooklyn,759,5.00,10.00,,9.00,1021.80,0.00,0.00,0.00,0.00,Y
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2334,15-01-2015 19:00,Brooklyn,594,5.00,10.00,,13.00,1016.20,0.00,0.00,0.00,0.00,N
2340,15-01-2015 20:00,Brooklyn,620,5.00,10.00,,13.00,1015.50,0.00,0.00,0.00,0.00,N
2347,15-01-2015 21:00,Brooklyn,607,3.00,10.00,,14.00,1015.40,0.00,0.00,0.00,0.00,N
2354,15-01-2015 22:00,Brooklyn,648,9.00,10.00,,14.00,1015.40,0.00,0.00,0.00,0.00,N


#Insight

- Brooklyn seems to have the missing values

- change column to datetime and review which months have the missing values for the 359 rows