In [None]:
# importing libraries
import pandas as pd
from scipy import stats

# Reading the Dataset

In [None]:
# loading the dataset
df = pd.read_csv("Uber_Drives_Clean.csv")

df.head()

# Inferential Statistics

One starts with testing whether there is any significant difference in the duration of the trips when the trip had the same start and end location compared to a trip with different start and stop locations. 

For this purpose, an independent two sample t test is used.

In [None]:
# making the series
short_trips = df[df['START*'] == df['STOP*']]['Duration']
long_trips = df[df['START*'] != df['STOP*']]['Duration']

In [None]:
# performing the test
stats.ttest_ind(short_trips, long_trips)

The low p-value leads to the rejection of the null hypothesis, and hence one can conclude that the duration of a trip is affected by the start and end locations of the trip, whether they are same or different. 

The hour of the day that the trip starts in can depend on the fact that whether it is a trip in the same area or to a different area. The person can be motivated to opt for long distance trips during specific hours of the day, considering his work schedule or traffic conditions.

To check for this, an independent two samples t test is performed. 

In [None]:
# making the series
start_hour_st = pd.to_datetime(df[df['START*'] == df['STOP*']]['Start Time']).dt.hour
start_hour_lt = pd.to_datetime(df[df['START*'] != df['STOP*']]['Start Time']).dt.hour

In [None]:
# performing the test
stats.ttest_ind(start_hour_st, start_hour_lt)

Here, the p value is greater than 0.05, the level of significance of the test. Hence, one has no evidence to reject the null hypothesis and the conclusion is that the starting hour of the trip is not affected by its start and end locations.

The duration of the trip can also depend on the purpose of the trip. This is checked using a one-way ANOVA test. 

In [None]:
# categories present in the "PURPOSE*" column
df['PURPOSE*'].value_counts()

In [None]:
# making the series
Meeting = df[df['PURPOSE*'] == 'Meeting']['Duration']
Meal = df[df['PURPOSE*'] == 'Meal/Entertain']['Duration']
Errand = df[df['PURPOSE*'] == 'Errand/Supplies']['Duration']
CV = df[df['PURPOSE*'] == 'Customer Visit']['Duration']
TS = df[df['PURPOSE*'] == 'Temporary Site']['Duration']
BO = df[df['PURPOSE*'] == 'Between Offices']['Duration']

In [None]:
# performing the test
anova_result = stats.f_oneway(Meeting, Meal, Errand, CV, TS, BO)
anova_result

The low p-value implies that there is at least one pair of samples that is significantly different. Hence, one cannot say that the duration of the trip will be same, whatever be the purpose.

It is also possible that certain days of the week witness higher traffic which in turn affects the duration of the trip. Thus, a one way ANOVA test is performed to know whether there is any dependence between the duration of the trip and the day of the week that it takes place on.

In [None]:
# making the series
Mon = df[df['Weekday'] == 0]['Duration']
Tue = df[df['Weekday'] == 1]['Duration']
Wed = df[df['Weekday'] == 2]['Duration']
Thu = df[df['Weekday'] == 3]['Duration']
Fri = df[df['Weekday'] == 4]['Duration']
Sat = df[df['Weekday'] == 5]['Duration']
Sun = df[df['Weekday'] == 6]['Duration']

In [None]:
# performing the test
anova_result = stats.f_oneway(Mon, Tue, Wed, Thu, Fri, Sat, Sun)
anova_result

Here, the p value is less than 0.05, the level of significance of the test. Hence, one can reject the null hypothesis and say that there is atleast one pair of samples that is significantly different from each other. The duration of the trip does depend on the day of the week that the trip takes place on. 

Finally, one can check whether the two features 'PURPOSE*' and 'Weekday' are independent of each other, as there is a possibility that the meetings might take place only on the weekdays, trips for supplies or entertainment take place on weekends and likewise. 

A Chi Square test is used here.

In [None]:
# making the contingency table
crosstab = pd.crosstab(df['PURPOSE*'], df['Weekday'])
crosstab

In [None]:
# performing the test
stats.chi2_contingency(crosstab)

The p value in this case is ~0.006 which is less than the level of significance. 0.05. Hence, one rejects the null hypothesis and concludes that there is a dependence between the purpose of the trip and the day of the week it takes place on.