Part A : Commuter pattern

Do average hourly rides (remember each row in the data is a ride) differ
between working days and non-working days?

In [None]:
import pandas as pd
import numpy as np 
from scipy import stats 
from statsmodels.stats.proportion import proportion_confint


In [3]:
df = pd.read_csv ('/Users/Marcy_Student/Desktop/Projects /Shaping-Urban-Mobility-/data/modified_hours_dateset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 20 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   instant                  17379 non-null  int64  
 1   dteday                   17379 non-null  object 
 2   season                   17379 non-null  int64  
 3   mnth                     17379 non-null  int64  
 4   hr                       17379 non-null  int64  
 5   holiday                  17379 non-null  int64  
 6   weekday                  17379 non-null  int64  
 7   weathersit               17379 non-null  int64  
 8   temp                     17379 non-null  float64
 9   atemp                    17379 non-null  float64
 10  hum                      17379 non-null  float64
 11  windspeed                17379 non-null  float64
 12  casual                   17379 non-null  int64  
 13  registered               17379 non-null  int64  
 14  Total_amount_of_rental

In [None]:

# filter for weekdays only (EXcluding the holidays )
weekdays= ['Monday','Tuesday','Wednesday','Thursday','Friday']
working_df= df[(df['weekday_names'].isin(weekdays))& (df['holiday'] == 0)]

# Average amount of rides during the weekdays
avg_rides_weekdays = working_df.groupby('weekday_names')['Total_amount_of_rentals'].mean().round(1)
totalavg_working_rides = working_df['Total_amount_of_rentals'].mean().round(1)
print(avg_rides_weekdays)
print('Avg_rides_weekdays : ' , totalavg_working_rides)



weekday_names
Friday       197.3
Monday       186.6
Thursday     198.7
Tuesday      192.6
Wednesday    190.0
Name: Total_amount_of_rentals, dtype: float64
Avg_rides_weekdays :  193.2


In [5]:
#Filter for non_working_days only 
weekends = ['Saturday', 'Sunday']
non_working_df = df[df['weekday_names'].isin(weekends) | (df['holiday'] == 1)]

# Average number of rides during the weekends 
avg_rides_non_working = non_working_df['Total_amount_of_rentals'].mean().round(1)
print('non_working_rides_avg :',avg_rides_non_working)


non_working_rides_avg : 181.4


In [None]:
# we are going to conduct a welchs t- test since we are comparing AVG rides between non_working days and working days 
alpha = 0.05
a = working_df['Total_amount_of_rentals']
b = non_working_df['Total_amount_of_rentals']
tt = stats.ttest_ind(a, b, equal_var=False)
# we sepearate the means between the groups so its easier seen 
mean_a = a.mean().round(1)
mean_b = b.mean().round(1)
print(f"t = {tt.statistic:.3f}, p = {tt.pvalue:.4f}, reject H0 @ α={alpha}? {tt.pvalue < alpha}")
print(f"Group means: working_days = {mean_a}, non_working_days = {mean_b}")


t = 4.095, p = 0.0000, reject H0 @ α=0.05? True
Group means: working_days = 193.2, non_working_days = 181.4


In [None]:
# Next we would look for the actual differences between the means . (confidence intervals)
difference_between_means = mean_a - mean_b

# Welches test needs standard error
standarderror = np.sqrt(a.var(ddof=1)/len(a) + b.var(ddof=1)/len(b))

# we find 95% confidence interval 
df_welch = ((a.var(ddof=1)/len(a) + b.var(ddof=1)/len(b))**2 /
    (((a.var(ddof=1)/len(a))**2 / (len(a)-1)) +
    ((b.var(ddof=1)/len(b))**2 /(len(b)-1))))

ci_low, ci_high = stats.t.interval(0.95, df_welch, loc=difference_between_means, scale=standarderror)

print(f"95% CI for difference in means (working - nonworking): [{ci_low:.2f}, {ci_high:.2f}]")

95% CI for difference in means (working - nonworking): [6.15, 17.45]


Null Hypothesis = No Difference in average hourly rides 
ALternative Hypothesis = There is a difference in average hourly rides 

We will reject the null hyphothesis due to there being a statitical difference between non_working rides and working_day rides . It shows that working days have a difference of 6 - 17 more rides then non_working days 

Part B : Multi-group comparison

Do mean hourly rides differ across categories of multi-level categorical variables such as season or weather condition (choose one)?

In [None]:
# Viewing the amount of rows per season in order to answer , if i should do statified sampling.
df['season_names'].value_counts() 
print(df['season_names'].value_counts())

season_names
Summer    4496
Spring    4409
Winter    4242
Fall      4232
Name: count, dtype: int64


In [None]:
# finding averages within each season 
avg_within_seasons = [df['Total_amount_of_rentals'][df['season_names'] == s] for s in df['season_names'].unique()]
anova = stats.f_oneway(*avg_within_seasons)
print(f"ANOVA F = {anova.statistic:.3f}, p = {anova.pvalue:.4f}, reject H0? {anova.pvalue < alpha}")

ANOVA F = 409.181, p = 0.0000, reject H0? True


In [19]:
# we will now do a poc-hoc test in order to circle in on the difference in which is being stated.
# we will use the tukeys Honestly Significant Difference (HSD) : we are gonna compare all possible group means .(seasonal)

from statsmodels.stats.multicomp import pairwise_tukeyhsd

hourly_rides = df['Total_amount_of_rentals']
seasons= df['season_names']

tukey_results = pairwise_tukeyhsd(endog= hourly_rides, groups=seasons, alpha=alpha)
print(tukey_results)


   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
group1 group2  meandiff p-adj    lower     upper   reject
---------------------------------------------------------
  Fall Spring    9.4752 0.0582   -0.2181   19.1685  False
  Fall Summer   37.1474    0.0   27.5001   46.7946   True
  Fall Winter  -87.7543    0.0  -97.5406   -77.968   True
Spring Summer   27.6722    0.0   18.1252   37.2192   True
Spring Winter  -97.2295    0.0  -106.917   -87.542   True
Summer Winter -124.9017    0.0 -134.5431 -115.2603   True
---------------------------------------------------------


Null Hypothesis = All seasonal average hourly rides are equal 
Alternative Hypthesis = One seasonal mean has a difference. 

Based on the F statiics shows that the seasonal hourly rides differ within the seasons , while the tukeys test shows the differences in mean within the grouped seasons .