<a href="https://colab.research.google.com/github/Raghav-81/SDS-Project-Bike-Rentals/blob/master/Hypothesis%20Testing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [52]:
import pandas as pd
from statsmodels.stats.weightstats import ztest
from scipy.stats import f_oneway as anova

In [8]:
bike = pd.read_csv("/content/SDS-Project-Bike-Rentals/Data_Handling.csv")
bike.head()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,winter,0,January,0,0,Saturday,0,1,9.84,14.395,81.0,0.0,3,13,16
1,2,2011-01-01,winter,0,January,1,0,Saturday,0,1,9.02,13.635,80.0,0.0,8,32,40
2,3,2011-01-01,winter,0,January,2,0,Saturday,0,1,9.02,13.635,80.0,0.0,5,27,32
3,4,2011-01-01,winter,0,January,3,0,Saturday,0,1,9.84,14.395,75.0,0.0,3,10,13
4,5,2011-01-01,winter,0,January,4,0,Saturday,0,1,9.84,14.395,75.0,0.0,0,1,1


# Hypothesis Testing

Throughout all the Hypothesis statements, the significance value is `0.05` or `5%`

$\alpha = 0.05$

## Hypothesis 1

The hypothesis statement is taken from https://nacto.org/bike-share-statistics-2017/

There is more demand for bike rentals during peak hours i.e. between 7-9 am and 4-6 pm on working days

Let $\mu_{peak}$ be the mean count of bike rental users on working days during peak hours

Let $\mu_{\text{not peak}}$ be the mean count of bike rental users on working days during non peak hours

Null Hypothesis:
$ \mu_{peak} - \mu_{\text{not peak}} = 0 $

Alternate Hypothesis:
$ \mu_{peak} - \mu_{\text{not peak}} > 0 $

Test to be conducted: Two sample z test

In [17]:
peak = bike.query('workingday == 1 and ((hr>=7 and hr<=9) or (hr>=16 and hr<=18))')
not_peak = bike.query('workingday == 1 and ((hr==6) or (hr>=10 and hr<=15) or (hr>=19 and hr<=21))')

In [19]:
ztest(peak['cnt'] , not_peak['cnt'], alternative="larger")

(52.80985574722311, 0.0)

The p_value is `0`! \\
That means the data strongly opposes the null hypothesis

Therefore we reject the null hypothesis and accept the alternate hypothesis

## Hypothesis 2

The hypothesis statement is taken from https://nacto.org/bike-share-statistics-2017/

There is more demand for bike rentals on working days than non working days

Let $\mu_{working}$ be the mean count of bike rental users on working days

Let $\mu_{\text{not working}}$ be the mean count of bike rental users on non working days

Null Hypothesis:
$ \mu_{working} - \mu_{\text{not working}} = 0 $

Alternate Hypothesis:
$ \mu_{working} - \mu_{\text{not working}} > 0$

Test to be conducted: Two sample z test

In [23]:
working = bike.query('workingday == 1')
not_working = bike.query('not workingday == 1')

In [24]:
ztest(working['cnt'] , not_working['cnt'], alternative="larger")

(3.993973309150059, 3.248759054718501e-05)

The p_value is less than `0.05`! \\
That means the data opposes the null hypothesis

Therefore we reject the null hypothesis and accept the alternate hypothesis

## Hypothesis 3

The hypothesis statement is taken from https://nacto.org/bike-share-statistics-2017/

The number of rented bikes double every year 

Let $\mu_{2012}$ be the mean count of bike rental users in 2012

Let $cnt_{2011}$ be the mean count of bike rental users in 2011

Null Hypothesis:
$ \mu_{2012} < 2 \times{cnt_{2011}} $

Alternate Hypothesis:
$ \mu_{2012} \geq  2 \times{cnt_{2011}}$

Test to be conducted: one sample z test

In [27]:
bike_2012 = bike.query('yr==1')
cnt_2011 = bike.query('yr==0')['cnt'].mean()

143.79444765760556

In [42]:
ztest(bike_2012['cnt'] , value = 2*cnt_2011, alternative="larger")

(-23.674784738187125, 1.0)

The p_value is `1`! \\
That means the data strongly accepts the null hypothesis

Therefore we accept the null hypothesis and reject the alternate hypothesis2 

## Hypothesis 4

The hypothesis statement is taken from the paper
[*Investigation on the effects of weather and calendar events on bike-sharing according to the trip patterns of bike rentals of stations*](https://www.sciencedirect.com/science/article/abs/pii/S0966692317304659)

During travel hours i.e 6am – 10pm, bike rentals drop at feeling temperatures above 30​

Let $\mu_{30}$ be the mean count of bike rental users during the time `6am - 10pm` at which the feeling temperature is above or equal to `30`

Let $\mu_{29}$ be the mean count of bike rental users during the time `6am - 10pm ` at which the feeling temperature is below 30

Null Hypothesis:
$ \mu_{30} - \mu_{29} >= 0$

Alternate Hypothesis:
$ \mu_{30} - \mu_{29} < 0$

Test to be conducted: two sample z test

In [51]:
bike_30 = bike.query('(hr>=6 and hr<=22) and atemp>=30')
bike_29 = bike.query('(hr>=6 and hr<=22) and atemp<30')


In [50]:
ztest(bike_30['cnt'] , bike_29['cnt'], alternative="smaller")

(44.7643892887034, 1.0)

The p_value is `1`! \\
That means the data strongly accepts the null hypothesis

Therefore we accept the null hypothesis and reject the alternate hypothesis2 

## Hypothesis 5

The hypothesis statement is taken from the paper
[*Investigation on the effects of weather and calendar events on bike-sharing according to the trip patterns of bike rentals of stations*](https://www.sciencedirect.com/science/article/abs/pii/S0966692317304659)

In case of Light rain, the count of bike rental users decreases as compared to mist and clear again during travel hours

Let $\mu_{lr}$ be the mean count of bike rental users during light rain

Let $\mu_{cl}$ be the mean count of bike rental users during clear and cloudy sky

Null Hypothesis:
$ \mu_{lr} = \mu_{cl}

Alternate Hypothesis:
$ \mu_{lr} < \mu_{cl}$

Test to be conducted: one way ANOVA test

In [53]:
clear = bike.query('weathersit == 1')
cloudy = bike.query('weathersit == 2')
lr = bike.query('weathersit == 3')

In [54]:
anova(clear['cnt'] , cloudy['cnt'], lr['cnt'])

F_onewayResult(statistic=190.12547702844438, pvalue=2.0903098979675505e-82)

The p_value is  less than 0.05 ! \\
That means the data rejects the null hypothesis

Therefore we reject the null hypothesis and accept the alternate hypothesis