# Skeleton of Assignment 4:
    test if the distribution of 
    
    1) trip duration of bikers that ride during the day vs night
    
    2) age of bikers for trips originating in Manhattan and in Brooklyn
    
    are different. Use 3 tests: KS, Pearson's, Spearman's. 
    
    Use the scipy.stats functions scipy.stats.ks_2samp, scipy.stats.pearsonr, scipy.stats.spearmanr. 
    
    For the KS do the test with the entire dataset and with a subset 200 times smaller
    
    Choose a single significant threshold for the whole exercise. 
    
    For each test phrase the Null Hypothesis in words.
    
    Describe the return of the scipy function you use in each case.
    
    State the result in terms of rejection of the Null.

In [1]:
# my usual imports and setups
import pylab as pl
import pandas as pd
import numpy as np

%pylab inline
import os

Populating the interactive namespace from numpy and matplotlib


# Read in data
I am reading in data from January 2015 with the function that I created getCitiBikeCSV. You are requested to use 2 months at least. It would be a good idea to use data from a colder and a warmer months, since there are more riders in the warm weather and ridership patterns may change with weather, temperature, etc. You should use data from multiple months, joining multiple datasets (thus addressing some systematic errors as well)

In [2]:
df1 = pd.read_csv('https://s3.amazonaws.com/tripdata/201611-citibike-tripdata.zip', compression = 'zip')
df2 = pd.read_csv('https://s3.amazonaws.com/tripdata/201612-citibike-tripdata.zip', compression = 'zip')

In [3]:
df = pd.concat([df1, df2])

# 1) trip duration of bikers that ride during the day vs night

**H0: The data sets of trip duration of bikers at day time vs at night are statistically similar **
$$ \alpha = 0.05 $$

In [4]:
df.loc[:,'date'] = pd.to_datetime(df.loc[:,'Start Time'])
df.loc[:,'hour'] = list(map(lambda x: x.hour, df.date))

In [5]:
df.loc[:,'day_or_night'] = list(map(lambda x: int((x >= 6) & (x <= 19)), df.hour))

In [6]:
df.loc[:,'day'] = df.day_or_night
df.loc[:,'night'] = df.day_or_night == 0
df.night = df.night.astype(int)
df = df.dropna(subset = ['Trip Duration'])

lets run the scipy KS test

In [7]:
import scipy.stats
#remember that your imports should all be at the top. I leave it here to hightlight that this package is needed at this point of the workflow

# KS tests to compare 2 samples

http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.ks_2samp.html

the KS test in scipy returns the p-value BUT make sure you understand what the NULL is! read the documentation carefully! what is the null hypothesis that you can/cannot reject?

In [8]:
df['day_td'] = df.day*df['Trip Duration']
df['night_td'] = df.night*df['Trip Duration']

Day_td = df.day_td
Night_td = df.night_td

Day_td = Day_td[Day_td != 0]
Night_td = Night_td[Night_td != 0]

D1, pval1 = scipy.stats.ks_2samp(Day_td, Night_td)
print (D1, pval1)

0.0660396159405 0.0


<span style="color:blue"> Since the maximal distance between the CDFs of the two data sets is close to 0, and the p-value is smaller than alpha, which indicate the result is statistically signifficant, Hence we reject null hypothesis, the two data sets are drawn from different distributions </span>.

The scipy.stats KS test already tells me the significance and the p-value. 

The next few cells are here just to show you how you would obtain the same result by hand, but they are **not required**. 

Remember: the Null hypothesis is rejected if 

$D_KS(n1,n2) > c(\alpha) \sqrt{\frac{(n1 + n2)}{n1n2}}$

(see class notes) where $c(\alpha$) is the inverse of the KS distribution, and you do not have to know how to get that cause there are tables that list critical values!! 

http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/kolmogorov-smirnov-test/kolmogorov-distribution/

But also this result depends in your choice of binning through, and thustheresultyou get by hand may not be exactly the same as the one the KS returns. Either way: this is how you would calculate the KS statistics by hand.

# Now retest using a test for correlation. 

That will answer a slightly different question though - formulate the NULL appropriately. The tests for correlations (generally) requires the variable to be paired, so that I can tell if x changes does y change similarly. But the datasets are of different size! You will need to reduce them to the same size. You can do that by subsampling of the data: take only 1 ride every of 200, which you can achieve "slicing and broadcasting" the array or using one of the python function (built in python numpy.random.choice() functions for example: Docstring:
choice(a, size=None, replace=True, p=None)

Generates a random sample from a given 1-D array

        .. versionadded:: 1.7.0

Parameters
...

But make sure you understand how to use it! there is an option "replace" which you should think about.

# Pearson's  test for correlation

** notice that the Pearson's is a pairwise test: the samples need to be **
 a. the same size
 b. sorted! (how??)
    
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#scipy.stats.pearsonr



In [9]:
#bin up data
bin1 = np.arange(min(df['Trip Duration']), max(df['Trip Duration']), 20000)
D = df.day_td.groupby(pd.cut(df.day_td, bin1)).agg([count_nonzero])
N = df.night_td.groupby(pd.cut(df.night_td, bin1)).agg([count_nonzero])

D = D.count_nonzero
N = N.count_nonzero

#drop 0
N = N.fillna(0)
D = D.fillna(0)

r11, pval11 = scipy.stats.pearsonr(D,N)
print(r11, pval11)

0.999999944187 0.0


<span style="color:blue"> the p-value is smaller than alpha, which indicate the result is statistically signifficant, Hence we reject null hypothesis, the two data sets are drawn from different distributions </span>.

# Spearman's  test for correlation

http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html#scipy.stats.spearmanr

In [10]:
r111, pval111= scipy.stats.spearmanr(N,D)
print(r111, pval111)

0.547941041186 3.02719275993e-18


<span style="color:blue"> the p-value is smaller than alpha, which indicate the result is statistically signifficant, Hence we reject null hypothesis, the two data sets are drawn from different distributions </span>.

# 1) The age of bikers in Manhattan and the age of bikers in Brooklyn


**H0: The data sets of age of bikers in Manhattan vs in Brooklyn are statistically similar**
$$ \alpha = 0.05 $$

### rough estimation of location ( using boxes as boundries) 

mn_range = -74.042358,40.703025,-73.927002,40.875622 #0

bk_range = -74.042358,40.568589,-73.859711,40.703025 #1

note: because of the get_mnbk function, each location would first be identified as manhattan, then brooklyn, hence, the overlaping area of the boxes (mostly small part of williamsburg will be identified as manhattan)

In [11]:
def get_mnbk(lat, lon):
    if ((40.703025 <= lat <= 40.875622) & (-74.042358 <= lon <= -73.927002)):
        return(0)
    if ((40.568589 <= lat <= 40.703025) & (-74.042358 <= lon <= -73.859711)): 
        return(1)
    else:
        return(2)

In [12]:
df['mnbk'] = 0
df.mnbk = list(map(get_mnbk, df['Start Station Latitude'], df['Start Station Longitude']))

In [13]:
df['age'] = 2017 - df['Birth Year']

In [14]:
df['mn'] = df.mnbk == 0
df.mn = df.mn.astype(int)
df['bk'] = df.mnbk

In [15]:
df['mn_age'] = df.mn*df.age 
df['bk_age'] = df.bk*df.age
df['mn_age'].dropna(inplace= True)
df['bk_age'].dropna(inplace= True)

# KS tests to compare 2 samples

In [16]:
MN_age = df['mn_age'] 
BK_age = df['bk_age']

MN_age = MN_age[MN_age != 0]
BK_age = BK_age[BK_age != 0]


D2, pval2 = scipy.stats.ks_2samp(MN_age, BK_age)
print (D2, pval2)

0.0521322062101 0.0


<span style="color:blue"> Since the maximal distance between the CDFs of the two data sets is close to 0, and the p-value is smaller than alpha, which indicate the result is statistically signifficant, Hence we reject null hypothesis, the two data sets are drawn from different distributions </span>.

# Pearson's  test for correlation

In [17]:
# citibike only allows people who are over 16 to use it
bins = np.arange(10, 90, 7)
MN = df.mn_age.groupby(pd.cut(df.mn_age, bins)).agg([count_nonzero])
BK = df.bk_age.groupby(pd.cut(df.bk_age, bins)).agg([count_nonzero])

MN = MN.count_nonzero
BK = BK.count_nonzero

r22, pval22 = scipy.stats.pearsonr(MN,BK)
print(r22, pval22)

0.973929418845 4.20534591097e-07


<span style="color:blue"> the p-value is smaller than alpha, which indicate the result is statistically signifficant, Hence we reject null hypothesis, the two data sets are drawn from different distributions </span>.

# Spearman's test for correlation

In [18]:
r222, pval222= scipy.stats.spearmanr(MN, BK)
print(r222, pval222)

0.981818181818 8.40306643396e-08


<span style="color:blue"> the p-value is smaller than alpha, which indicate the result is statistically signifficant, Hence we reject null hypothesis, the two data sets are drawn from different distributions </span>.