# Case Study: Cycle Sharing Scheme

##### Description:

-----------
    
    The cycle sharing scheme provides means for the people of the city to commute using a convenient, cheap, and green transportation alternative. The service has 500 bikes at 50 stations across Seattle. Each of the stations has a dock locking system (where all bikes are parked); kiosks (so customers can get a membership key or pay for a trip) and a helmet rental service. A person can choose between purchasing a membership key or short-term pass. A membership key entitles an annual membership, and the key can be obtained from a kiosk. Advantages for members include quick retrieval of bikes and unlimited 45-minute rentals. Short-term passes offer access to bikes for a 24-hour or 3-day time interval. Riders can avail and return the bikes at any of the 50 stations citywide.

------------

#### DATA Dictionary
![title](img/dictionary.png)

### Importing Packages

In [1]:
%matplotlib inline
import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
import numpy as np
import scipy
from scipy import stats
import seaborn

#### Reading Input File

In [2]:
data = pd.read_csv("data/trip.csv")

# EDA

#### Exploring data

##### Major types of variables
![title](img/vartype.png)

In [3]:
len(data)

286857

In [4]:
data.head()

Unnamed: 0,trip_id,starttime,stoptime,bikeid,tripduration,from_station_name,to_station_name,from_station_id,to_station_id,usertype,gender,birthyear
0,431,10/13/2014 10:31,10/13/2014 10:48,SEA00298,985.935,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1960.0
1,432,10/13/2014 10:32,10/13/2014 10:48,SEA00195,926.375,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1970.0
2,433,10/13/2014 10:33,10/13/2014 10:48,SEA00486,883.831,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Female,1988.0
3,434,10/13/2014 10:34,10/13/2014 10:48,SEA00333,865.937,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Female,1977.0
4,435,10/13/2014 10:34,10/13/2014 10:49,SEA00202,923.923,2nd Ave & Spring St,Occidental Park / Occidental Ave S & S Washing...,CBD-06,PS-04,Member,Male,1971.0


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 286857 entries, 0 to 286856
Data columns (total 12 columns):
trip_id              286857 non-null int64
starttime            286857 non-null object
stoptime             286857 non-null object
bikeid               286857 non-null object
tripduration         286857 non-null float64
from_station_name    286857 non-null object
to_station_name      286857 non-null object
from_station_id      286857 non-null object
to_station_id        286857 non-null object
usertype             286857 non-null object
gender               181557 non-null object
birthyear            181553 non-null float64
dtypes: float64(2), int64(1), object(9)
memory usage: 26.3+ MB


--------------------------
![title](img/vartype1.png)

In [6]:
data.describe()

Unnamed: 0,trip_id,tripduration,birthyear
count,286857.0,286857.0,181553.0
mean,112431.968012,1178.295675,1979.759062
std,76565.154943,2038.458947,10.167119
min,431.0,60.008,1931.0
25%,43051.0,387.924,1974.0
50%,103487.0,624.842,1983.0
75%,179545.0,1118.466,1987.0
max,255245.0,28794.398,1999.0


In [8]:
data.sort_values(by='starttime', inplace=True)
data.reset_index(drop=True, inplace=True)

In [18]:
print ('Date range of dataset: {} - {}'.format(data.loc[1, 'starttime'],data.loc[len(data)-1, 'stoptime']))

Date range of dataset: 2015-01-01 00:24:00 - 2015-09-09 10:00:00


#### Data Transformation

In [19]:
data.starttime = pd.to_datetime(data.starttime)
data.stoptime = pd.to_datetime(data.stoptime)

In [20]:
data.sort_values(by='starttime', inplace=True)
data.reset_index(drop=True, inplace=True)

In [None]:
print ('Date range of dataset: {} - {}'.format(data.loc[1, 'starttime'],data.loc[len(data)-1, 'stoptime']))

##### Generartion in workplace
| Generation | Description |
|--------|-------------|
| The Silent Generation | Born 1928-1945 (73-90 years old) |
| Baby Boomers | Born 1946-1964 (54-72 years old) |
| Generation X | Born 1965-1980 (38-53 years old) |
| Millennials | Born 1981-1996 (22-37 years old) |
| Post-Millennials | Born 1997-Present (0-21 years old) |

###### Exercise :Create a generation column using the above criteria

In [None]:
##YourAnswershere:


#### Plotting the distribution for the category variables

In [None]:
### Plotting the Distribution of User Types
groupby_user = data.groupby('usertype').size()
groupby_user.plot.bar(title = 'Distribution of User Types');

In [None]:
### Plotting the Distribution of User Gender
groupby_gender = data.groupby('gender').size()
groupby_gender.plot.bar(title = 'Distribution of User Types');

In [None]:
### Plotting the Distribution of Birth Years
data = data.sort_values(by='birthyear')
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution of birth years',figsize = (15,4));

In [None]:
data_mil = data[(data['birthyear'] >= 1981) & (data['birthyear']<=1996)]
groupby_mil = data_mil.groupby('usertype').size()
groupby_mil.plot.bar(title = 'Distribution of user types')

### Multivariate Analysis

In [None]:
data.gender.value_counts()

In [None]:
groupby_birthyear_gender = data.groupby(['birthyear', 'gender'])['birthyear'].count().unstack('gender').fillna(0)
groupby_birthyear_gender[['Male','Female','Other']].plot.bar(title =
'Distribution of birth years by Gender', stacked=True, figsize = (15,4));

##### Plotting the Distribution of Birth Years by User Types

In [None]:
groupby_birthyear_user = data.groupby(['birthyear', 'usertype'])['birthyear'].count().unstack('usertype').fillna(0)
groupby_birthyear_user[['Member']].plot.bar(title = 'Distribution of birth years by Usertype', stacked=True, figsize = (15,4));

In [None]:
data[data['usertype']=='Short-Term Pass Holder']['birthyear'].isnull().values.all()

In [None]:
data[data['usertype']=='Short-Term Pass Holder']['gender'].isnull().values.all()

In [None]:
data['starttime_date'] = pd.DatetimeIndex(data.starttime).date

---------------------------------------------------
## NORMAL DISTRIBUTION
![title](img/nd.png)

In [None]:
print ("Mean of trip duration:{}".format(data.tripduration.mean()))
print ("Median of trip duration:{}".format(data.tripduration.median()))
print("Mode of trip duration:{}".format(data.tripduration.mode()))
print("Mode of station originating from:{}".format(data.from_station_name.mode()))

In [None]:
data.from_station_name.value_counts()

In [None]:
data.tripduration.plot.hist(bins = 50, title = 'Frequency Distribution of Trip Duration');

##### Box plot or Whisker plot

![title](img/whisker.png)

##### With Outliers

![title](img/whisker1.png)


In [None]:
data.boxplot(column=['tripduration']);

#### Outliers - Detecting using IQR

![title](img/percentile.png)

In [None]:
q75, q25 = np.percentile(data.tripduration, [75,25])

In [None]:
iqr = q75 - q25

In [None]:
upper_whisker = q75 + 1.5 * iqr
lower_whisker = q25 - 1.5 * iqr

In [None]:
data.tripduration.describe()

In [None]:
def check(x, ul, ll):
    if ul>=x>=ll:
        return x

## Percentage of outliers

In [None]:
print("Percentage of Outliers in tripduration:",len(data[data.tripduration.apply(check, args = (upper_whisker, lower_whisker)).isnull()]['tripduration'])/len(data) * 100)

In [None]:
mean_trip_duration = data[data.tripduration.apply(check, args = (upper_whisker, lower_whisker)).notnull()]['tripduration'].mean()
print (mean_trip_duration)

### Outliers Treatment

In [None]:
def transform_tripduration(x):
    if x > upper_whisker:
        return mean_trip_duration
    return x

data['tripduration_mean'] = data['tripduration'].apply(lambda x: transform_tripduration(x))
data['tripduration_mean'].plot.hist(bins=100, title='Frequency distribution of mean transformed Trip duration');

#### Skewness vs. Symmetric distibution

!['title'](img/skew.png)

### Measuring Center of Measure
Mean

Median


Mode

Variance - represents variability of data points about the mean


Standard Deviation - Square root of Vairance


### Correlation

1) Pearson R


2) Kendall Rank


3) Spearman Rank

In [None]:
data['starttime_year'] = pd.DatetimeIndex(data.starttime).year

In [None]:
data['age'] = data['starttime_year'] - data['birthyear']

In [None]:
data.age.plot.hist(bins=100)

In [None]:
data = data.dropna()
seaborn.pairplot(data, vars=['age', 'tripduration'], kind='reg')
plt.show()

##### Correlation Directions

---------------------

![title](img/corr1.png)

-------------------
Reference table

![title](img/corr2.png)

In [None]:
correlations = data[['tripduration','age']].corr(method='pearson')
print(correlations)

#### Central Limit Theorem-VIsual Proof

![title](img/clt1.jpeg)

In [None]:
daily_tickets = list(data.groupby('starttime_date').size())
sample_tickets = []
checkpoints = [1, 10, 100, 300, 500, 700, 900, 1000]

plot_count = 1
random.shuffle(daily_tickets)
plt.figure(figsize=(15,7))
binrange=np.array(np.linspace(0,500,101))

for i in range(1000):
    if daily_tickets:
        sample_tickets.append(daily_tickets.pop())

        if i+1 in checkpoints or not daily_tickets:
            plt.subplot(2,3,plot_count)
            plt.hist(sample_tickets, binrange)
            plt.title('n=%d' % (i+1),fontsize=15)
            plot_count+=1
            
        if not daily_tickets:
            break

### Log Transformation to reduce skewness

In [None]:
plt.hist(data.age);

In [None]:
plt.hist(np.log10(data.age));