# Analyzing Bike Sharing Trends

## Problem Statement
With environmental issues and health becoming trending topics, usage of bicycles as a mode of transportation has gained traction in recent years. To encourage bike usage, cities across the world have successfully rolled out bike sharing programs. Under such schemes, riders can rent bicycles using manual/ automated kiosks spread across the city for defined periods. In most cases, riders can pick up bikes from one location and return them to any other designated place.

The bike sharing platforms from across the world are hotspots of all sorts of data, ranging from travel time, start and end location, demographics of riders, and so on. This data along with alternate sources
of information such as weather, traffic, terrain, and so on makes it an attractive proposition for different research areas.

The Capital Bike Sharing dataset contains information related to one such bike sharing program underway in Washington DC. Given this augmented (bike sharing details along with weather information) dataset, can we forecast bike rental demand for this program?

## Exploratory Analysis
Now that we have an overview of the business case and a formal problem statement, the very next stage is to explore and understand the data.

### Load Packages

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Preprocessing

In [None]:
df_hour = pd.read_csv('data/hour.csv')
print("Shape of dataset::{}".format(df_hour.shape))

In [None]:
df_hour.head()

In [None]:
df_hour.dtypes

As mentioned in the documentation for the dataset, there are bike sharing as well as weather attributes available. The attribute dteday would require type conversion from object (or string type) to timestamp. Attributes like season, holiday, weekday, and so on are inferred as integers by pandas, and they would require conversion to categoricals for proper understanding.

Before jumping into type casting attributes, the following snippet cleans up the attribute names to make them more understandable and pythonic.

In [None]:
df_hour.rename(columns={'instant':'rec_id',
                        'dteday':'datetime',
                        'holiday':'is_holiday',
                        'workingday':'is_workingday',
                        'weathersit':'weather_condition',
                        'hum':'humidity',
                        'mnth':'month',
                        'cnt':'total_count',
                        'hr':'hour','yr':'year'},inplace=True)

Now that we have attribute names cleaned up, we perform type-casting of attributes. The following snippet gets the attributes into the proper data types.

In [None]:
# date time conversion
df_hour['datetime'] = pd.to_datetime(df_hour.datetime)

# categorical variables
df_hour['season'] = df_hour.season.astype('category')
df_hour['is_holiday'] = df_hour.is_holiday.astype('category')
df_hour['weekday'] = df_hour.weekday.astype('category')
df_hour['weather_condition'] = df_hour.weather_condition.astype('category')
df_hour['is_workingday'] = df_hour.is_workingday.astype('category')
df_hour['month'] = df_hour.month.astype('category')
df_hour['year'] = df_hour.year.astype('category')
df_hour['hour'] = df_hour.hour.astype('category')

### Distribution and Trends

In [None]:
fig,ax = plt.subplots()
sn.pointplot(data=df_hour[['hour',
                           'total_count',
                           'season']],
             x='hour', y='total_count',
             hue='season', ax=ax)
ax.set(title="Season wise hourly distribution of counts")

The plot above shows similar trends for all seasons with counts peaking in the morning between 7-9 am and in the evening between 4-6 pm, possibly due to high movement during start and end of office hours. The counts are lowest for the spring season, while fall sees highest riders across all 24 hours.

In [None]:
fig,ax = plt.subplots()
sns.pointplot(data=df_hour[['hour',
                           'total_count',
                           'weekday']],
             x='hour', y='total_count',
             hue='weekday', ax=ax)
ax.set(title="Weekday wise hourly distribution of counts")

Similarly, distribution of ridership across days of the week also presents interesting trends of higher usage during afternoon hours over weekends, while weekdays see higher usage during mornings and evenings.

Having observed hourly distribution of data across different categoricals, let’s see if there are any aggregated trends.

In [None]:
fig,ax = plt.subplots()
sns.barplot(data=df_hour[['month',
                         'total_count']],
           x="month",y="total_count")
ax.set(title="Monthly distribution of counts")

In [None]:
sn.violinplot(data=hour_df[['year',
                            'total_count']],
              x="year",y="total_count")

The following snippet plots yearly distribution on violin plots. The figure above clearly helps us understand the multimodal distribution in both 2011 and 2012 ridership counts with 2011 having peaks at lower values as compared to 2012. The spread of counts is also much more for 2012, although the max density for both the years is between 100-200 rides.

### Outliers
While exploring and learning about any dataset, it is imperative that we check for extreme and unlikely values. Though we handle missing and incorrect information while preprocessing the dataset, outliers are usually caught during EDA. Outliers can severely and adversely impact the downstream steps like modeling and the results.
We usually utilize boxplots to check for outliers in the data. In the following snippet, we analyze outliers for numeric attributes like total_count, temperature, and wind_speed.

In [None]:
fig,(ax1,ax2)= plt.subplots(ncols=2)
sns.boxplot(data=df_hour[['total_count','casual','registered']],ax=ax1)
sns.boxplot(data=df_hour[['temp','windspeed']],ax=ax2)

We can easily mark out that for the three count related attributes, all of them seem to have a sizable number of outlier values. The casual rider distribution has overall lower numbers though. For weather attributes of temperature and wind speed, we find outliers only in the case of wind speed.

We can similarly try to check outliers at different granularity levels like hourly, monthly, and so on. 

In [None]:
fig,ax= plt.subplots()
sns.boxplot(data=df_hour[['hour','total_count']],ax=ax)

### Correlations

In [None]:
corrMatt = df_hour[['temp','atemp',
                   'humidity','windspeed',
                   'casual', 'registered',
                   'total_count']].corr()
mask = np.array(corrMatt)
mask[np.tril_indices_from(mask)] = False
sn.heatmap(corrMatt, mask=mask,
           vmax=.8, square=True,annot=True)

The two count variables, registered and casual, show obvious strong correlation to total_count. Similarly, temp and atemp show high correlation. wind_speed and humidity have slight negative correlation. Overall, none of the attributes show high correlational statistics.

## Regression Analysis
Regression analysis is a statistical modeling technique. It is the process of investigating relationships between dependent and independent variables. Regression itself includes a variety of techniques for modeling and analyzing relationships between variables. It is widely used for predictive analysis, forecasting, and time series analysis.
The dependent or target variable is estimated as a function of independent or predictor variables. The estimation function is called the regression function.

The height-weight relationship is a classic example to get started with regression analysis. The example states that weight of a person is dependent on his/her height. Thus, we can formulate a regression function to estimate the weight (dependent variable) given height (independent variable) of a person, provided we have enough training examples. We discuss more on this in the coming section.

Regression analysis models the relationship between dependent and independent variables. It should be kept in mind that correlation between dependent and independent variables does not imply causation!

### Assumptions
Assumptions
Regression analysis has a few general assumptions while specific analysis techniques have added (or reduced) assumptions as well. The following are important general assumptions for regression analysis:
* The training dataset needs to be representative of the population being modeled.
* The independent variables are linearly independent, i.e., one independent variable cannot be explained as a linear combination of others. In other words, there should be no multicollinearity.
* Homoscedasticity of error, i.e. the variance of error, is consistent across the sample.

### Evaluation Criteria
Evaluation of model performance is an important aspect of modelling. We should be able to not just understand the outcomes but also evaluate how models compare to each other or whether the performance is acceptable or not.
In general, evaluation metrics and performance guidelines are pretty use case and domain specific, regression analysis often uses a few standard metrics.

#### Residual Analysis
#### Normality Test (Q-Q Plot)
#### R-Squared: Goodness of Fit
#### Cross Validation