# bike-sharing-analysis
A data analysis project to explore bike sharing, and what factors impact bike sharing in Washington, D.C., USA, for the period between January 1, 2011, and December 31, 2012. Data source: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset# 

In [1]:
# import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
# load data
hourly_data = pd.read_csv('data/hour.csv')

In [8]:
# head
hourly_data.head(3)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32


In [5]:
# generic statistics
print(f"shape of data is: {hourly_data.shape[0]} rows and {hourly_data.shape[1]} columns.")
print(f"Missing values in data: {hourly_data.isnull().sum().sum()}")

shape of data is: 17379 rows and 17 columns.
Missing values in data: 0


In [7]:
# stats on the numerical columns
hourly_data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
instant,17379.0,8690.0,5017.0295,1.0,4345.5,8690.0,13034.5,17379.0
season,17379.0,2.50164,1.106918,1.0,2.0,3.0,3.0,4.0
yr,17379.0,0.502561,0.500008,0.0,0.0,1.0,1.0,1.0
mnth,17379.0,6.537775,3.438776,1.0,4.0,7.0,10.0,12.0
hr,17379.0,11.546752,6.914405,0.0,6.0,12.0,18.0,23.0
holiday,17379.0,0.02877,0.167165,0.0,0.0,0.0,0.0,1.0
weekday,17379.0,3.003683,2.005771,0.0,1.0,3.0,5.0,6.0
workingday,17379.0,0.682721,0.465431,0.0,0.0,1.0,1.0,1.0
weathersit,17379.0,1.425283,0.639357,1.0,1.0,1.0,2.0,4.0
temp,17379.0,0.496987,0.192556,0.02,0.34,0.5,0.66,1.0


- `temporal features`: This contains information about the time at which the
record was registered. This group contains the dteday, season, yr, mnth, hr,
holiday, weekday, and workingday columns.
- `weather related features`: This contains information about the weather
conditions. The weathersit, temp, atemp, hum, and windspeed columns
are included in this group.
- `record related features`: This contains information about the number
of records for the specific hour and date. This group includes the casual,
registered, and cnt columns.

## Data Preprocessing
Goal: Encode the temporal features into more human readable form
- seasons from 1-4, to Winter, Spring, Summer & Fall seasons.
- yr from 0 & 1 to 2011 and 2012
- weekday from 0-6 to Sunday (0), Monday(1) throuhg Saturday(6)
- scale hum column to 0-100 as it represents percentages
- scale windspeed to values between 0 (min) and 67(max)

In [9]:
# copy of dataset
hourly_data_clean = hourly_data.copy()

In [10]:
# seasons mapping using dictionary, apply and lambda functions
seasons_map = {1: 'winter',
               2: 'spring',
               3: 'summer',
               4: 'fall'}
hourly_data_clean.season = hourly_data_clean['season'].apply(lambda x: seasons_map[x])

In [11]:
# transform yr column
yr_map = {0: 2011, 1: 2012}
hourly_data_clean.yr = hourly_data_clean['yr'].apply(lambda x: yr_map[x])

In [12]:
# transfrom weekdays
day_map = {0: 'Sunday', 1: 'Monday', 2: 'Tuesday', 3: 'Wednesday', 4: 'Thursday', 5: 'Friday', 6: 'Saturday'}
hourly_data_clean.weekday = hourly_data_clean['weekday'].apply(lambda x: day_map[x])

Transforming weather related columns
weathersit column represents the current weather conditions, where
1 stands for clear weather with a few clouds, 2 represents cloudy weather,
3 relates to light snow or rain, and 4 stands for heavy snow or rain. The hum
column stands for the current normalized air humidity, with values from 0 to
1 (hence, we will multiply the values of this column by 100, in order to obtain
percentages). Finally, the windspeed column represents the windspeed, which
is again normalized to values between 0 and 67 m/s.

In [13]:
# transfrom weathersit
weather_map = {1: 'clear', 2: 'cloudy', 3: 'light_rain_snow', 4: 'heavy_rain_snow'}
hourly_data_clean.weathersit = hourly_data_clean['weathersit'].apply(lambda x: weather_map[x])

In [14]:
# rescale hum and windspeed columns
hourly_data_clean.hum = hourly_data_clean['hum']*100
hourly_data_clean.windspeed = hourly_data_clean['windspeed']*67

#### Test

In [16]:
# visualize the changes in the columns
cols = ['season', 'yr', 'weekday', 'weathersit', 'hum', 'windspeed']
hourly_data_clean[cols].sample(10, random_state=123)

Unnamed: 0,season,yr,weekday,weathersit,hum,windspeed
5792,summer,2011,Saturday,clear,74.0,8.9981
7823,fall,2011,Sunday,clear,43.0,31.0009
15426,fall,2012,Tuesday,cloudy,77.0,6.0032
15028,fall,2012,Sunday,clear,51.0,22.0028
12290,spring,2012,Friday,cloudy,89.0,12.998
3262,spring,2011,Friday,clear,64.0,7.0015
10763,spring,2012,Thursday,clear,42.0,23.9994
12384,spring,2012,Tuesday,light_rain_snow,82.0,11.0014
6051,summer,2011,Wednesday,clear,52.0,19.0012
948,winter,2011,Saturday,clear,80.0,0.0


## Analysis