# <center>Data Science Training - AXA Data Innovation Lab - Hands-on</center>
<center><b>Hands-on sessions</b><br>
Nathaniel Bern, June 2016</center>

# Goals and objectives <font color='blue'> (10 min) </font>

The series of IPython notebooks will guide you through the study of NYC Bikes Data, from data collection and cleaning to visualization, feature engineering and modelling. 

We will be predicting whether a trip was made by a customer or a subscriber, thanks to different machine learning algorithms.

There will be **4 IPython notebooks**, for each step :
1. Data collection and cleaning
2. Visualization
3. Feature engineering
4. Data modeling

### <font color='red'>Steps that you will need to complete will be written in red </font>
### <font color='blue'>The time that you will be given will be written in blue </font>

# Data collection and cleaning

Data has been downloaded from https://www.citibikenyc.com/system-data and has been open-sourced by *City Bike*.

The dataset consists in all bike trips in NYC that happened in June 2015, with the following information:
- Trip Duration (seconds)
- Start Time and Date
- Stop Time and Date
- Start Station Name
- End Station Name
- Station ID 
- Station Lat/Long
- Bike ID
- User Type (Customer = 24-hour pass or 7-day pass user; Subscriber = Annual Member)
- Gender (Zero=unknown; 1=male; 2=female)
- Year of Birth

The **pandas** package will be used all throughout the notebooks to load data and easily perform data analysis.

## Steps
- Core dataset collection
- Weather and temperature enrichment
- Number of docks per station enrichment

# Google doc with code corrections is accessible at:
### https://docs.google.com/document/d/1q63UIVenEinpAlK_bzt4paao_g4a0xQEMQLwX0qfHyE/edit?usp=sharing

# Necessary packages to study this dataset

If you are running these notebooks from your own computer, you will be required to have the following packages installed:

- **Basic packages** from Anaconda such as *numpy*, *pandas*, *scikit-learn*, ...
- seaborn
- Basemap
- haversine
- networkx
- folium
- urllib2

# 0) Using IPython notebook: tricks and tips <font color='blue'> (10 min) </font>

An IPython notebook is made of different boxes (or lines), which can be run independently from one another. They can be filled either with *Python code*, or with **Markdown code** as in this box.

- Click on the <b>LEFT SIDE</b> of a box, and press **A** to create a box **Above**, and **B** to create a box **Below**
- Click on the <b>LEFT SIDE</b> of a box, and double-press **D** to delete it
- Click on the <b>LEFT SIDE</b> of a box, and press **X** to cut it, and **V** to paste it
- Press **Shift + Enter** to run a box and go to the next one

By default, boxes will accept Python code. To create a **Markdown** box, click on the left side of the box, press **m** for **mardown**, and then you will be able to fill your box with the markdown format. For instance, if you write ### Title 1 in a **markdown defined box**, and then press **Shift + Enter** to run it.

### <font color='red'>0.1) Run the following lines by pressing Shift + Enter</font>

In [1]:
print 'Hello World'

Hello World


In [2]:
user_name = "Data Scientist"  # Enter you name between the quotation marks
user_age = 34  # Enter your age here

print 'Hello, my name is {}'.format(user_name)
print 'I am {} years old'.format(user_age)

Hello, my name is Data Scientist
I am 34 years old


### <font color='red'>0.2) Create a Markdown cell by:</font>
- clicking on the left side of the cell
- typing m
- entering in the cell the following text (include the sharp signs): *### This is a Markdown cell*
- running the cell with **Maj + Enter**

### <font color='red'>0.3) Add/Delete cells by clicking on the left side of a cell, and pressing "a" (above), "b" (below) and "dd" (delete)</font>

### Importing the necessary packages

Packages are imported at the beginning of the file, and can be aliased. For instance, the **pandas** package can be aliased as **pd**, the **numpy** package as **np**, and all **pandas** functions will be called with **pd.function** instead of **pandas.function**

In [3]:
from __future__ import division  # This allows for float division

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%pylab inline 

Populating the interactive namespace from numpy and matplotlib




<b>TIP</b>: You can press <b>TAB</b> for getting auto-completion of a command

### <font color='red'> 0.4) Start typing the user_name variable ('user_name') and press TAB to get auto-completion. Choose the right variable by pressing ENTER </font>

In [4]:
#### TYPE user.... + TAB and choose the right variable by pressing ENTER ####

### <font color='red'>0.5) Import the pandas package and give it the alias <i>pd</i></font>

In [5]:
import pandas as pd

### <font color='red'>0.6) Get help on a module by calling the <i>help()</i> function with the aliased module as the parameter</font>

### <font color='red'>0.7) Get help on module functions by calling the <i>help()</i> function on them</font>

In [6]:
help(sns.pointplot)

Help on function pointplot in module seaborn.categorical:

pointplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, estimator=<function mean>, ci=95, n_boot=1000, units=None, markers='o', linestyles='-', dodge=False, join=True, scale=1, orient=None, color=None, palette=None, ax=None, errwidth=None, capsize=None, **kwargs)
    Show point estimates and confidence intervals using scatter plot glyphs.
    
    A point plot represents an estimate of central tendency for a numeric
    variable by the position of scatter plot points and provides some
    indication of the uncertainty around that estimate using error bars.
    
    Point plots can be more useful than bar plots for focusing comparisons
    between different levels of one or more categorical variables. They are
    particularly adept at showing interactions: how the relationship between
    levels of one categorical variable changes across levels of a second
    categorical variable. The lines that join each po

## 1) Core dataset collection <font color='blue'> (30 min) </font>

### <font color='red'>1.1) Load data by using the <i>read_csv</i> function of pandas (which has been aliased as pd) ; data is located in <i>'../data/raw_data.csv'</i></font>

In [7]:
orig_data = pd.read_csv('./201506-citibike-tripdata.csv')

### <font color='red'>Run the following line so data is copied and does not need to be reloaded in case of mistakes</font>

In [8]:
data = orig_data.copy()

### <font color='red'>1.2) Show a sample (or several samples) of data by calling the method <i>.sample()</i> on the <i>data</i> dataframe</font>

In [9]:
data.sample(3)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
159605,813,6/7/2015 12:06,6/7/2015 12:20,448,W 37 St & 10 Ave,40.756604,-73.997901,486,Broadway & W 29 St,40.746201,-73.988557,19072,Subscriber,1976.0,1
610423,2016,6/20/2015 19:47,6/20/2015 20:20,387,Centre St & Chambers St,40.712733,-74.004607,217,Old Fulton St,40.702772,-73.993836,22079,Customer,,0
450276,398,6/16/2015 7:41,6/16/2015 7:47,484,W 44 St & 5 Ave,40.755003,-73.980144,468,Broadway & W 55 St,40.765265,-73.981923,17430,Subscriber,1976.0,1


### <font color='red'>1.3) Describe your data by calling the function <i>.describe()</i> on the <i>data</i> dataframe</font>

In [10]:
data.describe()

Unnamed: 0,tripduration,start station id,start station latitude,start station longitude,end station id,end station latitude,end station longitude,bikeid,birth year,gender
count,941219.0,941219.0,941219.0,941219.0,941219.0,941219.0,941219.0,941219.0,810827.0,941219.0
mean,904.6028,459.987285,40.73475,-73.991312,458.256856,40.73442,-73.991443,18174.511302,1976.25638,1.061284
std,3446.744,384.455415,0.019363,0.01242,383.179189,0.019332,0.01243,2113.307306,11.465324,0.579512
min,60.0,72.0,40.680342,-74.017134,72.0,40.680342,-74.017134,14529.0,1885.0,0.0
25%,401.0,306.0,40.721101,-74.001497,305.0,40.720828,-74.001547,16370.0,1968.0,1.0
50%,646.0,415.0,40.736494,-73.990985,411.0,40.736245,-73.990985,18158.0,1979.0,1.0
75%,1064.0,492.0,40.7502,-73.98205,490.0,40.749156,-73.98205,19944.0,1985.0,1.0
max,1691873.0,3002.0,40.771522,-73.950048,3002.0,40.771522,-73.950048,22364.0,1999.0,2.0


### <font color='red'>1.4) Print a column using <i>data.column_name</i> or <i>data[column_name]</i></font>

In [11]:
data.columns

Index([u'tripduration', u'starttime', u'stoptime', u'start station id',
       u'start station name', u'start station latitude',
       u'start station longitude', u'end station id', u'end station name',
       u'end station latitude', u'end station longitude', u'bikeid',
       u'usertype', u'birth year', u'gender'],
      dtype='object')

### Filtering

Filters are important in every dataset : data is noisy and filters are meant to keep data clean and understandable, on which models can be trained.
In Python, slicing conditions for dataframes can be expressed for instance as :
- **slice_condition = data.gender == 1**

Data can then be sliced with the following :
- **filtered_data = data[slice_condition]**

### <font color='red'>1.5) Get rid of trips that are more than 1.5 hours long (watch out for variable units !)</font>

In [12]:
# data in second
conv = 0.000277778
data.tripduration_hour = data.tripduration * conv
data = data[data.tripduration_hour<=1.5]

### <font color='red'>1.6) Describe a few columns within the dataset using the following functions on the columns:</font>
- len()
- .column_name.unique()
- .column_name.mean()
- .column_name.median()

In [13]:
len(data)

935635

In [14]:
len(orig_data)

941219

### Changing datetime to datetime format

### <font color='red'>1.7) The following block uses the <i>pd.to_datetime()</i> function with the <i>format</i> keyword to create datetime objects from strings columns, for starttime and stoptime.</font><br>

<font color='red'>Look at the string format of <i>starttime</i> and <i>stoptime</i> columns in the initial dataset, and notice how the <i>format</i> attribute matches exactly the structure of the string that you observe on the raw columns.

In [15]:
#### WHAT IS THE FORMAT OF starttime AND stoptime COLUMNS IN THE INITIAL DATASET ? ####

In [16]:
data['starttime_formatted'] =  pd.to_datetime(data['starttime'], format="%m/%d/%Y %H:%M")
data['stoptime_formatted'] =  pd.to_datetime(data['stoptime'], format="%m/%d/%Y %H:%M")

### <font color='red'>Run the following block to overwrite the starttime and stoptime columns (WARNING: THIS WILL OVERWRITE THE INITIAL COLUMNS)</font>

In [17]:
data['starttime'] =  data['starttime_formatted']
data['stoptime'] =  data['stoptime_formatted']
del data['starttime_formatted']
del data['stoptime_formatted']

### <font color='red'>1.8) Show a sample of 5 observations and check that the new columns have been created</font>

In [18]:
data.sample(2)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
871542,310,2015-06-29 08:46:00,2015-06-29 08:51:00,383,Greenwich Ave & Charles St,40.735238,-74.000271,236,St Marks Pl & 2 Ave,40.728419,-73.98714,19870,Subscriber,1985.0,1
377344,928,2015-06-13 14:31:00,2015-06-13 14:47:00,147,Greenwich St & Warren St,40.715422,-74.01122,225,W 14 St & The High Line,40.741951,-74.00803,20082,Subscriber,1998.0,1


### Fill missing values for birth year : these are called NaN (Not A Number)

In [19]:
data[data['birth year'].isnull()].sample(3)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender
677856,571,2015-06-23 02:30:00,2015-06-23 02:39:00,312,Allen St & E Houston St,40.722055,-73.989111,401,Allen St & Rivington St,40.720196,-73.989978,16646,Customer,,0
210972,1518,2015-06-08 19:07:00,2015-06-08 19:32:00,72,W 52 St & 11 Ave,40.767272,-73.993929,405,Washington St & Gansevoort St,40.739323,-74.008119,17450,Customer,,0
777116,930,2015-06-25 17:40:00,2015-06-25 17:55:00,267,Broadway & W 36 St,40.750977,-73.987654,2012,E 27 St & 1 Ave,40.739445,-73.976806,14610,Customer,,0


### <font color='red'>1.9) Print the proportion of missing birth years, using:</font>
- data["column_name"].isnull()
- data["column_name"].sum()
- len()

In [20]:
data.isnull().sum()

tripduration                    0
starttime                       0
stoptime                        0
start station id                0
start station name              0
start station latitude          0
start station longitude         0
end station id                  0
end station name                0
end station latitude            0
end station longitude           0
bikeid                          0
usertype                        0
birth year                 126090
gender                          0
dtype: int64

### <font color='red'>1.10) Print the median birth year</font>

In [21]:
meadian_year = data['birth year'].quantile(0.5)
print 'median year of bith :{}'.format(meadian_year)

median year of bith :1979.0


### <font color='red'>1.11) Replace the missing birth years with the median birth year, using the <i>data[column].fillna function</i></font>

In [22]:
data['birth year'] = data['birth year'].fillna(meadian_year)

## 2) Enrich dataset with hourly weather and temperature from weather websites <font color='blue'> (40 min) </font>

We can get historical New York City weather at : 
http://www.wunderground.com/history/

### <font color='red'>2.1) Go to the website, and check for weather and temperature in New York City on a random day in June 2015</font>

### <font color='red'>2.2) Run the following block so as to import the right packages</font>

In [23]:
import urllib2
import json
import re

### <font color='red'>2.3) Use the <i>urllib2.urlopen</i> function to open the following webpage:</font>
http://www.wunderground.com/history/airport/KNYC/2015/6/1/DailyHistory.html?req_city=New%20York&req_state=NY&req_statename=New%20York&reqdb.zip=10001&reqdb.magic=5&reqdb.wmo=99999&format=1

In [24]:
lync = 'http://www.wunderground.com/history/airport/KNYC/2015/6/1/DailyHistory.html?req_city=New%20York&req_state=NY&req_statename=New%20York&reqdb.zip=10001&reqdb.magic=5&reqdb.wmo=99999&format=1'

In [25]:
response =  urllib2.urlopen(lync)

### <font color='red'>2.3) Read the response with the <i>.read()</i> function</font>

In [26]:
html = response.read()

### <font color='red'>2.4) Print the html response here </font>

In [27]:
print html


TimeEDT,TemperatureC,Dew PointC,Humidity,Sea Level PressurehPa,VisibilityKm,Wind Direction,Wind SpeedKm/h,Gust SpeedKm/h,Precipitationmm,Events,Conditions,WindDirDegrees,DateUTC<br />
12:51 AM,13.9,12.2,89,1019.8,12.9,East,16.7,37.0,0.00,,Overcast,80,2015-06-01 04:51:00<br />
1:51 AM,13.9,12.2,89,1019.4,16.1,East,16.7,29.6,N/A,,Overcast,90,2015-06-01 05:51:00<br />
2:51 AM,13.9,12.2,89,1019.2,11.3,ENE,14.8,-,0.00,Rain,Light Rain,60,2015-06-01 06:51:00<br />
3:20 AM,14.4,12.8,90,1019.9,2.8,ENE,13.0,-,0.04,Rain,Heavy Rain,70,2015-06-01 07:20:00<br />
3:34 AM,13.9,12.8,93,1019.9,4.8,ENE,11.1,-,0.21,Rain,Light Rain,60,2015-06-01 07:34:00<br />
3:51 AM,13.9,12.2,89,1019.3,8.0,ENE,7.4,-,0.23,Rain,Rain,70,2015-06-01 07:51:00<br />
3:59 AM,13.9,12.8,93,1020.2,2.8,Variable,7.4,-,0.04,Rain,Heavy Rain,0,2015-06-01 07:59:00<br />
4:08 AM,13.9,12.2,89,1019.9,3.2,ENE,13.0,-,0.12,Rain,Rain,60,2015-06-01 08:08:00<br />
4:11 AM,13.9,12.2,89,1019.9,4.8,ENE,16.7,27.8,0.12,Rain,Rain,70,2015-06-01 08:11:0

### <font color='red'>2.5) Use the regular expression function <i>re.sub()</i> to substitute the <i>br</i> tag with an empty string:</font>

In [28]:
html_csv_file = re.sub(r'<br />','', html)

### <font color='red'>Run the following line to import the new csv-like html as a csv file:</font>

In [29]:
from StringIO import StringIO
raw_weather_data = pd.read_csv(StringIO(html_csv_file))

### <font color='red'>2.6) Print <i>samples</i> of this dataframe</font>

In [30]:
raw_weather_data.sample()

Unnamed: 0,TimeEDT,TemperatureC,Dew PointC,Humidity,Sea Level PressurehPa,VisibilityKm,Wind Direction,Wind SpeedKm/h,Gust SpeedKm/h,Precipitationmm,Events,Conditions,WindDirDegrees,DateUTC
19,11:10 AM,12.8,11.7,93,1020.9,3.2,East,16.7,33.3,0.06,Rain,Rain,80,2015-06-01 15:10:00


### <font color='red'>2.7) Keep the following columns in the dataframe : Temperature, Precipitation, Conditions and DateUTC</font>

In [31]:
col_to_keep = ['TemperatureC','Precipitationmm','Conditions','DateUTC']
raw_weather_data = raw_weather_data[col_to_keep]

In [32]:
raw_weather_data.head(2)

Unnamed: 0,TemperatureC,Precipitationmm,Conditions,DateUTC
0,13.9,0.0,Overcast,2015-06-01 04:51:00
1,13.9,,Overcast,2015-06-01 05:51:00


### Enrich every day in June

### <font color='red'>2.8.1) What is the webpage giving the necessary weather/temperature information in 2015, on June 29th? on July 2nd ?</font>

In [33]:
url_example = 'https://www.wunderground.com/history/airport/KNYC/2015/6/29/DailyHistory.html?req_city=New+York&req_state=NY&req_statename=New+York&reqdb.zip=10001&reqdb.magic=11&reqdb.wmo=99999'

### <font color='red'>2.8.2) Isolate fixed parts of the webpage address (URL)</font>

In [35]:
def get_url(url_year,url_month,url_day):
    url_prefix = 'https://www.wunderground.com/'
    url_suffix = 'history/airport/KNYC/'
    other ='/DailyHistory.html?req_city=New%20York&req_state=NY&req_statename=New%20York&reqdb.zip=10001&reqdb.magic=5&reqdb.wmo=99999&format=1'
    return url_prefix + url_suffix +str(url_year)+'/'+str(url_month)+'/'+str(url_day)+other

In [36]:
get_url(2015,6,29)

'https://www.wunderground.com/history/airport/KNYC/2015/6/29/DailyHistory.html?req_city=New%20York&req_state=NY&req_statename=New%20York&reqdb.zip=10001&reqdb.magic=5&reqdb.wmo=99999&format=1'

### <font color='red'>2.9) For every day from May 30th to July 2nd :
- Scrape weather data from the website using the adequate URL
- Transform the html file into a csv-like format as we did above
- Append the result to the existing raw_weather_data dataframe</font>

<b>WARNING</b>: the urls are requested from the Internet, so you might run into connection problems. If this happens, since the data has already been uploaded on the servers, you can request it with the <i>urllib2.urlopen</i> function:
`urllib2.urlopen('file:///home/data/weather_webpages/weather_{}_{}.html'.format(month, day))`

In [37]:
raw_weather_data = pd.DataFrame()
count= 0
for month in [5,6,7]:
    if month == 5:
        day_range = [30,31]
    elif month == 6:
        day_range = range(1,31)
    else:
        day_range = [1,2]
        
    for day in day_range:  # make requests for every day of June, and append result to existing Dataframe
        #print 'Collecting weather for Day {} ...'.format(day)
        count +=1
        response = urllib2.urlopen(get_url(2015,month,day))
        html = response.read()
        
        csv_file = re.sub(r'<br />','', html)
        weather_data_day = pd.read_csv(StringIO(csv_file))
        raw_weather_data = raw_weather_data.append(weather_data_day)

In [38]:
raw_weather_data.to_csv('./appel_API.csv',index=False)

### <font color='red'>2.10) Copy the data using <i>.copy()</i> and print samples</font>

In [39]:
weather_data = raw_weather_data.copy()

### <font color='red'>2.11) Transform the DateUTC column to standard datetime format, using <i>pd.to_datetime()</i>. Do not hesitate to add cells to test your code, before assigning the values to the existing dataset.</font>

In [40]:
weather_data['DateUTC'].head()

0    2015-05-30 04:51:00
1    2015-05-30 05:51:00
2    2015-05-30 06:51:00
3    2015-05-30 07:51:00
4    2015-05-30 08:51:00
Name: DateUTC, dtype: object

In [41]:
# Not needed , already in the good format
weather_data.DateUTC = pd.to_datetime(weather_data['DateUTC'], format="%Y-%m-%d %H:%M")

### Rounding the timeslots

### <font color='red'> Run the following function, which rounds hours </font>

In [42]:
def round_hour(dt):
    if dt.minute < 30:
        return datetime.datetime(dt.year, dt.month, dt.day, dt.hour)
    else:
        return datetime.datetime(dt.year, dt.month, dt.day, dt.hour) + datetime.timedelta(hours=1)

### <font color='red'>2.12) Round the DateUTC hours column with <i>dataframe.column.apply()</i> function</font>

In [43]:
dt =list(weather_data['DateUTC'])[0]

In [44]:
datetime.datetime(dt.year, dt.month, dt.day, dt.hour)

datetime.datetime(2015, 5, 30, 4, 0)

In [45]:
weather_data.DateUTC = weather_data.DateUTC.apply(lambda x: round_hour(x))

### <font color='red'>2.13) Show data samples once rounded</font>

In [46]:
weather_data.sample(3)

Unnamed: 0,TimeEDT,TemperatureC,Dew PointC,Humidity,Sea Level PressurehPa,VisibilityKm,Wind Direction,Wind SpeedKm/h,Gust SpeedKm/h,Precipitationmm,Events,Conditions,WindDirDegrees,DateUTC
11,9:51 AM,13.3,8.3,72,1025.6,16.1,East,11.1,-,,,Clear,90,2015-06-04 14:00:00
44,6:51 PM,25.6,19.4,68,1007.2,16.1,Variable,7.4,-,0.01,,Scattered Clouds,0,2015-06-21 23:00:00
1,1:51 AM,20.0,13.3,65,1007.7,16.1,North,-9999.0,-,,,Clear,0,2015-06-29 06:00:00


### <font color='red'>What does the following block do ?</font>

In [47]:
weather_data = weather_data.groupby('DateUTC', as_index=False).agg({
                                    'Precipitationmm': np.nanmean,
                                    'TemperatureC': np.nanmean,
                                    'Conditions': lambda x: x.value_counts().index[0]
                                    })

  f = lambda x: func(x, *args, **kwargs)


In [48]:
weather_data

Unnamed: 0,DateUTC,Precipitationmm,Conditions,TemperatureC
0,2015-05-30 05:00:00,,Clear,19.400000
1,2015-05-30 06:00:00,,Clear,20.000000
2,2015-05-30 07:00:00,,Clear,19.400000
3,2015-05-30 08:00:00,,Clear,20.000000
4,2015-05-30 09:00:00,,Clear,20.000000
5,2015-05-30 10:00:00,,Clear,20.000000
6,2015-05-30 11:00:00,,Clear,20.600000
7,2015-05-30 12:00:00,,Clear,21.700000
8,2015-05-30 13:00:00,,Clear,22.800000
9,2015-05-30 14:00:00,,Partly Cloudy,23.900000


### <font color='red'>What do the following blocks do ?</font>

In [49]:
print 'Missing precipitations values: {:%}'.format(sum(weather_data.Precipitationmm.isnull())/len(weather_data))

Missing precipitations values: 84.926471%


In [50]:
weather_data.sample(5)

Unnamed: 0,DateUTC,Precipitationmm,Conditions,TemperatureC
365,2015-06-14 10:00:00,,Clear,22.2
291,2015-06-11 08:00:00,,Partly Cloudy,22.8
171,2015-06-06 08:00:00,,Overcast,16.1
395,2015-06-15 16:00:00,,Haze,26.4
218,2015-06-08 07:00:00,,Overcast,17.2


### Fill NaN values with forward and backward methods

### <font color='red'>Run the following blocks; what do <i>fillna(method='pad')</i> and <i>fillna(method='bfill')</i> do ?</font>

In [51]:
weather_data[['TemperatureC','Precipitationmm']] = (
    weather_data[['TemperatureC','Precipitationmm']].apply(lambda x: np.round(x,2)))

In [52]:
weather_data.Precipitationmm.fillna(method="pad", inplace=True)
weather_data.Precipitationmm.fillna(method="bfill", inplace=True)

### <font color='red'>2.14) Merge data and weather data</font>

In [53]:
weather_data.head()

Unnamed: 0,DateUTC,Precipitationmm,Conditions,TemperatureC
0,2015-05-30 05:00:00,0.01,Clear,19.4
1,2015-05-30 06:00:00,0.01,Clear,20.0
2,2015-05-30 07:00:00,0.01,Clear,19.4
3,2015-05-30 08:00:00,0.01,Clear,20.0
4,2015-05-30 09:00:00,0.01,Clear,20.0


In [54]:
weather_data.rename(columns = {'DateUTC' : 'join_time'}, inplace=True)  # This renames the weather data Date column
data['join_time'] = data.starttime.apply(round_hour)  # This rounds the time slots for the start time in initial dataset


data = data.merge(weather_data, how='left')


del data['join_time']

### <font color='red'>2.15) Show sample observations of <i>data</i></font>

In [55]:
data.sample()

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,Precipitationmm,Conditions,TemperatureC
162869,248,2015-06-07 13:47:00,2015-06-07 13:51:00,339,Avenue D & E 12 St,40.725806,-73.974225,445,E 10 St & Avenue A,40.727408,-73.98142,19480,Subscriber,1986.0,1,0.0,Clear,18.3


## 3) Collect number of available docks <font color='blue'> (20 min) </font>

### <font color='red'>3.1) Load stations sizes in response, read the html response and load the json, from the following website:</font>
- https://www.citibikenyc.com/stations/json

In [56]:
url ='https://www.citibikenyc.com/stations/json'

In [57]:
response = urllib2.urlopen(url)
html = response.read()
stations = json.loads(html)

### <font color='red'>3.2) Understand the structure of the stations object </font>

In [58]:
stations

{u'executionTime': u'2017-03-22 10:54:09 AM',
 u'stationBeanList': [{u'altitude': u'',
   u'availableBikes': 36,
   u'availableDocks': 0,
   u'city': u'',
   u'id': 72,
   u'landMark': u'',
   u'lastCommunicationTime': u'2017-03-22 10:52:55 AM',
   u'latitude': 40.76727216,
   u'location': u'',
   u'longitude': -73.99392888,
   u'postalCode': u'',
   u'stAddress1': u'W 52 St & 11 Ave',
   u'stAddress2': u'',
   u'stationName': u'W 52 St & 11 Ave',
   u'statusKey': 1,
   u'statusValue': u'In Service',
   u'testStation': False,
   u'totalDocks': 39},
  {u'altitude': u'',
   u'availableBikes': 14,
   u'availableDocks': 17,
   u'city': u'',
   u'id': 79,
   u'landMark': u'',
   u'lastCommunicationTime': u'2017-03-22 10:54:04 AM',
   u'latitude': 40.71911552,
   u'location': u'',
   u'longitude': -74.00666661,
   u'postalCode': u'',
   u'stAddress1': u'Franklin St & W Broadway',
   u'stAddress2': u'',
   u'stationName': u'Franklin St & W Broadway',
   u'statusKey': 1,
   u'statusValue': u'I

In [59]:
stations.keys()

[u'executionTime', u'stationBeanList']

### <font color='red'>Run the following block that groups the stations ids and capacities in a dictionary </font>

In [60]:
total_docks = dict()

for x in stations['stationBeanList']:
    total_docks.update({x['id']:x['totalDocks']})

### <font color='red'>3.3) Create a pandas Dataframe from <i>total_docks</i> using pd.DataFrame.from_dict()</font>

In [61]:
stations_df =pd.DataFrame(total_docks.items())
stations_df.columns = ['start station id','total_docks_start']  # This renames the columns of the created DataFrame

### <font color='red'>3.4) Show a sample of the created DataFrame</font>

In [62]:
stations_df.sample()

Unnamed: 0,start station id,total_docks_start
162,3258,37


### <font color='red'>3.5) Left merge data and start stations capacities using <i>data.merge()</i></font>

In [63]:
data =data.merge(stations_df)

### <font color='red'>3.6) Left merge data and end stations capacities using <i>data.merge()</i></font>

In [64]:
stations_df.rename(columns = {'start station id': 'end station id',
                             'total_docks_start' : 'total_docks_end'},
                               inplace=True)  # This renames the columns 
                                            # so as to merge data with end stations capacities

In [65]:
data =data.merge(stations_df)

### <font color='red'>3.7) Show a sample of data with start and end stations capacities</font>

In [66]:
data.sample(2)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,Precipitationmm,Conditions,TemperatureC,total_docks_start,total_docks_end
67604,1144,2015-06-12 22:49:00,2015-06-12 23:08:00,428,E 3 St & 1 Ave,40.724677,-73.987834,351,Front St & Maiden Ln,40.70531,-74.006126,20086,Subscriber,1997.0,1,0.0,Partly Cloudy,28.3,31,39
428437,1258,2015-06-08 08:48:00,2015-06-08 09:09:00,514,12 Ave & W 40 St,40.760875,-74.002777,363,West Thames St,40.708347,-74.017134,18539,Subscriber,1967.0,2,0.0,Overcast,17.8,53,49


### Fill NaN values

### <font color='red'>3.8) Print the median dock size using for instance <i>numpy</i></font>

In [67]:
median_dock_size = np.median(total_docks.values())
print median_dock_size

30.0


### <font color='red'>3.9) What does the following block do ?</font>

In [68]:
data.total_docks_start.fillna(median_dock_size,inplace = True)
data.total_docks_end.fillna(median_dock_size,inplace = True)

### <font color='red'>3.10) Check that there are no NaN dock capacities anymore in <i>data</i></font>

In [69]:
data.isnull().any()

tripduration               False
starttime                  False
stoptime                   False
start station id           False
start station name         False
start station latitude     False
start station longitude    False
end station id             False
end station name           False
end station latitude       False
end station longitude      False
bikeid                     False
usertype                   False
birth year                 False
gender                     False
Precipitationmm            False
Conditions                 False
TemperatureC               False
total_docks_start          False
total_docks_end            False
dtype: bool

# Dataset we will be working on

In [70]:
data.sample(2)

Unnamed: 0,tripduration,starttime,stoptime,start station id,start station name,start station latitude,start station longitude,end station id,end station name,end station latitude,end station longitude,bikeid,usertype,birth year,gender,Precipitationmm,Conditions,TemperatureC,total_docks_start,total_docks_end
531807,116,2015-06-21 07:49:00,2015-06-21 07:51:00,453,W 22 St & 8 Ave,40.744751,-73.999154,462,W 22 St & 10 Ave,40.74692,-74.004519,15573,Subscriber,1988.0,1,0.03,Light Rain,22.2,39,47
662066,395,2015-06-24 15:30:00,2015-06-24 15:37:00,3002,South End Ave & Liberty St,40.711512,-74.015756,249,Harrison St & Hudson St,40.71871,-74.009001,19580,Subscriber,1987.0,1,0.0,Clear,27.2,25,27


# Save dataset to csv file

In [71]:
data.to_csv('./my_data_after_collection.csv', index=False)

# Implement your own ideas for data enrichment !

In [72]:
## Use your imagination to enrich this dataset ##

In [73]:
len(data)

797219