# Introduction to Data Visualization <font color='blue'> (15 min) </font>

# Google doc with code corrections is accessible at:
### https://docs.google.com/document/d/1Ks66tyNbHA5GobgiAY_S2ApvKMJmCF57N4cwuXLpjfQ/edit?usp=sharing

# 0) Importing the right tools <font color='blue'> (5 min) </font>

### <font color='red'>0.1) Import the necessary packages: </font>

- pandas (aliased as pd)
- seaborn (aliased as sns)
- matplotlib.pyplot (aliased as plt)

In [None]:
from __future__ import division

#### IMPORT THE ABOVE PACKAGES WITH THE ADEQUATE ALIASES ####

from mpl_toolkits.basemap import Basemap  # This imports the Basemap package
%pylab inline

### <font color='red'>0.2) Import the dataset from <i>'../data/data_after_collection_cleaning.csv'</i> using the <i>pd.read_csv()</i> function</font>

In [2]:
data = #### IMPORT THE DATASET AS A CSV FILE HERE ####

### <font color='red'>0.3) Show a sample of 2 observations from the dataset, using the <i>.sample()</i> function</font>

# Visualizations

## 1) Plotting bike stations with Basemap <font color='blue'> (15 min) </font>

### <font color='red'>1.1) Go on the following websites to understand how Basemap functions </font>

- http://matplotlib.org/basemap/users/geography.html
- http://matplotlib.org/basemap/users/examples.html

### <font color='red'>Run the following block, and understand what the <i>stations</i> object is </font>

In [4]:
stations = (set(zip(data['start station latitude'], 
                    data['start station longitude']))
            .union(set(zip(data['end station latitude'], 
                           data['end station longitude']))))

### <font color='red'>1.2) Print the total number of stations, using the <i>len</i> function</font>

### <font color='red'>Run the following blocks, that will define the bounds for the Basemap plot</font>

In [6]:
stations_lat = [x[0] for x in stations]
stations_long = [x[1] for x in stations]

In [7]:
b = min(stations_lat)  # bottom
t = max(stations_lat)  # top
l = min(stations_long)  # left
r = max(stations_long)  # right

### <font color='red'>Run the following block, that defines the Basemap Map with the adequate bounds</font>

In [8]:
map_manhattan = Basemap(projection='merc', resolution='h',
             area_thresh=0.1, llcrnrlon=l, llcrnrlat=b,
             urcrnrlon=r, urcrnrlat=t)

### <font color='red'>1.3) Compute the map projection coordinates of stations longitudes and latitudes, using the <i>map</i> object over both stations_long and stations_lat lists </font>

In [9]:
x, y = #### USE THE map OBJECT TO PROJECT STATIONS LONGITUDES AND LATITUDES, GET HELP ON THE BASEMAP WEBPAGE ####
       #### MAKE SURE YOU PROJECT LONGITUDES AND LATITUDES IN THE RIGHT ORDER ####

### <font color='red'>1.4) Draw the map, along with: </font>
- stations longitudes and latitudes (i.e : x and y)
- coastlines
- map boundaries
- fill the continents with the 'coral' color
- Add a title to the map

In [None]:
plt.figure(figsize=(10,20))

#### DRAW THE MAP ALONG WITH COASTLINES, MAP BOUNDARIES, AND BIKE STATIONS ####

plt.show()

## 2) Plotting with a background image from OpenStreetMap <font color='blue'> (15 min) </font>

### <font color='red'>2.1) An image of Manhattan has been downloaded on https://www.openstreetmap.org. For ease of use, you just need to import the image, so you will be able to plot stations on this image. Run the following block, that defines the right box bounds.</font>

In [85]:
l,r,b,t = (-74.0173,-73.9584,40.6990,40.7708)  # Left, right, bottom, top corners

### <font color='red'>2.2) Read the uploaded image using the <i>plt.imread()</i> function</font>

In [None]:
im = #### READ THE IMAGE LOCATED IN '../images/map.png' ####

### <font color='red'>2.3) Plot the stations on the image background, using the following:</font>
- plt.imshow() with the <i>extent=[l,r,b,t]</i> attribute
- plt.plot() on stations_long and stations_lat lists
- Add a title for the plot, as well as labels for x and y axes
- <b>Warning</b> : do not use <i>x</i> and <i>y</i> computed above any more, since they were the projections on the Basemap image in the previous section, and are not of any use here any more. Here, only <i>stations_long</i> and <i>stations_lat</i> need to be used.

In [None]:
sns.set_style("white")  # Sets a white background 
plt.figure(figsize=(10,15))  # Initialize the figure

#### FIRST PLOT STATIONS USING plt.plot ON stations_long ALONG WITH stations_lat LISTS ####

#### THEN PLOT THE BACKGROUND IMAGE USING plt.imshow() ON THE im IMAGE ####
### YOU WILL NEED TO ADD THE extent=[l,r,b,t] ATTRIBUTE TO FIX THE RIGHT BOUNDS ON THE IMAGE ####

#### ADD A TITLE, AND X AND Y LABELS ####

plt.show()  # Show the figure

### Going further
- C3.js and D3.js : http://d3js.org/ and http://c3js.org/
- Meteor : https://www.meteor.com/ used for DIL Telematics Dashboards as follows
- Angular : https://angularjs.org

### <font color='red'>Run the following block, that shows you how Meteor does the work in the Telematics Exchange project within the Data Innovation Lab</font>

In [None]:
dashboards = plt.imread('../images/dashboard.png')  # Importing dashboards images

plt.figure(figsize=(20,20))
plt.imshow(dashboards)
plt.show()

## 3) Plotting features distributions with seaborn package <font color='blue'> (20 min) </font>

### <font color='red'>Refer to the following pages to understand how you can leverage the <i>seaborn</i> package to understand your data:</font>

- http://stanford.edu/~mwaskom/software/seaborn/examples/distplot_options.html
- http://stanford.edu/~mwaskom/software/seaborn/examples/

### <font color='red'>3.1) Plot the distribution of trip durations (in minutes) using <i>sns.distplot(data.column_name)</i></font>

In [None]:
sns.set_style("darkgrid")  # This uses the dark style of seaborn for better understanding

#### PLOT THE DISTRIBUTION OF TRIP DURATIONS, MAKING SURE YOU ADD X AND Y LABELS, AS WELL AS A CLEAR TITLE ####

### <font color='red'>3.2) Filter out data for which gender is unknown (i.e gender=0) and birth year is before 1935</font>

In [90]:
#### TO ADD FILTERING CONDITIONS USE (data[column_name]<value) & (data[column_name]==value) ####

filtering_condition = data[#### INSERT YOUR SLICING CONDITION HERE ####]

truncated_data = data[filtering_condition]

### <font color='red'>3.3) Plot the distribution of bikes riders subscribers ages</font>

In [None]:
#### PLOT THE DISTRIBUTION OF SUBSCRIBERS AGES, MAKING SURE YOU ADD X AND Y LABELS, AS WELL AS A CLEAR TITLE ####

### <font color='red'>3.4) Use <i>sns.countplot</i> on <i>data.usertype</i> to understand the repartition of genders and usertypes</font>

In [None]:
#### PLOT THE DISTRIBUTION OF GENDERS HERE, MAKING SURE YOU ADD X AND Y LABELS, AS WELL AS A CLEAR TITLE ####

In [None]:
#### PLOT THE DISTRIBUTION OF USERTYPES HERE, MAKING SURE YOU ADD X AND Y LABELS, AS WELL AS A CLEAR TITLE ####

## Matplotlib piecharts

### <font color='red'>Run the following blocks, that group by data with respect to gender</font>

In [None]:
groupby_gender = data.groupby(['gender']).count()

In [None]:
groupby_gender

### <font color='red'>3.5) Use <i>plt.pie</i> with the <i>labels</i> and <i>explode</i> attributes to plot differently the number of males, females, and unknown genders</font>

In [93]:
explode = (0, 0, 0.1)
labels = ['Unknown','Male','Female']

plt.figure(figsize=(7,7))

#### USE plt.pie() WITH THE RIGHT ATTRIBUTES TO PIE PLOT MALES, FEMALES, UNKNOWN GENDERS ####

plt.show()

### <font color='red'>3.6) Use <i>sns.countplot</i> to plot number of trips vs. weather conditions for <i>truncated_data</i></font>

In [None]:
#### USE sns.countplot() HERE FOR NUMBER OF TRIPS VS. WEATHER CONDITIONS ####
#### ADD A CLEAR TITLE ####

## 4) Build a graph with the <i>networkx</i> package <font color='blue'> (40 min) </font>

- NetworkX is a Python language software package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks
- Package tutorial can be found on http://networkx.readthedocs.io/en/networkx-1.11/tutorial/
- It can be used in the case of NYC Bikes Dataset : stations will be the nodes, and trips from station to station will be directed edges

### <font color='red'>4.1) Import the <i>networkx</i> package and alias it as <i>nx</i></font>

### <font color='red'>4.2) Create a new directed graph with the <i>DiGraph()</i> function of the <i>nx</i> module</font>

In [5]:
graph_stations = #### CREATE A DIRECTED GRAPH ####

##### Add directed links from station to station

### <font color='red'>4.3) Run the following block and understand what the <i>stations_indexes</i> and stations_links objects are</font>

In [None]:
stations_indexes = data.groupby(['start station id', 'end station id']).groups

In [7]:
stations_links = {(k[0],k[1],len(v)) for k,v in stations_indexes.iteritems()}

In [None]:
#### UNDERSTAND THE STRUCTURE OF stations_links  and stations_indexes ####

### <font color='red'>4.3) Use the <i>add_weighted_edges_from()</i> function of any networkx Directed Graph to build the directed links from station to station</font>

In [8]:
#### ADD THE WEIGHTED EDGES FROM stations_links TO graph_stations ####

##### Analyze graph

### <font color='red'>4.4) Analyze the graph using the following functions on the graph:</font>
- nodes()
- edges()
- in_degree()
- nodes_with_selfloops()

### <font color='red'>4.4) Run the following block. What does it show ?</font>

In [None]:
sorted(graph_stations.edges(data=True), key = lambda x: x[2]['weight'], reverse=True)[:10]

### <font color='red'>4.5) Print the id of the graph-centric station using <i>nx.degree_centrality()</i></font>

### <font color='red'>4.6) Plot the graph using the <i>draw_networkx</i> function of the module. You can use the following attributes for clarity:</font>
- arrows=False
- pos = nx.spring_layout(graph_stations)
- width=0.1

<b>Note</b>: You will get different plots for different executions, as initial positions within the graph are random.

In [None]:
plt.figure(figsize=(20,20))

#### PLOT THE NETWORK HERE ####

plt.axis('off')
plt.show()

#### *Plot most common self-looping stations*

In [17]:
self_looping_stations = sorted(graph_stations.selfloop_edges(data=True),
                               key = lambda x : x[2]['weight'], reverse=True)[:10]

In [18]:
self_looping_stations

[(2006, 2006, {'weight': 1940}),
 (281, 281, {'weight': 712}),
 (499, 499, {'weight': 623}),
 (387, 387, {'weight': 540}),
 (426, 426, {'weight': 382}),
 (457, 457, {'weight': 346}),
 (514, 514, {'weight': 275}),
 (3002, 3002, {'weight': 245}),
 (217, 217, {'weight': 187}),
 (327, 327, {'weight': 187})]

### <font color='red'>4.7) Create a Directed Graph containing only the self-looping stations</font>

In [19]:
self_looping_stations_graph = #### CREATE A DIRECTED GRAPH FROM THE SELF-LOOPING STATIONS ####

### <font color='red'>Run the following blocks, they define the self-looping stations latitudes and longitudes and add them to two lists</font>

In [20]:
self_looping_stations_ids = self_looping_stations_graph.nodes()

In [21]:
self_looping_stations_long = \
[data[data['start station id']==id].iloc[0]['start station longitude'] for id in self_looping_stations_ids]

self_looping_stations_lat = \
[data[data['start station id']==id].iloc[0]['start station latitude'] for id in self_looping_stations_ids]

### <font color='red'>4.8) Plot the self looping stations on the Manhattan background as you did in the first part of the hands-on session:</font>

In [None]:
l,r,b,t = (-74.0173,-73.9584,40.6990,40.7708)
im = plt.imread('../images/map.png')
sns.set_style('white')

plt.figure(figsize=(12,12))

#### PLOT THE SELF-LOOPING STATIONS ON THE BACKGROUND IMAGE. HOW ARE THEY DISTRIBUTED AROUND MANHATTAN ? ####

plt.show()

##### Draw most common directed trips

### <font color='red'>Run the following block, that keeps the most common trips:</font>

In [69]:
common_trips = sorted([(u,v,d) for (u,v,d) in graph_stations.edges(data=True) if u!=v],
                              key = lambda x : x[2]['weight'], reverse=True)[:5]

### <font color='red'>4.9) Understand the structure of <i>common_trips</i>, and build a directed graph from these trips:</font>

In [None]:
#### ANALYZE THE STRUCTURE OF THE OBJECT common_trips ####

In [None]:
common_trips_graph = #### BUILD A DIRECTED GRAPH FROM THESE TRIPS ####

### <font color='red'>4.10) Plot the graph of the most common trips:</font>

In [None]:
plt.figure(figsize=(7,7))


#### DRAW THE GRAPH OF THE MOST COMMON TRIPS ####


plt.show()

##### Plot these common trips

### <font color='red'>4.11) Run the following block, understand what it does, as well as the structure of <i>coordinates_common_trips</i>:</font>

In [75]:
common_trips_ids = common_trips_graph.nodes()
coordinates_common_trips = {}

for id_station in common_trips_ids:
    station = data[data['start station id']==id_station].iloc[0]
    coordinates_common_trips.update({
                                    id_station: (station['start station longitude'],
                                         station['start station latitude'])
                                    }
                                    )

In [None]:
#### UNDERSTAND THE OBJECT coordinates_common_trips ####

### <font color='red'>4.12) Plot the different start and end coordinates of the most frequent trips on the background image of Manhattan. Add arrows showing the directions of the trips. You might use the following functions:</font>
- plt.plot()
- plt.annotate('', xy=..., xytext=..., arrowprops=...)
- plt.imshow()

In [None]:
plt.figure(figsize=(12,12))
sns.set_style('white')

for (start,end,weight) in common_trips:  # This loops over the most frequent trips

    #### PLOT THE DIFFERENT START AND END STATIONS FOR THE MOST FREQUENT TRIPS ####
    
    #### ADD ARROWS ON THE PLOT SO WE CAN SEE THE DIRECTION OF THE TRIPS ####

plt.imshow(im, extent=[l,r,b,t])
plt.title('Most frequent trips', size=17)
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.show()

## 5) Build interactive maps with Leaflet <font color='blue'> (10 min) </font>

Refer to the following webpage for more information:
- http://folium.readthedocs.org/en/latest/

### <font color='red'>5.1) Import the <i>folium</i> package</font>

### <font color='red'>5.2) Create a new map using the <i>.Map()</i> function, located around a given station latitude and longitude:</font>

In [None]:
map_osm = #### CREATE A FOLIUM MAP HERE, INITIATED ON A MANHATTAN COORDINATE ####

### <font color='red'>5.3) Run the following block, and study the structure of the <i>stations</i> object:</font>

In [120]:
stations = set(zip(data['start station latitude'],
                   data['start station longitude'],
                   data['start station name'],
                  data['total_docks_start']))\
            .union(set(zip(data['end station latitude'],
                           data['end station longitude'],
                           data['end station name'],
                          data['total_docks_end'])))

### <font color='red'>5.4) Add markers to the map using the <i>map_osm.simple_marker()</i> function. You can use the following attributes:</font>
- clustered_marker=True (this will clusterize the markers in a pretty way)
- popup= ... (this will put text when you click on a specific marker. Add for instance the name of the station, and the total number of docks)

In [None]:
for latitude, longitude, station_name, total_docks in stations:
    #### ADD THE STATION ON THE MAP AS A SIMPLE MARKER ####
    #### YOU CAN ADD A POPUP ATTRIBUTE THAT SHOWS THE STATION NAME AND TOTAL DOCKS ####

### <font color='red'>Display the map using the <i>display</i> function on the map_osm object. This should show the stations interactively !</font>

# 6) Try other visualizations yourself !

# Other nice visualizations

- http://www.newyorker.com/news/news-desk/interactive-a-month-of-citi-bike
- http://bikes.oobrien.com/newyork/#zoom=14&lon=-74.0045&lat=40.7319
- https://vimeo.com/89305412
- http://www.r-bloggers.com/new-yorkers-municipal-bikes-and-the-weather/?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29