
Introduction to Plotly's Cufflinks
Data Visualisation: An Intro to Plotly's Cufflinks

KyleOS | cufflinks-introNov 27, 2018

## Data Visualization: An Intro to Plotly's Cufflinks

##### <font color=grey>How to make great-looking, fully-interactive plots with a single line of Python</font>

Data visualization is an art as well as a science. I always keep exploring how to make my visualizations more interesting and informative. One of the jobs of a data scientist is to tell a story with the data at their disposal, and you really want to make the data jump out at the reader, to make your visualizations as understandable as possible. 

Another one of our main tasks is data manipulation. Today the main tool I use for that is Pandas (Python). What if I tell you that you can build some beautiful and interactive charts for the web right from your Pandas dataframes? Well, you can! We can use Plotly for that. Fortunately, this is a great time for Python plotting, and after exploring the options, a clear winner — in terms of ease-of-use, documentation, and functionality — is the [plotly Python library](https://plot.ly/python/). In this article, we’ll dive right into plotly, learning how to make better plots in less time — often with one line of code.

If you are unfamiliar with plotly itself, I drew up a brief beginner's guide a while back, available [here](https://kyso.io/KyleOS/plotly-intro). Some of the plots I generated in that post are re-created here - so it'll be interesting for you to see how cufflinks simplifies plotly's (already surprisingly simple)syntax when working with pandas.

### Plotly - Brief Overview

The plotly python package is an open-source library built on plotly.js which in turn is built on d3.js. We’ll be using a wrapper on plotly called cufflinks, which is designed to work with Pandas dataframes. All the work in this article was done in a Jupyter Notebook with plotly + cufflinks running in offline mode. Actually, the article you're reading right now is a rendered notebook. First things first, let's import the libraries we'll be needing for this post:

In [None]:
from IPython.display import display, HTML
import pandas as pd
import numpy as np
import math

# Using plotly + cufflinks in offline mode

import plotly.plotly as py
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)

I have imported download_plotlyjs, init_notebook_mode, plot and iplot from plotly.offline and the .go_offline() method to allow us to generate interactive visualizations in Kyso's jupyterlab environment offline.

### A Demonstration - The Advantage of Cufflinks

We have a simple list of the countries with the world's best restaurants. The magazine Restaraunt released its [Worlds best restaraunts 2018 list](https://www.theworlds50best.com/list/1-50-winners) at the end of the year. The list is obviously super subjective, but an interesting exercise all the same.

So let's read in the dataset and plot it using plotly's cufflinks, a library for easy interactive Pandas charting with Plotly. Cufflinks binds the power of plotly with the flexibility of pandas for easy plotting. 

In [None]:
df = pd.read_csv('../input/restaurants.csv',header=0)
df = df.groupby(by="Country").count()["Name"]
df = df.sort_values(ascending=False)

In [None]:
df.iplot(kind='bar')

Pretty easy, right? This plot allows us to click on the elements in the legend to hide and display context which is pretty neat. Move the cursor to the top right of the plot to observe the various features of the plot. We can also use the zoom feature of specific areas of the plot.

We simply use the `.iplot()` method and specify the kind of chart we want to generate with the dataset.

Ok, so we've generated a simple bar chart - now let's read in other datasets and have a test some other cufflinks-generated plots. For this tutorial, we have 2 different datasets:

First up, it's pokemon data from Kaggle's [Pokemon with Stats](https://www.kaggle.com/abcsds/pokemon#Pokemon.csv), read in as df1.

Second, FIFA 18 data from Kaggle's [FIFA 18 Updated Dataset](https://www.kaggle.com/piyushgandhi811/fifa-18-updated-dataset) as df2.

In [None]:
df1 = pd.read_csv('../input/Pokemon.csv')
df1 = df1.drop('#', axis=1)

In [None]:
df2 = pd.read_csv("../input/fifa18.csv")
df2 = df2.drop(['Photo', 'Flag', 'Club Logo'], axis=1)

### Bar Charts

Let's run a few more bar-chart examples to visualize the strength differences between different Pokemon types.

In [None]:
df = df1.drop(['Name','Type 2','Legendary','Generation'],axis=1)
df = df.groupby('Type 1').mean()
df['Type 1'] = df.index

df.iplot(kind='barh', y='Attack', x='Type 1', colorscale='rdylbu', title='Attack Strength')

In [None]:
df = df1.drop(['Total', 'Attack', 'Defense', 'Speed', 'HP', 'Type 2', 'Generation', 'Legendary'], axis=1)
df = df.groupby('Type 1').mean()

df.iplot(kind='bar', title='Special Attack and Defense Scores')

### Distributions - Histograms and Boxplots

The benefits of interactivity are that we can explore and subset the data as we like. There’s a lot of information in a boxplot, and without the ability to see the numbers, we’ll miss most of it! Generating a box plot to demonstrate the shape of the distribution of each stat:

In [None]:
df = df1[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]
df.iplot(kind='box')

Ok, time for our FIFA data! The histogram is a go-to plot for graphing a distribution. 

What's the distribution of all the players' age in the game?

In [None]:
df2['Age'].iplot(kind='hist', opacity=0.75, color='rgb(12, 128, 128)', title='Age Distribtution', yTitle='Count', xTitle='Age', bargap = 0.20)

And by their overall player rating:

In [None]:
df2['Overall'].iplot(kind='hist', opacity=0.75, color='#007959', title='Overall Rating Distribution', yTitle='Count', xTitle='Overall Rating', bargap = 0.20)

Ok, let's break the data down a little to compare player stats across countries:

In [None]:
df_spain = pd.DataFrame(df2.loc[df2['Nationality'] == 'Spain']['Overall']).reset_index(drop=True)
df_spain = df_spain.rename(columns={'Overall': 'Spain'})
df_brazil = pd.DataFrame(df2.loc[df2['Nationality'] == 'Brazil']['Overall']).reset_index(drop=True)
df_brazil = df_brazil.rename(columns={'Overall': 'Brazil'})
df_england = pd.DataFrame(df2.loc[df2['Nationality'] == 'England']['Overall']).reset_index(drop=True)
df_england = df_england.rename(columns={'Overall': 'England'})
frames = [df_spain, df_brazil, df_england]
df = pd.concat(frames, sort=False)

colors = ['rgba(171, 50, 96, 0.6)', '#051e3e', 'rgba(80, 26, 80, 0.8)']

df.iplot(
    kind='hist',
    barmode='overlay',
    xTitle='Rating',
    yTitle='Count',
    title='Distribution of Overall Rating by Country',
    opacity=0.75,
    color=colors,
    theme='white')


Spanish and Brazilean players clearly dominate the upper quartile, but to be fair, FIFA 18 includes more lower league players in England in comparison to other countries. Disclaimer: I happen to be Irish & so took particular delight in this graph, while also acknowledging Ireland's dismal performance at international level, if and when we even qualify!😀

Let's step it up a notch & segment the dataset by top finishers, so that we can look at the attributes of the game's forwards and strikers. First, let's generate a box-plot for descriptive stats on the games' finishers by country.

In [None]:
df_russia = pd.DataFrame(df2.loc[df2['Nationality'] == 'Russia']['Finishing']).reset_index(drop=True)
df_russia = df_russia.rename(columns={'Finishing': 'Russia'})
df_france = pd.DataFrame(df2.loc[df2['Nationality'] == 'France']['Finishing']).reset_index(drop=True)
df_france = df_france.rename(columns={'Finishing': 'France'})
df_argentina = pd.DataFrame(df2.loc[df2['Nationality'] == 'Argentina']['Finishing']).reset_index(drop=True)
df_argentina = df_argentina.rename(columns={'Finishing': 'Argentina'})
df_germany = pd.DataFrame(df2.loc[df2['Nationality'] == 'Germany']['Finishing']).reset_index(drop=True)
df_germany = df_germany.rename(columns={'Finishing': 'Germany'})

frames = [df_russia, df_france, df_argentina, df_germany]
df = pd.concat(frames, sort=False)

df.iplot(kind='box',
        yTitle='Rating',
        title='Descriptive Stats of Finishing Ability by Country')


Argentina just about steals the show when it comes to prowess in front of goal.

### Scatterplots 

The scatterplot is found at the heart of most analyses - it allows us to see the evolution of a variable over time or the relationship between two (or more) variables.


Let's look at the top 100 finishers in the game. Creating a bubble chart, where our y-values represent the players' finishing score, x-values their Composure and the players' wages are represented by the marker size.

In [None]:
df = df2.nlargest(100, 'Finishing')
df['Wage'] = df['Wage'] / 10000

df[['Composure', 'Finishing', 'Name', 'Wage']].iplot(
    y='Finishing', mode='markers', x='Composure', colorscale='rdylbu',
    xTitle='Composure', yTitle='Finishing',
    text='Name', title='Player Finshing vs Composure', size=df['Wage'])

Unsurprisingly, those two big red bubbles in the upper right-hand corner represent Ronaldo and Messi.

Now, does a player's market value and wage justify his ranking in FIFA 18's ratings? Let's find out!

In [None]:
df = df2
df = df.nlargest(100, 'Overall')
df['Rank'] = ''
df['Rank'] = np.arange(1, len(df_) + 1)

colors = ['rgba(16, 112, 2, 0.8)', 'rgba(80, 26, 80, 0.8)']

df[['Rank', 'Wage', 'Value', 'Name']].iplot(
    y='Wage', mode='lines+markers', secondary_y = 'Value',
    secondary_y_title='Value', xTitle='Rank', yTitle='Wage',
    text='Name', title='Wage and Rating by Rank', color=colors, theme='white')

Ok, now let's get a distribution of the players' actual and potential overall rating as a function of their market value.

In [None]:
df = df2.nlargest(500, 'Value')
df = df.rename(columns={'Overall': 'Actual Rating'})
colors=['#007959', '#FFA505']

df[['Value', 'Actual Rating', 'Potential', 'Name']].iplot(
    kind='scatter',
    mode='markers',
    x='Value',
    y='Actual Rating',
    secondary_y = 'Potential',
    color=colors,
    text='Name',
    name=names,
    xTitle='Market Valuation',
    yTitle='Ratings',
    title='Actual vs Potential Rating by Player Value')

## More Advanced Plots

Now we’ll get into a few plots that you probably won’t use all that often, but which can be quite impressive.

### Heatmap

To visualize the correlations between numerical values, in this case the various stats of FIFA 18 players, we calculate the correlations and then make a heatmap:

In [None]:
df = df2[['Age', 'Value', 'Wage', 'Potential','Acceleration','Shot Power', 'Sprint Speed', 'Finishing', 'Stamina', 'Strength', 'Vision', 'Ball Control']]

df.corr().iplot(kind='heatmap',colorscale='ylgn')

Run `cf.colors.scales()` to see the available colorscales for cufflinks.

### Pie Chart

Let's find out what percentage of the game's top 100 players are playing for which clubs:

In [None]:
df = df2.nlargest(100, 'Overall')
df = pd.DataFrame(df.groupby('Club').size())
df.columns = ['Count']
df['Club'] = df.index


df.iplot(kind='pie', labels='Club', values='Count', title='Number of Players by Club', hoverinfo="label+percent+name", hole=0.3, theme='white')

All European clubs - continental poachers!

### Geographic Plotting - Choropleth

With plotly you can all plot geographical data. Now, while not as effective as geo-spatial plotting with [folium](https://github.com/python-visualization/folium) a leaflet.js wrapper for python that generates html interactive maps, pandas & cufflinks works just fine for our purposes. Let's get a sense of the distribution of pro players by their nationality:

In [None]:
df = df2.groupby("Nationality").size().reset_index(name="Count")

df.iplot(
    kind='choropleth', locations='Nationality',  z ='Count',
    text = 'Nationality', locationmode = 'country names', theme='white',
    colorscale='oranges', title = "Nationalities of FIFA 18 Players",
    projection = dict(
            type = 'natural earth'
        ))

### 3D Plots

And finally, a cool 3D-plot of the intersection between player composure, positioning and finishing ability:

In [None]:
df = df2.nlargest(50, 'Potential')
df.iplot(x='Composure', y='Positioning', z='Finishing', kind='scatter3d', xTitle='Composure', yTitle='Positioning',
         zTitle='Finishing', theme='pearl', text= 'Name',
         categories='Club', title='Intersection between Composure, Positioning and Finishing Ability')

***

I hope you liked this short intro to cufflinks. It is a pretty awesome tool for quick-fire EDA on any dataframe. I reckon it is the best plotting library if you're working with python, not only for it's ease of use for you, but also in terms of bringing data to life for the reader.


For more information, check out plotly's [documentation](https://plot.ly/ipython-notebooks/cufflinks/).