# Plotly basics

## Scatter plots

Scatter plots allow the comparison of two variables for a set of data.

Depending on the trend of the scatter points, we could interpret a correlation.

In [1]:
import numpy as np
import pandas as pd
import plotly.offline as pyo
import plotly.graph_objs as go

In [5]:
np.random.seed(42)

In [6]:
random_x = np.random.randint(1,101,100)
random_y = np.random.randint(1,101,100)

In [12]:
data = [go.Scatter(x=random_x, 
                   y=random_y,
                   mode='markers',
                   marker=dict(
                       size=12,
                       color='rgb(51,204,153)',
                       symbol='pentagon',
                       line={'width':2}
                   ))]


layout = go.Layout(title ='Hello first plot', 
                   xaxis = {'title':'my x axis'},     # either way is good
                   yaxis = dict(title = 'my y axis'), # either way is good
                   hovermode = 'closest'
                   )

fig = go.Figure(data=data, layout=layout)

In [13]:
pyo.plot(fig,filename='scatter.html')

'scatter.html'

## Line charts 

Often used to visualise a trend in data over intervals of time - known as a time series.

In [5]:
np.random.seed(56)

x_values = np.linspace(0, 1, 100)
y_values = np.random.randn(100)


trace0 = go.Scatter(x=x_values, y=y_values+5, 
                   mode='markers', name='markers')

trace1 = go.Scatter(x=x_values, y=y_values,
                    mode='lines', name='mylines')

trace2 = go.Scatter(x=x_values, y=y_values-5,
                    mode='lines+markers', name='lines+markers')

data = [trace0, trace1, trace2]

layout = go.Layout(title='Line Charts')

fig = go.Figure(data=data, layout=layout)

pyo.plot(fig)

'temp-plot.html'

In [2]:
df = pd.read_csv('nst-est2017.csv')

In [3]:
print(df.head())

   SUMLEV REGION DIVISION  STATE              NAME  CENSUS2010POP  \
0    10.0      0        0    0.0     United States    308745538.0   
1    20.0      1        0    0.0  Northeast Region     55317240.0   
2    20.0      2        0    0.0    Midwest Region     66927001.0   
3    20.0      3        0    0.0      South Region    114555744.0   
4    20.0      4        0    0.0       West Region     71945553.0   

   ESTIMATESBASE2010  POPESTIMATE2010  POPESTIMATE2011  POPESTIMATE2012  ...  \
0        308758105.0      309338421.0      311644280.0      313993272.0  ...   
1         55318350.0       55388349.0       55642659.0       55860261.0  ...   
2         66929794.0       66973360.0       67141501.0       67318295.0  ...   
3        114563024.0      114869241.0      116060993.0      117291728.0  ...   
4         71946937.0       72107471.0       72799127.0       73522988.0  ...   

   RDOMESTICMIG2015  RDOMESTICMIG2016  RDOMESTICMIG2017  RNETMIG2011  \
0          0.000000          0.0

In [4]:
df2 = df[df['DIVISION'] == '1']

In [6]:
df2.set_index('NAME', inplace=True)

In [7]:
list_of_pop_col = [col for col in df2.columns if col.startswith('POP')]

In [8]:
df2 = df2[list_of_pop_col]

In [9]:
print(df2)

               POPESTIMATE2010  POPESTIMATE2011  POPESTIMATE2012  \
NAME                                                               
Connecticut          3580171.0        3591927.0        3597705.0   
Maine                1327568.0        1327968.0        1328101.0   
Massachusetts        6564943.0        6612178.0        6659627.0   
New Hampshire        1316700.0        1318345.0        1320923.0   
Rhode Island         1053169.0        1052154.0        1052761.0   
Vermont               625842.0         626210.0         625606.0   

               POPESTIMATE2013  POPESTIMATE2014  POPESTIMATE2015  \
NAME                                                               
Connecticut          3602470.0        3600188.0        3593862.0   
Maine                1327975.0        1328903.0        1327787.0   
Massachusetts        6711138.0        6757925.0        6794002.0   
New Hampshire        1322622.0        1328684.0        1330134.0   
Rhode Island         1052784.0        1054782.0

In [10]:
data = [ go.Scatter(x=df2.columns,
                    y=df2.loc[name],
                    mode='lines',
                    name=name) for name in df2.index]

pyo.plot(data)

'temp-plot.html'

### Line Chart exercise

In [2]:
df = pd.read_csv('2010YumaAZ.csv')

In [4]:
df.head()

Unnamed: 0,LST_DATE,DAY,LST_TIME,T_HR_AVG
0,20100601,TUESDAY,0:00,25.2
1,20100601,TUESDAY,1:00,24.1
2,20100601,TUESDAY,2:00,24.4
3,20100601,TUESDAY,3:00,24.9
4,20100601,TUESDAY,4:00,22.8


In [3]:
days = ['TUESDAY','WEDNESDAY','THURSDAY','FRIDAY','SATURDAY','SUNDAY','MONDAY']

In [17]:
data = []

for day in days:
    x_value = df['LST_TIME']
    y_value = df['T_HR_AVG']
    
    
    trace1 = go.Scatter(x=x_value, y=df[df['DAY'] == day]['T_HR_AVG'],
                    mode='lines', name = day)
                       
    data.append(trace1)
                       
    layout = go.Layout(title='Daily temp averages')
     
    fig = go.Figure(data=data, layout=layout)

    pyo.plot(fig)
                       

    

In [18]:
# Define a data variable
data = [{
    'x': df['LST_TIME'],
    'y': df[df['DAY']==day]['T_HR_AVG'],
    'name': day
} for day in df['DAY'].unique()]

# Define the layout
layout = go.Layout(
    title='Daily temperatures from June 1-7, 2010 in Yuma, Arizona',
    hovermode='closest'
)

# Create a fig from data and layout, and plot the fig
fig = go.Figure(data=data, layout=layout)
pyo.plot(fig, filename='solution2b.html')

'solution2b.html'

## Bar charts

A bar chart presents **categorical data** with rectangular bars with heights (or lengths) proportional to the values that they represent.

We will review **categorical data** versus **continuous data**.

In general variables and data either represent measurements on some **continuous scale** they represent information about the **categorical or discrete characteristics**.

For example, the weight, the height, and the age of respondents in a survey would represent **continuous variables**.

However, a persons gender, occupation, or marital status are **categorical or discrete variables**.

Using Bar charts, we can visualise categorical data.

Typically the x-axis is the categories and the y-axis is the count (number of occurrences) in each category.

However the y-axis can be any aggregation (count, sum, average, etc...).

In [2]:
df = pd.read_csv('2018WinterOlympics.csv')

In [3]:
print(df.head())

   Rank            NOC  Gold  Silver  Bronze  Total
0     1         Norway    14      14      11     39
1     2        Germany    14      10       7     31
2     3         Canada    11       8      10     29
3     4  United States     9       8       6     23
4     5    Netherlands     8       6       6     20


In [4]:
data = [go.Bar(x=df['NOC'], y=df['Total'])]
layout = go.Layout(title='Medals')
fig = go.Figure(data=data, layout=layout)

pyo.plot(fig)

'temp-plot.html'

### Nested Bar Chart

In [5]:
trace1 = go.Bar(x=df['NOC'], y=df['Gold'],
                name='Gold', marker={'color':'#FFD700'})

trace2 = go.Bar(x=df['NOC'], y=df['Silver'],
                name='Silver', marker={'color':'#9EA0A1'})

trace3 = go.Bar(x=df['NOC'], y=df['Bronze'],
                name='Bronze', marker={'color':'#CD7F32'})



data = [trace1, trace2, trace3]
layout = go.Layout(title='Medals')
fig = go.Figure(data=data, layout=layout)

pyo.plot(fig)

'temp-plot.html'

### Stacked

In [6]:
trace1 = go.Bar(x=df['NOC'], y=df['Gold'],
                name='Gold', marker={'color':'#FFD700'})

trace2 = go.Bar(x=df['NOC'], y=df['Silver'],
                name='Silver', marker={'color':'#9EA0A1'})

trace3 = go.Bar(x=df['NOC'], y=df['Bronze'],
                name='Bronze', marker={'color':'#CD7F32'})



data = [trace1, trace2, trace3]
layout = go.Layout(title='Medals', barmode='stack')    # ONLY DIFFERENCE
fig = go.Figure(data=data, layout=layout)

pyo.plot(fig)

'temp-plot.html'

## Bubble charts

Bubble charts are very similar to scatter plots, except we now convey a third variable's information through the size of it's marker.

We can also continue to add variable information by colouring points based on a category.

In [10]:
df = pd.read_csv('mpg.csv', na_values={'horsepower':'?'})

In [6]:
print(df)

      mpg  cylinders  displacement horsepower  weight  acceleration  \
0    18.0          8         307.0        130    3504          12.0   
1    15.0          8         350.0        165    3693          11.5   
2    18.0          8         318.0        150    3436          11.0   
3    16.0          8         304.0        150    3433          12.0   
4    17.0          8         302.0        140    3449          10.5   
..    ...        ...           ...        ...     ...           ...   
393  27.0          4         140.0         86    2790          15.6   
394  44.0          4          97.0         52    2130          24.6   
395  32.0          4         135.0         84    2295          11.6   
396  28.0          4         120.0         79    2625          18.6   
397  31.0          4         119.0         82    2720          19.4   

     model_year  origin                       name  
0            70       1  chevrolet chevelle malibu  
1            70       1          buick sk

In [4]:
print(df.columns)

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')


In [11]:
data = [go.Scatter(x=df['horsepower'],
                   y=df['mpg'],
                   text=df['name'],
                   mode='markers',
                   marker=dict(size=2*df['cylinders']))]

layout = go.Layout(title='Bubble')

fig = go.Figure(data=data, layout=layout)

pyo.plot(fig)

'temp-plot.html'