## Python Visualization with Plotly


### What is .ipynb?
An .ipynb file is a notebook document used by Jupyter Notebook, an interactive computational environment designed for Python language. It includes the inputs and outputs of computations, mathematics, machine learning, images, and more.


### What is Google Colab?
Google Colab is a Jupyter Notebook environment that allows anybody to write and execute  python code to run on google cloud.


### What is Plotly

Plotly is a free and open-source tool. It's Python library creates interactive, publication-quality graphs, including: 

    line plots
    scatter plots
    area charts
    bar charts
    Maps
    error bars
    box plots histograms heatmaps
    subplots
    multiple-axes
    polar charts 
    and bubble charts

Data source: https://data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o/9mfq-cb36

### Use Pandas to Import Data

First, let's use Pandas to import the csv file as a dataframe.
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table.

In [2]:
# Upload our data to google drive, and then connect google drive to google colab
from google.colab import drive
drive.mount("/content/drive")


Mounted at /content/drive


In [3]:
from google.colab import files
uploaded = files.upload()

Saving US_COVID19_Aggregated_Cases.csv to US_COVID19_Aggregated_Cases.csv


In [5]:
import pandas as pd
import io

In [6]:
df=pd.read_csv(io.StringIO(uploaded['US_COVID19_Aggregated_Cases.csv'].decode('utf-8')), sep=',')
print (df)

      submission_date state  ...  consent_cases  consent_deaths
0             1/22/20    CO  ...          Agree           Agree
1             1/23/20    CO  ...          Agree           Agree
2             1/24/20    CO  ...          Agree           Agree
3             1/25/20    CO  ...          Agree           Agree
4             1/26/20    CO  ...          Agree           Agree
...               ...   ...  ...            ...             ...
16675        10/21/20    PW  ...            NaN             NaN
16676        10/22/20    PW  ...            NaN             NaN
16677        10/23/20    PW  ...            NaN             NaN
16678        10/24/20    PW  ...            NaN             NaN
16679        10/25/20    PW  ...            NaN             NaN

[16680 rows x 15 columns]


In [7]:
# df.info
print (df.columns)
print (df.index)

Index(['submission_date', 'state', 'tot_cases', 'conf_cases', 'prob_cases',
       'new_case', 'pnew_case', 'tot_death', 'conf_death', 'prob_death',
       'new_death', 'pnew_death', 'created_at', 'consent_cases',
       'consent_deaths'],
      dtype='object')
RangeIndex(start=0, stop=16680, step=1)


### Use Pandas to Format Data

When it comes to data cleaning, there are some typical steps:

    --Remove duplicate or irrelevant observations
    --Fix structural errors (empty rows, nested structure, etc.)
    --Filter unwanted outliers

Here we have a clean dataset. There is not much cleaning work here. Instead, we need to format our data a little to better focus on the aspects we are researching. 

1. Shrink the dataframe to the columns you are focusing on

In [8]:
df_simp = df[['submission_date', 'state', 'tot_cases', 'new_case', 'tot_death', 'new_death']]
df_simp

Unnamed: 0,submission_date,state,tot_cases,new_case,tot_death,new_death
0,1/22/20,CO,0,0,0,0
1,1/23/20,CO,0,0,0,0
2,1/24/20,CO,0,0,0,0
3,1/25/20,CO,0,0,0,0
4,1/26/20,CO,0,0,0,0
...,...,...,...,...,...,...
16675,10/21/20,PW,0,0,0,0
16676,10/22/20,PW,0,0,0,0
16677,10/23/20,PW,0,0,0,0
16678,10/24/20,PW,0,0,0,0


2. Shrink the dataframe by the value of a column

In [9]:
df_updated =df_simp[df_simp["submission_date"]=="10/25/20"]
df_updated.head(5)

Unnamed: 0,submission_date,state,tot_cases,new_case,tot_death,new_death
277,10/25/20,CO,95089,1689,2223,5
555,10/25/20,FL,768653,2348,16429,12
833,10/25/20,AZ,238163,1391,5874,5
1111,10/25/20,SC,170678,1337,3802,9
1389,10/25/20,CT,66052,0,4577,0


### Use Plotly to create a choropleth map

In [10]:
import plotly.graph_objects as go

The general structure of setting this up is:

    theFigureToCreate = go.Figure(
                        data = how you define data, 
                        layout = how you define layout)

Data is the essential element to define because we need out figure to know which data to use on what, layout is optional, we define it only if we want our figure to look in certain way. So with that in mind, we can further trim down our structure to:


    theFigureToCreate = go.Figure(
                        data = how you define data)
    
If later we want to add information to the layout, we can also add the following structure to our figure:

    theFigureToCreate.update_layout(...)

A choropleth map is like a heat map, which projects areas in proportion to statistical values by shades or patterns.

In [21]:
Map = go.Figure(data=go.Choropleth(
    locations=df_updated['state'], # Spatial coordinates
    z = df_updated['tot_death'], #.astype(float), # Data to be color-coded
    locationmode = 'USA-states', # The 'locationmode' property is an enumeration that may be specified as one of the following:
                                    # ['ISO-3', 'USA-states', 'country names']
    colorscale = 'Greys',
    colorbar_title = "COVID death",
))

Map.update_layout(
    title_text = 'COVID Death by States',
    geo_scope='usa', # limite map scope to USA
)

Map.show()

### 3D Scatterplot

Let's create another sub-dataframe for a 3D scatterplot.

In [12]:
df_agg = df.groupby(["submission_date"]).sum().reset_index()  #keeps the original column "submission_date"
df_agg = df_agg[["submission_date", "tot_cases", "new_case", "tot_death", "new_death"]]
df_agg

Unnamed: 0,submission_date,tot_cases,new_case,tot_death,new_death
0,1/22/20,1,1,0,0
1,1/23/20,1,0,0,0
2,1/24/20,2,1,0,0
3,1/25/20,2,0,0,0
4,1/26/20,5,3,0,0
...,...,...,...,...,...
273,9/5/20,6226879,45405,188051,892
274,9/6/20,6261216,34337,188513,462
275,9/7/20,6287362,26146,188688,175
276,9/8/20,6310663,23301,189147,459


3D scatter plot uses three numeric / scalable values to show the distribution of our data points in the system.

In [23]:
trace1 = go.Scatter3d(
    x=df_agg.tot_death,
    y=df_agg.submission_date,
    z=df_agg.new_death,
    mode='markers',
    marker=dict(
        size=4,
        color=df_agg.new_death,  # set color to an array/list of desired values
        colorscale='Gray',   # choose a colorscale Gray
        opacity=0.8
    )
)
data = [trace1]
layout = go.Layout(
    title= 'COVID Total vs. New Death',
    scene = dict(
                    xaxis_title='Total death',
                    yaxis_title='Date',
                    zaxis_title='New death'),
                    width=800,
                    margin=dict(r=20, b=30, l=10, t=30))

Scatter3D = go.Figure(data=data, layout=layout)
Scatter3D.show()

### Save your charts on the web

https://plot.ly/export/

https://chart-studio.plotly.com/

In [26]:
!pip install chart-studio

Collecting chart-studio
[?25l  Downloading https://files.pythonhosted.org/packages/ca/ce/330794a6b6ca4b9182c38fc69dd2a9cbff60fd49421cb8648ee5fee352dc/chart_studio-1.1.0-py3-none-any.whl (64kB)
[K     |█████                           | 10kB 13.6MB/s eta 0:00:01[K     |██████████▏                     | 20kB 13.1MB/s eta 0:00:01[K     |███████████████▎                | 30kB 9.2MB/s eta 0:00:01[K     |████████████████████▍           | 40kB 8.1MB/s eta 0:00:01[K     |█████████████████████████▍      | 51kB 4.0MB/s eta 0:00:01[K     |██████████████████████████████▌ | 61kB 4.7MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 2.9MB/s 
Installing collected packages: chart-studio
Successfully installed chart-studio-1.1.0


In [27]:
import chart_studio
from chart_studio import plotly as py

user = "ximinmi"
key = "  "
chart_studio.tools.set_credentials_file(username=user, api_key=key)

In [28]:
py.iplot(Map, filename='Map') 
py.iplot(Scatter3D, filename='U.S COVID-19 New vs. Total Death over Time') 