<p style="text-align: right; font-size:0.8em;"> Thomas Bury<br><a href=mailto:thomas.bury@mcgill.ca> thomas.bury@mcgill.ca </a> </p>
<h1> Notebook 2: Visualising higher dimensions </h1>
Learning objectives of this notebook:

1. Import and explore a variety of datasets
2. Create visualisations of higher dimensions including
    - 3D plots
    - Grid plots
    - Heat maps

<br>

In [1]:
import numpy as np
import pandas as pd
import datetime
import plotly.express as px
import plotly.graph_objects as go

<br><h2> New datasets </h2>
The choice is yours! Uncomment the import line of the data you would like to visualise. All data is in the public domain, downloaded from Kaggle.

OR

Feel free to import your own dataset!


<br><h3> COVID-19 </h3>
<img src='images/covid19.jpg' width='60'>

COVID-19 cases, deaths, and recoveries at the provincial level across the globe. Data from provided by John Hopkins University.

Details [here](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset?select=covid_19_data.csv).


In [11]:
df_covid = pd.read_csv('datasets/covid_19_data.csv')

# Put date of observation into 'datetime' form
df_covid['ObservationDate'] = pd.to_datetime(df_covid['ObservationDate'])

# Fill nan values in Province cells with string 'NA'
df_covid['Province/State'] = df_covid['Province/State'].fillna('NA')

In [3]:
df_covid.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,2020-01-22,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,2020-01-22,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,2020-01-22,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,2020-01-22,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,2020-01-22,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


<br><h3> Heart disease UCI </h3>
<img src='images/heart.jpg' width='60'>

The Cleveland heart disease dataset. Investigate the different risk factors and outcomes among the patients.

Details [here](https://www.kaggle.com/ronitf/heart-disease-uci).


In [4]:
df_heart = pd.read_csv('datasets/heart.csv')

In [5]:
df_heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


<br><h3> World happiness report </h3>
<img src='images/happy.jpg' width='60'>

Data from the landmark survey on the state of global happiness from 2015-2019.

Details [here](https://www.kaggle.com/unsdsn/world-happiness?select=2019.csv). (Dataframes from different years have been concatenated.)


In [6]:
df_happy = pd.read_csv('datasets/world_happiness.csv')

In [7]:
df_happy.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year,Social support
0,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015,
1,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,2015,
2,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015,
3,Norway,Western Europe,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,2015,
4,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,2015,


<br><h3> Breast cancer </h3>
<img src='images/cancer.png' width='40'>

The breast cancer Wisconsin diagnostic dataset. Investigate the features of the cell nuclei present in scans and how they relate to malignancy.

Details [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data).


In [29]:
df_cancer = pd.read_csv('datasets/breast_cancer.csv')
# Drop unnamed column
df_cancer.drop(['Unnamed: 32'],axis=1,inplace=True)

<br><h3> Pokémon (?) </h3>
<img src='images/pikachu.png' width='60'>

For some lighthearted nostalgia.

Details [here](https://www.kaggle.com/abcsds/pokemon).


In [8]:
df_poke = pd.read_csv('datasets/pokemon.csv')

In [9]:
df_poke.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


<br><br><br><h2> Grid plots </h2>
Sometimes we just can't fit everything onto a single set of axes. Grid (facet) plots can help.<br>
<a href=https://plotly.com/python/facet-plots/> Plotly documentation </a>

In [10]:
df_covid.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,2020-01-22,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,2020-01-22,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,2020-01-22,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,2020-01-22,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,2020-01-22,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [14]:
df_covid['Country/Region'].sort_values().unique()

array([' Azerbaijan', "('St. Martin',)", 'Afghanistan', 'Albania',
       'Algeria', 'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina',
       'Armenia', 'Aruba', 'Australia', 'Austria', 'Azerbaijan',
       'Bahamas', 'Bahamas, The', 'Bahrain', 'Bangladesh', 'Barbados',
       'Belarus', 'Belgium', 'Belize', 'Benin', 'Bhutan', 'Bolivia',
       'Bosnia and Herzegovina', 'Botswana', 'Brazil', 'Brunei',
       'Bulgaria', 'Burkina Faso', 'Burma', 'Burundi', 'Cabo Verde',
       'Cambodia', 'Cameroon', 'Canada', 'Cape Verde', 'Cayman Islands',
       'Central African Republic', 'Chad', 'Channel Islands', 'Chile',
       'Colombia', 'Comoros', 'Congo (Brazzaville)', 'Congo (Kinshasa)',
       'Costa Rica', 'Croatia', 'Cuba', 'Curacao', 'Cyprus',
       'Czech Republic', 'Denmark', 'Diamond Princess', 'Djibouti',
       'Dominica', 'Dominican Republic', 'East Timor', 'Ecuador', 'Egypt',
       'El Salvador', 'Equatorial Guinea', 'Eritrea', 'Estonia',
       'Eswatini', 'Ethiopia', 'F

In [21]:
df_plot = df_covid[
    df_covid['Country/Region'].isin(
        ['Canada','UK','US','Norway','Sweden','Kenya','Mainland China','Thailand'])
    ]

In [22]:
df_plot.head()

Unnamed: 0,SNo,ObservationDate,Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
0,1,2020-01-22,Anhui,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
1,2,2020-01-22,Beijing,Mainland China,1/22/2020 17:00,14.0,0.0,0.0
2,3,2020-01-22,Chongqing,Mainland China,1/22/2020 17:00,6.0,0.0,0.0
3,4,2020-01-22,Fujian,Mainland China,1/22/2020 17:00,1.0,0.0,0.0
4,5,2020-01-22,Gansu,Mainland China,1/22/2020 17:00,0.0,0.0,0.0


In [23]:
fig = px.line(df_plot,
             x='ObservationDate',
             y='Confirmed',
             color='Province/State',
             facet_col='Country/Region',
             facet_col_wrap=4)

In [24]:
fig.write_html('figures/covid1.html')

<br><br><h2> 3D plots </h2>

Of course we could add another spatial dimension. Fine if you can interact with it!<br>
<a href=https://plotly.com/python/3d-scatter-plots/> Plotly documentation </a>

Let's take the world happiness index

In [25]:
df_happy.head()

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual,Year,Social support
0,Switzerland,Western Europe,1,7.587,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738,2015,
1,Iceland,Western Europe,2,7.561,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201,2015,
2,Denmark,Western Europe,3,7.527,1.32548,1.36058,0.87464,0.64938,0.48357,0.34139,2.49204,2015,
3,Norway,Western Europe,4,7.522,1.459,1.33095,0.88521,0.66973,0.36503,0.34699,2.46531,2015,
4,Canada,North America,5,7.427,1.32629,1.32261,0.90563,0.63297,0.32957,0.45811,2.45176,2015,


In [26]:
fig= px.scatter_3d(df_happy,
                  x='Economy (GDP per Capita)',
                  y='Family',
                  z='Health (Life Expectancy)',
                  color='Happiness Score',
                  hover_data=['Country'],
                  )

In [27]:
fig.write_html('figures/plot3d.html')

<br><br><h2> Heat maps </h2>
Add dimension with colour. Useful for visualising correlations between many variables.<br>
<a href=https://plotly.com/python/heatmaps/> Plotly documentation </a>

In [30]:
df_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [31]:
df_cancer.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [137]:
df_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [32]:
df_plot = df_cancer[df_cancer.columns[2:]].corr()

In [35]:
df_plot.index

Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst'],
      dtype='object')

In [36]:
fig = px.imshow(df_plot,
               labels=dict(x="Variable 1", y="Variable 2", color="Correlation"),
               x=df_plot.columns,
               y=df_plot.index)

In [37]:
fig.write_html('figures/heat1.html')

<br><br><h2> Practice area </h2>
Exercises:
1. Pick a dataset a familiarise yourself with the meaning of each column with the link provided.
2. Make a grid plot, 3D plot and heatmap of the dataset.
3. Revisit plots from Notebook 1 (particularly the scatter plot) with these data