[Information Visualization Tutorials](https://infovis.fh-potsdam.de/tutorials/) · FH Potsdam · Summer 2020

# Tutorial 6: Many dimensions

In this tutorial we are going to analyze multidimensional datasets. In the context of computing, ‘multidimensional’ refers to data that contains multiple attributes per item. When working with such multidimensional data we face a range of challenges from data preparation to processing and presentation. Let's look at these steps one after another.

## 🛒 1. Prepare 




In [1]:
import pandas as pd
import altair as alt

First, data preparation can be more elaborate when we are trying to make sense of many dimensions. We may encounter multidimensional data by 
- importing a dataset that contains multiple dimensions (easiest case),
- working with data from multiple sources (can be challenging), or
- querying multiple data fields from one database (may be difficult due to the query language). 

We will take a look at a range of national statistics about different sets of countries and thus go through each of these cases.

### OECD data

Of course, it would be most convenient if we had a dataset that already contains all the dimensions that we are interested in and that can be loaded directly into a Pandas DataFrame. Sometimes this is the case. For example, the [OECD Better Life Index](http://www.oecdbetterlifeindex.org/) about the prosperity, health, and development of its member states is available as one file. It comprises development data about 36 countries and 24 dimensions spanning social, economic, and ecological aspects of well-being.

The dataset exists online as a CSV file, so we can simply load it via `read_csv(url)`:

In [2]:
url = "https://gist.github.com/scotthmurray/f71065a5694f22259bf9/raw/ce891b9fe7ec3c5cab3308f4cd0c8eeccc36f6c7/Better%2520Life%2520Index%2520Data.csv"
oecd = pd.read_csv(url)

# to keep things consistent with the other examples we turn column names to lowercase
oecd.columns = map(str.lower, oecd.columns)

✏️ *Examine the dataset, e.g., by using `head()`, `info()` and `describe()`:*

### Multiple sources

Having everything in one file is the easiest and maybe rarest case of data preparation. More often than not we have to work with multiple datasets each containing different aspects about entities of interest. Suppose, we would like to work with a various country statistics, randing from population numbers and area sizes to economic performance and even Covid–19 cases. 

To do this, we will load data from three sources. In order to integrate the data sources, we need to match information belonging the same country. For this we will use three-letter country codes ([ISO 3166-1 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3)). The country codes allow us to reliably cross-reference the datasets, without confusing different ways of spelling or naming countries (such as USA, United States, US, etc). However, note that this does not resolve territorial conflicts that, of course, do exist. 

Let's start with the basics: we will use the handy `countryInfoCSV` service by GeoNames to get population numbers and area sizes:

In [3]:
geonames_full = pd.read_csv("https://www.geonames.org/countryInfoCSV", sep='\t', keep_default_na=False)
geonames_full.head()

Unnamed: 0,iso alpha2,iso alpha3,iso numeric,fips code,name,capital,areaInSqKm,population,continent,languages,currency,geonameId
0,AD,AND,20,AN,Andorra,Andorra la Vella,468.0,77006,EU,ca,EUR,3041565
1,AE,ARE,784,AE,United Arab Emirates,Abu Dhabi,82880.0,9630959,AS,"ar-AE,fa,en,hi,ur",AED,290557
2,AF,AFG,4,AF,Afghanistan,Kabul,647500.0,37172386,AS,"fa-AF,ps,uz-AF,tk",AFN,1149361
3,AG,ATG,28,AC,Antigua and Barbuda,St John's,443.0,96286,,en-AG,XCD,3576396
4,AI,AIA,660,AV,Anguilla,The Valley,102.0,13254,,en-AI,XCD,3573511


Look! Did you notice that the GeoNames dataset contains a column called `iso alpha 3`? This is our friend, who will link to the other datasets! So we will keep it, together with three other columns, which might come in handy.

In [4]:
geonames = geonames_full[['name', 'iso alpha3', 'areaInSqKm', 'population']]

And for simplicity sake we rename them, after which we set the country `code` as the DataFrame's index:

In [5]:
geonames.columns = ["country", "code", "area", "population"]
geonames = geonames.set_index("code")
geonames

Unnamed: 0_level_0,country,area,population
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AND,Andorra,468.0,77006
ARE,United Arab Emirates,82880.0,9630959
AFG,Afghanistan,647500.0,37172386
ATG,Antigua and Barbuda,443.0,96286
AIA,Anguilla,102.0,13254
...,...,...,...
YEM,Yemen,527970.0,28498687
MYT,Mayotte,374.0,159042
ZAF,South Africa,1219912.0,57779622
ZMB,Zambia,752614.0,17351822


Next we download GDP (gross domestic product) statistics from the World Bank, which provides such datasets [via their open data portal](https://data.worldbank.org/indicator/NY.GDP.MKTP.CD?view=map). Here we simply load it again via `read_csv()`:

In [6]:
worldbank_full = pd.read_csv("http://infovis.fh-potsdam.de/tutorials/data/gdp.csv", header=2)

The dataset contains GDP statistics over several years; we focus on 2018, which is the most recent and comprehensively covered year. Again, we rename the columns and set the country `code` as the DataFrame's index:

In [7]:
worldbank = worldbank_full[ ["Country Code", "2018"] ]
worldbank.columns = ["code", "gdp"]
worldbank = worldbank.set_index("code")
worldbank

Unnamed: 0_level_0,gdp
code,Unnamed: 1_level_1
ABW,
AFG,1.936297e+10
AGO,1.057510e+11
ALB,1.510250e+10
AND,3.236544e+09
...,...
XKX,7.938991e+09
YEM,2.691440e+10
ZAF,3.682889e+11
ZMB,2.672007e+10


Last but not least, we would like to include Covid–19 cases and death statistics, which are collected for many countries by the European Centre for Disease Prevention and Control (ECDC):

In [8]:
# https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide
ecdc_full = pd.read_csv("https://opendata.ecdc.europa.eu/covid19/casedistribution/csv")

By now you know, what we'll do: select and rename the columns, and turn the country code into the index:

In [9]:
ecdc = ecdc_full[ ["cases", "deaths", "countryterritoryCode"] ]
ecdc.columns = ["covid19_cases", "covid19_deaths", "code"]
ecdc = ecdc.set_index("code")

The Covid–19 statistics are provided on a daily basis. For the purpose of this tutorial, we are only interested in total numbers:

In [10]:
ecdc = ecdc.groupby("code").sum()
ecdc

Unnamed: 0_level_0,covid19_cases,covid19_deaths
code,Unnamed: 1_level_1,Unnamed: 2_level_1
ABW,1628,7
AFG,38070,1397
AGO,2222,100
AIA,3,0
ALB,8605,254
...,...,...
XKX,12448,467
YEM,1916,555
ZAF,611450,13159
ZMB,11148,280


To integrate the three DataFrames we need to make sure that the values associated with a specific country are linked to the right country. 

We already set the country codes as the indices for the three DataFrames. By virtue of using this identifier we can now simply integrate the various data sources using the `join()` method, which by default uses the indices of the DataFrames.


In [11]:
multiple = geonames.join(worldbank).join(ecdc)

# from now on, we do not need the country codes anymore; we can remove them:
multiple = multiple.reset_index(drop=True)
multiple

Unnamed: 0,country,area,population,gdp,covid19_cases,covid19_deaths
0,Andorra,468.0,77006,3.236544e+09,1060.0,53.0
1,United Arab Emirates,82880.0,9630959,4.141789e+11,67282.0,376.0
2,Afghanistan,647500.0,37172386,1.936297e+10,38070.0,1397.0
3,Antigua and Barbuda,443.0,96286,1.610574e+09,94.0,3.0
4,Anguilla,102.0,13254,,3.0,0.0
...,...,...,...,...,...,...
247,Yemen,527970.0,28498687,2.691440e+10,1916.0,555.0
248,Mayotte,374.0,159042,,,
249,South Africa,1219912.0,57779622,3.682889e+11,611450.0,13159.0
250,Zambia,752614.0,17351822,2.672007e+10,11148.0,280.0


Some of the later steps require values to be present for each country; with `dropna()` we can remove all rows with missing values.


✏️ *Before we remove incomplete rows, let's do a quick check and take a look these:*

Removing incomplete rows keeps the old index, which we can reset yet again.

In [12]:
multiple = multiple.dropna().reset_index(drop=True)
multiple

Unnamed: 0,country,area,population,gdp,covid19_cases,covid19_deaths
0,Andorra,468.0,77006,3.236544e+09,1060.0,53.0
1,United Arab Emirates,82880.0,9630959,4.141789e+11,67282.0,376.0
2,Afghanistan,647500.0,37172386,1.936297e+10,38070.0,1397.0
3,Antigua and Barbuda,443.0,96286,1.610574e+09,94.0,3.0
4,Albania,28748.0,2866376,1.510250e+10,8605.0,254.0
...,...,...,...,...,...,...
176,Kosovo,10908.0,1845300,7.938991e+09,12448.0,467.0
177,Yemen,527970.0,28498687,2.691440e+10,1916.0,555.0
178,South Africa,1219912.0,57779622,3.682889e+11,611450.0,13159.0
179,Zambia,752614.0,17351822,2.672007e+10,11148.0,280.0


### Wikidata query

Apart from working with specific datasets, we can also generate custom data sources by formulating queries against databases.

For various topics, there are comprehensive databases that offer multidimensional information across a wide range of domains. Wikipedia's sister Wikidata is particulary broad knowledge base that can be queried using the query language SPARQL. The query syntax may seem a bit daunting, which is because it operates on the basis of triples representing semantic relationships binding entities. In this tutorial, we will not go deeper into the syntax, but it may already help to visit the web-based interface to the [Wikidata query service](https://query.wikidata.org).

Queries against Wikidata can be directly issued from the comfort of your notebook using HTTP requests! For this we need the `requests` library again, the URL of the endpoint and the SPARQL query. The following query requests three bits of information about European countries: `gdp`, `area` and `population`.

In [13]:
import requests

endpoint = "https://query.wikidata.org/sparql"

# triple quotes start and end multi-line strings
sparql = """
SELECT ?countryLabel ?population ?gdp ( MAX(?areas) AS ?area )
WHERE {
  ?country wdt:P31 wd:Q3624078;
           wdt:P463 wd:Q458;
           wdt:P1082 ?population;
           p:P2046/psn:P2046/wikibase:quantityAmount ?areas.
  
  ?country p:P2131 ?gdp_statement.
  ?gdp_statement ps:P2131 ?gdp;
                pq:P585 ?gdp_date. 

  FILTER NOT EXISTS {
    ?country p:P2131/pq:P585 ?gdp_date_ .
    FILTER (?gdp_date_ > ?gdp_date)
  }
  
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}

GROUP BY ?countryLabel ?area ?population ?gdp

"""

res = requests.get(endpoint, params = {'format': 'json', 'query': sparql})
response = res.text

In [14]:
print(response[:1000])

{
  "head" : {
    "vars" : [ "countryLabel", "population", "gdp", "area" ]
  },
  "results" : {
    "bindings" : [ {
      "countryLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "Croatia"
      },
      "area" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#decimal",
        "type" : "literal",
        "value" : "56594000000"
      },
      "population" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#decimal",
        "type" : "literal",
        "value" : "4105493"
      },
      "gdp" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#decimal",
        "type" : "literal",
        "value" : "54849180228.8716"
      }
    }, {
      "countryLabel" : {
        "xml:lang" : "en",
        "type" : "literal",
        "value" : "Romania"
      },
      "area" : {
        "datatype" : "http://www.w3.org/2001/XMLSchema#decimal",
        "type" : "literal",
        "value" : "238397000000"
      },
      "population" : {
      

The query might take a bit of time to finish (a minute or so), which is due to the various attributes that are queried. Once the request is finished, we can parse the results:

In [15]:
import json

# let's define the parsing steps as a function, so that you can reuse it later
def wikidata_to_dataframe(data):

  # parse json data from response and get results:
  results = json.loads(data)["results"]["bindings"]

  # column names we draw from the first result
  cols = [ val for val in results[0] ]

  rows = []
  
  # to get the values from this, we need to loop through the results:
  for result in results:
      values = [ result[val]["value"] for val in result ]
      rows.append(values)

  # with rows and cols we can create a DataFrame:
  return pd.DataFrame(rows, columns=cols)

wikidata = wikidata_to_dataframe(response)
wikidata = wikidata.rename(columns={'countryLabel':'country'})
wikidata

Unnamed: 0,country,area,population,gdp
0,Croatia,56594000000,4105493,54849180228.8716
1,Romania,238397000000,19586539,211803281924.738
2,Sweden,528861060000,10343403,538040458216.997
3,Finland,338424380000,5501043,251884887972.766
4,Estonia,45339000000,1324820,30312000000.0
5,Austria,83878990000,8809212,398682000000.0
6,Czech Republic,78866000000,10693939,215725534372.371
7,Hungary,93036000000,9937628,139135029758.29
8,Luxembourg,2586400000,626108,62404461274.6636
9,Slovenia,20273000000,2066880,48769655479.2388


At this point the columns are not treated as series of numeric values yet, but strings (displayed as object). See:

In [16]:
wikidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   country     27 non-null     object
 1   area        27 non-null     object
 2   population  27 non-null     object
 3   gdp         27 non-null     object
dtypes: object(4)
memory usage: 992.0+ bytes


To let the DataFrame know that three of these columns contain numbers, we apply the `to_numeric` method to all columns except the first one, which contains the country names:

In [17]:
cols = wikidata.columns[1:]
wikidata[cols] = wikidata[cols].apply(pd.to_numeric)
wikidata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27 entries, 0 to 26
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     27 non-null     object 
 1   area        27 non-null     int64  
 2   population  27 non-null     int64  
 3   gdp         27 non-null     float64
dtypes: float64(1), int64(2), object(1)
memory usage: 992.0+ bytes


✏️ *What would you query from Wikidata? Have a look at the examples of the [Wikidata Query Service](https://query.wikidata.org) for inspiration and use the following template*

In [18]:
sparql = """

# 1. enter your query here

"""

# 2. uncomment the following lines

# res = requests.get(endpoint, params = {'format': 'json', 'query': sparql})
# your_wikidata = wikidata_to_dataframe(res.text)
# your_wikidata

In case you're curious about the text colors in above code cell: the string content of the SPARQL query can span multiple lines, therefore it is rendered in red. The lines below are displayed in green or blue (depending where you read the tutorials), because they are *commented out* via the hash sign: **#**. You can quickly comment out (disable) by adding a hash and uncomment (enable) it by removing it. To do above pencil exercise, you need to remove the hashsigns in the last three lines.

## 🗂 2. Process

There are several processing steps we can apply to the multidimensional data, to visualize them. First, we can take a  look at the various correlations between dimensions, then we will try out two techniques for dimensionality reduction.

### Correlation analysis

Regardless of the source, once we have multidimensional data we can explore how the different dimensions relate to one another, i.e., how they correlate. The `corr()` method of Pandas faciliates this process by calculating the coefficients of pairwise correlations between data columns…

In [19]:
oecd.corr()

Unnamed: 0,dwellings without basic facilities,housing expenditure,rooms per person,household net adjusted disposable income,household net financial wealth,employment rate,job security,long-term unemployment rate,personal earnings,quality of support network,educational attainment,student skills,years in education,air pollution,water quality,consultation on rule-making,voter turnout,life expectancy,self-reported health,life satisfaction,assault rate,homicide rate,employees working very long hours,time devoted to leisure and personal care
dwellings without basic facilities,1.0,-0.489839,-0.567042,-0.593321,-0.388205,-0.287332,-0.123392,-0.16602,-0.62909,-0.424416,-0.124793,-0.315943,-0.436999,0.337206,-0.738332,-0.480795,0.049858,-0.743453,-0.581215,-0.481953,0.27174,0.459555,0.413667,-0.376653
housing expenditure,-0.489839,1.0,0.108116,0.038185,0.075931,-0.087133,0.173189,0.266434,0.031752,0.064197,0.002601,-0.003023,0.12853,-0.097316,0.283561,0.227886,-0.143251,0.321563,0.391433,0.090239,-0.11976,-0.252707,-0.045215,0.023785
rooms per person,-0.567042,0.108116,1.0,0.759664,0.544616,0.474966,-0.068846,-0.158453,0.785113,0.610136,0.171443,0.484448,0.304749,-0.419969,0.668145,0.317098,0.268994,0.595731,0.556736,0.602806,-0.365941,-0.341003,-0.285292,0.355559
household net adjusted disposable income,-0.593321,0.038185,0.759664,1.0,0.731852,0.488977,-0.1701,-0.219721,0.924154,0.501033,0.359272,0.474423,0.134083,-0.376871,0.610765,0.238383,0.208298,0.622414,0.486712,0.555455,-0.417265,-0.449719,-0.332307,0.266777
household net financial wealth,-0.388205,0.075931,0.544616,0.731852,1.0,0.393753,-0.213427,-0.231309,0.69924,0.340065,0.265676,0.339731,-0.014888,-0.0976,0.430827,0.078099,-0.013059,0.51527,0.32013,0.41275,-0.205766,-0.291377,-0.092281,0.089523
employment rate,-0.287332,-0.087133,0.474966,0.488977,0.393753,1.0,-0.58198,-0.634556,0.485224,0.628297,0.463427,0.328473,0.260426,-0.368623,0.623944,0.140697,0.063188,0.32978,0.209322,0.747185,-0.233353,-0.085511,-0.27912,0.195197
job security,-0.123392,0.173189,-0.068846,-0.1701,-0.213427,-0.58198,1.0,0.718938,-0.179853,-0.216039,-0.369276,-0.166032,0.119941,0.067694,-0.265321,0.048292,-0.14133,0.059452,0.19161,-0.405422,0.011376,-0.112623,-0.050162,0.15372
long-term unemployment rate,-0.16602,0.266434,-0.158453,-0.219721,-0.231309,-0.634556,0.718938,1.0,-0.165461,-0.269344,-0.198428,-0.082061,0.126603,-0.031193,-0.215873,-0.100951,-0.296382,0.02529,-0.006083,-0.588429,-0.051477,-0.204423,-0.273848,0.298353
personal earnings,-0.62909,0.031752,0.785113,0.924154,0.69924,0.485224,-0.179853,-0.165461,1.0,0.534175,0.33808,0.57096,0.257718,-0.314274,0.652255,0.322097,0.228281,0.681086,0.492508,0.568969,-0.443623,-0.522366,-0.323937,0.343777
quality of support network,-0.424416,0.064197,0.610136,0.501033,0.340065,0.628297,-0.216039,-0.269344,0.534175,1.0,0.355565,0.40584,0.304004,-0.455147,0.668132,0.146287,0.137378,0.408093,0.380586,0.588929,-0.385733,-0.343444,-0.445918,0.371897


A positive value just under 1 indicates a strong positive correlation, and a negative value just above -1 indicates an inverse correlation. For example, for the OECD data, *rooms per person* is relatively strongly correlated with *personal earnings*. Can you spot another strong negative or positive correlation? Later we will find a way to visualize such a correlation matrix.

✏️ *Run correlation analyses on any of the other DataFrames we prepared above:*

You might encounter high correlations between covid–19 numbers and gdp - remember & repeat after me: *correlation does not mean causation!*

### PCA

Next, we are going to look at two dimensionality reduction techniques that *project* a high-dimensional dataset onto a plane.

[Principal component analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA) is a data processing technique that can be used to reduce a high-dimensional dataset to a few—typically two—main components. While a given dataset may contain many different dimensions, PCA is used to extract those main axes that help to differantiate the data points the best. Let's do this with the OECD data!


In [20]:
# import a few components from the machine learning library scikit-learn:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# create a data pipeline, which includes scaling to normalize the data
# and by initializing the PCA with the number of principal components: 2
pipe = Pipeline([('scaling', StandardScaler()), ('pca', PCA(n_components=2))])

# run the PCA on all columns (except the first which is the country name)
principal_components = pipe.fit_transform(oecd[oecd.columns[1:]])

# let's have a look
principal_components

array([[-3.1412375 , -1.16806912],
       [-1.81372201, -1.46980565],
       [-1.81756929, -0.47223677],
       [ 4.88261036, -2.84192496],
       [-3.39055718, -0.36589308],
       [ 4.19239801, -1.96566591],
       [ 0.39268858,  0.89939301],
       [-2.92025017, -0.07138337],
       [ 2.02998178,  0.63200076],
       [-2.4734093 ,  0.4890045 ],
       [-0.95910409,  0.1214977 ],
       [-2.27854523, -0.54517468],
       [ 3.10165962,  5.46032054],
       [ 2.56219186,  1.98623689],
       [-2.09358533, -1.23164587],
       [-1.76896703,  1.82271137],
       [ 1.50777355, -1.29780572],
       [ 0.27356567,  1.15774084],
       [-0.65385875, -0.44433129],
       [ 1.39581363, -0.55863067],
       [-1.72106825, -1.35882664],
       [ 6.42749295, -3.28152975],
       [-2.57220852, -0.80199977],
       [-2.24846402,  0.01581683],
       [-2.82842126, -1.50354461],
       [ 1.72656611,  2.00831992],
       [ 1.83323291,  2.21039445],
       [ 5.41805788, -2.01338784],
       [ 1.61020243,

The PCA returns the positions in the same order as the input dataset, which means we can simply combine these coordinates with the rest of the data to visualize it in a scatterplot:

In [21]:
# first turn principal_components into a DataFrame
pca_positions = pd.DataFrame(principal_components, columns = ['x', 'y'])

# … and combine it with the source dataframe oecd
oecd_pca = pd.concat([oecd, pca_positions], axis = 1)

Last but not least, we generate a scatterplot, using the x and y values from the PCA:


In [22]:
alt.Chart(oecd_pca).mark_circle().encode(
    x=alt.X('x', axis=alt.Axis(labels=False)),
    y=alt.Y('y', axis=alt.Axis(labels=False)),    
    tooltip='country'
).properties(width=400, height=400)

✏️ *Run the PCA analysis on any of the other multidimensional datasets we created above:*

In [23]:
# 1. run PCA on the numeric columns


# 2. combine the positions with original data


# 3. display a scatterplot



### UMAP

One major downside of PCA is that it throws away a lot of dimensionality information not included in the few (often two) principal components that are selected. A more recently developed dimensionality reduction technique is [UMAP](https://umap-learn.readthedocs.io/) (Uniform manifold approximation and projection). With this projection technique all dimensions are considered—as the name suggests: *in approximation*—to generate a dimensionality reduction that satisfies both global and local structures. We keep it brief here, but [there is more to it](https://pair-code.github.io/understanding-umap/).

In [24]:
# import the umap library
import umap

# there are two main parameters, which you need to tweak
reducer = umap.UMAP(
        n_neighbors=15,  # balances local versus global structure
        min_dist=.4  # .01 for tight clumps, large for loose 1
)

# the datasets have a similar structure; you can replace it with oecd or wikidata:
data = oecd

# we use again the StandardScaler for normalization
# and add the UMAP reducer afterwards into the pipeline
pipe = Pipeline([('scaling', StandardScaler()), ('umap', reducer)])

# start the normalization and reduction steps
embedding = pipe.fit_transform(data[data.columns[1:]])

# turn the resulting embedding into dataframe, in which the positions are x and y
umap_positions = pd.DataFrame(embedding, columns=["x", "y"])

# … and merge it with the original dataset
data_umap = pd.concat([data, umap_positions], axis = 1)

# to center the scatterplot around generated positions,
# we adjust the scales according to the smallest and largest x and y values:
x_domain = [data_umap["x"].min(), data_umap["x"].max() ]
y_domain = [data_umap["y"].min(), data_umap["y"].max() ]

# display scatterplot and pass domains for x and y to scale parameter
alt.Chart(data_umap).mark_circle().encode(
    alt.X('x', scale=alt.Scale(domain=x_domain), axis=alt.Axis(labels=False)),
    alt.Y('y', scale=alt.Scale(domain=y_domain), axis=alt.Axis(labels=False)),
    tooltip="country"
).properties(width=400, height=400)

✏️ *Play with the two main parameters and observe the various arrangements for the `oecd`, `countries`, and `wikidata` datasets!*

## 🥗 3. Present

With above dimensionality reduction techniques (short projections) we have already started to visualize the data. So we skip the simple scatterplot here…

 ### Scatterplot matrix

… but we can have multiple scatterplots in a matrix layout, each cell containing a small scatterplot for a pair of dimensions.

Altair supports [repeated charts](https://altair-viz.github.io/user_guide/compound_charts.html?#repeat-chart), with which we can create multi-view displays in a snap:

In [25]:
data = wikidata

cols = ["area",	"population", "gdp"]


alt.Chart(data).mark_circle().encode(
    # the data dimensions used for the encoding are specified below under repeat
    # they are all quantitative and we remove the axis labels to avoid clutter
    alt.X(alt.repeat("column"), type='quantitative', axis=alt.Axis(labels=False)),
    alt.Y(alt.repeat("row"), type='quantitative', axis=alt.Axis(labels=False)),
    # we add a tooltip with the country's name
    tooltip = "country"    
).properties( width=150, height=150).repeat(
    # specify which data columns are used
    column=cols,
    row=cols
)

✏️ *Create a scatterplot matrix of another dataset! Note that there are probably too many dimensions in the `oecd` to be displayed at once.*

### Correlation heatmap

To examine how the dimensions correlate across a larger dataset, we can visualize the pairwise correlation values in a compact matrix display that we might call a correlation heatmap or heat table. For this we will turn the correlation table we generated above into a DataFrame to be visualized:

In [26]:
corr = oecd.corr()

# first we reset the index and call it dim1
corr = corr.reset_index().rename(columns={'index': 'dim1'})

# turn correlation data into long form 
corr = pd.melt(corr, id_vars="dim1", var_name='dim2', value_name='corr')

# add a label column for rounded correlation values
corr['label'] = corr['corr'].map('{:.1f}'.format)

Now we have the pairwise correlation data in shape that we can visualize:

In [27]:
# we create layered chart, with the base taking in the correlation data corr
# and the basic layout based on the dimensions
base = alt.Chart(corr).encode(
    x='dim1:O',
    y='dim2:O'    
).properties(width=500, height=500)

# a textual layer displaying rounded correlation values
text = base.mark_text().encode( text='label' )

# heatmap of the correlation values
plot = base.mark_rect().encode(
    color='corr:Q'
)

# both layers are combined
plot + text

### Multiple views

While we can accommodate multiple data dimensions in the various visual channels, it is also possible to separate out the data dimensions into multiple coordinated visualizations. Here we are now simply looking at the three data dimensions retrieved from Wikipedia: population, area size, and GDP.

In [28]:
# to coordinate hover highlights we create a selection
selection = alt.selection_single(on='mouseover', fields=['country'])

# the definitions of the base are used by the three sub-charts
base = alt.Chart(wikidata).mark_bar().encode(
  # adjust opacity based on hover selection
  opacity=alt.condition(selection, alt.value(1), alt.value(.5)),
  x = alt.X("country:O", sort="-y", axis=None),
  tooltip=['country','population', 'area', 'gdp']
).properties(
    width=600, height=150
).add_selection(selection)

# create a chart for each dimension
pop = base.encode(y = "population")
area = base.encode(y = "area")
gdp = base.encode(y = "gdp")

# combine them with ampersands
pop & area & gdp

✏️ *Create a multi-view visualization with the `multiple` dataset:*

## Sources

Tutorials
- [Combining DataFrames with Pandas](https://datacarpentry.org/python-ecology-lesson/05-merging-data/)
- [Where do Mayors Come From: Querying Wikidata with Python and SPARQL](https://janakiev.com/blog/wikidata-mayors/)


Documentation
- [Merge, join, and concatenate — pandas 1.0.3 documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)
- [UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction](https://umap-learn.readthedocs.io)
- [Altair Repeated Charts](https://altair-viz.github.io/user_guide/compound_charts.html#repeated-charts)




