**Data Visualization course - winter semester 20/21 - FU Berlin**

*Tutorials adapted from the [Information Visualization](https://infovis.fh-potsdam.de/tutorials/) course at the FH Potsdam*

# Tutorial 1: Getting started

During the tutorials you will be reading and writing **Python** code in **Jupyter** notebooks.
Phew… Let's unpack this a bit!

* 🐍 [Python](https://www.python.org) is a programming language that has gained considerable traction over the last years, in various contexts, including data science and the digital humanities. If you have never written any Python before, it would be useful for you to familiarize yourself with the language, its basic constructs and conventions. It is popular for its versatility and readability. Speaking of which…

* 🪐 [Jupyter](https://jupyter.org) notebooks are hybrid documents that contain both code and markup. So it becomes easy to mix programming and documentation. What you are looking at now is a text cell written in the markup language Markdown, further below you see code cells written in the programming language Python (note the light grey background), which contain computable code! When viewing the notebooks on Colab or in Jupyter, you can double-click on any text cell to see its source. 

In this tutorial you will get a bit acquainted with Python and Jupyter, and get to know a few handy libraries for working with data.

## 🌍 Hello world 

Okay, enough words. Let's dive right into it and start with a classic:

In [1]:
print("Hello world")

Hello world


Above code cell can be executed (i.e., run) by clicking **Shift + Enter**.

Of course we can set variables and extend them. Feel free to change the message:

In [2]:
hello = "hello world"
hello = hello + " how are you!"
hello

'hello world how are you!'

Now that we have our first variable `hello` we can perform some string tricks, for example, we could change the capitalization:

In [3]:
hello.title()

'Hello World How Are You!'

In [4]:
hello.upper()

'HELLO WORLD HOW ARE YOU!'

✏️ *Now it's your turn! (The pencil stands for a small hands-on activity!). Try some string manipulations yourself. To get some inspiration, have a look at the [string methods](https://docs.python.org/3/library/stdtypes.html?#string-methods) that Python has built-in:*

## 📦 Let's get some packages

Python itself provides only limited methods for working with more complex data. One of the main reasons for Python's (and  Jupyter's) popularity is the wide availability of software packages that provide powerful means for preparing, processing, presenting, and probing data. Throughout the tutorials you will get to know a few packages, some of them highly specific tools and others more general-purpose libraries. 

The Colab platform already has many packages ready to go. To use them in a notebook, you simply `import` them and assign an abbreviation after `as` to keep your code succinct. This is how you do it:

In [6]:
import pandas as pd

Now the powerful `pandas` package is loaded and will answer to its nickname `pd`.

🐼 [Pandas](https://pandas.pydata.org) really is a data analysis workhorse with the DataFrame data structure being one of its main muscles. You will learn to love it! With pandas you can do simple and sophisticated operations over small and sizable datasets. 

Let's create a little toy dataset to give you a sense of how it works:


In [8]:
covid_data = pd.read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv")

To check whether the DataFrame was created successfully, we can simply type the variable name `covid_data`, display its content as an ouput:

In [10]:
covid_data

Unnamed: 0,iso_code,continent,location,date,total_cases,new_cases,new_cases_smoothed,total_deaths,new_deaths,new_deaths_smoothed,...,gdp_per_capita,extreme_poverty,cardiovasc_death_rate,diabetes_prevalence,female_smokers,male_smokers,handwashing_facilities,hospital_beds_per_thousand,life_expectancy,human_development_index
0,AFG,Asia,Afghanistan,2019-12-31,0.0,0.0,,0.0,0.0,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
1,AFG,Asia,Afghanistan,2020-01-01,0.0,0.0,,0.0,0.0,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
2,AFG,Asia,Afghanistan,2020-01-02,0.0,0.0,,0.0,0.0,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
3,AFG,Asia,Afghanistan,2020-01-03,0.0,0.0,,0.0,0.0,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
4,AFG,Asia,Afghanistan,2020-01-04,0.0,0.0,,0.0,0.0,,...,1803.987,,597.029,9.59,,,37.746,0.5,64.83,0.498
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48589,,,International,2020-10-03,696.0,,,7.0,,,...,,,,,,,,,,
48590,,,International,2020-10-04,696.0,,,7.0,,,...,,,,,,,,,,
48591,,,International,2020-10-05,696.0,,,7.0,,,...,,,,,,,,,,
48592,,,International,2020-10-06,696.0,,,7.0,,,...,,,,,,,,,,


In [11]:
covid_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48594 entries, 0 to 48593
Data columns (total 41 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   iso_code                         48312 non-null  object 
 1   continent                        48030 non-null  object 
 2   location                         48594 non-null  object 
 3   date                             48594 non-null  object 
 4   total_cases                      47979 non-null  float64
 5   new_cases                        47761 non-null  float64
 6   new_cases_smoothed               46979 non-null  float64
 7   total_deaths                     47979 non-null  float64
 8   new_deaths                       47761 non-null  float64
 9   new_deaths_smoothed              46979 non-null  float64
 10  total_cases_per_million          47697 non-null  float64
 11  new_cases_per_million            47697 non-null  float64
 12  new_cases_smoothed

The output generated by a code cell is printed right below it. In the case of a DataFrame we get a table. By convention, the rows are the data entries and the columns are the data dimensions. The first column on the left side is the index.

Now let's do something with our newly created DataFrame. For example, we could get the largest amount of new cases using the ```max``` method.

In [14]:
covid_data.total_cases.max()

35848254.0

✏️ *What would it take to get the highest positive rate?*

To get the entry belonging to the biggest amount of new cases, one needs to **loc**ate it via its index:

In [15]:
covid_data.loc[ covid_data.total_cases.idxmax() ]

iso_code                              OWID_WRL
continent                                  NaN
location                                 World
date                                2020-10-07
total_cases                        3.58483e+07
new_cases                               308788
new_cases_smoothed                      302387
total_deaths                       1.04818e+06
new_deaths                                5523
new_deaths_smoothed                    5580.71
total_cases_per_million                   4599
new_cases_per_million                   39.615
new_cases_smoothed_per_million          38.793
total_deaths_per_million               134.472
new_deaths_per_million                   0.709
new_deaths_smoothed_per_million          0.716
new_tests                                  NaN
total_tests                                NaN
total_tests_per_thousand                   NaN
new_tests_per_thousand                     NaN
new_tests_smoothed                         NaN
new_tests_smo

We can also calculate averages for each numeric column by selecting them first and then calculating the mean:

In [16]:
covid_data[['total_cases', 'new_cases', 'new_deaths']].mean(axis=0)

total_cases    111309.939203
new_cases        1501.151735
new_deaths         43.892758
dtype: float64

There is so much more to discover, some of which you will do over the course of the tutorials. The [DataFrame page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) in the pandas reference gives a complete (i.e., long) list of all methods provided by the data structure. 

If you want to do something specific, but do not know the particular method name, a well formulated Google search query can help wonders. In particular, the discussions on Stack Overflow contain various helpful entries. Quite often it is the case that somebody else has had a similar problem that you're trying to solve. The key then is to precisely formulate your query. For this it is good to understand the basic terminology of Python, pandas, etc.

## 🌠 Let's reach to the stars 

Altair is the brightest star in the Aquila constellation and it is also the name of a versatile [visualization library](https://altair-viz.github.io/)  specifically created for Python based on the popular Vega-Lite visualization grammar. 

With 📊Altair we can create charts and visualizations in little time. 

In order to put Altair to use, we first have to import it and give it a short name:


In [17]:
import altair as alt

First lets prepare the data. Since Altair only supports dataframes up to 5000 rows, we need a bit of work to get our data in form! So lets by start by aggregating our data.

In [18]:
data = covid_data.groupby('continent').sum().reset_index()

First we call ```groupby``` to group our data by the ```continent``` column, then we sum the values in each group. The result of this computation has the grouped-by values in its index. But since Altair does not support the creation of axes out of indexes we reset the index to a column by executing ```reset_index``` on the resulting dataframe.

In [25]:
alt.Chart(data).mark_bar().encode(x='continent', y='new_cases')

✏️ *Change above chart into a horizontal bar chart of new cases:* 

With a few more specifications, we can give this barchart some tooltips and an aspect ratio of a square:

In [23]:
alt.Chart(data).mark_bar().encode(
    x='continent', 
    y='new_cases',
    tooltip=['new_deaths', 'new_cases_per_million']
).properties(
    width=200,
    height=200
)

This is admittedly still a very simple chart, but it gets the job done.

Altair can be used to create a wide range of static and interactive visualizations—have a look at their [gallery](https://altair-viz.github.io/gallery/index.html) for some inspiration!

## Sources
- [Pandas Tutorial: DataFrames in Python - DataCamp](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python)
- [The ElementTree XML API](https://docs.python.org/2/library/xml.etree.elementtree.html)
- [Where do Mayors Come From? Querying Wikidata with Python and SPARQL - Towards Data Science](https://towardsdatascience.com/where-do-mayors-come-from-querying-wikidata-with-python-and-sparql-91f3c0af22e2)
- [External data: Local Files, Drive, Sheets, and Cloud Storage - Colaboratory](https://colab.research.google.com/notebooks/io.ipynb)
- [Loading data: Drive, Sheets, and Google Cloud Storage](https://colab.research.google.com/notebooks/io.ipynb) 
- [Examining Data Using Pandas | Linux Journal](https://www.linuxjournal.com/content/examining-data-using-pandas)
- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)