**Note**: the following document includes explanations and code but it is not designed to stand alone; they are the notes whith which the instructor will facilitate the workshop, and thus incomplete without further modifications/clarifications/explanations. Use it to familiarise yourself with some of the notions and with how the code looks. You will create your own document as you follow along in your own machine and type your own code on the workshop days. If there are bits that don't work, don't worry, they are on purpose and you'll learn why.

<hr>

# Data Visualisation in Python

## Overview (what to expect today)

### Data Visualisation

<ul>
    <ul>
        <li>Preparing the data and introducing Matplotlib</li>
        <li>Line plotting</li>
        <li>Area plots</li>
        <li>Bar plots (horizontal and vertical)</li>
        <li>Histogram plots</li>
        <li>Annotating plots</li>
    </ul>
</ul>

## Preparing the Data and Introducing Matplotlib

Matplotlib allows us to produce 2D plots and it is an essential library to know how to use. We will use ````matplotlib.pyplot````to be able to create different types of figures, change the x and y labels, write annotations, and more.

We will combine this with ````pandas```` own capacity to plot things, which uses ````matplotlib````. We only need to create a dataframe and append ````.plot()```` to be able to plot any part of the dataframe that we may need.

In [None]:
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Let us read the file we will be using for our plots:

In [None]:
#Getting the data: immigration of nationals in EU28 countries, 2009-2018.

#Source: https://commonslibrary.parliament.uk/research-briefings/sn06077/

df_immigration = pd.read_excel('CBP06077-data.xlsx',
                              sheet_name = "4 (1)",
                              skiprows = range(3))

In [None]:
#Let's have a look at its head and tail

df_immigration.head()

In [None]:
df_immigration.tail()

In [None]:
#It seems the tail is saying something strange. Let's see more of it

df_immigration.tail(10)

In [None]:
#By default .drop drops rows, so here we drop the rows from 28-36

df_immigration.drop(range(28,37))

In [None]:
#To drop columns we have to tell .drop() that the axis = 1, which means columns. axis = 0 is the default
#and it means rows

df_immigration.drop(['Unnamed: 0', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], axis = 1)

Something seems wrong. We had cleaned the last rows because there weren't data, but still now we can see them. What happened?

Yep, we forgot the "inplace = True" for the operation to be performed, shown **and** applied to the actual dataframe.

Let's see how the datframe actually looks and correct our mistakes.

In [None]:
df_immigration.head()

In [None]:
df_immigration.tail(10)

In [None]:
df_immigration.drop(range(28, 37), inplace = True)
df_immigration.tail()

In [None]:
df_immigration.drop(['Unnamed: 0', 'Unnamed: 12', 'Unnamed: 13', 'Unnamed: 14'], axis = 1, inplace = True)

df_immigration.head()

In [None]:
df_immigration.rename(columns = {'Unnamed: 1': 'Country'}, inplace = True)
df_immigration.head()

In [None]:
df_immigration.isnull().sum()

In [None]:
df_immigration.dtypes

In [None]:
df_immigration.replace(".", np.nan)

Now that all the non-data entries have been filled with NaN, we have to convert the columns that were affected by the "." (and hence where type "object" and not "int64") to "int64.""

In [None]:
df_immigration[[2009, 2010, 2011]] = df_immigration[[2009, 2010, 2011]].astype("float")

#Show the types now
df_immigration.dtypes

Oops.. what's wrong?

In [None]:
# Again, inplace = True, otherwise what you see is not what the dataframe actually has

df_immigration.replace(".", np.nan, inplace = True)
df_immigration

In [None]:
#Now let us do again the conversion

df_immigration[[2009, 2010, 2011]] = df_immigration[[2009, 2010, 2011]].astype("float")

#Show the types now
df_immigration.dtypes

Perfect! Question: what do we do with the "NaNs"?

In [None]:
df_immigration

In [None]:
df_immigration.info()

In [None]:
df_immigration.describe()

In [None]:
df_immigration.columns.tolist()

It seems we are ready to move to our visualisations! Note that data preparation *is a part of it*, hence we went through the process prior any visualisation, and it should make visualisations easier.

Style first... sort of: we can choose the style of our <a href = "https://matplotlib.org/stable/gallery/style_sheets/style_sheets_reference.html">matplotlib figures</a>.

In [None]:
#See the styles in a list

plt.style.available

Let's use 'seaborn' as our style (because it looks cool)

In [None]:
mpl.style.use(['seaborn'])

## Line Plotting

We already plotted a simple figure yesterday using pandas. Pandas allows us to plot literally by typing ````.plot()```` just after the data frame. Let us use it again to plot some of our new immigration data.

Example: Let's pick a nice and warm country to imagine ourselves there, and ask about the immigration data from 2009 - 2015.

In [None]:
#Make the 'Country' column the index

df_immigration.set_index('Country', inplace = True)

# To remove index name: df_immigration.index.name = Name

df_immigration.head()

In [None]:
greece = df_immigration.loc['Greece', [2009, 2010, 2011, 2012, 2013, 2014, 2015]]
greece.head()

In [None]:
greece.plot()

In [None]:
greece.plot(kind = 'line', figsize = (12, 8))

plt.title('Immigration to Greece')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show()

We can also annotate (more on this below) the graph.

In [None]:
greece.plot(kind = 'line', figsize = (12, 8))

plt.title('Immigration to Greece')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.text(2012.9, 29500, 'Lowest Point: '+str(greece[2014]))

plt.show()

Let's see if we can see more than one country's immigration numbers.

In [None]:
df_gg = df_immigration.loc[['Greece', 'Germany'], [2009, 2010, 2011, 2012, 2013, 2014, 2015]]
df_gg.head()

There seems to be a mistake with "Germany". Let's see if we can figure the problem out by showing the columns.

In [None]:
df_immigration.columns.values

Since the column of "Country" is properly speaking the "index" (we made it the index above!), it doesn't appear as a column value. Let's check the index values.

In [None]:
df_immigration.index.values

Ah... if you look at "Germany ", there is a very annoying space before it, and hence the actual value of that row's index is not "Germany" but "Germany ", with the space. That's why it didn't recognise the name. Let's fix it.

In [None]:
df_immigration.rename(index = {'Germany ': 'Germany'}, inplace = True)
df_immigration.index

Excellent! Now let's continue what we were doing.

In [None]:
df_gg = df_immigration.loc[['Greece', 'Germany'], [2009, 2010, 2011, 2012, 2013, 2014, 2015]]
df_gg.head()

In [None]:
#Plotting both Greece and Germany

df_gg.plot()

It doesn't seem to be working: why?

It seems that with greece (on its own) it plotted well because Greece is of a different type.

In [None]:
type(greece)

In [None]:
type(df_gg)

In [None]:
greece.head()

We need to convert the dataframe so that the years are the indices, just like in greece.

In [None]:
df_gg.head()

In [None]:
df_gg.transpose()

In [None]:
df_gg = df_gg.transpose()

df_gg.plot(kind = 'line')

plt.title ('Immigration numbers to Greece and Germany')
plt.ylabel ('Numbers of Immigrants')
plt.xlabel ('Years')

plt.show()

Can this be right? If we look at the columns, the magnitude of immigration to Germany is in hundreds of thousands and in 2015 it surpases one million; in Greece the magnitude is in the tends of thousands, hence the graph is not a good way of "comparing". Can we "compare" them in any meaningful sense?

One way is to make the values of each column relative to the maximum value of the column.

In [None]:
ndf_gg = df_gg
ndf_gg[['Germany']] = df_gg[['Germany']]/df_gg[['Germany']].max()
ndf_gg[['Greece']] = df_gg[['Greece']]/df_gg[['Greece']].max()
ndf_gg

In [None]:
ndf_gg.plot()

plt.title ('Immigration to Greece and Germany')
plt.ylabel ('Ratio of Immigrants (Yearly Immigration / Maximum Year Immigration)')
plt.xlabel ('Years')

plt.show()

## Area Plots

In [None]:
df_immigration.head()

In [None]:
df_immigration['Total']=df_immigration.sum(axis = 1)
df_immigration.head()

Let's plot the top 3 countries in terms of immigration in the years 2009 - 2013

In [None]:
df_immigration.sort_values(['Total'], ascending = False, axis = 0, inplace = True)

df_top3 = df_immigration.head(3)

df_top3 = df_top3[[2009, 2010, 2011, 2012, 2013]]

df_top3 = df_top3.transpose()

In [None]:
df_top3.head()

In [None]:
df_top3.plot(kind = 'area', stacked = False, figsize=(15, 10))

plt.title ('Top 3 Immigration Trends')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

In [None]:
#We can change the transparency values, the default being alpha = 0.5

df_top3.plot(kind = 'area', alpha = 0.1, stacked = False, figsize=(15, 10))

plt.title ('Top 3 Immigration Trends')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

In [None]:
#We can make it stacked by removing the "stacked = False" option (or by setting it to "True")

df_top3.plot(kind = 'area', alpha = 0.2, figsize=(15, 10))

plt.title ('Top 3 Immigration Trends')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

plt.show()

## Bar Plots: Vertical and Horizontal

We will use both vertical and horizontal bar plots. We'll focus on Germany given its remarkable numbers of immigration.

In [None]:
df_germany = df_immigration.loc['Germany']

df_germany.head()

In [None]:
df_germany.plot(kind = 'bar', figsize=(10, 5))

plt.xlabel('Year')
plt.ylabel('Immigration Numbers')
plt.title('Immigration to Germany 2009 - 2018')

plt.show()

We don't need that bar of "total" so we can remove it from our dataframe.

In [None]:
df_germany.drop(['Total'], inplace = True)

df_germany.plot(kind = 'bar', figsize=(10, 5))

plt.xlabel('Year')
plt.ylabel('Immigration Numbers')
plt.title('Immigration to Germany 2009 - 2018')

plt.show()

It seems that immigration to Germany was steadily increasing since 2009 and was doubled from 2014 to 2015, and then it returned to "normal" after that incredible surge. Let's try and annotate our graph.

In [None]:
df_germany.plot(kind = 'bar', figsize=(12, 10), rot = 45)

plt.xlabel('Year')
plt.ylabel('Immigration Numbers')
plt.title('Immigration to Germany 2009 - 2018')

plt.annotate('jump',
            xy = (5.9, 1.45*10**6),
            xytext = (4.9, 0.8*10**6),
            xycoords = 'data',
            arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'lightblue', lw = 2))

plt.show()

In [None]:
df_germany.plot(kind = 'bar', figsize=(12, 10), rot = 45)

plt.xlabel('Year')
plt.ylabel('Immigration Numbers')
plt.title('Immigration to Germany 2009 - 2018')

plt.annotate('jump',
            xy = (5.9, 1.45*10**6),
            xytext = (4.9, 0.8*10**6),
            xycoords = 'data',
            arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'lightblue', lw = 2))

plt.annotate('fall',
            xy = (7, 0.9*10**6),
            xytext = (5.9, 1.48*10**6),
            xycoords = 'data',
            arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'lightgreen', lw = 2))

plt.show()

In [None]:
df_germany.plot(kind = 'bar', figsize=(14, 11), rot = 45)

plt.xlabel('Year')
plt.ylabel('Immigration Numbers')
plt.title('Immigration to Germany 2009 - 2018')


# message, coordinates of the end of the arrow, coordinates of the beginning of the arrow,
# xycoords = 'data' to use the same units as the plot,
# type of arrow with its connectionstyle, colour and thickness
plt.annotate('jump',
            xy = (5.9, 1.45*10**6),
            xytext = (4.9, 0.8*10**6),
            xycoords = 'data',
            arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'lightblue', lw = 2))

plt.annotate('fall',
            xy = (7, 0.9*10**6),
            xytext = (5.9, 1.48*10**6),
            xycoords = 'data',
            arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3', color = 'lightgreen', lw = 2))

#We can put some text as well

plt.annotate('Immigration Jump 2014 - 2015',
             xy = (5, 0.9*10**6), 
             rotation = 75.7,
             va = 'bottom',
             ha = 'left')

plt.annotate('Immigration Fall 2015 - 2016',
             xy = (6.2, 1.45*10**6), 
             rotation = -70.5,
             va = 'top',
             ha = 'left')

plt.show()

## Histogram Plots

The histogram groups or "bins" datasets.

In [None]:
df_immigration[2015].head()

In [None]:
count, bins = np.histogram(df_immigration[2015])

# count is the frequency count

#bins is the divisions, which if not indicated, it defaults to 10

In [None]:
count

In [None]:
bins

In [None]:
df_immigration[2015].plot(kind = 'hist', figsize=(10, 8), xticks = bins)

plt.title('Histogram of Immigration to the EU in 2015')
plt.ylabel('Number of Countries')
plt.xlabel('Number of Immigrants')

plt.show()

Note that the units below are in factors of ten to the power of six (that is factors of a million)

Let's see the contribution to total immigration in the EU of four countries.

In [None]:
df_immigration.index.values

In [None]:
df_immigration.loc[['Romania', 'Belgium', 'Sweden', 'Austria'], [2013, 2014, 2015, 2016, 2017]].head()

In [None]:
df_four = df_immigration.loc[['Romania', 'Belgium', 'Sweden', 'Austria'], [2013, 2014, 2015, 2016, 2017]].transpose()
df_four.head()

In [None]:
df_four.plot(kind = 'hist', figsize = (10, 8))

plt.title('Romania, Belgium, Sweden, Austria: Histogram of Immigration from 2013 - 2017')
plt.ylabel('Number of Years')
plt.xlabel('Immigration Numbers')

plt.show()

Let's make it more understandable. We can change the bin number, the transparency, and the colours.

Also, note that the x "ticks" do not coincide with the bars. We'll fix that as well.

The interpretation is clear though: each bar shows us how many years the given country had immigration numbers that correspond to the x-scale. For example, we can see that Romania has had four years of being (roughly) between 1000 and 2500, and one year between 2500 and 4000. 

In [None]:
# see colours here: https://matplotlib.org/2.0.2/examples/color/named_colors.html

count, bins = np.histogram(df_four, 15)

df_four.plot(kind = 'hist', figsize = (10, 8), bins = 15, alpha = 0.4, xticks = bins, 
             color = ['tomato', 'palegreen', 'plum', 'lightpink'])

plt.title('Romania, Belgium, Sweden, Austria: Histogram of Immigration from 2013 - 2017')
plt.ylabel('Number of Years')
plt.xlabel('Immigration Numbers')

plt.show()

In [None]:
count, bins = np.histogram(df_four, 15)

df_four.plot(kind = 'hist', figsize = (10, 8), bins = 15, alpha = 0.4, xticks = bins, 
             color = ['tomato', 'palegreen', 'plum', 'lightpink'], stacked = True)

plt.title('Romania, Belgium, Sweden, Austria: Histogram of Immigration from 2013 - 2017')
plt.ylabel('Number of Years')
plt.xlabel('Immigration Numbers')

plt.show()