In [None]:
# Import the plotting library
import matplotlib.pyplot as plt
# Import the data analysis library
import pandas as pd

## Plotting introduction

Matplotlib [Website here](https://matplotlib.org/) is the most popular plotting library for python. It is a bit barebone by default though.

One can directly plot values, and they are displayed directly in the notebook itself. If not x axis is given, it is assumed to be the integers 0-to-nb_values

In [None]:
# Creating a list of values
y_values = []
for i in range(20):
    y_values.append(i*i)

plt.plot(y_values)
print(y_values)

By the way, a short hand version for creating a list is the following, which is equivalent to the code in the previous cell

In [None]:
y_values = [i*i for i in range(20)]
print(y_values)

Additionally, two iterables can be given to specify the x-values and the y-values of the corresponding graph.

In [None]:
x_values = [i/100-0.5 for i in range(100)]
y_values = [x**3 for x in x_values]
plt.plot(x_values, y_values)

### Changing the plotting style

By default, the matplotlib style is really not very inviting, so let us change it. Now all the generated plots will use it, you can even re-run the previous cells.

The list of all default styles available are [here](https://matplotlib.org/3.1.3/gallery/style_sheets/style_sheets_reference.html)

In [None]:
plt.style.use('ggplot')

We actually recommend using the default styling done by the wonderful [seaborn](https://seaborn.pydata.org/) library

In [None]:
import seaborn as sns
sns.set()  # Set the complete seaborn styling

### Figure size, labels, title

In order to make your plots prettier, it is often useful to specify what we are talking about on each axis.

Uncomment the code lines to change the size of the figure and/or add legends to it.

In [None]:
# Change the figure size
plt.figure(figsize=(12, 8))

x_values = [i/100-0.5 for i in range(100)]
y_values = [x**3 for x in x_values]
plt.plot(x_values, y_values)

# Add labels
plt.title('My Title')
plt.xlabel('What is that axis?')
plt.ylabel('Another axis')

### Distribution histograms

Histogram is a powerful way of visualizing a distribution of values. It only requires the list of values and will automatically aggregate them in counts.

In [None]:
import random
# Generate random values based an a standard gaussian distribution (bell-shaped curve)
random_values = [random.gauss(0,1) for i in range(10000)]
plt.hist(random_values);

Sometimes, we want a more granular view of the histograms, and for this we need to increase the number of bins used for counting (i.e the number of bars)

In [None]:
plt.hist(random_values, bins=40);

### DIY

Before using the easier automatic techniques, let us do things a bit manually.

Load the complete architectura data (all the json files), can you display the distribution of the years the treaties were issued?

For this, you will need first to load all the data as a list of dictionaries, then extract all the years in a single list and use `plt.hist` on that list.

**WARNING**: some entries might not have a defined year (it is equal to `None` the null value in python) and will need to not be included.

In [None]:
import glob
import json

architectura_data = []
for fn in glob.glob('../../data/architectura_treaties/*.json'):
    with open(fn ,'r', encoding='utf-8')as f:
        architectura_data.append(json.load(f))
len(architectura_data)

Can you specify the bins so that they are exactly every quarter of century instead of the automated representation? Have a look at the documentation of `plt.hist` and especially the `bins` parameter.

Make it pretty by giving it a title and labelling the axis

In [None]:
# Your code here

# Pandas Introduction

Pandas [website here](https://pandas.pydata.org/) is a powerful and widely used data analysis library.

The fundamental object type is the `DataFrame` which is basically a table representation and can be created directly by loading a `.csv`.

## Creating a DataFrame

Here we load a sample dataset representing data about passengers of the titanic.

In [None]:
titanic_df = pd.read_csv('../../data/titanic.csv')
type(titanic_df)

In [None]:
# A preview can directly be visualized in the notebook
titanic_df

The `DataFrame` is made of a serie of columns where each has a type.

In [None]:
titanic_df.dtypes

But it is also possible to give a list of dictionnaries as input to create a DataFrame from any form of data we have, do this by giving the list of dictionnaries coming from the loaded architectura data

In [None]:
# Change architectura_data to your own loaded data
architectura_df = pd.DataFrame(architectura_data)

In [None]:
architectura_df

### Accessing single column of the data

`DataFrame` are complex and powerful objects. An example is that you can directly access single column directly as a `pd.Serie` object.

Note also the NaN (Not-A-Number) values representing missing data, which is a common occurence in tabular data as we often have incomplete information.

In [None]:
titanic_df.age

The name of the column sometimes has spaces or weird characters and can not be used as a property name of the dataframe. In all cases, you can always access a column with standard indexing.

In [None]:
titanic_df['age']

The `Serie` object has a lot of methods available (have a look at them). For instance, directly plotting the histogram ignoring the missing values, or accessing the maximum value or the corresponding index of it in the table.

In [None]:
titanic_df.age.hist()

In [None]:
# Display the age of the oldest passenger and the corresponding index in the Dataframe
titanic_df.age.max(), titanic_df.age.idxmax()

## Accessing single rows

A single row can directly be accessed with its row number and `iloc`.

In [None]:
titanic_df.iloc[10]

The value of the corresponding from a single row can be accessed directly

In [None]:
titanic_df.iloc[10].embark_town

In [None]:
titanic_df.iloc[10].fare

### DIY

Using `architectura_df`, can you plot again the distribution of years of the books but in a much simpler way compared to the previous version?

In [None]:
# your code here

Using `architectura_df` can you visit the corresponding webpage of the oldest book in the dataset? You will need first to find its index and then use the `loc` indexing method of the `DataFrame` to extract the correct row.

An alternative solution is to sort the dataframe by year and taking the first row.

Notice how we do not need to care about missing or `None` values as it is handled automatically.

In [None]:
#Your code here

### Filtering the data

`DataFrame` are very easy to filter. The trick is to apply conditional logic on `Serie`s. For instance, looking at only the male passenger, we get this binary `Serie`.

In [None]:
titanic_df.sex == 'male'

This can then be used directly as indexing input for the `DataFrame` to get a new `DataFrame`

In [None]:
titanic_df[titanic_df.sex == 'male']

Conditions can be put together, but instead of `and` and `or`, one need to use respectively `&` and `|`. Also beware of parenthesis.

In [None]:
titanic_df[(titanic_df.sex == 'male') & (titanic_df.age < 10)]

### DIY

Using filtering and `len` can you compute how many treaties in architectura were issued in French and before 1650?

In [None]:
# Your code here


# Easy advanced plotting with seaborn

Seaborn is a powerful library for advance plotting, what can you say from the following visualization about the survival of the people on the Titanic?

In [None]:
sns.catplot(x="alive", y="age", hue="sex", kind="swarm", data=titanic_df);

In [None]:
sns.catplot(col="adult_male", x='alive', kind='count', data=titanic_df);

## Exercise

Can you use a [Box plot](https://seaborn.pydata.org/tutorial/categorical.html#boxplots) to visualize the year of publishing of the architectura treatises based on their language?

After having managed to plot the complete dataset. Filter them based on the three more used languages (use `value_counts()` to find which are the most common languages, and `Serie.isin()` to filter based on multiple values).

In [None]:
# Your code here

## Explore data by yourself (open ended)

Load the `xenotheka.json` data and put it in a `DataFrame`, how would you explore its characteristics?