# Visualizing Data with Seaborn

"A picture is worth a thousand words."

An important element of data analysis and data presentation consists of making data visualizations. With [Seaborn](https://seaborn.pydata.org/), you can easily make simple, clear, and beautiful graphs. The only thing you need to do, is to present the data to Seaborn in a way it accepts. In general, Seaborn needs a pandas dataframe. We will make X different types of visualizations with data we extract from ancient texts.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
from tf.app import use
A = use('etcbc/bhsa', hoist=globals())

## Stripchart

https://seaborn.pydata.org/generated/seaborn.stripplot.html
https://seaborn.pydata.org/tutorial/categorical.html#categorical-tutorial

If you want to make a vizualization using a function from some package, it is important to get an idea of the structure of the data that the function needs. For instance, in the [documentation of the stripplot](https://seaborn.pydata.org/generated/seaborn.stripplot.html), a dataset "tips" is loaded. Explore this dataset first!

In [None]:
tips = sns.load_dataset("tips")

In [None]:
type(tips)

In [None]:
tips.shape

In [None]:
tips.head()

Then you see immediately, that the arguments x and y in the function sns.stripplot() are the names of two columns: "day" is a categorical variable, and "total_bill" is a numeric variable.

In [None]:
ax = sns.stripplot(x="day", y="total_bill", data=tips)

Could we also use plain lists? Sure!

In [None]:
day = ['monday', 'monday', 'monday', 'tuesday', 'tuesday']
bill = [10, 20, 30, 50, 55]

ax = sns.stripplot(x=day, y=bill)

Now we make such a stripplot using alternative expressions for "kingdom" in Biblical Hebrew.

In [None]:
query = """
word lex=MLKWT/|MMLKH/
"""

In [None]:
results = A.search(query)
A.table(results)

In [None]:
results

In [None]:
result_nodes = [r[0] for r in results]

In the following cell we retrieve the lexemes in Hebrew script. Note, that with [::-1], we reverse the order of the letters. This is a trick to avoid a reversed order in the plot. Check this!

In [None]:
lexemes = [F.lex_utf8.v(w)[::-1] for w in result_nodes]

In [None]:
kingdom_dict = {'id': result_nodes, 
                'lexeme': lexemes}

kingdom_df = pd.DataFrame(kingdom_dict)
kingdom_df

In [None]:
sns.set(style="whitegrid",
         rc = {'figure.figsize':(12,8)})

In [None]:
ax = sns.stripplot(y="id", 
                   x="lexeme", 
                   data=kingdom_df, 
                   palette="Set1", 
                   edgecolor="gray", 
                   alpha=.75, 
                   size=5
                   ).set(title='"Kingdom" in the Hebrew Bible')

## Boxplot

https://seaborn.pydata.org/generated/seaborn.boxplot.html


In [None]:
ax = sns.boxplot(x="id", 
                 y="lexeme", 
                 data=kingdom_df
                 ).set(title='"Kingdom" in the Hebrew Bible')

## Barplot

We make a barplot with the lengths of the books of the Hebrew Bible. We measure the length of a book by counting the number of words.

In [None]:
book_names = []
book_lengths = []

for b in F.otype.s('book'):
    book_names.append(F.book.v(b))
    
    word_count = len(L.d(b, 'word'))
    book_lengths.append(word_count)

In [None]:
ax = sns.barplot(x=book_names, 
                 y=book_lengths, 
                 palette="deep"
                )

ax.set(title = 'Length of biblical books')
ax.tick_params(labelrotation=90)

plt.savefig('book_lengths.png')

## Heatmap with clustering

Which books have a similar use of parts of speech? We will investigate this by counting and plotting the different parts of speech in ech biblical book.

In [None]:
query = """
book
  word language=Hebrew
"""

In [None]:
results = A.search(query)

In [None]:
books = []
pos = []

for result in results:
    bo, wo = result
    books.append(F.book.v(bo))
    pos.append(F.sp.v(wo))

In [None]:
pos_df = pd.DataFrame(zip(books, pos), columns=['book', 'pos'])

In [None]:
pos_counts_df = pos_df.groupby('book')['pos'].value_counts().unstack().fillna(0)

In [None]:
stand_df = pos_counts_df.div(pos_counts_df.sum(axis=1), axis=0)

In [None]:
g = sns.clustermap(stand_df, 
                   center=0, 
                   cmap="vlag",
                   dendrogram_ratio=(.1, .2),
                   cbar_pos=(.02, .32, .03, .2),
                   linewidths=.75, figsize=(12, 13))