# Why do you need to visualize the data? 

Visualization helps to convert large datasets into smaller graphs. This makes complex data more accessible and understandable. As a result you get the insights for data analysis, modelling and inference.

# Why Seaborn?

**Seaborn** is built on top of [Matplotlib](https://matplotlib.org/) - a defacto standard for data visualization in Python.

The advanteages of **Seaborn**:
* Simple syntax (if compared to Matplotlib)
* Data-oriented: Treats the dataset as a single unit, interoperable with pandas `DataFrame`
* Avoids graphs overlapping
* Beautiful themes
* Aims to display the whole data on a single plot 

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns            # Conventional import syntax
import matplotlib.pyplot as plt

print("Seaborn version: ", sns.__version__)

# The first diagram with Seaborn

Let's load "[World Happiness Report](https://www.kaggle.com/datasets/unsdsn/world-happiness)" dataset and draw a histogram of *Score*'s 

In [2]:
report = pd.read_csv("../input/world-happiness/2019.csv")
print(report.head())
sns.histplot(data=report, x="Score").set_title("Score counts")

## Configuring the histogram

Providing the values of bins:

In [3]:
sns.histplot(report, x="Score", bins=list(range(2, 9)));

Setting the width of the bins:

In [4]:
sns.histplot(report, x="Score", binwidth=0.5);

Setting the range for the histogram:

In [5]:
sns.histplot(report, y="Score", binwidth=1, binrange=(2, 8));

# Composing multiple axes-level plots into one figure

As a grid using `plt.subplots`:

In [6]:
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
sns.histplot(report, x="GDP per capita", ax=axs[0][0]);
sns.histplot(report, x="Social support", ax=axs[0][1]);
sns.histplot(report, x="Generosity", ax=axs[1][0]);
sns.histplot(report, x="Score", ax=axs[1][1]);

Alternative way of combining the data:

In [7]:
sns.histplot(report, x="GDP per capita", binrange=(0,2), binwidth=0.2, color="r");
sns.histplot(report, x="Social support", binrange=(0,2), binwidth=0.2, color="g");
sns.histplot(report, x="Generosity", binrange=(0,2), binwidth=0.2, color="b");
plt.legend(["GDP per capita", "Social support", "Generosity"]);

# Working with the whole dataset

Let's examine the cumulative distribution of the dataset's features

In [8]:
report_features = report.drop(["Overall rank", "Country or region", "Score"], axis = 1)
sns.displot(data=report_features);

Ok, that is too complicated.

Let's examine the cumulative distribution of *some* features of the dataset.

In [9]:
sns.displot(report[["Social support",
                    "Healthy life expectancy",
                    "Generosity"]], kde=True);

In [10]:
sns.displot(report[["Social support",
                    "Healthy life expectancy",
                    "Generosity"]], kind="kde");

# Playing with themes

Changing the theme to default:

In [11]:
sns.set_theme()
sns.displot(report[["Social support",
                    "Healthy life expectancy",
                    "Generosity"]], kde=True);

Setting the specific theme:

In [12]:
sns.set_theme(style="dark", palette="husl")
sns.displot(report[["Social support",
                    "Healthy life expectancy",
                    "Generosity"]], kde=True);

# Plotting a bivariate distribution

Using kernel density estimate (KDE) and `kdeplot` function:

In [13]:
sns.kdeplot(data=report, x="GDP per capita", y="Score", fill=True, color="cyan");

Using KDE and `jointplot` function:

In [14]:
sns.jointplot(data=report, x="GDP per capita", y="Score", kind="kde");

It is possible to do some regression analysis while visualizing the data.

Use `lmplot` or `regplot` functions:

In [15]:
sns.lmplot(data=report, x="GDP per capita", y="Score");

`jointplot` can combine both regression and cumulative distributions of `x` and `y` in one plot. 

In [16]:
sns.jointplot(data=report, x="GDP per capita", y="Score", kind="reg");

# Plotting relations

Simple line plot

In [17]:
sns.lineplot(data=report, x="Overall rank", y="Generosity");
sns.lineplot(data=report, x="Overall rank", y="Healthy life expectancy");
plt.legend(["Generosity", "Healthy life expectancy"]);

More interesting: Weighted scatterplot

In [18]:
sns.scatterplot(data=report, x="Freedom to make life choices",
                             y="Perceptions of corruption",
                             size="Score", hue="Score", sizes=(10, 200));

In [19]:
sns.scatterplot(data=report, x="Freedom to make life choices",
                             y="Perceptions of corruption",
                             size="Score", hue="Score", sizes=(10, 200));
sns.scatterplot(data=report.loc[report["Country or region"] == "Russia"],
                x="Freedom to make life choices",
                y="Perceptions of corruption",
                size="Score", hue="Score", palette="Set1", sizes=(100,100));

# Plotting the confidence intervals

Using simple box plots:

In [20]:
sns.boxplot(data=report_features, orient="h");

Let's insert the data about continent to add categories to our analysis.

In [21]:
continents = np.array([["Finland", "Europe"],
["Denmark", "Europe"],
["Norway", "Europe"],
["Iceland", "Europe"],
["Netherlands", "Europe"],
["Switzerland", "Europe"],
["Sweden", "Europe"],
["New Zealand", "Australia and Oceania"],
["Canada", "America"],
["Austria", "Europe"],
["Australia", "Australia and Oceania"],
["Costa Rica", "America"],
["Israel", "Asia"],
["Luxembourg", "Europe"],
["United Kingdom", "Europe"],
["Ireland", "Europe"],
["Germany", "Europe"],
["Belgium", "Europe"],
["United States", "America"],
["Czech Republic", "Europe"],
["United Arab Emirates", "Asia"],
["Malta", "Europe"],
["Mexico", "America"],
["France", "Europe"],
["Taiwan", "Asia"],
["Chile", "America"],
["Guatemala", "America"],
["Saudi Arabia", "Asia"],
["Qatar", "Asia"],
["Spain", "Europe"],
["Panama", "America"],
["Brazil", "America"],
["Uruguay", "America"],
["Singapore", "Asia"],
["El Salvador", "America"],
["Italy", "Europe"],
["Bahrain", "Asia"],
["Slovakia", "Europe"],
["Trinidad & Tobago", "America"],
["Poland", "Europe"],
["Uzbekistan", "Asia"],
["Lithuania", "Europe"],
["Colombia", "America"],
["Slovenia", "Europe"],
["Nicaragua", "America"],
["Kosovo", "Europe"],
["Argentina", "America"],
["Romania", "Europe"],
["Cyprus", "Europe"],
["Ecuador", "America"],
["Kuwait", "Asia"],
["Thailand", "Asia"],
["Latvia", "Europe"],
["South Korea", "Asia"],
["Estonia", "Europe"],
["Jamaica", "America"],
["Mauritius", "Africa"],
["Japan", "Asia"],
["Honduras", "America"],
["Kazakhstan", "Asia"],
["Bolivia", "America"],
["Hungary", "Europe"],
["Paraguay", "America"],
["Northern Cyprus", "Europe"],
["Peru", "America"],
["Portugal", "Europe"],
["Pakistan", "Asia"],
["Russia", "Eurasia"],
["Philippines", "Asia"],
["Serbia", "Europe"],
["Moldova", "Europe"],
["Libya", "Africa"],
["Montenegro", "Europe"],
["Tajikistan", "Asia"],
["Croatia", "Europe"],
["Hong Kong", "Asia"],
["Dominican Republic", "America"],
["Bosnia and Herzegovina", "Europe"],
["Turkey", "Asia"],
["Malaysia", "Asia"],
["Belarus", "Europe"],
["Greece", "Europe"],
["Mongolia", "Asia"],
["North Macedonia", "Europe"],
["Nigeria", "Africa"],
["Kyrgyzstan", "Asia"],
["Turkmenistan", "Asia"],
["Algeria", "Africa"],
["Morocco", "Africa"],
["Azerbaijan", "Asia"],
["Lebanon", "Asia"],
["Indonesia", "Asia"],
["China", "Asia"],
["Vietnam", "Asia"],
["Bhutan", "Asia"],
["Cameroon", "Africa"],
["Bulgaria", "Europe"],
["Ghana", "Africa"],
["Ivory Coast", "Africa"],
["Nepal", "Asia"],
["Jordan", "Asia"],
["Benin", "Africa"],
["Congo (Brazzaville)", "Africa"],
["Gabon", "Africa"],
["Laos", "Asia"],
["South Africa", "Africa"],
["Albania", "Europe"],
["Venezuela", "America"],
["Cambodia", "Asia"],
["Palestinian Territories", "Asia"],
["Senegal", "Africa"],
["Somalia", "Africa"],
["Namibia", "Africa"],
["Niger", "Africa"],
["Burkina Faso", "Africa"],
["Armenia", "Asia"],
["Iran", "Asia"],
["Guinea", "Africa"],
["Georgia", "Asia"],
["Gambia", "Africa"],
["Kenya", "Africa"],
["Mauritania", "Africa"],
["Mozambique", "Africa"],
["Tunisia", "Africa"],
["Bangladesh", "Asia"],
["Iraq", "Asia"],
["Congo (Kinshasa)", "Africa"],
["Mali", "Africa"],
["Sierra Leone", "Africa"],
["Sri Lanka", "Asia"],
["Myanmar", "Asia"],
["Chad", "Africa"],
["Ukraine", "Europe"],
["Ethiopia", "Africa"],
["Swaziland", "Africa"],
["Uganda", "Africa"],
["Egypt", "Africa"],
["Zambia", "Africa"],
["Togo", "Africa"],
["India", "Asia"],
["Liberia", "Africa"],
["Comoros", "Africa"],
["Madagascar", "Africa"],
["Lesotho", "Africa"],
["Burundi", "Africa"],
["Zimbabwe", "Africa"],
["Haiti", "America"],
["Botswana", "Africa"],
["Syria", "Asia"],
["Malawi", "Africa"],
["Yemen", "Asia"],
["Rwanda", "Africa"],
["Tanzania", "Africa"],
["Afghanistan", "Asia"],
["Central African Republic", "Africa"],
["South Sudan", "Africa"]])

In [22]:
report.insert(2, "Continent", continents[:,1])
print(report.head())

In [23]:
sns.scatterplot(data=report, x="GDP per capita",
                             y="Healthy life expectancy",
                             size="Score", hue="Continent", sizes=(10, 200));
plt.legend(bbox_to_anchor=(1, 1));

Let's use a `violinplot` to examine the cumulative distribution of the happiness score across different continents.

Violinplot is a combination of a boxplot and a cumulative distribution function. By default the cumulative distribution functions are mirrored in the violinplot forming the body of the violin.

In [24]:
reduced_report=report.loc[report["Continent"].isin(["Africa", "America", "Asia", "Europe"])]

In [25]:
sns.boxplot(data=reduced_report, x="Perceptions of corruption", y="Continent", orient="h");

In [26]:
sns.violinplot(data=reduced_report, x="Continent", y="Score");

# Plotting the relations with the figure-level plots

Figure-level plots allow to visualize a set of graphs within one figure using `col` and `row` keyword arguments.

Use `relplot` function to vizualise relations. Each plot shows a subset of the data for a specific continent (see `col`="Continent"):

In [27]:
reduced_report=report.        \
    loc[report["Continent"].  \
    isin(["Africa", "America",\
          "Asia", "Europe"])]
sns.relplot(                  \
    data=reduced_report,      \
    x="Freedom to make life choices",
    y="Perceptions of corruption",
    size="Score", sizes=(10, 200),
    hue="Score", col="Continent",
    col_wrap=2);

# Plotting pairwise dependencies in the dataset

`pairplot` allows to examine all the pairwise dependencies in the data at once.

In [28]:
sns.pairplot(data=reduced_report[["Continent", "GDP per capita", "Social support", "Generosity"]], \
             hue="Continent");