#### This tutorial introduces how to create basic visualizations of a large data set using Python. 
#### Note: You do not need to have any prior knowledge of Python to successfully complete this exercise.
#### Remember to execute code cells with Shift+Enter.

<div class="alert alert-block alert-info">A cell like this indicates a question you need to answer in the "Answers.txt" file. Please answer the question <b>before</b> continuing through the notebook. You can <b>double click</b> on "Answers.txt" in the Left Sidebar now to open it in a new tab. As you go through the notebook, navigate between the tabs to answer questions.
</div>

## Introduction

The World Happiness Report is a survey of global happiness. It contains rankings of happiness based on participants' ratings of their own lives. 

*Happiness* is based on a survey in which representative samples of participants from each country are asked to think of a ladder, with the best possible life for them being a 10, and the worst possible life being a 0. They are then asked to rate their own current lives on that 0 to 10 scale. The report discusses how the results correlate with various life factors.

In this tutorial, we will try to answer some questions by creating visualizations of data from the World Happiness Report:
- Which countries and regions of the world are happiest?
- What factors contribute to a country's/region's happiness ranking?
- How did happiness change over time?

<b>We will work through a few different visualizations (listed below). You can follow links to jump directly to a particular section.</b>

## Table of contents

* [Variables](#1)
* [Import Libraries](#2)
* [Load & Preview Data](#3)
* [Bar Graph](#4)
* [Box Plot](#5)
* [Violin Plot](#6)
* [Scatter Plot](#7)
* [Pair Plot](#8) 
* [Heat Map](#9)
* [Interactive Bubble Plot](#10)
* [Sources](#11)

<a id=1></a>
## Variables
[[ go back to the top ]](#Table-of-contents)

The original report and data sets contain a lot of data. In this tutorial, we will use only a subset of these data, and focus on the following variables:

* **Country**: Name of the country
* **Region**: Region the country belongs to
* **Happiness_Score**: Participants were asked: “Please imagine a ladder, with steps numbered from 0 at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time?”

As factors that potentially contribute to Happiness_Score we consider:

* **Economy**: GDP per capita
* **Health**: Average life expectancy at birth
* **Family**: Average of the binary responses (either 0 or 1) to the question: “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
* **Freedom**: Average of the binary responses to the question: “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
* **Trust**: Average of the binary responses to the questions: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” 
* **Generosity**: Determined from the responses to the binary question: “Have you donated money to a charity in the past month?”

Note: These variables are taken from a number of different measurements and some have been altered and standardized so that the data set can be analyzed. There are various ways in which the data sets can be interpreted and analyzed; this tutorial uses simplified data sets and analyses for the purposes of demonstrating data visualizations. For full reports and statistical analyses, please visit https://worldhappiness.report/.

<div class="alert alert-block alert-info">Pause! Answer <b>Question 1</b> in the Answers.txt file.
    
Which variables do you think relate to Happiness_Score the most? Why?
    
</div>

<a id=2></a>
## Import Libraries
[[ go back to the top ]](#Table-of-contents)

In [None]:
# import Pandas library and call it 'pd' for analyzing & visualizing data
import pandas as pd
from pandas.plotting import autocorrelation_plot

# import matplotlib.plplot and call it 'plt' for plotting data
import matplotlib.pyplot as plt
%matplotlib inline

# import numpy and call it 'np' for scientific computing
import numpy as np

# import seaborn and call it 'sns' for visualizations
import seaborn as sns 

# import libraries for interactive viaulizations
from IPython.display import HTML
from bubbly.bubbly import bubbleplot
import plotly.offline as py
import plotly.graph_objs as go

import warnings            
warnings.filterwarnings("ignore") 

<a id=3></a>
## Load & Preview Data
[[ go back to the top ]](#Table-of-contents)

In [None]:
# Load data from csv files and assign to variables "data_2016", "data_2017", "data_2018", "data_2019"
data_2016=pd.read_csv("2016.csv")
data_2017=pd.read_csv("2017.csv")
data_2018=pd.read_csv("2018.csv")
data_2019=pd.read_csv("2019.csv")

In [None]:
# Show the first 5 rows of data_2016
data_2016.head()

In [None]:
# Show the first 5 rows of data_2017
data_2017.head()

In [None]:
# Show the first 5 rows of data_2018
data_2018.head()

In [None]:
# Show the first 5 rows of data_2019
data_2019.head()

Looking at the first few rows of data, it is still difficult to draw conclusions from the numbers alone. 

This is exactly why data visualizations are so powerful. Let's get started!

<a id=4></a>
## Bar Graph
[[ go back to the top ]](#Table-of-contents)

#### Let's start with an easy question: <b>What were the happiest countries in 2016?</b>

In [None]:
# Set figure size
plt.figure(figsize=(25,10))

# Create bar graph
sns.barplot(x=data_2016['Country'], y=data_2016['Happiness_Score'], palette="BuPu")

# Set axes and title
plt.xticks(rotation= 90)
plt.xlabel('Country', size = 15)
plt.ylabel('Happiness Score', size = 15)
plt.title('Happiness Score by Country in 2016', size = 18)

plt.show()

We can see that the happiest country in 2019 was Denmark, and the least happiest country was Burundi.

But this graph has a lot of information, so let's try to condense it by grouping countries by region and creating a new variable `region_happiness_ratio` (the sum of `happiness_score` for all the countries in a region divided by the number of countries in that region).

To do this, we will create a new data frame, which is a table in which each column contains values of one variable and each row contains one set of values from each column. 

In [None]:
# Create new data frame "sorted_data_2016" with region and region_happiness_ratio_2016
region_lists=list(data_2016['Region'].unique())
region_happiness_ratio_2016=[]
for each in region_lists:
    region=data_2016[data_2016['Region']==each]
    region_happiness_rate_2016=sum(region.Happiness_Score)/len(region)
    region_happiness_ratio_2016.append(region_happiness_rate_2016)
   
data=pd.DataFrame({'region':region_lists,'region_happiness_ratio_2016':region_happiness_ratio_2016})
new_index=(data['region_happiness_ratio_2016'].sort_values(ascending=False)).index.values
sorted_data_2016 = data.reindex(new_index)

#### Let's visualize our condensed data: <b>What were the happiest regions of the world in 2016?</b>

In [None]:
# Set figure size
plt.figure(figsize=(10,8))

# Create bar graph
sns.barplot(x=sorted_data_2016['region'], y=sorted_data_2016['region_happiness_ratio_2016'], palette="BuPu")

# Set axes and title
plt.xticks(rotation= 90)
plt.xlabel('Region', size = 15)
plt.ylabel('Region Happiness Ratio', size = 15)
plt.title('Happiest Regions in the World in 2016', size = 18)

plt.show()

<div class="alert alert-block alert-info">Pause!
Create another bar graph for the region happiness ratio in <b>2019</b> (use the code cell below) and answer <b>Question 2</b> in the Answers.txt file. 

What were the 3 happiest regions in 2019?

*Note: The code for the new data frame called "sorted_data_2019" is already entered in the code cell below. You just need to include the code for the plot from the cell above, making sure to use the correct variable "sorted_data_2019" in the assignments to x and y.*
</div>

In [None]:
# Create new data frame "sorted_data_2019" with region and region_happiness_ratio_2019
region_lists=list(data_2019['Region'].unique())
region_happiness_ratio_2019=[]
for each in region_lists:
    region=data_2019[data_2019['Region']==each]
    region_happiness_rate_2019=sum(region.Happiness_Score)/len(region)
    region_happiness_ratio_2019.append(region_happiness_rate_2019)
   
data=pd.DataFrame({'region':region_lists,'region_happiness_ratio_2019':region_happiness_ratio_2019})
new_index=(data['region_happiness_ratio_2019'].sort_values(ascending=False)).index.values
sorted_data_2019 = data.reindex(new_index)

# your code here

<a id=5></a>
## Box Plot
[[ go back to the top ]](#Table-of-contents)

A box plot displays the distribution of a data set. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be outliers.

Let's look at the `Happiness_Score` across years.

#### <b>Did regional `Happiness_Score` change from 2016 to 2019?</b>

In [None]:
# Combine all 4 data sets into one "data_concat" and add variable "Year"
data_2016['Year']=2016
data_2017['Year']=2017
data_2018['Year']=2018
data_2019['Year']=2019
data_concat=pd.concat([data_2016,data_2017,data_2018,data_2019],axis=0,sort = False)

# Create box plot
f,ax = plt.subplots(figsize =(20,10))
sns.boxplot(x="Year" , y="Happiness_Score", hue="Region",data=data_concat,ax=ax)
plt.xlabel('Year', size = 15)
plt.ylabel('Happiness_Score', size = 15)
plt.title('Happiness Score 2016-2019', size = 18)

# Format legend location
plt.legend(bbox_to_anchor=(1, 0.4, 0.2, 0.2), loc='center')

plt.show()

We can see that regional happiness scores didn't change much from 2016 to 2019.

<div class="alert alert-block alert-info">Pause! Answer <b>Question 3</b> in the Answers.txt file.
    
Why is there no visualization for **2018**? 
    
*Hint: Look back at the variable names (column names) in each data set in the Load & Preview Data section.*
</div>

[[go back to Load & Preview Data ]](#Load-&-Preview-Data)

<a id=6></a>
## Violin Plot
[[ go back to the top ]](#Table-of-contents)

A violin plot is similar to a box plot in that it shows the median, quartiles, and minimum and maximum values, but in addition, it also shows the probability density of the data at different values. This can be useful if the distribution of the data set has more than one peak (i.e., the distribution is multimodal). 

<img src="Boxplot_violinplot.png" align="left"/>

#### <b>What is the distribution of the features `Economy`, `Family`, and `Freedom` in 2016?</b>

In [None]:
# Create new data frame "data_2016_violin"
data_2016_violin=pd.pivot_table(data_2016, index = 'Region', values=["Economy","Family","Freedom"])

# Set figure size
f,ax=plt.subplots(figsize=(22,8))

# Create violin plot
sns.violinplot(data=data_2016_violin, inner="box")
plt.xticks(size = 15)
plt.yticks(size = 15)
plt.xlabel('Factor', size = 15)
plt.ylabel('Factor Score', size = 15)
plt.title('Economy, Family, and Freedom in 2016', size = 18)

plt.show()

<div class="alert alert-block alert-info">Pause! Create another violin plot (use the code cell below) for "<b>data_2019</b>" and answer <b>Question 4</b> in the Answers.txt file. 
    
Save your violin plot. Why do you think Freedom might have two peaks?
    
*Note: The code for a new data frame called "data_2019_violin" is already entered in the code cell below. Uncomment the last line of code to save the plot as a separate file.*
</div>

In [None]:
# Create new data frame "data_2019_violin"
data_2019_violin=pd.pivot_table(data_2019, index = 'Region', values=["Economy","Family", "Freedom"])

# your code here


# Uncomment the next line to save your graph as a png
# f.savefig('violinplot.png')

<a id=7></a>
## Scatter Plot
[[ go back to the top ]](#Table-of-contents)

A scatter plot can be used to visualize whether there is a correlation (relationship) between two variables (i.e., whether the increase or decrease in one variable depends on the increase or decrease of the other variable).

#### <b>What was the relationship between `Economy` and `Happiness_Score` in 2019?</b>

In [None]:
# Create scatter plot for economy and happiness_score for data_2019
f,ax = plt.subplots(figsize = (15,8))
sns.scatterplot(x=data_2019["Happiness_Score"], y=data_2019["Economy"])

# Format title and axes
plt.title("Relationship Between Economy and Happiness Score in 2019", size=18)
plt.xlabel('Happiness_Score', size = 15)
plt.ylabel('Economy', size = 15)

# Add line of best fit
sns.regplot(x='Happiness_Score',y='Economy', data=data_2019)

plt.show()

We have included a line of best fit (or regression line) in the scatter plot to better visualize the relationship. We can see that the slope of the regression line is positive, which indicates a correlation: higher Economy scores correlate with higher Happiness scores.

<div class="alert alert-block alert-info">Pause! Create another scatter plot for Happiness_Score and Generosity in 2019 (use the code cell below) and answer <b>Question 5</b> in the Answers.txt file.
    
Save your scatter plot and explain what the scatter plot is showing (e.g., positive, negative, or no correlation).  
</div>

In [None]:
# your code here

# Uncomment the next line to save your graph as a png
# f.savefig('scatterplot.png')

<a id=8></a>
## Scatterplot Matrix (Pair Plot)
[[ go back to the top ]](#Table-of-contents)

A scatterplot matrix shows a matrix of scatterplots for the combination of any two attributes, and plots the histograms of each column along the diagonal. This set of plots allows us to see the relationships between any two variables and the distribution of each variable for all the variables in a data set.

Scatterplot matrices are useful for identifying trends to follow up on in large data sets with several variables.

#### <b>What are the relationships between four factors: `Economy`, `Family`, `Health`, and `Freedom`?</b>

In [None]:
# Drop columns from data sets to only include variables of interest
data_2016_reduced = data_2016.drop(['Generosity', 'Trust', 'Year'], axis=1)
data_2017_reduced = data_2017.drop(['Generosity', 'Trust', 'Year'], axis=1)
data_2018_reduced = data_2018.drop(['Generosity', 'Trust', 'Year'], axis=1)
data_2019_reduced = data_2019.drop(['Generosity', 'Trust', 'Year'], axis=1)

In [None]:
# Create a pair plot for data_2019_reduced
sns.pairplot(data_2019_reduced, hue="Region")
plt.show()

# Note: Plotting the scatterplot matrix will take a moment. 

In the scatterplots, each dot represents a country, and its color indicates the region that the country belongs to. 

<div class="alert alert-block alert-info">Pause! Answer <b>Question 6</b> in the Answers.txt file.
    
What do you notice about the pair plot? 
What might be a problem of plotting this particular data set in this way?
    
*Hint: Think about the number of countries in each region.*
    
</div>

<a id=9></a>
## Heat Map
[[ go back to the top ]](#Table-of-contents)

We can also visualize correlations between variables with a so-called heat map. A heat map shows the magnitude of a relationship as color, and is particularly well-suited for making prominent values easily recognizable in a large amount of data.

In [None]:
# Remove column Year from data frame
data_2019_noYear = data_2019.drop(['Year'], axis=1)

# Create heat map for data_2019_noYear
f,ax=plt.subplots(figsize=(10,8))
sns.heatmap(data_2019_noYear.corr(),annot=True, cmap="BuPu")
plt.show()

In this heat map, lighter colors represent a lower correlation (a weaker relationship) and darker colors represent a higher correlation (a stronger relationship). The diagonal represents variables correlated with themselves.

Keep in mind that a correlation is simply an association between two variables, whether it be positive or negative, and does not indicate causality.

<div class="alert alert-block alert-info">Pause! Create another heat map for "data_2016" in the code cell below and answer <b>Question 7</b> in the Answers.txt file. Remember to first remove the column "Year" from data_2016.
    
From the two heat maps (for 2019 and 2016), which factors are most correlated with Happiness_Score? 
</div>

In [None]:
# your code here

<a id=10></a>
## Interactive Bubble Plot
[[ go back to the top ]](#Table-of-contents)

A bubble plot is a scatterplot with a third dimension which is represented by the size of the dots. This representation not only allows it to convey more information in one plot, but it can also make it easier to visually grasp and interpret the data.

#### <b>What is the relationship between `Happiness_Score`, `Trust`, and `Economy`?
    
*Note: Once you create the interactive plot, you can explore the visualization by adding/removing regions and hovering over the bubbles.*

In [None]:
figure = bubbleplot(dataset = data_2019, x_column = 'Happiness_Score', y_column = 'Trust', 
    bubble_column = 'Country', size_column = 'Economy', color_column = 'Region', 
    x_title = "Happiness Score", y_title = "Trust", title = 'Happiness Score, Trust, and Economy by Region',
    x_logscale = False, scale_bubble = 0.2, height = 600)

py.iplot(figure, config={'scrollzoom': True})

#### Well done! You have completed this tutorial. Remember to save the Answers.txt file (via File > Save File) before you close the tutorial by clicking on the "Submit" button.

<a id=11></a>
## Sources 

- https://www.kaggle.com/saduman/eda-and-data-visualization-with-seaborn 
- https://www.kaggle.com/roshansharma/world-happiness-report 
- https://www.kaggle.com/unsdsn/world-happiness 
- https://en.wikipedia.org/wiki/World_Happiness_Report 
- https://towardsdatascience.com/complete-guide-to-data-visualization-with-python-2dd74df12b5e

Sources for pictures:

- Violin plot: https://towardsdatascience.com/violin-plots-explained-fb1d115e023d