# Practice  Day 2
## Let’s explore again, with Plotly

Skills :
- Literacy
- Reflexivity
- Design
- Create

## Preliminary - tools info

**Plotly**

The [plotly Python library](https://plotly.com/python/getting-started/) is an interactive, open-source plotting library that supports over 40 unique chart types covering a wide range of statistical, financial, geographic, scientific, and 3-dimensional use-cases.


Built on top of the Plotly JavaScript library ([plotly.js](https://plotly.com/javascript/)), plotly enables Python users to create beautiful interactive web-based visualizations that can be displayed in Jupyter notebooks, saved to standalone HTML files, or served as part of pure Python-built web applications using Dash. The plotly Python library is sometimes referred to as "plotly.py" to differentiate it from the JavaScript library.

[Plotly Python Open Source Graphing Library](https://plotly.com/python/)


### environment setup
If using your local environment `conda install -c plotly`

Test that plotly is working properly running the folllowing cells.


In [1]:
# Import libriraies
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
# Testing Plotly
import plotly.graph_objects as go
import plotly.express as px
fig = go.Figure(data=go.Bar(y=[2, 3, 1]))
fig.show()

# Data exploration

## Step 1 : Context and data overview

Run the following cells and have a look at the data manipulation and merging

In [3]:
# Import data 
df_fertility = pd.read_csv("data/children_per_woman_total_fertility.csv")
df_gdp_capita = pd.read_csv("data/income_per_person_gdppercapita_ppp_inflation_adjusted.csv")
df_life_expectancy = pd.read_csv("data/life_expectancy_years.csv")
df_population = pd.read_csv("data/population_total.csv")

In [9]:
df_fertility.head()

Unnamed: 0,country,1800,1801,1802,1803,1804,1805,1806,1807,1808,...,2091,2092,2093,2094,2095,2096,2097,2098,2099,2100
0,Afghanistan,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,7.0,...,1.74,1.74,1.74,1.74,1.74,1.74,1.74,1.74,1.74,1.74
1,Albania,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,4.6,...,1.78,1.78,1.78,1.79,1.79,1.79,1.79,1.79,1.79,1.79
2,Algeria,6.99,6.99,6.99,6.99,6.99,6.99,6.99,6.99,6.99,...,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86,1.86
3,Angola,6.93,6.93,6.93,6.93,6.93,6.93,6.93,6.94,6.94,...,2.54,2.52,2.5,2.48,2.47,2.45,2.43,2.42,2.4,2.4
4,Antigua and Barbuda,5.0,5.0,4.99,4.99,4.99,4.98,4.98,4.97,4.97,...,1.81,1.81,1.81,1.81,1.81,1.81,1.81,1.82,1.82,1.82


In [10]:
# Subsetting data : selecting only years ranging between 1950 and 2020
year_range = [str(x) for x in range(1950, 2020)]

# melting the table, for easier ploting
df_melted_fertility = pd.melt(df_fertility, id_vars=['country'], value_vars=year_range, var_name='year', value_name='fertility')
df_melted_fertility.head()

Unnamed: 0,country,year,fertility
0,Afghanistan,1950,7.57
1,Albania,1950,5.87
2,Algeria,1950,7.49
3,Angola,1950,7.11
4,Antigua and Barbuda,1950,4.45


In [11]:
# melting the table, for easier ploting
df_melted_gdp_capita = pd.melt(df_gdp_capita, id_vars=['country'], value_vars=year_range, var_name='year', value_name='gdp_capita')
df_melted_life_expectancy = pd.melt(df_life_expectancy, id_vars=['country'], value_vars=year_range, var_name='year', value_name='life_expectancy')
df_melted_population = pd.melt(df_population, id_vars=['country'], value_vars=year_range, var_name='year', value_name='population')
df_melted_population.head()

Unnamed: 0,country,year,population
0,Afghanistan,1950,7750000
1,Albania,1950,1260000
2,Algeria,1950,8870000
3,Andorra,1950,6200
4,Angola,1950,4550000


In [14]:
from functools import reduce

# compile the list of dataframes we want to merge
df_to_merge = [df_melted_fertility, df_melted_gdp_capita, df_melted_life_expectancy, df_melted_population]

# concatenate tables
df_data_merged = reduce(lambda  left,right: pd.merge(left,right,
                                                on=['country', 'year'],
                                                how='inner'), df_to_merge)

In [15]:
# Adding continent information
df_continent = pd.read_csv("data/Countries-Continents.csv")

In [23]:
df2 = pd.merge(
  df_data_merged, 
  df_continent, 
  left_on='country', 
  right_on='Country',
  how="left",
)
df_merged = df2.drop(['Country'], axis=1)
df_merged.to_csv('data/merged_data.csv')

From the dataframe **df_merged**, list files, variables and their type & range (if applicable) in order to have a global picture. 

Tips : df.head, df.info, df.describe

In [18]:
# Have a look at the variables
df_merged.head()
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12880 entries, 0 to 12879
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          12880 non-null  object 
 1   year             12880 non-null  object 
 2   fertility        12880 non-null  float64
 3   gdp_capita       12880 non-null  int64  
 4   life_expectancy  12880 non-null  float64
 5   population       12880 non-null  int64  
 6   Continent        12880 non-null  object 
dtypes: float64(2), int64(2), object(3)
memory usage: 805.0+ KB


Is it the same starting point as last week ? Do we have more / less dimensions in data ?

_my answer_

Read a bit about the [indicators explanations](https://www.gapminder.org/data/documentation/)

Given the variables available, what are the questions that are coming to you ?

What kind of relationship do you anticipate or are curious about ?

_my answer_

## Step 2 : Exploring each variable

Get a better look at each individual variable alone.

Let’s do this for **one numerical variable**. 

With this data here, we have more dimensions than last week _(for each country, for each year, we have values that can be investigated)_.
So we can both look into **changes over time**, and **distribution for a specific year**.

### Distribution

Let have a look at the distribution of population for specific years: **1950** and **2010**

First select only the 1950 year

Tips
- Filter a dataframe : `df[df[x] == 'coucou']`
- look for **histogram chart** in the [Plotly Express gallery](https://plotly.com/python/plotly-express/#gallery)

In [33]:
# select one year
df_1950=df_merged[df_merged["year"]=="1950"]
df_2010=df_merged[df_merged["year"]=="2010"]

In [25]:
import plotly.express as px

In [32]:
# create a distribution plot for population in 1950
df = px.data.tips()
fig = px.histogram(df_1950, x="fertility",marginal="rug",
                   hover_data=df_1950.columns)
fig.show()

In [34]:
# create a distribution plot for population in 2010
df = px.data.tips()
fig = px.histogram(df_2010, x="fertility",marginal="rug",
                   hover_data=df_2010.columns)
fig.show()

### Changes over time

Let’s start by a numerical variable that has a meaning even if it’s summed across one dimension (sum countries data along years). Let’s look at the numerical variable **population over time** summed for all countries.

In [40]:
# Let’s look at the numerical variable population over time summed for all countries
df = px.data.tips()
fig = px.histogram(df_merged, x="population",y="year",marginal="rug",
                   hover_data=df_merged.columns)
fig.show()

Now, let’s try to see how we can still see **evolution over time** for another variable that doesn’t make sense if it’s summed : life expectancy.

Because the basic options of Plotly Express does not allow us to specify the aggregation function, we are using the **go.Histogram** class from **plotly.graph_objects**.

Here we want to see the evolution of the **average** of life expectancy for all countries.

In [41]:
fig = go.Figure(
    data=go.Histogram(
        histfunc="avg", 
        x=df_merged['year'],
        y=df_merged['life_expectancy'],
    )
)

fig.show()

In [42]:
# Can you do the same for the fertility rate ?
fig = go.Figure(
    data=go.Histogram(
        histfunc="avg", 
        x=df_merged['year'],
        y=df_merged['fertility'],
    )
)

fig.show()

Then, let’s visualise the fertility rate, on both dimensions : 
- for each country and 
- for each year.

What kind of graph can show an evolution for many entries ?

In [61]:
# Plot the fertility rate, on both dimensions for each country and for each year. 
df = px.data.tips()
fig = px.histogram(df_merged, x="population",  color="Continent",
                   hover_data=df_merged.columns)
fig.show()

In [57]:
df = px.data.tips()
fig = px.histogram(df_merged,histfunc="avg", y="fertility",x="year",  color="Continent",hover_data=df_merged.columns)
fig.show()

What if you would like to see the evolution of fertility rate over time for **only one country**, let’s say France ? How would you do ?

In [52]:
# Plot the evolution of fertility rate over time for only one country
df_France=df_merged[df_merged["country"]=="France"]
fig = go.Figure(
    data=go.Histogram(
        histfunc="avg", 
        x=df_France['year'],
        y=df_France['fertility'],
    )
)

fig.show()

## Step 3 : Exploring variables relationships

This time, let's start with exploring the **relationship between a numerical variable and a categorical one**.

Back on the previous chart representing the fertility rate on both dimensions for all countries.

Let's try to make more sense with it, **adding a visual variable : the color** to encode the **continent** of each country

In [59]:
# Plot the fertility rate, on both dimensions for each country and for each year
# Adding color
df = px.data.tips()
fig = px.histogram(df_merged,histfunc="avg", y="fertility",x="year",  color="country",hover_data=df_merged.columns)
fig.show()

In [67]:
fig = px.density_heatmap(df_merged,y="country",x="year",z="fertility")
fig.show()

It's still very much a spagetti chart.

So let's plot a small multiple : the same chart we did before but for each continent present in the dataframe. 

Have a look at the documentation on [faceting with Plotly](https://plotly.com/python/facet-plots/)

In [91]:
df = px.data.tips()
fig= px.line(df_merged, x="year", y="fertility",line_group="country", color="Continent")
fig.show()

In [70]:
# Using small multiples or "facets"
df = px.data.tips()
fig = px.scatter(df_merged, x="country", y="year", color="fertility", facet_col="Continent",facet_col_wrap=2)
fig.show()

In [74]:
df = px.data.gapminder()
fig = px.scatter(df_merged, x="fertility", y="year", color="Continent", facet_col="Continent",facet_col_wrap=2)
fig.show()

**Discovering the parallel coordinate plot**


This chart type that is not widespread but very useful to have an overview of many numerical variables distribution : the [Parallel Coordinates plot](https://www.data-to-viz.com/graph/parallel.html). 

The best practice when we have a lot of lines (here 1 line = 1 country) is to have opacity under 100% to see the most overlapped area. This feature is not available in Plotly.

In [72]:
# Let's see the data in the year 1950
fig = px.parallel_coordinates(df_merged.loc[df_merged['year'] == '1950'])
fig.show()

In [73]:
# Can you do the same for 2019 ? 
fig = px.parallel_coordinates(df_merged.loc[df_merged['year'] == '2019'])
fig.show()

Do you learn something from comparing the two graphs ?

_my answer_

Let’s have a look at the relationship between **fertility rate** and **gdp per capita**. 

What would be your first idea about this ?

How would you feel this relationship would be ?

_my answer_

In order to fix one dimension before charting, let’s say we want to see the relationship between **fertility rate** and **gdp per capita** first in **1950** and then in **2019**.

What types of charts can we make to investigate correlation ? (See [From data to viz](https://www.data-to-viz.com/))

Tips
- `df[df[x] == 'coucou']`
- px.scatter

In [83]:
df_merged.head()

Unnamed: 0,country,year,fertility,gdp_capita,life_expectancy,population,Continent
0,Afghanistan,1950,7.57,2390,32.5,7750000,Asia
1,Albania,1950,5.87,1780,54.1,1260000,Europe
2,Algeria,1950,7.49,4640,47.3,8870000,Africa
3,Angola,1950,7.11,3180,35.2,4550000,Africa
4,Antigua and Barbuda,1950,4.45,3470,58.5,45500,North America


In [89]:
# Make a chart representing the relationship between those 2 variables in 1950
df_1950
fig = px.scatter(df_1950, y="fertility", x="gdp_capita", color="country")
fig.show()


Let’s improve a bit this chart by adding the name of the country on hover and  log axis for gdp per capita.

Tips
- in the chart option add a parameter : `hover_data=['variable']`
- in the chart option add a parameter : `log_x=True`

In [24]:
# Add hover, and log axis for gdp per capita.


Now, let’s add more context with the country’s **population**.

- Can you see a way to add this variable to the chart ?
- What other visual variables are available for us to encode ?
- From the ones available, which one would be easier to see ?

In [25]:
# Add the population data in the chart


In [26]:
# You have plotted this for the year 1950. Can you do it again for 2019 ?


Have a look at the situation in 1950 then 2019.

Does that match the idea you had before ?

Do you see a lot of difference between 1950 and 2019 ?

_my answer_

It’s quite a long time between 1950 and 2019. Let’s see if we can have a view every 10 years and plot those charts next to each other.

I’ve created a dataframe for selected years df_selected_years for simplicity of use.

In [27]:
selected_years = ['1950', '1960', '1970', '1980', '1990', '2000', '2010', '2019']
df_selected_years = df_merged.loc[df_merged['year'].isin(selected_years)]


Let's plot a small multiple : the same chart we did before but for each year present in the list.

Have a look at the documentation on [faceting with Plotly](https://plotly.com/python/facet-plots/
)).

In [28]:
# Let's plot a small multiple

That’s interesting. Another nice feature of Plotly that could be tried here is animation.

Use the [animation feature of plotly express](https://plotly.com/python/animations/) to see the evolution over time in a single chart

In [29]:
# Use the animation feature of plotly express to see the evolution over time in a single chart


In [30]:
# Let’s make our own connected scatter plot 

#Let’s say we want to plot it for France 
df_country = df_merged[df_merged['country'] == 'France']
# convert column "a" of a DataFrame
df_country["year"] = pd.to_numeric(df_country["year"])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [31]:
# Because it’s kind of a weird chart, we will use the plotly.graph_objects module 
# to build a connected scatterplot that we can customize
fig = go.Figure(data=go.Scatter(
    x=df_country['gdp_capita'],
    y=df_country['fertility'],
    mode='lines+markers'
))

fig.show()

In [32]:
# From the basic chart provide, let’s add more information so that it’s more lisible 
# - Make bigger dot (called marker)
# - Encode the dot color as per the year to better see the evolution of time and its direction
# - Show this color scale
# - Reduce the line weight and choose a less flashy color (not to interfere with the colors used for the years)


In [33]:
# Make it again for another country


In [34]:
# Any other things you would like to explore ... go !