# Deadly Visualizations!!!

![Image](../images/viz_types_portada.png)

## Setup

First we need to create a basic setup which includes:

- Importing the libraries.

- Reading the dataset file (source [Instituto Nacional de Estadística](https://www.ine.es/ss/Satellite?L=es_ES&c=Page&cid=1259942408928&p=1259942408928&pagename=ProductosYServicios%2FPYSLayout)).

- Create a couple of columns and tables for the analysis.

__NOTE:__ some functions were already created in order to help you go through the challenge. However, feel free to perform any code you might need.

In [2]:
# imports

import sys
import re
sys.path.insert(0, "../modules")

import numpy as np
import pandas as pd

import plotly.express as px
import cufflinks as cf
cf.go_offline()

import module as mod     # functions are include in module.py

In [3]:
# read dataset

deaths = pd.read_csv('../data/7947.csv', sep=';', thousands='.')

deaths.info()

#deaths["Causa de muerte"].unique()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301158 entries, 0 to 301157
Data columns (total 5 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Causa de muerte  301158 non-null  object
 1   Sexo             301158 non-null  object
 2   Edad             301158 non-null  object
 3   Periodo          301158 non-null  int64 
 4   Total            301158 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.5+ MB


In [4]:
# add some columns...you'll need them later

deaths['cause_code'] = deaths['Causa de muerte'].apply(mod.cause_code)
deaths['cause_group'] = deaths['Causa de muerte'].apply(mod.cause_types)
deaths['cause_name'] = deaths['Causa de muerte'].apply(mod.cause_name)

deaths.info()

deaths.tail()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301158 entries, 0 to 301157
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Causa de muerte  301158 non-null  object
 1   Sexo             301158 non-null  object
 2   Edad             301158 non-null  object
 3   Periodo          301158 non-null  int64 
 4   Total            301158 non-null  int64 
 5   cause_code       301158 non-null  object
 6   cause_group      301158 non-null  object
 7   cause_name       301158 non-null  object
dtypes: int64(2), object(6)
memory usage: 18.4+ MB


Unnamed: 0,Causa de muerte,Sexo,Edad,Periodo,Total,cause_code,cause_group,cause_name
301153,102 Otras causas externas y sus efectos tardíos,Mujeres,95 y más años,1984,0,102,Single cause,Otras causas externas y sus efectos tardíos
301154,102 Otras causas externas y sus efectos tardíos,Mujeres,95 y más años,1983,0,102,Single cause,Otras causas externas y sus efectos tardíos
301155,102 Otras causas externas y sus efectos tardíos,Mujeres,95 y más años,1982,0,102,Single cause,Otras causas externas y sus efectos tardíos
301156,102 Otras causas externas y sus efectos tardíos,Mujeres,95 y más años,1981,0,102,Single cause,Otras causas externas y sus efectos tardíos
301157,102 Otras causas externas y sus efectos tardíos,Mujeres,95 y más años,1980,0,102,Single cause,Otras causas externas y sus efectos tardíos


In [5]:
# lets check the categorical variables

var_list = ['Sexo', 'Edad', 'Periodo', 'cause_code', 'cause_name', 'cause_group']

categories = mod.cat_var(deaths, var_list)
categories

Unnamed: 0,categorical_variable,number_of_possible_values,values
0,cause_code,117,"[001-102, 001-008, 001, 002, 003, 004, 005, 00..."
1,cause_name,117,"[I-XXII.Todas las causas, I.Enfermedades infec..."
2,Periodo,39,"[2018, 2017, 2016, 2015, 2014, 2013, 2012, 201..."
3,Edad,22,"[Todas las edades, Menos de 1 año, De 1 a 4 añ..."
4,Sexo,3,"[Total, Hombres, Mujeres]"
5,cause_group,2,"[Multiple causes, Single cause]"


In [6]:
# we need also to create a causes table for the analysis

causes_table = deaths[['cause_code', 'cause_name']].drop_duplicates().sort_values(by='cause_code').reset_index(drop=True)

causes_table

Unnamed: 0,cause_code,cause_name
0,001,Enfermedades infecciosas intestinales
1,001-008,I.Enfermedades infecciosas y parasitarias
2,001-102,I-XXII.Todas las causas
3,002,Tuberculosis y sus efectos tardíos
4,003,Enfermedad meningocócica
...,...,...
112,098,Suicidio y lesiones autoinfligidas
113,099,Agresiones (homicidio)
114,100,Eventos de intención no determinada
115,101,Complicaciones de la atención médica y quirúrgica


In [7]:
# And some space for free-style Pandas!!! (e.g.: df['column_name'].unique())






## Lets make some transformations

Eventhough the dataset is pretty clean, the information is completely denormalized as you could see. For that matter a collection of methods (functions) are available in order to generate the tables you might need:

- `row_filter(df, cat_var, cat_values)` => Filter rows by any value or group of values in a categorical variable.

- `nrow_filter(df, cat_var, cat_values)` => The same but backwards. 

- `groupby_sum(df, group_vars, agg_var='Total', sort_var='Total')` => Add deaths by a certain variable.

- `pivot_table(df, col, x_axis, value='Total')`=> Make some pivot tables, you might need them...

__NOTE:__ be aware that the filtering methods can perform a filter at a time. Feel free to perform the filter you need in any way you want or feel confortable with.

In [8]:
# Example 1
'''
dataset = mod.row_filter(deaths, 'Sexo', ['Total'])
dataset = mod.row_filter(dataset, 'Edad', ['Todas las edades'])
dataset.head()
'''

"\ndataset = mod.row_filter(deaths, 'Sexo', ['Total'])\ndataset = mod.row_filter(dataset, 'Edad', ['Todas las edades'])\ndataset.head()\n"

In [9]:
# Example 2
'''
group = ['cause_code','Periodo']
dataset = mod.groupby_sum(deaths, group)
dataset.head()
'''

"\ngroup = ['cause_code','Periodo']\ndataset = mod.groupby_sum(deaths, group)\ndataset.head()\n"

In [10]:
# Example 3
'''
dataset = mod.pivot_table(dataset, 'cause_code', 'Periodo')
dataset.head()
'''

"\ndataset = mod.pivot_table(dataset, 'cause_code', 'Periodo')\ndataset.head()\n"

In [11]:
'''
My code:
I want to analyse the categories of death causes. For this I will take both genders, all age groups and year 2018.
I will represent the data as bar plot

Steps:
1. Identify the list of death categories
2. Filter the list by death category, gender, age and year
3. Create a plot
'''

'\nMy code:\nI want to analyse the categories of death causes. For this I will take both genders, all age groups and year 2018.\nI will represent the data as bar plot\n\nSteps:\n1. Identify the list of death categories\n2. Filter the list by death category, gender, age and year\n3. Create a plot\n'

In [12]:
# 1. Identify the list of death categories: I will take cause_codes with length major than 3

lst_cause_cat = [i for i in deaths['cause_code'].unique() if len(i) > 3]
lst_single_cause = [i for i in deaths['cause_code'].unique() if len(i) == 3]

#print(lst_cause_cat)
#print(lst_single_cause)

In [13]:
# 2.1. Filtering

# d) death category
deaths_filtered = mod.row_filter(deaths, 'cause_code', lst_cause_cat[1:])

# a) gender
deaths_filtered = mod.row_filter(deaths_filtered, 'Sexo', ['Total'])

# b) age
deaths_filtered = mod.row_filter(deaths_filtered, 'Edad', ['Todas las edades'])

# c) year
deaths_filtered = mod.row_filter(deaths_filtered, 'Periodo', [2018])

deaths_filtered


Unnamed: 0,Causa de muerte,Sexo,Edad,Periodo,Total,cause_code,cause_group,cause_name
0,053-061 IX.Enfermedades del sistema circulatorio,Total,Todas las edades,2018,120859,053-061,Multiple causes,IX.Enfermedades del sistema circulatorio
1,009-041 II.Tumores,Total,Todas las edades,2018,112714,009-041,Multiple causes,II.Tumores
2,062-067 X.Enfermedades del sistema respiratorio,Total,Todas las edades,2018,53687,062-067,Multiple causes,X.Enfermedades del sistema respiratorio
3,050-052 VI-VIII.Enfermedades del sistema nerv...,Total,Todas las edades,2018,26279,050-052,Multiple causes,VI-VIII.Enfermedades del sistema nervioso y de...
4,046-049 V.Trastornos mentales y del comportam...,Total,Todas las edades,2018,22376,046-049,Multiple causes,V.Trastornos mentales y del comportamiento
5,068-072 XI.Enfermedades del sistema digestivo,Total,Todas las edades,2018,21689,068-072,Multiple causes,XI.Enfermedades del sistema digestivo
6,090-102 XX.Causas externas de mortalidad,Total,Todas las edades,2018,15768,090-102,Multiple causes,XX.Causas externas de mortalidad
7,077-080 XIV.Enfermedades del sistema genitour...,Total,Todas las edades,2018,13941,077-080,Multiple causes,XIV.Enfermedades del sistema genitourinario
8,"044-045 IV.Enfermedades endocrinas, nutricion...",Total,Todas las edades,2018,13465,044-045,Multiple causes,"IV.Enfermedades endocrinas, nutricionales y me..."
9,"086-089 XVIII.Síntomas, signos y hallazgos an...",Total,Todas las edades,2018,10088,086-089,Multiple causes,"XVIII.Síntomas, signos y hallazgos anormales c..."


In [14]:
# Let's do some transformations to create and clean the labels:

deaths_filtered['cause_id'] = deaths_filtered['cause_name'].apply(lambda row: row.split('.')[0])
deaths_filtered['cause_name'] = deaths_filtered['cause_name'].apply(lambda row: row.replace('Enfermedades', 'Enf.' ))

deaths_filtered.head()

Unnamed: 0,Causa de muerte,Sexo,Edad,Periodo,Total,cause_code,cause_group,cause_name,cause_id
0,053-061 IX.Enfermedades del sistema circulatorio,Total,Todas las edades,2018,120859,053-061,Multiple causes,IX.Enf. del sistema circulatorio,IX
1,009-041 II.Tumores,Total,Todas las edades,2018,112714,009-041,Multiple causes,II.Tumores,II
2,062-067 X.Enfermedades del sistema respiratorio,Total,Todas las edades,2018,53687,062-067,Multiple causes,X.Enf. del sistema respiratorio,X
3,050-052 VI-VIII.Enfermedades del sistema nerv...,Total,Todas las edades,2018,26279,050-052,Multiple causes,VI-VIII.Enf. del sistema nervioso y de los órg...,VI-VIII
4,046-049 V.Trastornos mentales y del comportam...,Total,Todas las edades,2018,22376,046-049,Multiple causes,V.Trastornos mentales y del comportamiento,V


In [15]:
# 3. Create a plot

deaths_filtered.iplot(kind='bar',
                x='cause_id',
                y='Total',
                xTitle='Death cause-category',
                yTitle='Number of deaths in (1.000)',
                title='Death by cause category| 2018')

In [17]:
fig = px.bar(deaths_filtered,
        x='cause_name',
        y='Total',
        labels={'cause_name': 'Death Cause Category', 'Total': "Number of deaths in (1.000)"},
        color="cause_name",
        title='Death by cause category| 2018')

fig.update_layout(template='plotly_dark')

fig.update_layout(
    legend=dict(
        title="Categories",        # Title for the legend
        orientation='h',           # Horizontal orientation for the legend
        x=0.5,                     # Center the legend horizontally
        y=-0.2,                    # Position the legend below the plot
        xanchor='center',          # Align the legend horizontally in the center
        yanchor='top',             # Align the legend vertically from the top
    ),
    margin=dict(b=100),  # Add some bottom margin to ensure there's space for the legend
    xaxis=dict(
        showticklabels=False  # Hide x-axis labels
    )
)

# Show the updated plot
fig.show()

In [20]:
# 2nd exercise:
# I want to show the evolution of top5 deaths causes

In [21]:
# 1. Find top5 death causes in 2018:

# d) single death cause
deaths_all = mod.row_filter(deaths, 'cause_code', lst_single_cause)

# a) gender
deaths_all = mod.row_filter(deaths_all, 'Sexo', ['Total'])

# b) age
deaths_all = mod.row_filter(deaths_all, 'Edad', ['Todas las edades'])

# c) year
deaths_all = mod.row_filter(deaths_all, 'Periodo', [2018])


df_top5 = deaths_all.loc[:4, :].copy()
df_top5

top5_lst = [i for i in df_top5['cause_code']]
top5_lst

['059', '067', '058', '018', '046']

In [22]:
deaths_all.head(10)

Unnamed: 0,Causa de muerte,Sexo,Edad,Periodo,Total,cause_code,cause_group,cause_name
0,059 Enfermedades cerebrovasculares,Total,Todas las edades,2018,26420,59,Single cause,Enfermedades cerebrovasculares
1,067 Otras enfermedades del sistema respiratorio,Total,Todas las edades,2018,24665,67,Single cause,Otras enfermedades del sistema respiratorio
2,058 Otras enfermedades del corazón,Total,Todas las edades,2018,24399,58,Single cause,Otras enfermedades del corazón
3,"018 Tumor maligno de la tráquea, de los bronq...",Total,Todas las edades,2018,22153,18,Single cause,"Tumor maligno de la tráquea, de los bronquios ..."
4,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,2018,21669,46,Single cause,"Trastornos mentales orgánicos, senil y presenil"
5,057 Insuficiencia cardíaca,Total,Todas las edades,2018,19142,57,Single cause,Insuficiencia cardíaca
6,056 Otras enfermedades isquémicas del corazón,Total,Todas las edades,2018,16631,56,Single cause,Otras enfermedades isquémicas del corazón
7,051 Enfermedad de Alzheimer,Total,Todas las edades,2018,14929,51,Single cause,Enfermedad de Alzheimer
8,055 Infarto agudo de miocardio,Total,Todas las edades,2018,14521,55,Single cause,Infarto agudo de miocardio
9,072 Otras enfermedades del sistema digestivo,Total,Todas las edades,2018,13590,72,Single cause,Otras enfermedades del sistema digestivo


In [23]:
# 2. Df with top5 causes for all years:

# d) single death cause
deaths_years = mod.row_filter(deaths, 'cause_code', top5_lst)

# a) gender
deaths_years = mod.row_filter(deaths_years, 'Sexo', ['Total'])

# b) age
deaths_years = mod.row_filter(deaths_years, 'Edad', ['Todas las edades'])

deaths_years


Unnamed: 0,Causa de muerte,Sexo,Edad,Periodo,Total,cause_code,cause_group,cause_name
0,059 Enfermedades cerebrovasculares,Total,Todas las edades,1981,49000,059,Single cause,Enfermedades cerebrovasculares
1,059 Enfermedades cerebrovasculares,Total,Todas las edades,1983,48331,059,Single cause,Enfermedades cerebrovasculares
2,059 Enfermedades cerebrovasculares,Total,Todas las edades,1984,47699,059,Single cause,Enfermedades cerebrovasculares
3,059 Enfermedades cerebrovasculares,Total,Todas las edades,1985,47684,059,Single cause,Enfermedades cerebrovasculares
4,059 Enfermedades cerebrovasculares,Total,Todas las edades,1980,47475,059,Single cause,Enfermedades cerebrovasculares
...,...,...,...,...,...,...,...,...
190,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1984,1203,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"
191,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1983,944,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"
192,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1982,499,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"
193,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1981,494,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"


In [32]:
# checking data type of 'Periodo'
type(deaths_years.iloc[0,3])

numpy.int64

In [34]:
# ordering the data:
order_data = deaths_years[['Periodo', 'Total', 'cause_name']].sort_values(by=['Periodo']).reset_index(drop=True)
order_data

Unnamed: 0,Periodo,Total,cause_name
0,1980,397,"Trastornos mentales orgánicos, senil y presenil"
1,1980,8375,Otras enfermedades del sistema respiratorio
2,1980,8771,"Tumor maligno de la tráquea, de los bronquios ..."
3,1980,47475,Enfermedades cerebrovasculares
4,1980,13810,Otras enfermedades del corazón
...,...,...,...
190,2018,26420,Enfermedades cerebrovasculares
191,2018,24665,Otras enfermedades del sistema respiratorio
192,2018,24399,Otras enfermedades del corazón
193,2018,22153,"Tumor maligno de la tráquea, de los bronquios ..."


In [35]:
# 3. Create the line plot

fig = px.line(order_data, 
              x='Periodo', 
              y='Total', 
              color='cause_name', 
              title='TOP5 deaths-causes evolution',
              labels={'Periodo': 'Year', 'Total': 'Number of Deaths'},
              markers=True)

fig.update_layout(
    legend_title='Death Cause',
    legend=dict(title='Death Cause', font=dict(size=12)))

fig.show()

In [None]:
'''
Observing the data we see that 4 out of 5 deaths causes were not so important in 1980.
Therefore it is insightful to check 1980 top5 deaths causes 
'''

In [38]:
# Find top5 death causes in 1980:

# d) single death cause
deaths_all2 = mod.row_filter(deaths, 'cause_code', lst_single_cause)

# a) gender
deaths_all2 = mod.row_filter(deaths_all2, 'Sexo', ['Total'])

# b) age
deaths_all2 = mod.row_filter(deaths_all2, 'Edad', ['Todas las edades'])

# c) year
deaths_all2 = mod.row_filter(deaths_all2, 'Periodo', [1980])


df_top5_80 = deaths_all2.loc[:4, :].copy()
df_top5_80

top5_80_lst = [i for i in df_top5_80['cause_code']]
top5_80_lst

['059', '055', '057', '058', '060']

In [54]:
# We see that some of these causes were not in top5 in 2018. Let's visualize the combination of both:

combined_list = list(set(top5_lst + top5_80_lst))

combined_list



['055', '058', '067', '018', '046', '059', '057', '060']

In [55]:
# 2. Df with top5 causes for all years:

# d) single death cause
deaths_years2 = mod.row_filter(deaths, 'cause_code', combined_list)

# a) gender
deaths_years2 = mod.row_filter(deaths_years2, 'Sexo', ['Total'])

# b) age
deaths_years2 = mod.row_filter(deaths_years2, 'Edad', ['Todas las edades'])

deaths_years2


Unnamed: 0,Causa de muerte,Sexo,Edad,Periodo,Total,cause_code,cause_group,cause_name
0,059 Enfermedades cerebrovasculares,Total,Todas las edades,1981,49000,059,Single cause,Enfermedades cerebrovasculares
1,059 Enfermedades cerebrovasculares,Total,Todas las edades,1983,48331,059,Single cause,Enfermedades cerebrovasculares
2,059 Enfermedades cerebrovasculares,Total,Todas las edades,1984,47699,059,Single cause,Enfermedades cerebrovasculares
3,059 Enfermedades cerebrovasculares,Total,Todas las edades,1985,47684,059,Single cause,Enfermedades cerebrovasculares
4,059 Enfermedades cerebrovasculares,Total,Todas las edades,1980,47475,059,Single cause,Enfermedades cerebrovasculares
...,...,...,...,...,...,...,...,...
307,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1984,1203,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"
308,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1983,944,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"
309,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1982,499,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"
310,"046 Trastornos mentales orgánicos, senil y pr...",Total,Todas las edades,1981,494,046,Single cause,"Trastornos mentales orgánicos, senil y presenil"


In [57]:
# ordering the data:
order_data2 = deaths_years2[['Periodo', 'Total', 'cause_name']].sort_values(by=['Periodo']).reset_index(drop=True)
order_data2

Unnamed: 0,Periodo,Total,cause_name
0,1980,397,"Trastornos mentales orgánicos, senil y presenil"
1,1980,8375,Otras enfermedades del sistema respiratorio
2,1980,8771,"Tumor maligno de la tráquea, de los bronquios ..."
3,1980,47475,Enfermedades cerebrovasculares
4,1980,13297,Aterosclerosis
...,...,...,...
307,2018,26420,Enfermedades cerebrovasculares
308,2018,14521,Infarto agudo de miocardio
309,2018,19142,Insuficiencia cardíaca
310,2018,24665,Otras enfermedades del sistema respiratorio


In [58]:
# 3. Create the line plot

fig2 = px.line(order_data2, 
              x='Periodo', 
              y='Total', 
              color='cause_name', 
              title='TOP5 deaths-causes evolution',
              labels={'Periodo': 'Year', 'Total': 'Number of Deaths'},
              markers=True)

fig2.update_layout(
    legend_title='Death Cause',
    legend=dict(title='Death Cause', font=dict(size=12)))

fig2.show()

## ...and finally, show me some insights with Plotly!!!

In [None]:
# Cufflinks histogram
'''
dataset_column.iplot(kind='hist',
                     title='VIZ TITLE',
                     yTitle='AXIS TITLE',
                     xTitle='AXIS TITLE')
'''

In [None]:
# Cufflinks bar plot
'''
dataset_bar.iplot(kind='bar',
                  x='VARIABLE',
                  xTitle='AXIS TITLE',
                  yTitle='AXIS TITLE',
                  title='VIZ TITLE')
'''

In [None]:
# Cufflinks line plot
'''
dataset_line.iplot(kind='line',
                   x='VARIABLE',
                   xTitle='AXIS TITLE',
                   yTitle='AXIS TITLE',
                   title='VIZ TITLE')
'''

In [None]:
# Cufflinks scatter plot
'''
dataset_scatter.iplot(x='VARIABLE', 
                      y='VARIABLE', 
                      categories='VARIABLE',
                      xTitle='AXIS TITLE', 
                      yTitle='AXIS TITLE',
                      title='VIZ TITLE')
'''