<center><img src="https://github.com/Magallanes-at-UTDT/DataViz_shortTalk_1/blob/main/LogoTaller_viz.png?raw=true" width="1000"></center>


# **THE BASICS**

We will cover the foundations for preparing a _viz_. Let's check some basic concepts here:


# **HANDS-ON!**

## The Data:

We will use:
- Covid deaths in Perú [link](https://www.datosabiertos.gob.pe/dataset/fallecidos-por-covid-19-ministerio-de-salud-minsa)
- Covid cases in Perú [link](https://www.datosabiertos.gob.pe/dataset/casos-positivos-por-covid-19-ministerio-de-salud-minsa)

I renamed the files as *fallecidos_covid* and *positivos_covid* in the _.csv_ format.

In [None]:
# !pip install unidecode

In [None]:
dataLink1="https://short-talks.s3.sa-east-1.amazonaws.com/fallecidos_covid.csv"
dataLink2="https://short-talks.s3.sa-east-1.amazonaws.com/positivos_covid.csv"

# # opening file
import pandas as pd
covid_f=pd.read_csv(dataLink1,sep=";")
covid_p=pd.read_csv(dataLink2,sep=";")

# column names normalized
from unidecode import unidecode
covid_f.columns=[unidecode(col) for col in covid_f.columns.str.replace('\s','',regex=True)]
covid_p.columns=[unidecode(col) for col in covid_p.columns.str.replace('\s','',regex=True)]

# see what Python understood
covid_f.info(),covid_p.info();

The **plan** is to see how covid developed as time went by, so we make sure the data is well suited.

In [None]:
# making sure key columns are complete
covid_p=covid_p[~(covid_p.FECHA_RESULTADO.isnull()|covid_p.EDAD.isnull()|covid_p.PROVINCIA.isnull())]
covid_f=covid_f[~(covid_f.FECHA_FALLECIMIENTO.isnull()|covid_f.EDAD_DECLARADA.isnull()|covid_f.PROVINCIA.isnull())]

# format dates the right way
covid_p.loc[:,'FECHA_RESULTADO']=covid_p.FECHA_RESULTADO.astype(int)
covid_p['FECHA_RESULTADO']=pd.to_datetime(covid_p.FECHA_RESULTADO, format='%Y%m%d')
covid_f['FECHA_FALLECIMIENTO']=pd.to_datetime(covid_f.FECHA_FALLECIMIENTO, format='%Y%m%d')

# extract information from dates
covid_p[['month_test','year_test']]=[(day.month,day.year) for day in covid_p['FECHA_RESULTADO']]
covid_f[['month_test','year_test']]=[(day.month,day.year) for day in covid_f['FECHA_FALLECIMIENTO']]

# bining ages
TheBins=[0,40,60,200]
TheBins_Labels=[1,2,3]
covid_f['EDADgrupo']=pd.cut(covid_f.EDAD_DECLARADA,bins=TheBins,labels=TheBins_Labels)
covid_p['EDADgrupo']=pd.cut(covid_p.EDAD,bins=TheBins,labels=TheBins_Labels)

# subset the data
columnsToGet=['PROVINCIA','EDADgrupo','SEXO', 'month_test','year_test']
yearsNeeded=[2020,2021]
covid_p_monthly=covid_p.loc[:,columnsToGet]
covid_f_monthly=covid_f.loc[:,columnsToGet]
covid_p_monthly=covid_p_monthly[covid_p_monthly.year_test.isin(yearsNeeded)]
covid_f_monthly=covid_f_monthly[covid_f_monthly.year_test.isin(yearsNeeded)]

# getting rid of imcomplete cases
covid_p_monthly=covid_p_monthly[~covid_p_monthly.PROVINCIA.isin(['EN INVESTIGACIÓN'])]
covid_f_monthly=covid_f_monthly[~covid_f_monthly.PROVINCIA.isin(['EN INVESTIGACIÓN'])]

covid_f_monthly['cases_f']=1 # flag for arithmetic
covid_p_monthly['cases_p']=1 # flag for arithmetic

You have two dataframes, that look like this:

In [None]:
covid_f_monthly

## <div class="alert alert-success" role="alert">Get Tidy, Decide what to Encode, Choose a Mark for the encodings</div>

Those are the three basics. Let's follow them, and play with the results:

* **Get Tidy**
  
The dataframe is in **wide shape** (each row is a case), we need a **long shape** (a **TIDY** format). Let me sum the cases by all the possible grouping variables:

In [None]:
covid_f_monthly_tidy=covid_f_monthly.groupby(by=columnsToGet,observed=True,as_index=False)['cases_f'].agg('sum') #using flag
covid_p_monthly_tidy=covid_p_monthly.groupby(by=columnsToGet,observed=True,as_index=False)['cases_p'].agg('sum') #using flag

Let's check the tidy data frame, sorted by the amount of cases:

In [None]:
covid_p_monthly_tidy.sort_values(['cases_p'], ascending=[False])

Tidy data generally will have more rows. The first one is telling you the amount of men cases in Lima in march of 2021 in that age group.

Now, let's prepare our plot. First, activate the library we will use:

In [None]:
# !pip install altair -U

In [None]:
# !pip install "vegafusion-jupyter[embed]"

In [None]:
import altair as alt
alt.data_transformers.enable("vegafusion")

Now, read the data frame into the library:

In [None]:
# the data
ALT_covid_p=alt.Chart(covid_p_monthly_tidy)

* **Decide what to encode**

In [None]:
ENCO_covid_p=ALT_covid_p.encode(
                                alt.Y('cases_p')# column 'cases_p' on the vertical
)

* **Choose a mark for the encodings**

In [None]:
ENCO_covid_p.mark_point()

Altair gave us exactly what was requested. But it is a viz far from relevant. While we keep using the same data, we will make changes to encodings and marks.

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('cases_p') # on the horizontal
                               )
ENCO_covid_p.mark_point()

We could use a boxplot to represent the encodings.

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('cases_p',
                                      scale=alt.Scale(type='log'))) # rescale
ENCO_covid_p.mark_boxplot()

What if we sum?

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('sum(cases_p)')
                               )
ENCO_covid_p.mark_point()

As expected, we got one point that represents:

In [None]:
covid_p_monthly_tidy.cases_p.sum()

Let's encode these values on both axes:

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p)'),
                                alt.X('year_test')
                               )
ENCO_covid_p.mark_point()

Something basic to remember, the data types: **Q**uantitative, **O**rdinal, and **Nominal**. If we specify those in _Altair_, results may be better:

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('year_test:O')
                               )
ENCO_covid_p.mark_point()

In [None]:
# or
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O')
                               )
ENCO_covid_p.mark_point()

Keep in mind **audiences**, some are used to particular marks for their usual encodings. However, there are poor choices "encodings-marks" too. Points encode the **position** of counts, the higher the more cases. What if we use **color**, where the higher the count the darker the _hue_.

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('month_test:O').title('meses'),
                               alt.Color('sum(cases_p):Q').title('Conteo'))

ENCO_covid_p.mark_rect().properties(height=150)

The _density_ represented should take into account the problem of color-blindness when choosing a [color map](https://vega.github.io/vega/docs/schemes/).

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('month_test:O').title('meses'),
                               alt.Color('sum(cases_p):Q',scale=alt.Scale(scheme='goldgreen')).title('Conteo'))
ENCO_covid_p.mark_rect().properties(height=150)

It is very safe to use greys!

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('month_test:O').title('meses'),
                               alt.Color('sum(cases_p):Q',scale=alt.Scale(scheme='greys')).title('Conteo'))
ENCO_covid_p.mark_rect().properties(height=150)

The use of bars is not better than points, specially if the width is not encoding something.

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O')
                               )
ENCO_covid_p.mark_bar()

So, length lines are a better choice than bars, and a good alternative to points:

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O')
                               )
ENCO_covid_p.mark_rule()

But, a time line is in general the best choice:

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O')
                               )
ENCO_covid_p.mark_line()

We have months for two years, so we could split the previous plot:

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O'),
                                alt.Color('year_test')
                               )
ENCO_covid_p.mark_line()

The data type is wrong for year, let's try ordinal?

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O'),
                                alt.Color('year_test:O')
                               )
ENCO_covid_p.mark_line()

In this case, the saturation did not help. If use year as nominal, two hues will be used:

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O'),
                                alt.Color('year_test:N')
                               )
ENCO_covid_p.mark_line()

Can this heatmap be a better choice?

In [None]:
ENCO_covid_p=ALT_covid_p.encode(alt.X('month_test:O').title('meses'),
                                alt.Y('year_test:N',scale=alt.Scale(reverse=True)).title('año'),
                               alt.Color('sum(cases_p)').title('Conteo'))

ENCO_covid_p.mark_rect().properties(height=150)

<div class="alert alert-danger">
  Once we have decided the mark for the basic encodings, we can start making our viz <strong>more complex!</strong>.
</div>

We could prepare this for positive cases:

In [None]:
# facetting by row and column

ENCO_covid_p=ALT_covid_p.encode(alt.Y('sum(cases_p):Q'),
                                alt.X('month_test:O'),
                                alt.Color('year_test:N'),
                                alt.Row('SEXO:N'),
                                alt.Column('EDADgrupo:N', title="POSITIVOS (por grupo etareo)")
                               )
ENCO_covid_p.mark_line()

And this one for deadly cases:

In [None]:
# the data
ALT_covid_f=alt.Chart(covid_f_monthly_tidy)
ENCO_covid_f=ALT_covid_f.encode(alt.Y('sum(cases_f):Q'),
                                alt.X('month_test:O'),
                                alt.Color('year_test:N'),
                                alt.Row('SEXO:N'),
                                alt.Column('EDADgrupo:N', title="FALLECIDOS (por grupo etareo)")
                               )
ENCO_covid_f.mark_line()

Apparently, the previous plots revealed a mild difference between men and women. If that were true we may omit that in our facets, and proceed to concatenate cases and deaths:

In [None]:
columnsToGet=['PROVINCIA','EDADgrupo','month_test','year_test'] # no SEX

# redoing the DF
covid_p_monthly=covid_p.loc[:,columnsToGet]
covid_f_monthly=covid_f.loc[:,columnsToGet]
covid_p_monthly=covid_p_monthly[covid_p_monthly.year_test.isin([2020,2021])]
covid_f_monthly=covid_f_monthly[covid_f_monthly.year_test.isin([2020,2021])]
covid_p_monthly=covid_p_monthly[~covid_p_monthly.PROVINCIA.isin(['EN INVESTIGACIÓN'])]
covid_f_monthly=covid_f_monthly[~covid_f_monthly.PROVINCIA.isin(['EN INVESTIGACIÓN'])]

covid_f_monthly['fallecidos']=1
covid_f_monthly_tidy=covid_f_monthly.groupby(by=columnsToGet,observed=True,as_index=True)['fallecidos'].agg('sum')
covid_p_monthly['positivos']=1
covid_p_monthly_tidy=covid_p_monthly.groupby(by=columnsToGet,observed=True,as_index=True)['positivos'].agg('sum')

# a wide DF from two tidy DF
covid_p_f_monthly_wide=pd.concat([covid_p_monthly_tidy,covid_f_monthly_tidy],ignore_index=False,axis=1)
covid_p_f_monthly_wide

Notice that this concatenation created missing values (very usual). Then,

In [None]:
covid_p_f_monthly_wide=pd.concat([covid_p_monthly_tidy,covid_f_monthly_tidy],ignore_index=False,axis=1).fillna(0)
covid_p_f_monthly=covid_p_f_monthly_wide.reset_index()

# this is not tidy but usual
covid_p_f_monthly

And this is the tidiest:

In [None]:
covid_p_f_monthly_tidy=covid_p_f_monthly_wide.melt(ignore_index=False).reset_index()
covid_p_f_monthly_tidy

Let's simplify what we did earlier, now that we have the deaths and cases in one dataframe:

In [None]:
ALT_data=alt.Chart(covid_p_f_monthly_tidy)
ENC_data=ALT_data.encode(
    alt.X('month_test:O'),
    alt.Y('sum(value)',scale=alt.Scale(type="log")),
    alt.Color('year_test:N'),
    alt.Row('variable'),
    alt.Column('EDADgrupo:N',title="Casos y Muertes por grupo etareo"),
    tooltip=['sum(value)']
).properties(width=200,height=200)

ENC_data.mark_line()

We can keep comparing deaths and cases with other marks:

In [None]:
ENC_data=ALT_data.encode(alt.Y('sumValues:Q',scale=alt.Scale(type="symlog")),
                         alt.Column('year_test:N'),
                         alt.X('EDADgrupo:O'),
                         alt.Row('variable'),
                          tooltip=['PROVINCIA','sumValues:Q']
                         ).transform_aggregate(sumValues='sum(value):Q',
                                               groupby=["PROVINCIA",'EDADgrupo','year_test','variable']
                                               )

ENC_data.mark_boxplot().resolve_scale(y='independent').properties(width=200,height=200)


The wide data is useful for scatterplot-like viz:

In [None]:
ALT_data=alt.Chart(covid_p_f_monthly)

ENC_data=ALT_data.encode(alt.X('sum(positivos)'),
                         alt.Y('sum(fallecidos)'),
                         alt.Column('year_test:N'),
                         tooltip=['PROVINCIA']
                         ).properties(width=200,height=200)

ENC_data.mark_circle()

In [None]:
ENC_data=ALT_data.encode(alt.X('sum(positivos)',scale=alt.Scale(type="symlog")),
                         alt.Y('sum(fallecidos)',scale=alt.Scale(type="symlog")),
                         alt.Column('year_test:N'),
                         tooltip=['PROVINCIA']
                         ).properties(width=200,height=200)

ENC_data.mark_circle()

We can split the previous viz without facetting more:

In [None]:
ENC_data=ALT_data.encode(alt.X('sum(positivos)'),
                         alt.Y('sum(fallecidos)'),
                         alt.Column('year_test:N'),
                         alt.Color('EDADgrupo:N'),
                         tooltip=['PROVINCIA']
                         ).properties(width=200,height=200)

ENC_data.mark_circle()

Let's save the tidy version.

In [None]:
# covid_p_f_monthly_tidy.to_csv("covid_p_f_monthly_tidy.csv",index=False)

## <div class="alert alert-success" role="alert">Basics of Mapping data</div>

Let's open a shapefile with the provinces of Peru.

In [None]:
linkMap="https://github.com/Magallanes-at-UTDT/DataViz_shortTalk_1/raw/main/map/PROVINCIAS.shp"

import geopandas as gpd

mapaProv=gpd.read_file(linkMap)

We have a **GeoDF**:

In [None]:
mapaProv.info()

We will merge our *covid_p_f_monthly_tidy* **into** *mapaProv*. This require that both columns "PROVINCIA" have the exact names.

In [None]:
# in the GeoDF, but not in the DF
NotInGeoDF=sorted(list(set(mapaProv.PROVINCIA)-set(covid_p_f_monthly_tidy.PROVINCIA)))
# in the DF, but not in the GeoDF
NotInDF=sorted(list(set(covid_p_f_monthly_tidy.PROVINCIA)-set(mapaProv.PROVINCIA)))

# this are the changed needed
changesMap={geo:df for geo,df in zip(NotInGeoDF,NotInDF)}
changesMap

Then, you make the changes needed:

In [None]:
mapaProv.replace({'PROVINCIA':changesMap}, inplace=True)

You may need you DF in a wide format for an easy merge (smaller file size):

In [None]:
covid_p_f_yearly_tidy=covid_p_f_monthly_tidy.groupby(by=['PROVINCIA','year_test','variable'],as_index=False)['value'].agg('sum')
covid_p_f_yearly_tidy.head(20)

In [None]:
covid_p_f_yearly=covid_p_f_yearly_tidy.pivot(index='PROVINCIA', columns=['year_test','variable'], values='value')
covid_p_f_yearly.head()

Notice the multi-index:

In [None]:
covid_p_f_yearly.columns

In [None]:
covid_p_f_yearly.columns=['_'.join((element[1],str(element[0]))) for element  in covid_p_f_yearly.columns]
covid_p_f_yearly.reset_index(inplace=True)
covid_p_f_yearly.head()

Now we can merge:

In [None]:
mapaProvCovid=mapaProv.merge(covid_p_f_yearly, on='PROVINCIA', how='inner')

mapaProvCovid

We may need to make to calculations with maps, like positions, distance and areas. In those case, it is better to verify if your GeoDF is **projected**:

In [None]:
# current crs
mapaProvCovid.crs

In [None]:
# or simply
mapaProvCovid.crs.is_projected

We could reproject:

In [None]:
mapaProvCovid_rpj=mapaProvCovid.to_crs(24892)
mapaProvCovid_rpj.crs.is_projected

In [None]:
# details
mapaProvCovid_rpj.crs

Let's compute the area:

In [None]:
mapaProvCovid['areaKm2']=mapaProvCovid_rpj.area/10**6

In [None]:
# wanna try this?
# mapaProvCovid.area/10**6

<div class="alert alert-danger">
  We need to make sure the area of the polygons have a minimal bias in the viz.
</div>

We should **avoid** encoding counts with color, as it is correlated to area.

In [None]:
# !pip install mapclassify

In [None]:
# poor choice...very common?
mapaProvCovid.plot(column='positivos_2021',scheme='boxplot')

Color may encode **densities** :

In [None]:
mapaProvCovid['positivos_2021_perKM2']=mapaProvCovid.positivos_2021/mapaProvCovid.areaKm2
mapaProvCovid.plot(column='positivos_2021_perKM2',scheme='boxplot')

Color may encode **proportions** :

In [None]:
mapaProvCovid['fallecidos2021_share']=mapaProvCovid.fallecidos_2021/(mapaProvCovid.fallecidos_2020+mapaProvCovid.fallecidos_2021)
mapaProvCovid.plot(column='fallecidos2021_share',scheme='boxplot')

If we need to compare, we should avoid **differences**:

In [None]:
mapaProvCovid['fallecidos_diff']=mapaProvCovid.fallecidos_2021-mapaProvCovid.fallecidos_2020
mapaProvCovid.plot(column='fallecidos_diff',scheme='boxplot')

For comparissons, we should compute ratios instead:

In [None]:
mapaProvCovid['fallecidos_rate2120']=mapaProvCovid.fallecidos_2021/mapaProvCovid.fallecidos_2020
mapaProvCovid.plot(column='fallecidos_rate2120',scheme='boxplot')

Let me get rid of "fallecidos_diff":

In [None]:
mapaProvCovid.drop(columns='fallecidos_diff',inplace=True)

These are the new columns:

In [None]:
mapaProvCovid.loc[:,'positivos_2021_perKM2':]

The maps created grouped those variables on the run, let's create our own groupings. For example:

In [None]:
import mapclassify as mc # needed

bp_perkm2=mc.BoxPlot(mapaProvCovid.positivos_2021_perKM2)
# details
bp_perkm2

In [None]:
# for each province
bp_perkm2.yb.astype('str')

Let's create a column with those values:

In [None]:
mapaProvCovid['positivos_2021_perKM2_group']=bp_perkm2.yb.astype('str')

Let's do the same for the rest:

In [None]:
mapaProvCovid['fallecidos_rate2120_group']=mc.BoxPlot(mapaProvCovid.fallecidos_rate2120).yb.astype('str')
mapaProvCovid['fallecidos2021_share_group']=mc.BoxPlot(mapaProvCovid.fallecidos2021_share).yb.astype('str')

Take a look at their stats:

In [None]:
mapaProvCovid.loc[:,'positivos_2021_perKM2_group':].describe()

...and data types:

In [None]:
mapaProvCovid.loc[:,'positivos_2021_perKM2_group':].info()

These are the current values:

In [None]:
mapaProvCovid.loc[:,'positivos_2021_perKM2_group':]

Let me relabel those values:

In [None]:
TheLabels=['0_Best','1_veryGood','2_notGood','3_Bad','4_veryBad','5_Worst']
TheRecoding={str(n):l for n,l in zip(range(6),TheLabels)}
TheRecoding

In [None]:
mapaProvCovid.loc[:,'positivos_2021_perKM2_group':]=mapaProvCovid.loc[:,'positivos_2021_perKM2_group':].replace(TheRecoding)
mapaProvCovid.loc[:,'positivos_2021_perKM2_group':]

We can replot the map. Make sure to use the right colormap for ordinal data from [matplotlib](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (another nice alternative is [brewer colors](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3))

In [None]:
mapaProvCovid.plot(column='fallecidos_rate2120_group',categorical=True,cmap='YlOrRd_r',# notice _r
                   legend=True,legend_kwds={"loc": "center left", "bbox_to_anchor": (1, 0.5)})

This is a nice option, too:

In [None]:
mapaProvCovid.explore("fallecidos_rate2120_group", cmap="YlOrRd_r")

We should save the map now:

In [None]:
# mapaProvCovid.to_file("mapaProvCovid.geojson", driver='GeoJSON')

Let me do some extra steps, which **can not** be saved in a geo structure. Let's create ordinal factors!

In [None]:
TheNewColumns=['positivos_2021_perKM2_ord',	'fallecidos_rate2120_ord','fallecidos2021_share_ord']

In [None]:
from pandas.api.types import CategoricalDtype

cat_type = CategoricalDtype(categories=TheLabels, ordered=True)
mapaProvCovid[TheNewColumns]=mapaProvCovid.loc[:,'positivos_2021_perKM2_group':].apply(lambda x:x.astype(cat_type))
mapaProvCovid.info()

In [None]:
mapaProvCovid.fallecidos2021_share_ord

In [None]:
mapaProvCovid.plot(column='fallecidos_rate2120_ord',categorical=True,cmap='YlOrRd_r',
                   legend=True,legend_kwds={"loc": "center left", "bbox_to_anchor": (1, 0.5)})

In [None]:
mapaProvCovid.explore("fallecidos_rate2120_ord", cmap="YlOrRd_r")

Everything worked well. But...

In [None]:
# mapaProvCovid.to_file(path('data',"mapaProvCovid.gpkg"), layer='provincias', driver="GPKG")

In [None]:
# mapaProvCovid.to_file(path('data',"mapaProvCovid.shp"))

In [None]:
# mapaProvCovid.to_file(path('data',"mapaProvCovid.geojson"), driver='GeoJSON')