<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center> 


<a id='home'></a>
_____


# Transforming

Most of the time you need to transform the data you have prepared. I will show the main mathematical transformations that data goes through, namely, aggregating and re scaling.

Let me get a data set familiar to us:

In [None]:
import os, pandas as pd
fragcia=pd.read_csv(os.path.join("data","FragilityCia_isos.csv"))
fragcia.info()

## Aggregation

Sometimes, we need to reorganize the data by groups. One of those columns is the category that represents the group. In the previous table, region is an attribute of country; but we can turn the table from a table countries to a table of regions:

In [None]:
# sum of cases by estado
fragcia.groupby('region')[['fragility','co2','ForestRev_gdp']].agg('mean')

In [None]:
# more complex
fragciaAGG=fragcia.groupby('region')[['fragility','co2','ForestRev_gdp']].agg(['min','max'],)

fragciaAGG

Notice that we have multi-index in the columns

In [None]:
fragciaAGG.columns

Even though pandas can work well, when exporting files to another application, you may prefer just simple indexes. So, let me show you how to **flat** the indexes:

In [None]:
fragciaAGG.columns.to_flat_index()

Then,

In [None]:
fragciaAGG.columns=fragciaAGG.columns.to_flat_index()
fragciaAGG

Now you have tuples as column names!... we solve it like this:

In [None]:
['_'.join(col) for col in fragciaAGG.columns]

So,

In [None]:
fragciaAGG.columns=['_'.join(col) for col in fragciaAGG.columns]
fragciaAGG

The last step would be to have _region_ as a column, not as the row index:

In [None]:
fragciaAGG.reset_index(inplace=True) #you don't drop it!!
fragciaAGG

## Spatial aggregation


We can do similar aggregations once the data is in a map. Let me open what we have:

In [None]:
import os, geopandas as gpd

mapWorld=gpd.read_file(os.path.join("maps","mapWorld.gpkg"),layer='countries')
mapWorldV=gpd.read_file(os.path.join("maps","mapWorld.gpkg"),layer='countries_valid')
mapWorldV_data=gpd.read_file(os.path.join("maps","mapWorld.gpkg"),layer='countries_valid_data')

In [None]:
mapWorld.plot(linewidth=0.5, edgecolor='k')

This map has invalid geometries, so the aggregation (**dissolve**) will not work:

In [None]:
#mapWorld.dissolve().plot(linewidth=0.5, edgecolor='k')

This map has valid geometries, so the aggregation (**dissolve**) will work:

In [None]:
mapWorldV.dissolve().plot(linewidth=0.5, edgecolor='k')

In [None]:
mapWorldV.columns

The previous aggregation has only grouped geometries, let;s use the map with data to get stats:

In [None]:
mapWorldV_data.columns

In [None]:
cv=lambda x:x.std()/x.mean() # custom function

someCols=['fragility','co2','ForestRev_gdp','REGION_UN','geometry']

mapAgg=mapWorldV_data.loc[:,someCols].dissolve(by='REGION_UN',aggfunc={'co2':['mean',cv]})

mapAgg

In [None]:
# column names
mapAgg.columns

See the map colored by a column:

In [None]:
mapAgg.plot(column =('co2', 'mean'), scheme='quantiles', cmap='YlOrRd')

Compare:

In [None]:
mapWorldV_data.plot(column ='co2', scheme='quantiles', cmap='YlOrRd')

## Re Scaling

Different columns have different value ranges. That is normal. However, some time, you need to manipulate the actual values so they have a particular scale or range. Let's see statistical summary:

In [None]:
fragcia.describe(include='all')

The **describe** will only show numerical stats by default, so you need the parameter _include_ set to *all*. However, for our case, we should just request the range:

In [None]:
fragcia.describe().loc[['min','max']].T # notice the transposing:

A boxplot may also be helpful:

In [None]:
import matplotlib.pyplot as plt

fragcia.plot(kind='box', rot=90);

In [None]:
#alternatively

fragcia.plot(kind='box', rot=90)
plt.semilogy();

As you see above, the ranges are very different (except the years).  Let's do some re scaling.

## Min-Max Scaling

In [None]:
columnsToScale=['fragility','co2','ForestRev_gdp'] 

from sklearn import preprocessing #installed?

# prepare the process
mnMx_Scaler = preprocessing.MinMaxScaler(feature_range=(0, 10))# default is 0,1

# apply process
mnMx_Result = mnMx_Scaler.fit_transform(fragcia[columnsToScale])

# result
mnMx_Result

## Standard Scaling

In thi case, we will turn the data unitless. Their mean will be 0 and their standard deviation will be one:

In [None]:
# prepare the process
std_Scaler = preprocessing.StandardScaler()

# apply process
std_Result = std_Scaler.fit_transform(fragcia[columnsToScale])

# result
std_Result

You just got:

In [None]:
type(mnMx_Result), type(std_Result)

Let's prepare the new columns:

In [None]:
newNames_mM=[name+'_mM' for name in columnsToScale]
newNames_sd=[name+'_sd' for name in columnsToScale]
newNames_mM,newNames_sd

Let me use that array to replace my values in the pandas _Series_:

In [None]:
mMDF=pd.DataFrame(mnMx_Result,columns=newNames_mM)
stDF=pd.DataFrame(std_Result,columns=newNames_sd)

In [None]:
fragcia=pd.concat([fragcia,mMDF,stDF],axis=1)

fragcia.info()

Now, these are my new data values:

In [None]:
fragcia[newNames_mM].plot(kind='box', rot=90);

In [None]:
fragcia[newNames_sd].plot(kind='box', rot=90);

The data is ready to be exported.


[Home](#home)


______

<a id='exporting'></a>


# Exporting file

The current *fragcia* data frame is clean and formatted. It is time to send it to a format that will keep all our work for future use.

#### For future use in Python:

In [None]:
fragcia.to_csv(os.path.join("data","fragcia.csv"))

#### For future  use in R:

In [None]:
#try the following before starting Python:
#export LD_LIBRARY_PATH="$(python -m rpy2.situation LD_LIBRARY_PATH)":${LD_LIBRARY_PATH}

from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(fragcia,file="fragcia.RDS")


#In R, you call it with: DF = readRDS("fragcia.RDS")
#or, if iyou read from cloud: DF = readRDS(url("https://..../fragcia.RDS")