<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center> 


<a id='home'></a>
_____


# Rescaling


In [2]:
import os, pandas as pd
fragcia=pd.read_csv(os.path.join("data","FragilityCia_isos.csv"))
fragcia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171 entries, 0 to 170
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Country            171 non-null    object 
 1   Officialstatename  171 non-null    object 
 2   InternetccTLD      170 non-null    object 
 3   iso2               170 non-null    object 
 4   iso3               171 non-null    object 
 5   fragility_date     171 non-null    int64  
 6   fragility          171 non-null    float64
 7   co2                171 non-null    float64
 8   co2_date           171 non-null    int64  
 9   region             171 non-null    object 
 10  ForestRev_gdp      171 non-null    float64
 11  ForestRev_date     171 non-null    int64  
dtypes: float64(3), int64(3), object(6)
memory usage: 16.2+ KB


In [None]:
# sum of cases by estado
fragcia.groupby('region')[['fragility','co2','ForestRev_gdp']].agg('mean')

In [None]:
# more complex
fragciaAGG=fragcia.groupby('region')[['fragility','co2','ForestRev_gdp']].agg(['min','max'],)

fragciaAGG

Notice that we have multi-index in the columns

In [None]:
fragciaAGG.columns

Even though pandas can work well, when exporting files to another application, you may prefer just simple indexes. So, let me show you how to **flat** the indexes:

In [None]:
fragciaAGG.columns.to_flat_index()

Then,

In [None]:
fragciaAGG.columns=fragciaAGG.columns.to_flat_index()
fragciaAGG

Now you have tuples as column names!... we solve it like this:

In [None]:
['_'.join(col) for col in fragciaAGG.columns]

So,

In [None]:
fragciaAGG.columns=['_'.join(col) for col in fragciaAGG.columns]
fragciaAGG

The last step would be to have _region_ as a column, not as the row index:

In [None]:
fragciaAGG.reset_index(inplace=True) #you don't drop it!!
fragciaAGG

Different columns have different value ranges. That is normal. However, some times, you need to manipulate the actual values so they have a particular scale or range. Let's see statistical summary:

In [3]:
fragcia.describe(include='all')

Unnamed: 0,Country,Officialstatename,InternetccTLD,iso2,iso3,fragility_date,fragility,co2,co2_date,region,ForestRev_gdp,ForestRev_date
count,171,171,170,170,171,171.0,171.0,171.0,171.0,171,171.0,171.0
unique,171,171,170,170,171,,,,,10,,
top,AFGHANISTAN,The Islamic Republic of Afghanistan,.af,AF,AFG,,,,,AFRICA,,
freq,1,1,1,1,1,,,,,51,,
mean,,,,,,2019.0,66.206433,201225000.0,2019.0,,1.332222,2017.959064
std,,,,,,0.0,23.836492,940022800.0,0.0,,2.616312,0.294193
min,,,,,,2019.0,16.9,173000.0,2019.0,,0.0,2015.0
25%,,,,,,2019.0,48.4,4331000.0,2019.0,,0.045,2018.0
50%,,,,,,2019.0,70.4,16478000.0,2019.0,,0.26,2018.0
75%,,,,,,2019.0,83.6,84670000.0,2019.0,,1.485,2018.0


The **describe** will only show numerical stats by default, so you need the parameter _include_ set to *all*. However, for our case, we should just request the range:

In [4]:
fragcia.describe().loc[['min','max']].T # notice the transposing:

Unnamed: 0,min,max
fragility_date,2019.0,2019.0
fragility,16.9,113.5
co2,173000.0,10773250000.0
co2_date,2019.0,2019.0
ForestRev_gdp,0.0,20.27
ForestRev_date,2015.0,2018.0


A boxplot may also be helpful:

In [None]:
import matplotlib.pyplot as plt

fragcia.plot(kind='box', rot=90);

In [None]:
#alternatively

fragcia.plot(kind='box', rot=90)
plt.semilogy();

As you see above, the ranges are very different (except the years).  Let's do some re scaling.

## Min-Max Scaling

In [None]:
columnsToScale=['fragility','co2','ForestRev_gdp'] 

from sklearn import preprocessing #installed?

# prepare the process
mnMx_Scaler = preprocessing.MinMaxScaler(feature_range=(0, 10))# default is 0,1

# apply process
mnMx_Result = mnMx_Scaler.fit_transform(fragcia[columnsToScale])

# result
mnMx_Result

## Standard Scaling

In thi case, we will turn the data unitless. Their mean will be 0 and their standard deviation will be one:

In [None]:
# prepare the process
std_Scaler = preprocessing.StandardScaler()

# apply process
std_Result = std_Scaler.fit_transform(fragcia[columnsToScale])

# result
std_Result

You just got:

In [None]:
type(mnMx_Result), type(std_Result)

Let's prepare the new columns:

In [None]:
newNames_mM=[name+'_mM' for name in columnsToScale]
newNames_sd=[name+'_sd' for name in columnsToScale]
newNames_mM,newNames_sd

Let me use that array to replace my values in the pandas _Series_:

In [None]:
mMDF=pd.DataFrame(mnMx_Result,columns=newNames_mM)
stDF=pd.DataFrame(std_Result,columns=newNames_sd)

In [None]:
fragcia=pd.concat([fragcia,mMDF,stDF],axis=1)

fragcia.info()

Now, these are my new data values:

In [None]:
fragcia[newNames_mM].plot(kind='box', rot=90);

In [None]:
fragcia[newNames_sd].plot(kind='box', rot=90);

The data is ready to be exported.


[Home](#home)


______

<a id='exporting'></a>


# Exporting file

The current *fragcia* data frame is clean and formatted. It is time to send it to a format that will keep all our work for future use.

#### For future use in Python:

In [None]:
fragcia.to_csv(os.path.join("data","fragcia_transformed.csv"))

#### For future  use in R:

In [None]:
#try the following before starting Python:
#export LD_LIBRARY_PATH="$(python -m rpy2.situation LD_LIBRARY_PATH)":${LD_LIBRARY_PATH}

from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(fragcia,file="fragcia_transformed.RDS")


#In R, you call it with: DF = readRDS("fragcia.RDS")
#or, if iyou read from cloud: DF = readRDS(url("https://..../fragcia.RDS")