<img src="https://i.imgur.com/6U6q5jQ.png"/>

<a target="_blank" href="https://colab.research.google.com/github/SocialAnalytics-StrategicIntelligence/TableOperations/blob/main/index.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Operations on Data Frames


Let me get the data on dengue from [Peru](https://www.datosabiertos.gob.pe/dataset/vigilancia-epidemiol%C3%B3gica-de-dengue):

In [None]:
import pandas as pd
linkData="https://github.com/SocialAnalytics-StrategicIntelligence/TableOperations/raw/main/dataFiles/dengue_ok.pkl"
dengue = pd.read_pickle(linkData)
dengue.info()

In [None]:
# some exploration
dengue.describe().apply(lambda s: s.apply('{0:.5f}'.format))

Each row is a person:

In [None]:
dengue.head()

If we wanted to count people, creating a column of ones helps:

In [None]:
dengue=dengue.assign(case=1)
dengue.head()

Let's start creating _data from these data_!

# Dengue by Year

## Aggregation

Having people, we need to count them by some grouping variable, in this case year (_ano_) and dengue status.


In [None]:
indexList=['ano','enfermedad']
aggregator={'edad': ['mean','median'], 'case':['sum']}
ByYear_stats=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYear_stats.head(20)

In [None]:
# notice hierarchy: multindex
ByYear_stats.columns

For easier manipualtion outside Python, we could flatten the index hierarchy:

In [None]:
# ok?
["_".join(name) for name in ByYear_stats.columns]

In [None]:
# changing
ByYear_stats.columns=["_".join(name) for name in ByYear_stats.columns]
ByYear_stats.head(20)

In [None]:
# final look:
ByYear_stats.reset_index(drop=False,inplace=True)
ByYear_stats.head(20)

Notice a particular data type:

In [None]:
ByYear_stats.enfermedad.dtype

Saving to CSV will erase that _dtype_ attribute. Then, use pickle file format: 

In [None]:
ByYear_stats.to_pickle('dataFiles/ByYear_stats.pkl') # this can be read in R.

## Reshaping

Notice the variables are in three columns: **edad_mean** /	**edad_median** / 	**case_sum**. We could reshape those columns to a long format: 

In [None]:
theVarsAsIndex=['ano','enfermedad']

# stacking  and resetting index
ByYear_LongStats=ByYear_stats.set_index(theVarsAsIndex).stack().reset_index()

#result
ByYear_LongStats

In [None]:
# just renaming
ByYear_LongStats.rename(columns={'level_2':'statsName',0:'statsValue'},inplace=True)
ByYear_LongStats

In [None]:
# still ordinal?
ByYear_LongStats.enfermedad.dtype

In [None]:
ByYear_LongStats.to_pickle('dataFiles/ByYear_LongStats.pkl')

# Dengue by Location (Region vs Province)

## Aggregating

We can redo the previous process, adding _departamento_ and _province_: 

In [None]:
indexList=['ano','departamento','provincia','enfermedad']
aggregator={'case':['sum']}
ByYearPlace=dengue.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace

Before flattening the output data frame in long format, you could create a wide shape:

## Long to wide

In [None]:
#simply
ByYearPlace.unstack()

In [None]:
# a more familiar look
ByYearPlace_wide=ByYearPlace.unstack().reset_index()
ByYearPlace_wide

In [None]:
# zero instead of missing
ByYearPlace_wide.fillna(0,inplace=True)
ByYearPlace_wide

In [None]:
# you expected
ByYearPlace_wide.columns

In [None]:
#prepare
["_".join(names) if names[1]!='' else names[0] for names in ByYearPlace_wide.columns]

In [None]:
# change
ByYearPlace_wide.columns=["_".join(names) if names[1]!='' else names[0] for names in ByYearPlace_wide.columns]
ByYearPlace_wide

What about finding the _provincia_ most affected in a _departamento_?

In [None]:
where = ByYearPlace_wide.groupby(['ano','departamento'])['case_sum_ALARMA'].idxmax()
worst_prov_year = ByYearPlace_wide.loc[where].reset_index(drop=True)
worst_prov_year

In [None]:
# worst provinces
len(worst_prov_year.provincia.value_counts())

In [None]:
# worst >0
len(worst_prov_year[worst_prov_year.case_sum_ALARMA>0].provincia.value_counts())

## Filtering

Lets' filter some rows based on what we just computed:

In [None]:
worst_ProvYear_alarma=worst_prov_year[worst_prov_year.case_sum_ALARMA>0].loc[:,['departamento','provincia']]
worst_ProvYear_alarma.reset_index(drop=True,inplace=True)
worst_ProvYear_alarma

In [None]:
# adding a column of ones
worst_ProvYear_alarma['case']=1
worst_ProvYear_alarma

## Frequency table

With filtered data, let's create a crosstabulation:

In [None]:
indexList=['departamento','provincia']
aggregator={'case':['sum']}
worst_ProvYear_alarma_Frequency=worst_ProvYear_alarma.groupby(indexList,observed=True).agg(aggregator)
worst_ProvYear_alarma_Frequency

In [None]:
# we get a long format
worst_ProvYear_alarma_Frequency.reset_index()

In [None]:
# final look
worst_ProvYear_alarma_Frequency.columns=['case']
worst_ProvYear_alarma_Frequency.reset_index(inplace=True)
worst_ProvYear_alarma_Frequency

Saving the frequencies in a file:

In [None]:
worst_ProvYear_alarma_Frequency.to_csv('dataFiles/worst_ProvYear_alarma_Frequency.csv',index=False)

# Dengue by Location and Year

## Aggregating

Let's check a previous data frame:

In [None]:
ByYearPlace_wide

This time, I want two variables:

In [None]:
indexList=['ano','departamento']
aggregator={'case_sum_SIN_SEÑALES':['sum'],'case_sum_ALARMA':['sum']}
ByYearPlace=ByYearPlace_wide.groupby(indexList,observed=True).agg(aggregator)
ByYearPlace.columns=['sum_SIN_SEÑALES','sum_ALARMA']
ByYearPlace.reset_index(inplace=True)
ByYearPlace

## Creating information

I will create a new variable:

In [None]:
ByYearPlace['rateAlarma']=(ByYearPlace['sum_ALARMA']/ByYearPlace['sum_SIN_SEÑALES'])
ByYearPlace['rateAlarma'].describe()

We got _inf_ values:

In [None]:
import numpy as np #identify with numpy
ByYearPlace[np.isinf(ByYearPlace.rateAlarma)]

We need to make a decision. I did this:

In [None]:
ByYearPlace.loc[186,'rateAlarma']=1
ByYearPlace.drop(columns=['sum_SIN_SEÑALES','sum_ALARMA'],inplace=True)
ByYearPlace['rateAlarma'].describe()

### Discretizing

Sometimes you need a numerical variable as an ordinal variable: 

In [None]:
edges=[-1, .1, .25, .5,.75,1,2]
theLabels=["less10%","10-25%","25-50","51-75%","75-100%","above100%"]
ByYearPlace["rateAlarma.cut"]=pd.cut(ByYearPlace['rateAlarma'], include_lowest=True,
                                     bins=edges, 
                                     labels=theLabels,
                                     ordered=True)

In [None]:
# we have
ByYearPlace

We could check the yearly behavior:

In [None]:
ByYearPlace.groupby('ano').describe()

Let's do some **filtering**:

In [None]:
ByYearPlace=ByYearPlace[ByYearPlace.ano>=2012]
ByYearPlace.reset_index(drop=True,inplace=True)
ByYearPlace

In [None]:
ByYearPlace.info()

In [None]:
# the categort should be exported as pickle

ByYearPlace.to_pickle("dataFiles/ByYearPlace.pkl")

# World Fragility Data

## Concatenating


Let's visit this website: https://fundforpeace.org/what-we-do/country-risk-and-fragility-data/

There, you will find several excel files with the _Fragile States Index_ per year. Please, create folder **fragility** inside the folder **dataFiles**, where you will download the excel files from 2006 to 2023. 

In [None]:
# Import libraries
import os
import glob
import pandas as pd

path = os.path.join('dataFiles','fragility','*.xlsx') # xlsx files in the folder
excel_files_names = glob.glob(path) #file names using pyhton's glob

# see the file names
excel_files_names


Let´s open each file (make sure you have previously installed **openpyxl**):

In [None]:
allDFs=[] # all XLSX will be here!

import pandas as pd

for fileName in excel_files_names:
    currentFile=pd.read_excel(fileName)
    allDFs.append(currentFile)

In [None]:
# amount of rows and columns:
for df,year in zip(allDFs,range(2006,2024)):
    print(df.shape,year)

In [None]:
#dropping one year
allDFs_sub=allDFs[1::]

Putting all the dataframes column names into a list:

In [None]:
allColumnNames=[] # I will write every column 
for df in allDFs_sub:
    allColumnNames.append(set(df.columns))# list of sets!

# this is what we have
allColumnNames

In [None]:
# common columns
commonColumns=set.intersection(*allColumnNames) # expanding list of sets (*)
commonColumns

In [None]:
commonColumns.symmetric_difference(set.union(*allColumnNames))

In [None]:
allDFs_sameNames=[] # final DataFrame (with all DFs from 2013-2021
colnamesSorted=sorted(list(commonColumns)) # columns names sorted - must turn 'set' into 'list'

# making list of DFs
for df in allDFs_sub:
    allDFs_sameNames.append(df.loc[:,colnamesSorted]) 

# here it is
allDFs_sameNames

In [None]:
# concatenating
allDFsConcat=pd.concat(allDFs_sameNames,ignore_index=True) # appending DFs using 'concat()'

#done!... see it:
allDFsConcat

In [None]:
allDFsConcat.info()

In [None]:
# value_counts can be used in object type
allDFsConcat.Year.value_counts()

In [None]:
# keeping just the year value
yearAsNumber=[]
for y in allDFsConcat.Year:
    try:
        yearAsNumber.append(y.year)# the value from a date-time format
    except:
        yearAsNumber.append(y) # if not a datetime

#verifying
pd.Series(yearAsNumber).value_counts()

In [None]:
# overwriting the year column
allDFsConcat['Year']=yearAsNumber

In [None]:
# current order
allDFsConcat.columns.to_list()

In [None]:
# this is a trick: setting columns as index
allDFsConcat.set_index(['Country','Year','Total'],inplace=True)
allDFsConcat.head()

Reordering columns:

In [None]:
# dropping unneeded column
allDFsConcat.drop(columns='Rank',inplace=True)

In [None]:
# indexes will be columns
allDFsConcat.reset_index(drop=False,inplace=True)

# see
allDFsConcat.head()

In [None]:
# better ?
allDFsConcat.columns.to_list()

In [None]:
# clean column names
allDFsConcat.columns=allDFsConcat.columns.str.replace(':\s',"_",regex=True)
allDFsConcat.columns=allDFsConcat.columns.str.replace('\s',"",regex=True)
#see
allDFsConcat.columns.to_list()

In [None]:
# overwriting country
allDFsConcat['Country']=allDFsConcat.Country.str.upper()
allDFsConcat["Country"]=allDFsConcat.Country.str.strip()

## Reshaping after concatenation

We can find some problems that were created during the concatenation:

In [None]:
# seeing long shape
fragileLong=allDFsConcat.iloc[:,:3]
fragileLong

In [None]:
# to wide
fragileWide=pd.pivot_table(fragileLong,
               values='Total', # values to use
               index=['Country'], # unit of analysis
               columns=['Year']) # the values for NEW column
# see wide
fragileWide.head()

In [None]:
# missing values in long format
fragileLong[fragileLong.isna().any(axis=1)]

In [None]:
# what cells have missing values?
fragileWide[fragileWide.isna().any(axis=1)]

So, we got problems.

In [None]:
# details
fragileWide[fragileWide.isna().any(axis=1)].index

In [None]:
# prepare changes as dict:
changes={"CABO VERDE": "CAPE VERDE","CÔTE D'IVOIRE":"COTE D'IVOIRE", 
"CZECHIA":"CZECH REPUBLIC",
"SWAZILAND":"ESWATINI",
"ISRAEL AND WEST BANK":"ISRAEL",
"KYRGYZSTAN":"KYRGYZ REPUBLIC",
"NORTH MACEDONIA":"MACEDONIA",
"SLOVAKIA": "SLOVAK REPUBLIC"}

In [None]:
# make changes using 'replace':
allDFsConcat.Country.replace(to_replace=changes,inplace=True)
# re create:
fragileLong=allDFsConcat.iloc[:,:3]

In [None]:
# to wide shape again
fragileWide=pd.pivot_table(fragileLong,
               values='Total',
               index=['Country'],
               columns=['Year']).\
            reset_index(drop=False).\
            rename_axis(index=None, columns=None)

# verify missing
fragileWide[fragileWide.isna().any(axis=1)] # remember you had an extra country

In [None]:
# new subset
allDFsConcat=allDFsConcat[allDFsConcat.Year>=2012]

In [None]:
allDFsConcat

In [None]:
allDFsConcat.reset_index(drop=True, inplace=True)

In [None]:
allDFsConcat

In [None]:
# saving
allDFsConcat.to_csv(os.path.join("dataFiles","fragility","fragility2012_2023.csv"),index=False)

# Country Codes

## Merging

In [None]:
# make sure to install 'html5lib', 'beautifulSoup4' and 'lxml'

codesLink='https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes'

allTablesWiki=pd.read_html(codesLink, flavor='bs4')

In [None]:

allTablesWiki[0]

In [None]:
# keep that on
countryCodes=allTablesWiki[0].copy()

In [None]:
# check names
countryCodes.columns

In [None]:
# keeping what is needed
countryCodes=countryCodes.iloc[:,[0,3,4]]

In [None]:
countryCodes.columns

In [None]:
# brute-force renaming
newNames=["Country","iso2","iso3"]
countryCodes.columns=newNames
countryCodes

In [None]:
# bye symbols
countryCodes['Country']=countryCodes['Country'].str.normalize('NFKD').\
                        str.encode('ascii', errors='ignore').str.decode('utf-8').str.upper()

In [None]:
# check missing
countryCodes[countryCodes.isna().any(axis=1)]

In [None]:
# easy fix
countryCodes.loc[countryCodes.Country=='NAMIBIA','iso2']="NA"

# something missing?
countryCodes[countryCodes.isna().any(axis=1)]

In [None]:
# are these iso2 valid values?
[x for x in countryCodes.iso2 if len(x)>2]

In [None]:
# wrong rows

badValues=[x for x in countryCodes.iso2 if len(x)>2]

countryCodes[countryCodes.iso2.isin(badValues)]

In [None]:
# dropping wrong rows
countryCodes=countryCodes[~countryCodes.iso2.isin(badValues)] # filtering

countryCodes.reset_index(drop=True,inplace=True) # needed when rows are dropped

In [None]:
#how many countries?
allDFsConcat.Country.unique().shape

In [None]:
#how many countries?
countryCodes.Country.shape

Let's use sets to determine the non coincidences:

In [None]:
# only in countryCodes.Country NOT in allDFsConcat.Country
OnlyInCodes=set(countryCodes.Country)-set(allDFsConcat.Country)
OnlyInCodes

In [None]:
# only in allDFsConcat.Country NOT in countryCodes.Country
OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)
OnlyInConcat

## Fuzzy merging

We used the previous information to look for _possible_ matches (please install **thefuzz**):

In [None]:
from thefuzz import process as fz

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)]

In [None]:
# this may be clearer:

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)
 if fz.extractOne(f, OnlyInCodes)[1]>=90]

In [None]:
# prepare a dict of changes

changesInCodes1={fz.extractOne(f, OnlyInCodes)[0]:f 
                 for f in sorted(OnlyInConcat)
                 if fz.extractOne(f, OnlyInCodes)[1] >=90}
#the dict
changesInCodes1

In [None]:
countryCodes.Country.replace(to_replace=changesInCodes1,inplace=True)

In [None]:
# second iteration

OnlyInCodes=set(countryCodes.Country)-set(allDFsConcat.Country)
OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)]

Based on last result, we may need manual changes:

In [None]:
# see the strings in countryCodes:

countryCodes[countryCodes.Country.str.contains('LAO|KOREA|CZECH|CONGO',regex=True)]

In [None]:
# second iteration

changesInCodes2={"KOREA (THE DEMOCRATIC PEOPLE'S REPUBLIC OF) [P]":'NORTH KOREA',
                 "KOREA (THE REPUBLIC OF) [Q]":"SOUTH KOREA",
                 "LAO PEOPLE'S DEMOCRATIC REPUBLIC (THE) [R]":"LAOS",
                 "CZECHIA [J]":'CZECH REPUBLIC',
                 "CONGO (THE) [H]":'CONGO REPUBLIC'}
countryCodes.Country.replace(to_replace=changesInCodes2,inplace=True)

Those changes now allow for a different result:

In [None]:
OnlyInCodes=set(countryCodes.Country)-set(allDFsConcat.Country)
OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)

[(f,fz.extractOne(f, OnlyInCodes)) for f in sorted(OnlyInConcat)]

In [None]:
# we got it !
changesInCodes3={fz.extractOne(f, OnlyInCodes)[0]:f 
                 for f in sorted(OnlyInConcat)
                 if fz.extractOne(f, OnlyInCodes)[1] >=52}
#dict of matches
changesInCodes3

In [None]:
# make the changes
countryCodes.Country.replace(to_replace=changesInCodes3,inplace=True)

In [None]:
# confirming

OnlyInConcat=set(allDFsConcat.Country)-set(countryCodes.Country)
OnlyInConcat

When we recover the  most matches, we are ready to merge:

In [None]:
fragilityCoded_2012_2023=allDFsConcat.merge(countryCodes,left_on='Country',right_on='Country') #merge on Country
fragilityCoded_2012_2023

In [None]:
# the NA in NAmibia
fragilityCoded_2012_2023.to_pickle(os.path.join("dataFiles","fragility","fragilityCoded_2012_2023.pkl"))