<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

# Data Formatting (categorical)


In this formatting tutorial we will see the categorical case.


Let's get [some data](https://en.wikipedia.org/wiki/List_of_freedom_indices):

In [59]:
%reset
import pandas as pd

link='https://en.wikipedia.org/wiki/List_of_freedom_indices'
freeDFs=pd.read_html(link,flavor='bs4',match='w',attrs={'class':"wikitable"})

# how many tables?
len(freeDFs)

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


2

In [60]:
#is this one?
freeDFs[0]

Unnamed: 0_level_0,Index,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale
Unnamed: 0_level_1,Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Freedom in the World,free,free,free,free,free,partly free,partly free,partly free,partly free,partly free,not free,not free,not free,not free,not free
1,Index of Economic Freedom,free,free,free,mostly free,mostly free,mostly free,moderately free,moderately free,moderately free,mostly unfree,mostly unfree,mostly unfree,repressed,repressed,repressed
2,World Press Freedom Index,good,good,good,satisfactory,satisfactory,satisfactory,problematic,problematic,problematic,difficult,difficult,difficult,very serious,very serious,very serious
3,The Economist Democracy Index,full democracy,full democracy,full democracy,flawed democracy,flawed democracy,flawed democracy,hybrid regime,hybrid regime,hybrid regime,—,—,—,authoritarian regime,authoritarian regime,authoritarian regime


Then, you want the second table:

In [61]:
allFree=freeDFs[1]
allFree.head()

Unnamed: 0,Country,Freedom in the World 2023[13],Score,Index of Economic Freedom 2023[14],Score.1,Press Freedom Index 2023[3],Score.2,Democracy Index 2023[9],Score.3
0,Norway,free,100,mostly free,76.9,good,95.18,full democracy,9.81
1,Ireland,free,97,free,82.0,good,89.91,full democracy,9.05
2,Sweden,free,100,mostly free,77.5,good,88.15,full democracy,9.26
3,Finland,free,100,mostly free,77.1,good,87.94,full democracy,9.2
4,Denmark,free,97,mostly free,77.6,good,89.48,full democracy,9.15


Cleaning column names:

In [62]:
allFree.columns

Index(['Country', 'Freedom in the World 2023[13]', 'Score',
       'Index of Economic Freedom 2023[14]', 'Score.1',
       'Press Freedom Index 2023[3]', 'Score.2', 'Democracy Index 2023[9]',
       'Score.3'],
      dtype='object')

Is this a good alternative?

In [63]:
allFree.columns.str.replace(r"\W|\d","",regex=True)

Index(['Country', 'FreedomintheWorld', 'Score', 'IndexofEconomicFreedom',
       'Score', 'PressFreedomIndex', 'Score', 'DemocracyIndex', 'Score'],
      dtype='object')

You might prefer this:

In [64]:
NewNames=['Country', 'Freedom', 'FreedomScore', 'EconomicFreedom',
       'EconomicFreedomScore', 'PressFreedom', 'PressFreedomScore', 'Democracy', 'DemocracyScore']
allFree.columns=NewNames

Let's check data types:

In [65]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Country               197 non-null    object
 1   Freedom               196 non-null    object
 2   FreedomScore          197 non-null    object
 3   EconomicFreedom       176 non-null    object
 4   EconomicFreedomScore  197 non-null    object
 5   PressFreedom          184 non-null    object
 6   PressFreedomScore     197 non-null    object
 7   Democracy             165 non-null    object
 8   DemocracyScore        197 non-null    object
dtypes: object(9)
memory usage: 14.0+ KB


Let's clean all the leading/trailing space in every cell:

In [66]:
# this code breaks if applied to numeric columns
allFree=allFree.apply(lambda x: x.str.strip())

Do we have unique country names?

In [67]:
len(allFree.Country)==len(pd.unique(allFree.Country))

True

In [68]:
allFree.head()

Unnamed: 0,Country,Freedom,FreedomScore,EconomicFreedom,EconomicFreedomScore,PressFreedom,PressFreedomScore,Democracy,DemocracyScore
0,Norway,free,100,mostly free,76.9,good,95.18,full democracy,9.81
1,Ireland,free,97,free,82.0,good,89.91,full democracy,9.05
2,Sweden,free,100,mostly free,77.5,good,88.15,full democracy,9.26
3,Finland,free,100,mostly free,77.1,good,87.94,full democracy,9.2
4,Denmark,free,97,mostly free,77.6,good,89.48,full democracy,9.15


You have categorical and numerical columns. Would you prefer this look:

In [69]:
#non scores

allFree.columns[~allFree.columns.str.contains("score",case=False)]

Index(['Country', 'Freedom', 'EconomicFreedom', 'PressFreedom', 'Democracy'], dtype='object')

In [70]:
# non scores as index
allFree.set_index(allFree.columns[~allFree.columns.str.contains("score",case=False)].to_list())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,FreedomScore,EconomicFreedomScore,PressFreedomScore,DemocracyScore
Country,Freedom,EconomicFreedom,PressFreedom,Democracy,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Norway,free,mostly free,good,full democracy,100,76.9,95.18,9.81
Ireland,free,free,good,full democracy,97,82,89.91,9.05
Sweden,free,mostly free,good,full democracy,100,77.5,88.15,9.26
Finland,free,mostly free,good,full democracy,100,77.1,87.94,9.2
Denmark,free,mostly free,good,full democracy,97,77.6,89.48,9.15
...,...,...,...,...,...,...,...,...
Afghanistan,not free,,very serious,authoritarian regime,8,—,39.75,2.85
Yemen,not free,,very serious,authoritarian regime,9,—,32.78,1.95
Palestine,,,very serious,authoritarian regime,—,—,37.86,3.83
Syria,not free,,very serious,authoritarian regime,1,—,27.22,1.43


In [71]:
# reset index
allFree.set_index(allFree.columns[~allFree.columns.str.contains("score",case=False)].to_list()).reset_index(drop=False)

Unnamed: 0,Country,Freedom,EconomicFreedom,PressFreedom,Democracy,FreedomScore,EconomicFreedomScore,PressFreedomScore,DemocracyScore
0,Norway,free,mostly free,good,full democracy,100,76.9,95.18,9.81
1,Ireland,free,free,good,full democracy,97,82,89.91,9.05
2,Sweden,free,mostly free,good,full democracy,100,77.5,88.15,9.26
3,Finland,free,mostly free,good,full democracy,100,77.1,87.94,9.2
4,Denmark,free,mostly free,good,full democracy,97,77.6,89.48,9.15
...,...,...,...,...,...,...,...,...,...
192,Afghanistan,not free,,very serious,authoritarian regime,8,—,39.75,2.85
193,Yemen,not free,,very serious,authoritarian regime,9,—,32.78,1.95
194,Palestine,,,very serious,authoritarian regime,—,—,37.86,3.83
195,Syria,not free,,very serious,authoritarian regime,1,—,27.22,1.43


In [72]:
#Then
allFree=allFree.set_index(allFree.columns[~allFree.columns.str.contains("score",case=False)].to_list()).reset_index(drop=False)

In [73]:
allFree.head()

Unnamed: 0,Country,Freedom,EconomicFreedom,PressFreedom,Democracy,FreedomScore,EconomicFreedomScore,PressFreedomScore,DemocracyScore
0,Norway,free,mostly free,good,full democracy,100,76.9,95.18,9.81
1,Ireland,free,free,good,full democracy,97,82.0,89.91,9.05
2,Sweden,free,mostly free,good,full democracy,100,77.5,88.15,9.26
3,Finland,free,mostly free,good,full democracy,100,77.1,87.94,9.2
4,Denmark,free,mostly free,good,full democracy,97,77.6,89.48,9.15


Let's pay attention to the categorical columns:

In [74]:
[list(allFree[c].sort_values().unique()) for c in allFree.columns[1:5]]

[['free', 'not free', 'partly free', nan],
 ['free', 'moderately free', 'mostly free', 'mostly unfree', 'repressed', nan],
 ['difficult', 'good', 'problematic', 'satisfactory', 'very serious', nan],
 ['authoritarian regime',
  'flawed democracy',
  'full democracy',
  'hybrid regime',
  nan]]

You wanted to check for mistakes like ['free', 'not free', 'partly free', 'Free']. The cells are clean.

Now, let's turn the values into **ordinal** categories. Remember that the worst, best and middle values should be comparable:

In [75]:
mapper1 = {'not free': 1,'partly free': 3, 'free':5 }
allFree.Freedom.replace(mapper1,inplace=True)

mapper2 = {'repressed':1, 'mostly unfree':2,'moderately free':3, 'mostly free':4, 'free':5}
allFree.EconomicFreedom.replace(mapper2,inplace=True)


mapper3 = {'very serious':1,'difficult':2, 'problematic':3,
           'satisfactory':4,'good':5}
allFree.PressFreedom.replace(mapper3,inplace=True)

mapper4 = {'authoritarian regime':1, 'hybrid regime':2,'flawed democracy':4,'full democracy':5}
allFree.Democracy.replace(mapper4,inplace=True)


In [76]:
allFree

Unnamed: 0,Country,Freedom,EconomicFreedom,PressFreedom,Democracy,FreedomScore,EconomicFreedomScore,PressFreedomScore,DemocracyScore
0,Norway,5.0,4.0,5.0,5.0,100,76.9,95.18,9.81
1,Ireland,5.0,5.0,5.0,5.0,97,82,89.91,9.05
2,Sweden,5.0,4.0,5.0,5.0,100,77.5,88.15,9.26
3,Finland,5.0,4.0,5.0,5.0,100,77.1,87.94,9.2
4,Denmark,5.0,4.0,5.0,5.0,97,77.6,89.48,9.15
...,...,...,...,...,...,...,...,...,...
192,Afghanistan,1.0,,1.0,1.0,8,—,39.75,2.85
193,Yemen,1.0,,1.0,1.0,9,—,32.78,1.95
194,Palestine,,,1.0,1.0,—,—,37.86,3.83
195,Syria,1.0,,1.0,1.0,1,—,27.22,1.43


Let's explore:

In [77]:
#check types:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country               197 non-null    object 
 1   Freedom               196 non-null    float64
 2   EconomicFreedom       176 non-null    float64
 3   PressFreedom          184 non-null    float64
 4   Democracy             165 non-null    float64
 5   FreedomScore          197 non-null    object 
 6   EconomicFreedomScore  197 non-null    object 
 7   PressFreedomScore     197 non-null    object 
 8   DemocracyScore        197 non-null    object 
dtypes: float64(4), object(5)
memory usage: 14.0+ KB


In [78]:
# what about
allFree[allFree.columns[1:5]]=allFree.iloc[:,1:5].apply(lambda x: x.astype('Int64'))

In [79]:
#then
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Country               197 non-null    object
 1   Freedom               196 non-null    Int64 
 2   EconomicFreedom       176 non-null    Int64 
 3   PressFreedom          184 non-null    Int64 
 4   Democracy             165 non-null    Int64 
 5   FreedomScore          197 non-null    object
 6   EconomicFreedomScore  197 non-null    object
 7   PressFreedomScore     197 non-null    object
 8   DemocracyScore        197 non-null    object
dtypes: Int64(4), object(5)
memory usage: 14.8+ KB


However, these are not yet ordinal. Let's do it:

In [80]:
from pandas.api.types import CategoricalDtype

order = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)
allFree.iloc[:,1:5].apply(lambda x:x.astype(order)).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Freedom          196 non-null    category
 1   EconomicFreedom  176 non-null    category
 2   PressFreedom     184 non-null    category
 3   Democracy        165 non-null    category
dtypes: category(4)
memory usage: 1.7 KB


In [81]:
# create some new names:
newNames=allFree.columns[1:5]+'_ord'
#see
newNames

Index(['Freedom_ord', 'EconomicFreedom_ord', 'PressFreedom_ord',
       'Democracy_ord'],
      dtype='object')

In [82]:
allFree[newNames]=allFree.iloc[:,1:5].apply(lambda x:x.astype(order))

In [83]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Country               197 non-null    object  
 1   Freedom               196 non-null    Int64   
 2   EconomicFreedom       176 non-null    Int64   
 3   PressFreedom          184 non-null    Int64   
 4   Democracy             165 non-null    Int64   
 5   FreedomScore          197 non-null    object  
 6   EconomicFreedomScore  197 non-null    object  
 7   PressFreedomScore     197 non-null    object  
 8   DemocracyScore        197 non-null    object  
 9   Freedom_ord           196 non-null    category
 10  EconomicFreedom_ord   176 non-null    category
 11  PressFreedom_ord      184 non-null    category
 12  Democracy_ord         165 non-null    category
dtypes: Int64(4), category(4), object(5)
memory usage: 16.3+ KB


In [84]:
allFree.EconomicFreedom_ord

0        4
1        5
2        4
3        4
4        4
      ... 
192    NaN
193    NaN
194    NaN
195    NaN
196      1
Name: EconomicFreedom_ord, Length: 197, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

You may want to rename them:

In [85]:
ordCats={1:'veryLow',2:'low',3:'medium',4:'good',5:'veryGood'}

turnToOrdinal= lambda x:x.cat.rename_categories(ordCats)

allFree.iloc[:,9:].apply(turnToOrdinal)

Unnamed: 0,Freedom_ord,EconomicFreedom_ord,PressFreedom_ord,Democracy_ord
0,veryGood,good,veryGood,veryGood
1,veryGood,veryGood,veryGood,veryGood
2,veryGood,good,veryGood,veryGood
3,veryGood,good,veryGood,veryGood
4,veryGood,good,veryGood,veryGood
...,...,...,...,...
192,veryLow,,veryLow,veryLow
193,veryLow,,veryLow,veryLow
194,,,veryLow,veryLow
195,veryLow,,veryLow,veryLow


In [86]:
allFree[newNames]=allFree.iloc[:,9:].apply(turnToOrdinal)

# see
allFree.head(10)

Unnamed: 0,Country,Freedom,EconomicFreedom,PressFreedom,Democracy,FreedomScore,EconomicFreedomScore,PressFreedomScore,DemocracyScore,Freedom_ord,EconomicFreedom_ord,PressFreedom_ord,Democracy_ord
0,Norway,5,4,5,5,100,76.9,95.18,9.81,veryGood,good,veryGood,veryGood
1,Ireland,5,5,5,5,97,82.0,89.91,9.05,veryGood,veryGood,veryGood,veryGood
2,Sweden,5,4,5,5,100,77.5,88.15,9.26,veryGood,good,veryGood,veryGood
3,Finland,5,4,5,5,100,77.1,87.94,9.2,veryGood,good,veryGood,veryGood
4,Denmark,5,4,5,5,97,77.6,89.48,9.15,veryGood,good,veryGood,veryGood
5,Switzerland,5,5,4,5,96,83.8,84.4,8.83,veryGood,veryGood,good,veryGood
6,New Zealand,5,4,4,5,99,78.9,84.23,9.25,veryGood,good,good,veryGood
7,Netherlands,5,4,5,5,97,78.0,87.0,8.96,veryGood,good,veryGood,veryGood
8,Luxembourg,5,4,4,5,97,78.4,81.98,8.68,veryGood,good,good,veryGood
9,Estonia,5,4,5,4,94,78.6,85.31,7.84,veryGood,good,veryGood,good


Let's keep this last result, but this let me show you the use of **pickle** format:

In [87]:
#saving

import os 

allFree.to_csv(os.path.join("data","allFree.csv"),index=False )
allFree.to_pickle(os.path.join("data","allFree.pkl") )

In [88]:
#reading

dfPickle=pd.read_pickle(os.path.join("data","allFree.pkl") )  
dfCSV=pd.read_csv(os.path.join("data","allFree.csv") )  

Now, notice the difference when you have categorical data:

In [89]:
dfPickle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Country               197 non-null    object  
 1   Freedom               196 non-null    Int64   
 2   EconomicFreedom       176 non-null    Int64   
 3   PressFreedom          184 non-null    Int64   
 4   Democracy             165 non-null    Int64   
 5   FreedomScore          197 non-null    object  
 6   EconomicFreedomScore  197 non-null    object  
 7   PressFreedomScore     197 non-null    object  
 8   DemocracyScore        197 non-null    object  
 9   Freedom_ord           196 non-null    category
 10  EconomicFreedom_ord   176 non-null    category
 11  PressFreedom_ord      184 non-null    category
 12  Democracy_ord         165 non-null    category
dtypes: Int64(4), category(4), object(5)
memory usage: 15.7+ KB


In [90]:
dfCSV.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Country               197 non-null    object 
 1   Freedom               196 non-null    float64
 2   EconomicFreedom       176 non-null    float64
 3   PressFreedom          184 non-null    float64
 4   Democracy             165 non-null    float64
 5   FreedomScore          197 non-null    object 
 6   EconomicFreedomScore  197 non-null    object 
 7   PressFreedomScore     197 non-null    object 
 8   DemocracyScore        197 non-null    object 
 9   Freedom_ord           196 non-null    object 
 10  EconomicFreedom_ord   176 non-null    object 
 11  PressFreedom_ord      184 non-null    object 
 12  Democracy_ord         165 non-null    object 
dtypes: float64(4), object(9)
memory usage: 20.1+ KB


In [91]:
# the file kept the data type
dfPickle.Democracy_ord

0      veryGood
1      veryGood
2      veryGood
3      veryGood
4      veryGood
         ...   
192     veryLow
193     veryLow
194     veryLow
195     veryLow
196     veryLow
Name: Democracy_ord, Length: 197, dtype: category
Categories (5, object): ['veryLow' < 'low' < 'medium' < 'good' < 'veryGood']

In [92]:
# the file did not keep the data type
dfCSV.Democracy_ord

0      veryGood
1      veryGood
2      veryGood
3      veryGood
4      veryGood
         ...   
192     veryLow
193     veryLow
194     veryLow
195     veryLow
196     veryLow
Name: Democracy_ord, Length: 197, dtype: object