<center><img src="https://i.imgur.com/zRrFdsf.png" width="700"></center>

# Data Formatting (categorical)


In this formatting tutorial we will see the categorical case.


Let's get [some data](https://en.wikipedia.org/wiki/List_of_freedom_indices):

In [4]:
import pandas as pd

link='https://en.wikipedia.org/wiki/List_of_freedom_indices'
freeDFs=pd.read_html(link,flavor='bs4',match='w',attrs={'class':"wikitable"})

In [5]:
len(freeDFs)

2

In [6]:
freeDFs[0]

Unnamed: 0_level_0,Index,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale,Scale
Unnamed: 0_level_1,Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,Freedom in the World,free,free,free,free,free,partly free,partly free,partly free,partly free,partly free,not free,not free,not free,not free,not free
1,Index of Economic Freedom,free,free,free,mostly free,mostly free,mostly free,moderately free,moderately free,moderately free,mostly unfree,mostly unfree,mostly unfree,repressed,repressed,repressed
2,Press Freedom Index,good situation,good situation,good situation,satisfactory situation,satisfactory situation,satisfactory situation,noticeable problems,noticeable problems,noticeable problems,difficult situation,difficult situation,difficult situation,very serious situation,very serious situation,very serious situation
3,Democracy Index,full democracy,full democracy,full democracy,flawed democracy,flawed democracy,flawed democracy,hybrid regime,hybrid regime,hybrid regime,authoritarian regime,authoritarian regime,authoritarian regime,authoritarian regime,authoritarian regime,authoritarian regime


We need the second table:

In [8]:
allFree=freeDFs[1]
allFree.head()

Unnamed: 0,Country,Freedom in the World 2022[13],2022 Index of Economic Freedom[14],2022 Press Freedom Index[3],2021 Democracy Index[9]
0,Afghanistan,not free,,very serious situation,authoritarian regime
1,Albania,partly free,moderately free,noticeable problems,flawed democracy
2,Algeria,not free,repressed,difficult situation,authoritarian regime
3,Andorra,free,,noticeable problems,
4,Angola,not free,mostly unfree,noticeable problems,authoritarian regime


Cleaning column names:

In [9]:
allFree.columns

Index(['Country', 'Freedom in the World 2022[13]',
       '2022 Index of Economic Freedom[14]', '2022 Press Freedom Index[3]',
       '2021 Democracy Index[9]'],
      dtype='object')

This is a good alternative:

In [13]:
allFree.columns.str.replace(r"\W|\d","",regex=True)

Index(['Country', 'FreedomintheWorld', 'IndexofEconomicFreedom',
       'PressFreedomIndex', 'DemocracyIndex'],
      dtype='object')

In [14]:
#then
allFree.columns=allFree.columns.str.replace(r"\W|\d","",regex=True)

Let's clean all the leading/trailing space in every cell:

In [18]:
allFree=allFree.apply(lambda x: x.str.strip())

Do we have unique country names?

In [19]:
len(allFree.Country)==len(pd.unique(allFree.Country))

True

Let's start formatting:

In [21]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Country                 197 non-null    object
 1   FreedomintheWorld       197 non-null    object
 2   IndexofEconomicFreedom  177 non-null    object
 3   PressFreedomIndex       184 non-null    object
 4   DemocracyIndex          167 non-null    object
dtypes: object(5)
memory usage: 7.8+ KB


We see every column is just of the object type. Let's ask frequecies and keep the categories:

In [43]:
[allFree[c].value_counts(sort=False, dropna=False).index.sort_values().to_list()  for c in allFree.columns[1:]]

[['free', 'not free', 'partly free'],
 ['free', 'moderately free', 'mostly free', 'mostly unfree', 'repressed', nan],
 ['difficult situation',
  'good situation',
  'noticeable problems',
  'satisfactory situation',
  'very serious situation',
  nan],
 ['authoritarian regime',
  'flawed democracy',
  'full democracy',
  'hybrid regime',
  nan]]

We could request unique values to shorten the code:

In [76]:
cars = ['Ford', 'BMW', 'Volvo']

cars.sort(reverse=True)

In [81]:
[list(allFree[c].sort_values().unique()) for c in allFree.columns[1:]]

[['free', 'not free', 'partly free'],
 ['free', 'moderately free', 'mostly free', 'mostly unfree', 'repressed', nan],
 ['difficult situation',
  'good situation',
  'noticeable problems',
  'satisfactory situation',
  'very serious situation',
  nan],
 ['authoritarian regime',
  'flawed democracy',
  'full democracy',
  'hybrid regime',
  nan]]

It is very important to verify that the strings that represent categories do not need _cleaning_ (i.e. 'free' instead of 'freee')

Let me assign the dataframe to a new object:

In [13]:
allFree=allFree.copy()

Now, let's turn the values into **ordinal** categories. Remember that the worst, best and middle values should be comparable:

In [82]:
mapper1 = {'free':5 , 'partly free': 3, 'not free': 1}
allFree.FreedomintheWorld.replace(mapper1,inplace=True)

mapper2 = {'moderately free':3, 'repressed':1, 'mostly unfree':2,
       'mostly free':4, 'free':5}
allFree.IndexofEconomicFreedom.replace(mapper2,inplace=True)


mapper3 = {'very serious situation':1, 'noticeable problems':3,
       'difficult situation':2, 'satisfactory situation':4,
       'good situation':5}
allFree.PressFreedomIndex.replace(mapper3,inplace=True)

mapper4 = {'authoritarian regime':1, 'flawed democracy':3,'hybrid regime':2,
       'full democracy':5}
allFree.DemocracyIndex.replace(mapper4,inplace=True)


In [83]:
allFree

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Afghanistan,1,,1.0,1.0
1,Albania,3,3.0,3.0,3.0
2,Algeria,1,1.0,2.0,1.0
3,Andorra,5,,3.0,
4,Angola,1,2.0,3.0,1.0
...,...,...,...,...,...
192,Venezuela,1,1.0,1.0,1.0
193,Vietnam,1,3.0,1.0,1.0
194,Yemen,1,,1.0,1.0
195,Zambia,3,1.0,3.0,2.0


Let's explore:

In [84]:
#check types:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Country                 197 non-null    object 
 1   FreedomintheWorld       197 non-null    int64  
 2   IndexofEconomicFreedom  177 non-null    float64
 3   PressFreedomIndex       184 non-null    float64
 4   DemocracyIndex          167 non-null    float64
dtypes: float64(3), int64(1), object(1)
memory usage: 7.8+ KB


In [85]:
# what about

# finally

allFree.iloc[:,1:].apply(lambda x: x.astype('Int32')).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype
---  ------                  --------------  -----
 0   FreedomintheWorld       197 non-null    Int32
 1   IndexofEconomicFreedom  177 non-null    Int32
 2   PressFreedomIndex       184 non-null    Int32
 3   DemocracyIndex          167 non-null    Int32
dtypes: Int32(4)
memory usage: 4.0 KB


In [24]:
# this will fail:
#allFree.iloc[:,1:].apply(lambda x: x.astype('int32')).info()

In [86]:
for col in allFree.iloc[:,1:].columns:
    try:
        allFree[col].astype('int32')
    except:
        print(col)
        try: 
            for cell in allFree[col]:
                int(cell)
        except:
            print(cell)
                

IndexofEconomicFreedom
nan
PressFreedomIndex
nan
DemocracyIndex
nan


In [87]:
allFree.head()

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Afghanistan,1,,1.0,1.0
1,Albania,3,3.0,3.0,3.0
2,Algeria,1,1.0,2.0,1.0
3,Andorra,5,,3.0,
4,Angola,1,2.0,3.0,1.0


However, these are not yet ordinal. Let's do it:

In [88]:
from pandas.api.types import CategoricalDtype

order = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)
allFree.iloc[:,1:].apply(lambda x:x.astype(order)).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 4 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   FreedomintheWorld       197 non-null    category
 1   IndexofEconomicFreedom  177 non-null    category
 2   PressFreedomIndex       184 non-null    category
 3   DemocracyIndex          167 non-null    category
dtypes: category(4)
memory usage: 1.7 KB


In [89]:
allFree.columns[1:]+'_or'

Index(['FreedomintheWorld_or', 'IndexofEconomicFreedom_or',
       'PressFreedomIndex_or', 'DemocracyIndex_or'],
      dtype='object')

In [90]:
newNames=allFree.columns[1:]+'_or'
allFree[newNames]=allFree.iloc[:,1:].apply(lambda x:x.astype(order))

In [36]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Country                    197 non-null    object  
 1   FreedomintheWorld          197 non-null    int64   
 2   IndexofEconomicFreedom     177 non-null    float64 
 3   PressFreedomIndex          184 non-null    float64 
 4   DemocracyIndex             167 non-null    float64 
 5   FreedomintheWorld_or       197 non-null    category
 6   IndexofEconomicFreedom_or  177 non-null    category
 7   PressFreedomIndex_or       184 non-null    category
 8   DemocracyIndex_or          167 non-null    category
dtypes: category(4), float64(3), int64(1), object(1)
memory usage: 9.4+ KB


In [38]:
allFree.IndexofEconomicFreedom_or

0      NaN
1        3
2        1
3      NaN
4        2
      ... 
192      1
193      3
194    NaN
195      1
196      1
Name: IndexofEconomicFreedom_or, Length: 197, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

You may want to rename them:

In [91]:
ordCats={1:'veryLow',2:'low',3:'medium',4:'good',5:'veryGood'}

turnToOrdinal= lambda x:x.cat.rename_categories(ordCats)

allFree.iloc[:,5:].apply(turnToOrdinal)

Unnamed: 0,FreedomintheWorld_or,IndexofEconomicFreedom_or,PressFreedomIndex_or,DemocracyIndex_or
0,veryLow,,veryLow,veryLow
1,medium,medium,medium,medium
2,veryLow,veryLow,low,veryLow
3,veryGood,,medium,
4,veryLow,low,medium,veryLow
...,...,...,...,...
192,veryLow,veryLow,veryLow,veryLow
193,veryLow,medium,veryLow,veryLow
194,veryLow,,veryLow,veryLow
195,medium,veryLow,medium,low


In [92]:
allFree.iloc[:,5:]=allFree.iloc[:,5:].apply(turnToOrdinal)

  allFree.iloc[:,5:]=allFree.iloc[:,5:].apply(turnToOrdinal)


Let's keep this last result, but this let me show you the use of **pickle** format:

In [94]:
#saving

import os 

allFree.to_csv(os.path.join("data","allFree.csv"),index=False )
allFree.to_pickle(os.path.join("data","allFree.pkl") )

In [95]:
#reading

dfPickle=pd.read_pickle(os.path.join("data","allFree.pkl") )  
dfCSV=pd.read_csv(os.path.join("data","allFree.csv") )  

Now, notice the difference when you have categorical data:

In [96]:
dfPickle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   Country                    197 non-null    object  
 1   FreedomintheWorld          197 non-null    int64   
 2   IndexofEconomicFreedom     177 non-null    float64 
 3   PressFreedomIndex          184 non-null    float64 
 4   DemocracyIndex             167 non-null    float64 
 5   FreedomintheWorld_or       197 non-null    category
 6   IndexofEconomicFreedom_or  177 non-null    category
 7   PressFreedomIndex_or       184 non-null    category
 8   DemocracyIndex_or          167 non-null    category
dtypes: category(4), float64(3), int64(1), object(1)
memory usage: 8.7+ KB


In [97]:
dfCSV.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197 entries, 0 to 196
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Country                    197 non-null    object 
 1   FreedomintheWorld          197 non-null    int64  
 2   IndexofEconomicFreedom     177 non-null    float64
 3   PressFreedomIndex          184 non-null    float64
 4   DemocracyIndex             167 non-null    float64
 5   FreedomintheWorld_or       197 non-null    object 
 6   IndexofEconomicFreedom_or  177 non-null    object 
 7   PressFreedomIndex_or       184 non-null    object 
 8   DemocracyIndex_or          167 non-null    object 
dtypes: float64(3), int64(1), object(5)
memory usage: 14.0+ KB


In [98]:
# the file kept the data type
dfPickle.DemocracyIndex_or

0      veryLow
1       medium
2      veryLow
3          NaN
4      veryLow
        ...   
192    veryLow
193    veryLow
194    veryLow
195        low
196    veryLow
Name: DemocracyIndex_or, Length: 197, dtype: category
Categories (5, object): ['veryLow' < 'low' < 'medium' < 'good' < 'veryGood']

In [99]:
# the file did not keep the data type
dfCSV.DemocracyIndex_or

0      veryLow
1       medium
2      veryLow
3          NaN
4      veryLow
        ...   
192    veryLow
193    veryLow
194    veryLow
195        low
196    veryLow
Name: DemocracyIndex_or, Length: 197, dtype: object