<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Formatting Categorical data in Python


In this formatting tutorial we will see the categorical case, let me open a file we created before about [Freedom Indices](https://en.wikipedia.org/wiki/List_of_freedom_indices):

In [24]:
import pandas as pd

link='https://github.com/PythonVersusR/OperationsCleaning/raw/main/freedom_Py.csv'
allFree=pd.read_csv(link)

Let's explore:

In [25]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     157 non-null    object 
 1   FitW        157 non-null    object 
 2   FitW_score  157 non-null    int64  
 3   IoEF        157 non-null    object 
 4   IoEF_score  157 non-null    float64
 5   PFI         157 non-null    object 
 6   PFI_score   157 non-null    float64
 7   DI          157 non-null    object 
 8   DI_score    157 non-null    float64
dtypes: float64(3), int64(1), object(5)
memory usage: 11.2+ KB


Notice that the clean numeric values were recognised as numeric (that may not always be the case, so always verify). When that is the case, statistics can be obtained:

In [26]:
allFree.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
FitW_score,157.0,55.382166,30.094836,2.0,28.0,59.0,83.0,100.0
IoEF_score,157.0,59.064331,11.673919,2.9,52.2,58.5,67.3,83.9
PFI_score,157.0,58.687389,16.961262,21.72,46.21,59.25,71.06,95.18
DI_score,157.0,5.460573,2.242645,1.08,3.31,5.77,7.16,9.81


But the categories are still recognised as object. Let´s check again the levels:

In [27]:
allFree.iloc[:,1::2]

Unnamed: 0,FitW,IoEF,PFI,DI
0,free,mostly free,good,full democracy
1,free,free,good,full democracy
2,free,mostly free,good,full democracy
3,free,mostly free,good,full democracy
4,free,mostly free,good,full democracy
...,...,...,...,...
152,not free,repressed,very serious,authoritarian regime
153,not free,repressed,very serious,authoritarian regime
154,not free,repressed,very serious,authoritarian regime
155,not free,repressed,very serious,authoritarian regime


Remembering the levels (it must have been previously cleaned):

In [28]:
[{x:set(allFree[x])} for x in allFree.iloc[:,1::2].columns]

[{'FitW': {'free', 'not free', 'partly free'}},
 {'IoEF': {'free',
   'moderately free',
   'mostly free',
   'mostly unfree',
   'repressed'}},
 {'PFI': {'difficult', 'good', 'problematic', 'satisfactory', 'very serious'}},
 {'DI': {'authoritarian regime',
   'flawed democracy',
   'full democracy',
   'hybrid regime'}}]

Now, let's turn the values into **ordinal** categories. Remember that the worst, best and middle values should be comparable:

In [29]:
# assign value so that worst and best is the same across levels

mapper1 = {'not free': 1 ,'partly free': 3,'free':5}
mapper2 = {'repressed':1, 'mostly unfree':2,'moderately free':3, 'mostly free':4, 'free':5}
mapper3 = {'very serious':1, 'difficult':2,'problematic':3,'satisfactory':4,'good':5}
mapper4 = {'authoritarian regime':1,'hybrid regime':2,'flawed democracy':4, 'full democracy':5}

allFree.FitW.replace(mapper1,inplace=True)
allFree.IoEF.replace(mapper2,inplace=True)
allFree.PFI.replace(mapper3,inplace=True)
allFree.DI.replace(mapper4,inplace=True)


You see:

In [30]:
allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15
...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15
155,CUBA,1,12,1,24.3,1,29.00,1,2.84


In [31]:
#check types:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     157 non-null    object 
 1   FitW        157 non-null    int64  
 2   FitW_score  157 non-null    int64  
 3   IoEF        157 non-null    int64  
 4   IoEF_score  157 non-null    float64
 5   PFI         157 non-null    int64  
 6   PFI_score   157 non-null    float64
 7   DI          157 non-null    int64  
 8   DI_score    157 non-null    float64
dtypes: float64(3), int64(5), object(1)
memory usage: 11.2+ KB


We have integers instead of categories. Let's create ordinal columns:

In [32]:
# new column names
newNames=allFree.columns[1::2]+'_or'
newNames

Index(['FitW_or', 'IoEF_or', 'PFI_or', 'DI_or'], dtype='object')

In [33]:
# copy the previous values
allFree[newNames]=allFree.iloc[:,1::2]
allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score,FitW_or,IoEF_or,PFI_or,DI_or
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81,5,4,5,5
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05,5,5,5,5
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26,5,4,5,5
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20,5,4,5,5
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15,5,4,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76,1,1,1,1
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72,1,1,1,1
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15,1,1,1,1
155,CUBA,1,12,1,24.3,1,29.00,1,2.84,1,1,1,1


In [34]:
# turn intergers into ordinal level

# create the data type info
from pandas.api.types import CategoricalDtype
myOrdinal = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)

#one column
allFree.loc[:,"FitW_or"].astype(myOrdinal)

0      5
1      5
2      5
3      5
4      5
      ..
152    1
153    1
154    1
155    1
156    1
Name: FitW_or, Length: 157, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [35]:
# several columns
allFree.loc[:,"FitW_or":]=allFree.loc[:,"FitW_or":].astype(myOrdinal)
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Country     157 non-null    object  
 1   FitW        157 non-null    int64   
 2   FitW_score  157 non-null    int64   
 3   IoEF        157 non-null    int64   
 4   IoEF_score  157 non-null    float64 
 5   PFI         157 non-null    int64   
 6   PFI_score   157 non-null    float64 
 7   DI          157 non-null    int64   
 8   DI_score    157 non-null    float64 
 9   FitW_or     157 non-null    category
 10  IoEF_or     157 non-null    category
 11  PFI_or      157 non-null    category
 12  DI_or       157 non-null    category
dtypes: category(4), float64(3), int64(5), object(1)
memory usage: 12.6+ KB


In [36]:
## see

allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score,FitW_or,IoEF_or,PFI_or,DI_or
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81,5,4,5,5
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05,5,5,5,5
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26,5,4,5,5
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20,5,4,5,5
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15,5,4,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76,1,1,1,1
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72,1,1,1,1
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15,1,1,1,1
155,CUBA,1,12,1,24.3,1,29.00,1,2.84,1,1,1,1


In [37]:
# rename the levels

ordinalLevels={1:'1_veryLow',2:'2_low',3:'3_medium',4:'4_good',5:'5_veryGood'}

renameLevels= lambda x:x.cat.rename_categories(ordinalLevels)

allFree.loc[:,"FitW_or":].apply(renameLevels)

Unnamed: 0,FitW_or,IoEF_or,PFI_or,DI_or
0,5_veryGood,4_good,5_veryGood,5_veryGood
1,5_veryGood,5_veryGood,5_veryGood,5_veryGood
2,5_veryGood,4_good,5_veryGood,5_veryGood
3,5_veryGood,4_good,5_veryGood,5_veryGood
4,5_veryGood,4_good,5_veryGood,5_veryGood
...,...,...,...,...
152,1_veryLow,1_veryLow,1_veryLow,1_veryLow
153,1_veryLow,1_veryLow,1_veryLow,1_veryLow
154,1_veryLow,1_veryLow,1_veryLow,1_veryLow
155,1_veryLow,1_veryLow,1_veryLow,1_veryLow


In [38]:
allFree.loc[:,"FitW_or":]=allFree.loc[:,"FitW_or":].apply(renameLevels)
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Country     157 non-null    object  
 1   FitW        157 non-null    int64   
 2   FitW_score  157 non-null    int64   
 3   IoEF        157 non-null    int64   
 4   IoEF_score  157 non-null    float64 
 5   PFI         157 non-null    int64   
 6   PFI_score   157 non-null    float64 
 7   DI          157 non-null    int64   
 8   DI_score    157 non-null    float64 
 9   FitW_or     157 non-null    category
 10  IoEF_or     157 non-null    category
 11  PFI_or      157 non-null    category
 12  DI_or       157 non-null    category
dtypes: category(4), float64(3), int64(5), object(1)
memory usage: 12.6+ KB


In [39]:
allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score,FitW_or,IoEF_or,PFI_or,DI_or
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81,5_veryGood,4_good,5_veryGood,5_veryGood
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05,5_veryGood,5_veryGood,5_veryGood,5_veryGood
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26,5_veryGood,4_good,5_veryGood,5_veryGood
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20,5_veryGood,4_good,5_veryGood,5_veryGood
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15,5_veryGood,4_good,5_veryGood,5_veryGood
...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76,1_veryLow,1_veryLow,1_veryLow,1_veryLow
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72,1_veryLow,1_veryLow,1_veryLow,1_veryLow
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15,1_veryLow,1_veryLow,1_veryLow,1_veryLow
155,CUBA,1,12,1,24.3,1,29.00,1,2.84,1_veryLow,1_veryLow,1_veryLow,1_veryLow


Let's keep this last result, and let me show you the use of **pickle** format:

In [40]:
#saving

import os 

allFree.to_csv(os.path.join("DataFiles","allFree_Py.csv"),index=False )
allFree.to_pickle(os.path.join("DataFiles","allFree.pkl") )

In [41]:
#reading

dfPickle=pd.read_pickle(os.path.join("DataFiles","allFree.pkl") )  
dfCSV=pd.read_csv(os.path.join("DataFiles","allFree_Py.csv") )  

Now, notice the difference when you have categorical data:

In [42]:
dfPickle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Country     157 non-null    object  
 1   FitW        157 non-null    int64   
 2   FitW_score  157 non-null    int64   
 3   IoEF        157 non-null    int64   
 4   IoEF_score  157 non-null    float64 
 5   PFI         157 non-null    int64   
 6   PFI_score   157 non-null    float64 
 7   DI          157 non-null    int64   
 8   DI_score    157 non-null    float64 
 9   FitW_or     157 non-null    category
 10  IoEF_or     157 non-null    category
 11  PFI_or      157 non-null    category
 12  DI_or       157 non-null    category
dtypes: category(4), float64(3), int64(5), object(1)
memory usage: 11.9+ KB


In [43]:
dfCSV.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     157 non-null    object 
 1   FitW        157 non-null    int64  
 2   FitW_score  157 non-null    int64  
 3   IoEF        157 non-null    int64  
 4   IoEF_score  157 non-null    float64
 5   PFI         157 non-null    int64  
 6   PFI_score   157 non-null    float64
 7   DI          157 non-null    int64  
 8   DI_score    157 non-null    float64
 9   FitW_or     157 non-null    object 
 10  IoEF_or     157 non-null    object 
 11  PFI_or      157 non-null    object 
 12  DI_or       157 non-null    object 
dtypes: float64(3), int64(5), object(5)
memory usage: 16.1+ KB


In [44]:
# the file kept the data type
dfPickle.DI_or

0      5_veryGood
1      5_veryGood
2      5_veryGood
3      5_veryGood
4      5_veryGood
          ...    
152     1_veryLow
153     1_veryLow
154     1_veryLow
155     1_veryLow
156     1_veryLow
Name: DI_or, Length: 157, dtype: category
Categories (5, object): ['1_veryLow' < '2_low' < '3_medium' < '4_good' < '5_veryGood']

In [45]:
# the file did not keep the data type
dfCSV.DI_or

0      5_veryGood
1      5_veryGood
2      5_veryGood
3      5_veryGood
4      5_veryGood
          ...    
152     1_veryLow
153     1_veryLow
154     1_veryLow
155     1_veryLow
156     1_veryLow
Name: DI_or, Length: 157, dtype: object