# Data Formatting (categorical)


In this formatting tutorial we will see the categorical case.


Let's get [some data](https://en.wikipedia.org/wiki/List_of_freedom_indices):

In [98]:
import pandas as pd

link='https://github.com/PythonVersusR/OperationsCleaning/raw/main/freedom_Py.csv'
allFree=pd.read_csv(link)

Let's explore:

In [99]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     157 non-null    object 
 1   FitW        157 non-null    object 
 2   FitW_score  157 non-null    int64  
 3   IoEF        157 non-null    object 
 4   IoEF_score  157 non-null    float64
 5   PFI         157 non-null    object 
 6   PFI_score   157 non-null    float64
 7   DI          157 non-null    object 
 8   DI_score    157 non-null    float64
dtypes: float64(3), int64(1), object(5)
memory usage: 11.2+ KB


The clean numeric values were recognised as numeric:

In [100]:
allFree.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
FitW_score,157.0,55.382166,30.094836,2.0,28.0,59.0,83.0,100.0
IoEF_score,157.0,59.064331,11.673919,2.9,52.2,58.5,67.3,83.9
PFI_score,157.0,58.687389,16.961262,21.72,46.21,59.25,71.06,95.18
DI_score,157.0,5.460573,2.242645,1.08,3.31,5.77,7.16,9.81


BUt the categories are still recognised as object. Let´s check again the levels:

In [101]:
pd.unique(allFree.FitW) 

array(['free', 'partly free', 'not free'], dtype=object)

In [102]:
pd.unique(allFree.IoEF)

array(['mostly free', 'free', 'moderately free', 'mostly unfree',
       'repressed'], dtype=object)

In [103]:
pd.unique(allFree.PFI)

array(['good', 'satisfactory', 'problematic', 'difficult', 'very serious'],
      dtype=object)

In [104]:
pd.unique(allFree.DI)

array(['full democracy', 'flawed democracy', 'hybrid regime',
       'authoritarian regime'], dtype=object)

In [105]:
# Guess what this will do before runing it:
[list(pd.unique(allFree[x])) for x in allFree.columns[1:8:2]]

[['free', 'partly free', 'not free'],
 ['mostly free', 'free', 'moderately free', 'mostly unfree', 'repressed'],
 ['good', 'satisfactory', 'problematic', 'difficult', 'very serious'],
 ['full democracy',
  'flawed democracy',
  'hybrid regime',
  'authoritarian regime']]

It is very important to verify that the strings that represent categories do not need _cleaning_ (i.e. 'free' instead of 'freee')

Now, let's turn the values into **ordinal** categories. Remember that the worst, best and middle values should be comparable:

In [106]:
mapper1 = {'free':5 , 'partly free': 3, 'not free': 1}
allFree.FitW.replace(mapper1,inplace=True)

mapper2 = {'repressed':1, 'mostly unfree':2,'moderately free':3, 
       'mostly free':4, 'free':5}
allFree.IoEF.replace(mapper2,inplace=True)


mapper3 = {'very serious':1, 'difficult':2,
           'problematic':3,'satisfactory':4,
           'good':5}
allFree.PFI.replace(mapper3,inplace=True)

mapper4 = {'authoritarian regime':1,'hybrid regime':2,
           'flawed democracy':3, 'full democracy':5}
allFree.DI.replace(mapper4,inplace=True)


In [107]:
allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15
...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15
155,CUBA,1,12,1,24.3,1,29.00,1,2.84


Let's explore:

In [108]:
#check types:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     157 non-null    object 
 1   FitW        157 non-null    int64  
 2   FitW_score  157 non-null    int64  
 3   IoEF        157 non-null    int64  
 4   IoEF_score  157 non-null    float64
 5   PFI         157 non-null    int64  
 6   PFI_score   157 non-null    float64
 7   DI          157 non-null    int64  
 8   DI_score    157 non-null    float64
dtypes: float64(3), int64(5), object(1)
memory usage: 11.2+ KB


In [109]:
allFree.head()

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26
3,FINLAND,5,100,4,77.1,5,87.94,5,9.2
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15


However, these are not yet ordinal. Let's do it:

In [110]:
from pandas.api.types import CategoricalDtype

order = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)
allFree.iloc[:,1:8:2].apply(lambda x:x.astype(order)).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   FitW    157 non-null    category
 1   IoEF    157 non-null    category
 2   PFI     157 non-null    category
 3   DI      157 non-null    category
dtypes: category(4)
memory usage: 1.6 KB


In [111]:
allFree.columns[1:8:2]+'_or'

Index(['FitW_or', 'IoEF_or', 'PFI_or', 'DI_or'], dtype='object')

In [112]:
newNames=allFree.columns[1:8:2]+'_or'
allFree[newNames]=allFree.iloc[:,1:8:2].apply(lambda x:x.astype(order))

In [93]:
allFree.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Country     157 non-null    object  
 1   FitW        157 non-null    int64   
 2   FitW_score  157 non-null    int64   
 3   IoEF        157 non-null    int64   
 4   IoEF_score  157 non-null    float64 
 5   PFI         157 non-null    int64   
 6   PFI_score   157 non-null    float64 
 7   DI          157 non-null    int64   
 8   DI_score    157 non-null    float64 
 9   FitW_or     157 non-null    category
 10  IoEF_or     157 non-null    category
 11  PFI_or      157 non-null    category
 12  DI_or       157 non-null    category
dtypes: category(4), float64(3), int64(5), object(1)
memory usage: 12.6+ KB


In [113]:
allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score,FitW_or,IoEF_or,PFI_or,DI_or
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81,5,4,5,5
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05,5,5,5,5
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26,5,4,5,5
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20,5,4,5,5
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15,5,4,5,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76,1,1,1,1
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72,1,1,1,1
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15,1,1,1,1
155,CUBA,1,12,1,24.3,1,29.00,1,2.84,1,1,1,1


You may want to rename them:

In [114]:
ordCats={1:'veryLow',2:'low',3:'medium',4:'good',5:'veryGood'}

turnToOrdinal= lambda x:x.cat.rename_categories(ordCats)

allFree.loc[:,"FitW_or":].apply(turnToOrdinal)

Unnamed: 0,FitW_or,IoEF_or,PFI_or,DI_or
0,veryGood,good,veryGood,veryGood
1,veryGood,veryGood,veryGood,veryGood
2,veryGood,good,veryGood,veryGood
3,veryGood,good,veryGood,veryGood
4,veryGood,good,veryGood,veryGood
...,...,...,...,...
152,veryLow,veryLow,veryLow,veryLow
153,veryLow,veryLow,veryLow,veryLow
154,veryLow,veryLow,veryLow,veryLow
155,veryLow,veryLow,veryLow,veryLow


In [115]:
allFree.loc[:,"FitW_or":]=allFree.loc[:,"FitW_or":].apply(turnToOrdinal)

In [116]:
allFree

Unnamed: 0,Country,FitW,FitW_score,IoEF,IoEF_score,PFI,PFI_score,DI,DI_score,FitW_or,IoEF_or,PFI_or,DI_or
0,NORWAY,5,100,4,76.9,5,95.18,5,9.81,veryGood,good,veryGood,veryGood
1,IRELAND,5,97,5,82.0,5,89.91,5,9.05,veryGood,veryGood,veryGood,veryGood
2,SWEDEN,5,100,4,77.5,5,88.15,5,9.26,veryGood,good,veryGood,veryGood
3,FINLAND,5,100,4,77.1,5,87.94,5,9.20,veryGood,good,veryGood,veryGood
4,DENMARK,5,97,4,77.6,5,89.48,5,9.15,veryGood,good,veryGood,veryGood
...,...,...,...,...,...,...,...,...,...,...,...,...,...
152,VENEZUELA,1,15,1,25.8,1,36.99,1,2.76,veryLow,veryLow,veryLow,veryLow
153,TURKMENISTAN,1,2,1,46.5,1,25.82,1,1.72,veryLow,veryLow,veryLow,veryLow
154,ERITREA,1,3,1,39.5,1,27.86,1,2.15,veryLow,veryLow,veryLow,veryLow
155,CUBA,1,12,1,24.3,1,29.00,1,2.84,veryLow,veryLow,veryLow,veryLow


Let's keep this last result, but this let me show you the use of **pickle** format:

In [117]:
#saving

import os 

allFree.to_csv(os.path.join("DataFiles","allFree.csv"),index=False )
allFree.to_pickle(os.path.join("DataFiles","allFree.pkl") )

In [118]:
#reading

dfPickle=pd.read_pickle(os.path.join("DataFiles","allFree.pkl") )  
dfCSV=pd.read_csv(os.path.join("DataFiles","allFree.csv") )  

Now, notice the difference when you have categorical data:

In [119]:
dfPickle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   Country     157 non-null    object  
 1   FitW        157 non-null    int64   
 2   FitW_score  157 non-null    int64   
 3   IoEF        157 non-null    int64   
 4   IoEF_score  157 non-null    float64 
 5   PFI         157 non-null    int64   
 6   PFI_score   157 non-null    float64 
 7   DI          157 non-null    int64   
 8   DI_score    157 non-null    float64 
 9   FitW_or     157 non-null    category
 10  IoEF_or     157 non-null    category
 11  PFI_or      157 non-null    category
 12  DI_or       157 non-null    category
dtypes: category(4), float64(3), int64(5), object(1)
memory usage: 11.9+ KB


In [120]:
dfCSV.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Country     157 non-null    object 
 1   FitW        157 non-null    int64  
 2   FitW_score  157 non-null    int64  
 3   IoEF        157 non-null    int64  
 4   IoEF_score  157 non-null    float64
 5   PFI         157 non-null    int64  
 6   PFI_score   157 non-null    float64
 7   DI          157 non-null    int64  
 8   DI_score    157 non-null    float64
 9   FitW_or     157 non-null    object 
 10  IoEF_or     157 non-null    object 
 11  PFI_or      157 non-null    object 
 12  DI_or       157 non-null    object 
dtypes: float64(3), int64(5), object(5)
memory usage: 16.1+ KB


In [122]:
# the file kept the data type
dfPickle.DI_or

0      veryGood
1      veryGood
2      veryGood
3      veryGood
4      veryGood
         ...   
152     veryLow
153     veryLow
154     veryLow
155     veryLow
156     veryLow
Name: DI_or, Length: 157, dtype: category
Categories (5, object): ['veryLow' < 'low' < 'medium' < 'good' < 'veryGood']

In [123]:
# the file did not keep the data type
dfCSV.DI_or

0      veryGood
1      veryGood
2      veryGood
3      veryGood
4      veryGood
         ...   
152     veryLow
153     veryLow
154     veryLow
155     veryLow
156     veryLow
Name: DI_or, Length: 157, dtype: object