<img src="https://i.imgur.com/6U6q5jQ.png"/>

# Formatting Categorical data in Python


In this formatting tutorial we will see the categorical case, let me open a file we created before about [Freedom Indices](https://en.wikipedia.org/wiki/List_of_freedom_indices):

In [None]:
import pandas as pd

link='https://github.com/PythonVersusR/OperationsCleaning/raw/main/freedom_Py.csv'
allFree=pd.read_csv(link)

Let's explore:

In [None]:
allFree.info()

Notice that the clean numeric values were recognised as numeric (that may not always be the case, so always verify). When that is the case, statistics can be obtained:

In [None]:
allFree.describe().T

But the categories are still recognised as object. Let´s check again the levels:

In [None]:
allFree.iloc[:,1::2]

Remembering the levels (it must have been previously cleaned):

In [None]:
[{x:set(pd.unique(allFree[x]))} for x in allFree.iloc[:,1::2].columns]

Now, let's turn the values into **ordinal** categories. Remember that the worst, best and middle values should be comparable:

In [None]:
# assign value so that worst and best is the same across levels

mapper1 = {'not free': 1 ,'partly free': 3,'free':5}
mapper2 = {'repressed':1, 'mostly unfree':2,'moderately free':3, 'mostly free':4, 'free':5}
mapper3 = {'very serious':1, 'difficult':2,'problematic':3,'satisfactory':4,'good':5}
mapper4 = {'authoritarian regime':1,'hybrid regime':2,'flawed democracy':4, 'full democracy':5}

allFree.FitW.replace(mapper1,inplace=True)
allFree.IoEF.replace(mapper2,inplace=True)
allFree.PFI.replace(mapper3,inplace=True)
allFree.DI.replace(mapper4,inplace=True)


You see:

In [None]:
allFree

In [None]:
#check types:
allFree.info()

We have integers instead of categories. Let's create ordinal columns:

In [None]:
# new column names
newNames=allFree.columns[1::2]+'_or'
newNames

In [None]:
# copy the previous values
allFree[newNames]=allFree.iloc[:,1::2]
allFree

In [None]:
# turn intergers into ordinal level

# create the data type info
from pandas.api.types import CategoricalDtype
myOrdinal = CategoricalDtype(categories=[1,2,3,4,5], ordered=True)

#one column
allFree.loc[:,"FitW_or"].astype(myOrdinal)

In [None]:
# several columns
allFree.loc[:,"FitW_or":]=allFree.loc[:,"FitW_or":].astype(myOrdinal)
allFree.info()

In [None]:
## see

allFree

In [None]:
# rename the levels

ordinalLevels={1:'1_veryLow',2:'2_low',3:'3_medium',4:'4_good',5:'5_veryGood'}

renameLevels= lambda x:x.cat.rename_categories(ordinalLevels)

allFree.loc[:,"FitW_or":].apply(renameLevels)

In [None]:
allFree.loc[:,"FitW_or":]=allFree.loc[:,"FitW_or":].apply(renameLevels)
allFree.info()

In [None]:
allFree

Let's keep this last result, and let me show you the use of **pickle** format:

In [None]:
#saving

import os 

allFree.to_csv(os.path.join("DataFiles","allFree_Py.csv"),index=False )
allFree.to_pickle(os.path.join("DataFiles","allFree.pkl") )

In [None]:
#reading

dfPickle=pd.read_pickle(os.path.join("DataFiles","allFree.pkl") )  
dfCSV=pd.read_csv(os.path.join("DataFiles","allFree_Py.csv") )  

Now, notice the difference when you have categorical data:

In [None]:
dfPickle.info()

In [None]:
dfCSV.info()

In [None]:
# the file kept the data type
dfPickle.DI_or

In [None]:
# the file did not keep the data type
dfCSV.DI_or