# Dealing with non-numerical values in a Dataframe

Often we come across a dataframe which has various columns with data type as int, float or object. When the datatype is non numerical, it can be either of type 'object' or 'strings'.

But Machine learning algorithms work on numbers. We need numeric represenation of strings. Moreover strings can be slow compared to numbers. So we need to convert **'object'** type to **'category'**

In pandas, 'category' dtypes encodes categorical data numerically. This will speed up the code.

Let us take a look at a dataset of Bigmart which a shopping mall. Lets load the dataset first.

In [80]:
#import pandas library
import pandas as pd

In [81]:
#load the dataset
df = pd.read_csv('Bigmart_Dataset.csv')

Explore first few rows to get an idea what all columns we have by using head() method

In [82]:
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


By using info() method check no of rows entry and type of each column

In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8523 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier              8523 non-null object
Item_Weight                  7060 non-null float64
Item_Fat_Content             8523 non-null object
Item_Visibility              8523 non-null float64
Item_Type                    8523 non-null object
Item_MRP                     8523 non-null float64
Outlet_Identifier            8523 non-null object
Outlet_Establishment_Year    8523 non-null int64
Outlet_Size                  6113 non-null object
Outlet_Location_Type         8523 non-null object
Outlet_Type                  8523 non-null object
Item_Outlet_Sales            8523 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 799.1+ KB


As you can see we have total of 8523 entries. We have missing values too. For the time being we can drop missing values by using dropna() method but in general this is not advisable. We need to impute missing values whenever it is possible.

In [84]:
df = df.dropna()

Now check the info of dataframe again.

In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4650 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier              4650 non-null object
Item_Weight                  4650 non-null float64
Item_Fat_Content             4650 non-null object
Item_Visibility              4650 non-null float64
Item_Type                    4650 non-null object
Item_MRP                     4650 non-null float64
Outlet_Identifier            4650 non-null object
Outlet_Establishment_Year    4650 non-null int64
Outlet_Size                  4650 non-null object
Outlet_Location_Type         4650 non-null object
Outlet_Type                  4650 non-null object
Item_Outlet_Sales            4650 non-null float64
dtypes: float64(4), int64(1), object(7)
memory usage: 472.3+ KB


As you can see we dont have any column with missing values. Now we will take convert few columns of object type to category.

Consider column 'Outlet_Location_Type'.

In [86]:
df['Outlet_Location_Type'].head()

0    Tier 1
1    Tier 3
2    Tier 1
4    Tier 3
5    Tier 3
Name: Outlet_Location_Type, dtype: object

Check unique values of that column.

In [87]:
df.Outlet_Location_Type.unique()

array(['Tier 1', 'Tier 3', 'Tier 2'], dtype=object)

As you can see we have 3 unique values shown in above cell output. Let us try to convert this column to category.

In [88]:
df.Outlet_Location_Type = df.Outlet_Location_Type.astype('category')
df.Outlet_Location_Type.head()

0    Tier 1
1    Tier 3
2    Tier 1
4    Tier 3
5    Tier 3
Name: Outlet_Location_Type, dtype: category
Categories (3, object): [Tier 1, Tier 2, Tier 3]

In [89]:
dummies = pd.get_dummies(df[['Outlet_Location_Type']], prefix_sep='_')

In [90]:
dummies.head()

Unnamed: 0,Outlet_Location_Type_Tier 1,Outlet_Location_Type_Tier 2,Outlet_Location_Type_Tier 3
0,1,0,0
1,0,0,1
2,1,0,0
4,0,0,1
5,0,0,1


This is also called as 'Binary Indicator' represention. 

In [91]:
df['Outlet_Location_Type'].head()

0    Tier 1
1    Tier 3
2    Tier 1
4    Tier 3
5    Tier 3
Name: Outlet_Location_Type, dtype: category
Categories (3, object): [Tier 1, Tier 2, Tier 3]

Check the info of the dataframe again and see that Outlet_Location_Type is no more of type object but category.

In this way we can convert all 'object' type column to category by writing a for loop over a dataframe as follows.

In [92]:
for col in ['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size', 'Outlet_Type']:
    df[col] = df[col].astype('category')

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4650 entries, 0 to 8522
Data columns (total 12 columns):
Item_Identifier              4650 non-null object
Item_Weight                  4650 non-null float64
Item_Fat_Content             4650 non-null category
Item_Visibility              4650 non-null float64
Item_Type                    4650 non-null category
Item_MRP                     4650 non-null float64
Outlet_Identifier            4650 non-null category
Outlet_Establishment_Year    4650 non-null int64
Outlet_Size                  4650 non-null category
Outlet_Location_Type         4650 non-null category
Outlet_Type                  4650 non-null category
Item_Outlet_Sales            4650 non-null float64
dtypes: category(6), float64(4), int64(1), object(1)
memory usage: 283.0+ KB


As seen in above info, all columns with 'Object' type are converted to 'category' type.