# Categorical Data in Pandas

Hi Guys, Welcome to [Tirendaz Academy](https://youtube.com/c/tirendazacademy) 😀
</br>
In this notebook, I'm going to show categorical data in Pandas.
</br>
Happy Learning 🐱‍🏍 

## How is a variable translated into categorical structure?

In [12]:
import pandas as pd 
import numpy as np

In [13]:
data=pd.Series(["Tim","Tom","Sam","Sam"]*3)
data

0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
dtype: object

In [14]:
pd.unique(data)

array(['Tim', 'Tom', 'Sam'], dtype=object)

In [15]:
# pd.value_counts(data)
pd.Series(data).value_counts()

Sam    6
Tim    3
Tom    3
Name: count, dtype: int64

In [18]:
values=pd.Series([0,1,0,0]*3)
values

0     0
1     1
2     0
3     0
4     0
5     1
6     0
7     0
8     0
9     1
10    0
11    0
dtype: int64

In [17]:
names=pd.Series(["Tim","Sam"])
names.take(values)

0    Tim
1    Sam
0    Tim
0    Tim
0    Tim
1    Sam
0    Tim
0    Tim
0    Tim
1    Sam
0    Tim
0    Tim
dtype: object

## Categorical Type in Pandas

In [19]:
data

0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
dtype: object

In [20]:
N=len(data)

In [21]:
df=pd.DataFrame(
    {"name":data,
     "num":np.arange(N),
     "score":np.random.randint(40,100,
                               size=N),
     "weight":np.random.uniform(50,70,
                                size=N)},
    columns=["num","name","score","weight"])

In [23]:
df

Unnamed: 0,num,name,score,weight
0,0,Tim,49,62.471251
1,1,Tom,99,55.868881
2,2,Sam,64,61.479556
3,3,Sam,60,57.606994
4,4,Tim,99,50.073494
5,5,Tom,55,59.05578
6,6,Sam,57,65.173716
7,7,Sam,72,62.896121
8,8,Tim,66,69.602391
9,9,Tom,59,68.148746


In [24]:
df["name"]

0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
Name: name, dtype: object

In [29]:
type(df["name"])

pandas.core.series.Series

In [31]:
# astype(x) muda a categoria para x 
name_cat=df["name"].astype("category")
name_cat

0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']

In [35]:
x=name_cat.values
x

['Tim', 'Tom', 'Sam', 'Sam', 'Tim', ..., 'Sam', 'Tim', 'Tom', 'Sam', 'Sam']
Length: 12
Categories (3, object): ['Sam', 'Tim', 'Tom']

In [33]:
x.categories

Index(['Sam', 'Tim', 'Tom'], dtype='object')

In [36]:
x.codes

array([1, 2, 0, 0, 1, 2, 0, 0, 1, 2, 0, 0], dtype=int8)

In [38]:
df["name"]=df["name"].astype("category")
df.name

0     Tim
1     Tom
2     Sam
3     Sam
4     Tim
5     Tom
6     Sam
7     Sam
8     Tim
9     Tom
10    Sam
11    Sam
Name: name, dtype: category
Categories (3, object): ['Sam', 'Tim', 'Tom']

In [43]:
data_cat=pd.Categorical(list("abcde"))
data_cat

['a', 'b', 'c', 'd', 'e']
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [45]:
pd.Categorical(["banana", "apple", 
                "kiwi", "banana", "apple"])

['banana', 'apple', 'kiwi', 'banana', 'apple']
Categories (3, object): ['apple', 'banana', 'kiwi']

In [47]:
people=["baby", "child", "young", "old"]
codes=[0,1,2,3,1,0,0]
people_cat=pd.Categorical.from_codes(
    codes,people)
people_cat

['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby', 'child', 'young', 'old']

In [48]:
people_cat=pd.Categorical.from_codes(
    codes,people,ordered=True)
people_cat

['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby' < 'child' < 'young' < 'old']

In [49]:
people_cat.as_ordered()

['baby', 'child', 'young', 'old', 'child', 'baby', 'baby']
Categories (4, object): ['baby' < 'child' < 'young' < 'old']

## Working with Categorical

In [50]:
data=np.random.randn(1000)

In [53]:
interval=pd.qcut(data,4)
interval

[(-3.553, -0.672], (-0.672, 0.00388], (0.609, 2.89], (-3.553, -0.672], (0.00388, 0.609], ..., (-0.672, 0.00388], (0.00388, 0.609], (0.609, 2.89], (0.00388, 0.609], (-3.553, -0.672]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.553, -0.672] < (-0.672, 0.00388] < (0.00388, 0.609] < (0.609, 2.89]]

In [54]:
type(interval)

pandas.core.arrays.categorical.Categorical

In [55]:
interval=pd.qcut(data,4,labels=["Q1","Q2",
                                "Q3","Q4"])
interval

['Q1', 'Q2', 'Q4', 'Q1', 'Q3', ..., 'Q2', 'Q3', 'Q4', 'Q3', 'Q1']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [56]:
interval=pd.Series(interval,name="quarter")

In [64]:
pd.Series(
    data).groupby(
    interval,observed=False).agg(["count",
                   "min",
                   "max"]).reset_index()

Unnamed: 0,quarter,count,min,max
0,Q1,250,-3.552203,-0.677975
1,Q2,250,-0.669523,0.003683
2,Q3,250,0.004086,0.60801
3,Q4,250,0.610576,2.890387


## 3- How is the performance of categorical types?

In [65]:
N=10000000
num=pd.Series(np.random.randn(N))

In [75]:
label=pd.Series(["a","b","c","d"]*(N//4))

In [76]:
cat=label.astype("category")

In [77]:
label.memory_usage()

80000132

In [78]:
cat.memory_usage()

10000336

## 4- What are categorical methods?

In [81]:
s=pd.Series(["a","b","c","d"]*2)
s

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: object

In [82]:
s_ct=s.astype("category")
s_ct

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [83]:
s_ct.cat.codes

0    0
1    1
2    2
3    3
4    0
5    1
6    2
7    3
dtype: int8

In [84]:
s_ct.cat.categories

Index(['a', 'b', 'c', 'd'], dtype='object')

In [85]:
new_ct=["a","b","c","d","e"]
s_ct.cat.set_categories(new_ct)

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']

In [86]:
s2_ct=s_ct[s_ct.isin(["a","b"])]
s2_ct             

0    a
1    b
4    a
5    b
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [87]:
s2_ct.cat.remove_unused_categories()

0    a
1    b
4    a
5    b
dtype: category
Categories (2, object): ['a', 'b']

## 5- How to create a dummy variable?

In [88]:
s_ct

0    a
1    b
2    c
3    d
4    a
5    b
6    c
7    d
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']

In [92]:
pd.get_dummies(s_ct)

Unnamed: 0,a,b,c,d
0,True,False,False,False
1,False,True,False,False
2,False,False,True,False
3,False,False,False,True
4,True,False,False,False
5,False,True,False,False
6,False,False,True,False
7,False,False,False,True


Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎