<a href="https://colab.research.google.com/github/SinghReena/MachineLearning/blob/master/9_Categorical.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

A Categorical variable takes on a limited number of possible values. Examples are days of the week, months of the year, number of stars for a review, blood type.  Some variables have order -- 'strongly agree' vs. 'agree'.

`Categoricals` are a pandas datatype for categorical variables. The values of categorical data are either in `categories` or `np.nan`. 

By defining a series as `Categorical`, we are making its use more precise
- we cannot have arbitrary strings or values be stored in the series. (A rating cannot have 100 stars).
- An order can be defined that is not dependent on the lexical order. ("strongly  disagree" < "disagree").
- Useful for other tasks like classification, graph plotting etc.





In [None]:
import pandas as pd
import numpy as np

In [None]:
s = pd.Series(["a", "a", "c", "b"],dtype="category")
s

0    a
1    a
2    c
3    b
dtype: category
Categories (3, object): ['a', 'b', 'c']

In [None]:
pd.Series(["a", "a", "c", "b"])

0    a
1    a
2    c
3    b
dtype: object

In [None]:
df = pd.DataFrame({"A": ["a", "b", "Nan", "a"]})
df["B"] = df["A"].astype("category")

df

Unnamed: 0,A,B
0,a,a
1,b,b
2,Nan,Nan
3,a,a


In [None]:
df.B

0    a
1    b
2    c
3    a
Name: B, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [None]:
df.A

0    a
1    b
2    c
3    a
Name: A, dtype: object

Cannot enter a non-existing category to the Series.

In [None]:
df["A"][2] = "d"
df.A

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


0    a
1    b
2    d
3    a
Name: A, dtype: object

In [None]:
df.B[2] = "d"

ValueError: ignored

Some categories may be absent too.

In [None]:
df.B[2] = "b"
df.B

0    a
1    b
2    b
3    a
Name: B, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [None]:
df.B

0    a
1    b
2    b
3    a
Name: B, dtype: category
Categories (3, object): ['a', 'b', 'c']

In [None]:
df.B[2] = np.NaN
df.B


0      a
1      b
2    NaN
3      a
Name: B, dtype: category
Categories (3, object): ['Nan', 'a', 'b']

In [None]:
df

Unnamed: 0,A,B
0,a,a
1,b,b
2,c,c
3,a,a


## Transforming values using a map

In [None]:
d = pd.DataFrame({"A": ["alpha", "beta", "gamma", "alpha"]})

In [None]:
my_map = {"alpha" : 1, "beta" : 2, "gamma" : 3}

In [None]:
d.A

0    alpha
1     beta
2    gamma
3    alpha
Name: A, dtype: object

In [None]:
d.A.map(my_map)

0    1
1    2
2    3
3    1
Name: A, dtype: int64

In [None]:
Convert the days of the week to a categorical variable