In [2]:
import pandas as pd

In [3]:
# creating a dataframe of letter grades in descending order, we can also set an index value
df = pd.DataFrame(['A+', 'A', 'A-', 'B+', 'B', 'B-', 'C+', 'C', 'C-', 'D+', 'D'],
                  index = ['excellent', 'excellent', 'excellent', 'good', 'good', 'good', 'ok', 'ok', 'ok', 'poor', 'poor'],
                 columns = ["Grades"])
df

Unnamed: 0,Grades
excellent,A+
excellent,A
excellent,A-
good,B+
good,B
good,B-
ok,C+
ok,C
ok,C-
poor,D+


In [4]:
df.dtypes  # it's just an object, since we set string values

Grades    object
dtype: object

In [5]:
# we can change the type to category, using the astype() function
df['Grades'].astype('category').head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [A, A+, A-, B, ..., C+, C-, D, D+]

Tthere's 11 categories, and pandas is aware of what those categories are. More interesting though is that our data isn't just categoricale, but it's actually ordered. That is, an A- comes after a B+, and a B comes before a B+. We can tell pandas that the data is ordered by first creating a new categorical data type with the list of categories in order, and the ordered equals true flag.

In [6]:
my_categories = pd.CategoricalDtype(categories=['D', 'D+', 'C-', 'C', 'C+', 'B-', 'B', 'B+', 'A-', 'A', 'A+'],
                                    ordered = True)
grades = df['Grades'].astype(my_categories)
grades.head()

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

Now we see that pandas is not only aware that there are 11 categories, but it's also aware of the order of those categories. Because there's an ordering, this can help with some comparisons and Boolean masking. 

In [7]:
# For instance, if we have a list of our grades and we compare them to a C, we can see that lexicographical comparisons, 
# which is the default for strings, return results that we're not intending.

df[df['Grades'] > 'C']  # a boolean mask

Unnamed: 0,Grades
ok,C+
ok,C-
poor,D+
poor,D


In [9]:
# however, C+ is greater than C, but C- and D are not
# we can correct this by broadcasting over the dataframe which has the type set to an ordered catagorical

# Remember that grades is also a dataframe
grades[grades > 'C']

excellent    A+
excellent     A
excellent    A-
good         B+
good          B
good         B-
ok           C+
Name: Grades, dtype: category
Categories (11, object): [D < D+ < C- < C ... B+ < A- < A < A+]

We see that the operator works as we would expect here. We can then use a certain set of mathematical operators, like minimum, maximum, etc, on this ordinal data. 

Sometimes it's useful to represent categorical values as each being a column with a true or false as to whether the category applies. This is especially common in feature extraction, which is a topic in the data mining course. Variables with a Boolean value are typically called dummy variables, and pandas has built-in function called get dummies, which will convert the values of a single column into multiple columns of zeros and ones, indicating the presence of a dummy variable.

There's one more common scale-based operation, and that's on converting a scale from something that is on the interval or ratio scale, like a numeric grade, into one which is categorical. Now, this might seem a bit counter intuitive since it is losing information about the value, but it's commonly done in a couple of places. 

For instance, if we're visualizing the frequencies of categories, this can be an extremely useful approach, and histograms are regularly used with converted interval or ratio data. In addition, if we're using a machine learning classification approach on data, we'll need to be using categorical data. 

So reducing dimensionality may be useful just to apply a given technique. Pandas has a function called cut which takes an argument, some array-like structure like a column of a DataFrame or a series. It also takes a number of bins to be used and all bins are kept at equal spacing. So let's go back to some census data as an example.

In [11]:
# we can group by our census data by state, then aggregate to get a list of the average county size by state
# if we apply cut to this with, say, 10 bins, we can see the states listed as catagoricals using the average county size

import numpy as np
df = pd.read_csv('resources/week-3/datasets/census.csv')

# reducing this to county data
df = df[df['SUMLEV'] == 50]

# for a few groups
df = df.set_index('STNAME').groupby(level=0)['CENSUS2010POP'].agg(np.average)
df.head()

STNAME
Alabama        71339.343284
Alaska         24490.724138
Arizona       426134.466667
Arkansas       38878.906667
California    642309.586207
Name: CENSUS2010POP, dtype: float64

In [12]:
# if we just want to make 'bins' of each of these, we use cut()
# bin means category???? I guess.
pd.cut(df,10)

STNAME
Alabama                   (11706.087, 75333.413]
Alaska                    (11706.087, 75333.413]
Arizona                 (390320.176, 453317.529]
Arkansas                  (11706.087, 75333.413]
California              (579312.234, 642309.586]
Colorado                 (75333.413, 138330.766]
Connecticut             (390320.176, 453317.529]
Delaware                (264325.471, 327322.823]
District of Columbia    (579312.234, 642309.586]
Florida                 (264325.471, 327322.823]
Georgia                   (11706.087, 75333.413]
Hawaii                  (264325.471, 327322.823]
Idaho                     (11706.087, 75333.413]
Illinois                 (75333.413, 138330.766]
Indiana                   (11706.087, 75333.413]
Iowa                      (11706.087, 75333.413]
Kansas                    (11706.087, 75333.413]
Kentucky                  (11706.087, 75333.413]
Louisiana                 (11706.087, 75333.413]
Maine                    (75333.413, 138330.766]
Maryland     

Here we see that states like Alabama and Alaska fall into the same category, while California and the District of Columbia fall into very different categories. 

Cutting is just one way to build categoricals from data, and there's many other methods. For instance, cut gives we interval data, where the spacing between each category is equally sized, but sometimes we want to form categories based on frequency. We want the number of items in each bin to be the same and instead of spacing between the bins. So it really depends on what our data is, and what we're planning to do with it.