Title: Applying Categories to Continuous Data
Slug: pandas/applying-categories-to-continuous-data
Category: Pandas
Tags: describe, nunique, cut, value_counts, sort_index, floor, ceil, max, min, 
Date: 2017-09-24
Modified: 2017-09-24

#### Import libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from bokeh.sampledata.iris import flowers

#### Inspect data

In [2]:
flowers.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [3]:
flowers['petal_length'].nunique()

43

Petal length has the biggest range, so we take a closer look at that feature. We might consider using `value_counts` if there were relatively few values, but here we have 43 distinct values across all observations.

Let's start by using 3 evenly-sized bins.

In [4]:
pd.cut(flowers['petal_length'], bins=3).value_counts().sort_index()

(0.994, 2.967]    50
(2.967, 4.933]    54
(4.933, 6.9]      46
Name: petal_length, dtype: int64

That's good and all, but our bins are a bit... ugly. If we want nice and neat bins, we can pass an array instead of a single value. Here we round down to the lowest value and up to the highest, then make a list of our bin edges. Finally, we add some labels for extra pizzazz.

In [5]:
bins_left = np.floor(flowers['petal_length'].min())
bins_right = np.ceil(flowers['petal_length'].max())

bins = [i for i in range(int(bins_left), int(bins_right)+1, 2)]
bins

[1, 3, 5, 7]

In [6]:
labels = ['Short', 'Medium', 'Long']
pd.cut(flowers['petal_length'], bins=bins, labels=labels).value_counts().sort_index(ascending=False)

Short     50
Medium    57
Long      42
Name: petal_length, dtype: int64

#### Life on the edge
The eagle-eyed amongst you might have noticed a slight problem with the example above: our initial cut returns 150 results, but the second only 149.

This is because we passed a list to our second cut and, by default, `pd.cut` doesn't include the lowest value of lists. Here's how to get around this.

In [7]:
# Our example from above without labels
pd.cut(flowers['petal_length'], bins=bins).value_counts().sort_index()

(1, 3]    50
(3, 5]    57
(5, 7]    42
Name: petal_length, dtype: int64

In [8]:
# With include_lowest=True, the lowest edge is expanded by 0.1% to capture all values
pd.cut(flowers['petal_length'], bins=bins, include_lowest=True).value_counts().sort_index()

(0.999, 3.0]    51
(3.0, 5.0]      57
(5.0, 7.0]      42
Name: petal_length, dtype: int64

If things still aren't clear, [take a look at the docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.cut.html) or play around with `pd.cut` yourself - it's the best way to develop your understanding!