# Cut() & QCut()


#### Inputs: 
Both of these functions will take as first argument a 1d array (or a Series object) of NUMBERS, as second argument it will take the number of bins (also called buckets, basically agrupations) in which they'll distribute the data.
<br><br>
#### Outputs:
Both cut() and qcut() return a Categorical type object if it recieves a 1d array as first
argument. If it recieves a Series type object it will also return a Series.
<br><br>
#### Takeaways:
1. Both cut() and qcut() are great for discretizizing data (making nominal discreet categories based on continuous values). 
2. If you want to create bins (groups) based on ranges of equal amplitude or on your own labels based on punctual values (like using the age of a person to call them "kid" (until 15), "teen" (until 20), "adult"(from then on), etc.) use cut().
3. If you want to create bins based on the distribution of the values (quantiles a.k.a. position metrics), ergo, you want each bin to have more or less the same number of values falling in them, use qcut().

## Setup

In [1]:
# Imports
import numpy as np
import pandas as pd

# We define factors in order to use it as our synthetic "series" (hard coded).
factors = [1, 0, 2, 3, 4, 5, 6, 7, 10, 20, 15, 40, 59, 30, 28, 29]

# Setting up data frames
titanic_df = pd.read_csv('Data/titanic.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Cut()
Cut() will create groups (bins) of equall range/length/amplitude. In other words, cut() concerns itself with the actual values of the structure sent as the first argument. 
<br>

Depending on the type of the input sent as first argument, cut() will return a different data type. If one sends a 1d array it'll return a Categorical object, if one sends a Series, it'll return a Series object.

[Check cut's documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html)

In [2]:
# I: 1d array --> O: Categorical 
type(pd.cut(factors,5))

pandas.core.arrays.categorical.Categorical

In [3]:
# I: Series --> O: Series 
type(pd.cut(titanic_df.Age, 5))

pandas.core.series.Series

### Categorical

A Categorical object is an array-like structure containing intervals (Interval objects).<br>
When checking Categorical, pandas also outputs the length of the categorical and the categories.

In [4]:
# Checking a Categorical 
pd.cut(factors, 5)

[(-0.059, 11.8], (-0.059, 11.8], (-0.059, 11.8], (-0.059, 11.8], (-0.059, 11.8], ..., (35.4, 47.2], (47.2, 59.0], (23.6, 35.4], (23.6, 35.4], (23.6, 35.4]]
Length: 16
Categories (5, interval[float64]): [(-0.059, 11.8] < (11.8, 23.6] < (23.6, 35.4] < (35.4, 47.2] < (47.2, 59.0]]

In [5]:
# You can label the categories/bins
pd.cut(factors, 5, labels=["Category 1", "Category 2", "Category 3", "Category 4", "Category 5"])

['Category 1', 'Category 1', 'Category 1', 'Category 1', 'Category 1', ..., 'Category 4', 'Category 5', 'Category 3', 'Category 3', 'Category 3']
Length: 16
Categories (5, object): ['Category 1' < 'Category 2' < 'Category 3' < 'Category 4' < 'Category 5']

### Series
If you use a Series as input, cut() will return a Series with the proper categories for the values of the input Series.

In [6]:
pd.cut(titanic_df.Age, 5, labels=["Kid", "Teen", "Young Adult", "Adult", "Elder"])

0             Teen
1      Young Adult
2             Teen
3      Young Adult
4      Young Adult
          ...     
886           Teen
887           Teen
888            NaN
889           Teen
890           Teen
Name: Age, Length: 891, dtype: category
Categories (5, object): ['Kid' < 'Teen' < 'Young Adult' < 'Adult' < 'Elder']

In [7]:
# You can specify the breaking point values for the bins explicitly by sending a List
pd.cut(factors, [0, 5, 10, 20, 100])  # Second value in the Categorical is a NaN

[(0.0, 5.0], NaN, (0.0, 5.0], (0.0, 5.0], (0.0, 5.0], ..., (20, 100], (20, 100], (20, 100], (20, 100], (20, 100]]
Length: 16
Categories (4, interval[int64]): [(0, 5] < (5, 10] < (10, 20] < (20, 100]]

### WARNING !!!
Notice that the second element of the Categorical returned a NaN. In the original factors vector, the second value was a 0. This happened because anything outside the ranges of the Intervals will return a NaN and 0 is outside all ranges (lowest bound is not included in cut()).<br>

This behaviour is exclusive to cut(), qcut() will accomodate bin bounds depending on the distribution of the input values and will thus include min values; unlike cut() who leaves the firstmost value out of all the defined bins. 

In [8]:
# Making a new discrete column in a dataframe using a continuous value
new_age_column = pd.cut(titanic_df.Age, [0, 13, 20, 35, 60, 100], 
            labels=["Kid", "Teen", "Young Adult", "Adult", "Elder"])

titanic_df["Age Group"] = new_age_column
titanic_df.head()  # Check the new 'Age Group' column

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age Group
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Young Adult
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Adult
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Young Adult
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,Young Adult
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,Young Adult


In [9]:
# The "value_counts()" function shows you the number of cases that apply per bin
pd.cut(factors, 5).value_counts()  # Categoricals can use value_counts() too, just like a Series

(-0.059, 11.8]    9
(11.8, 23.6]      2
(23.6, 35.4]      3
(35.4, 47.2]      1
(47.2, 59.0]      1
dtype: int64

In [10]:
# The method value_counts() can also be used in Series (not only with Categorical objects)
pd.cut(titanic_df.Age, [0, 13, 20, 35, 60, 100], labels=['Kid', 'Teen', 'Young Adult', 'Adult', 'Elder']).value_counts()

Young Adult    318
Adult          195
Teen           108
Kid             71
Elder           22
Name: Age, dtype: int64

## QCut()


QCut() will create groups (bins) based on "sample quantiles", so it'll create bins that will hold as close to the same number of registers (rows, instances, values) in each bin as posible. in other words, qcut() concerns itself with the distribution of the values (not the values themselves) of the structure sent as first argument.

The second argument that we send is used to determine the number of bins (Quartiles, Quintiles, Deciles, etc.).

You can send an integer to represent the number of bins or send a float array to indicate de percentajes of the positional metrics [0, 0.25, 0.5, 0.75, 1]): 
<br><br>
4   --> Cuartiles <br>
5 --> Quintiles <br>
10  --> Deciles <br>
100 --> Percentiles 
<br><br>
You can also send as second argument a list with percentiles that don't conform to common positional metrics (Quartiles, Deciles, etc.) (e.g. [0, 0.1, 0.25, 0.5, 0.63, 0.75, 1]).
<br><br>
[Check qcut's documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html)

In [12]:
# qcut() tries to have more or less the same number of examples in each bin
pd.qcut(factors, 5).value_counts()

(-0.001, 3.0]    4
(3.0, 6.0]       3
(6.0, 15.0]      3
(15.0, 29.0]     3
(29.0, 59.0]     3
dtype: int64

In [13]:
# Follows the same return behaviour as cut
pd.qcut(factors, 5)         # Returns a Categorical
pd.qcut(titanic_df.Age, 5)  # Returns a Series

0       (19.0, 25.0]
1       (31.8, 41.0]
2       (25.0, 31.8]
3       (31.8, 41.0]
4       (31.8, 41.0]
           ...      
886     (25.0, 31.8]
887    (0.419, 19.0]
888              NaN
889     (25.0, 31.8]
890     (31.8, 41.0]
Name: Age, Length: 891, dtype: category
Categories (5, interval[float64]): [(0.419, 19.0] < (19.0, 25.0] < (25.0, 31.8] < (31.8, 41.0] < (41.0, 80.0]]

In [14]:
# Send quantiles based on an List
pd.qcut(titanic_df.Age, [0, 0.2, 0.4, 0.6, 0.8, 1])  # Notice that the number of elements in the list is 1 more than the number of bins you want --> 2nd_arg.length = n+1 (where n is number of bins you want).

0       (19.0, 25.0]
1       (31.8, 41.0]
2       (25.0, 31.8]
3       (31.8, 41.0]
4       (31.8, 41.0]
           ...      
886     (25.0, 31.8]
887    (0.419, 19.0]
888              NaN
889     (25.0, 31.8]
890     (31.8, 41.0]
Name: Age, Length: 891, dtype: category
Categories (5, interval[float64]): [(0.419, 19.0] < (19.0, 25.0] < (25.0, 31.8] < (31.8, 41.0] < (41.0, 80.0]]

In [15]:
# Just like with cut(), you can label the bins with qcut()
pd.qcut(factors, 5, labels=["Primer Quintil", "Segundo Quintil", 
                            "Tercer Quintil", "Cuarto Quintil", 
                            "Quinto Quintil"]).value_counts()

Primer Quintil     4
Segundo Quintil    3
Tercer Quintil     3
Cuarto Quintil     3
Quinto Quintil     3
dtype: int64

In [16]:
# Precision will dictate how many decimals the bins will display!
pd.cut(factors, 5, precision=1)
pd.cut(titanic_df.Age, 5,precision=1)

0      (16.3, 32.3]
1      (32.3, 48.2]
2      (16.3, 32.3]
3      (32.3, 48.2]
4      (32.3, 48.2]
           ...     
886    (16.3, 32.3]
887    (16.3, 32.3]
888             NaN
889    (16.3, 32.3]
890    (16.3, 32.3]
Name: Age, Length: 891, dtype: category
Categories (5, interval[float64]): [(0.3, 16.3] < (16.3, 32.3] < (32.3, 48.2] < (48.2, 64.1] < (64.1, 80.0]]