In this notebook, we will learn how to discretize continuous data into discrete bins using the cut() and qcut() methods provided by the Pandas library.

Discretization is the process of converting continuous data into discrete intervals or categories. The cut() method divides the data into bins defined by intervals, and the qcut() method divides the data into quantiles that have equal-sized bins.


**Example 1: Using cut() for Binning Heights**

Let's start with an example where we have a list of students' heights and we want to convert this data into discrete intervals (bins).

In [1]:
import pandas as pd

# Heights of students
height = [120, 122, 125, 127, 121, 123, 137, 131, 161, 145, 141, 132]

# Define the bins (intervals)
bins = [118, 125, 135, 160, 200]

# Use cut() to categorize the data into the defined bins
category = pd.cut(height, bins)

# Display the result
print(category)


[(118, 125], (118, 125], (118, 125], (125, 135], (118, 125], ..., (125, 135], (160, 200], (135, 160], (135, 160], (125, 135]]
Length: 12
Categories (4, interval[int64, right]): [(118, 125] < (125, 135] < (135, 160] < (160, 200]]


Explanation:

We define the intervals [118, 125], [125, 135], [135, 160], and [160, 200] as the bins.
The cut() function categorizes each height value into one of these intervals.

**Example 2: Changing Interval Types Using right=False**

We can adjust the interval boundary settings. By default, intervals are right-closed (right=True), meaning the right endpoint is included in the bin. We can change this by setting right=False to make the intervals left-closed.

In [2]:
# Use cut() with right=False to create left-closed intervals
category2 = pd.cut(height, bins, right=False)

# Display the result
print(category2)

[[118, 125), [118, 125), [125, 135), [125, 135), [118, 125), ..., [125, 135), [160, 200), [135, 160), [135, 160), [125, 135)]
Length: 12
Categories (4, interval[int64, left]): [[118, 125) < [125, 135) < [135, 160) < [160, 200)]


Explanation:

Here, we are using right=False, which means the intervals will be left-closed and right-open (e.g., [118, 125)).

**Example 3: Counting Values in Each Bin**

Next, let's count how many data points fall into each bin using the value_counts() method.

In [3]:
# Count the number of values in each bin
value_counts = pd.value_counts(category)

# Display the result
print(value_counts)

(118, 125]    5
(125, 135]    3
(135, 160]    3
(160, 200]    1
Name: count, dtype: int64


  value_counts = pd.value_counts(category)


Explanation:

The value_counts() method will show how many height values fall within each defined bin.

**Example 4: Labeling the Bins**

We can also assign meaningful labels to each bin. For example, we can label the height categories as Short Height, Average Height, Good Height, and Taller.

In [4]:
# Define bin labels
bin_names = ['Short Height', 'Average Height', 'Good Height', 'Taller']

# Use cut() to categorize with custom labels
labeled_category = pd.cut(height, bins, labels=bin_names)

# Display the labeled result
print(labeled_category)

['Short Height', 'Short Height', 'Short Height', 'Average Height', 'Short Height', ..., 'Average Height', 'Taller', 'Good Height', 'Good Height', 'Average Height']
Length: 12
Categories (4, object): ['Short Height' < 'Average Height' < 'Good Height' < 'Taller']


Explanation:

By passing a list of labels to the labels parameter, we can assign a human-readable label to each bin.


**Example 5: Using qcut() for Quantile-based Binning**

Now, let's use the qcut() method, which divides the data into quantiles, ensuring that each bin contains an equal number of data points.

In [5]:
import numpy as np

# Generate 2000 random numbers between 0 and 1
random_numbers = np.random.rand(2000)

# Use qcut to divide the data into 4 equal-sized quantiles (quartiles)
category3 = pd.qcut(random_numbers, 4)

# Display the result
print(category3)


[(0.259, 0.51], (-0.000602, 0.259], (0.51, 0.751], (0.51, 0.751], (-0.000602, 0.259], ..., (0.751, 0.998], (0.51, 0.751], (0.51, 0.751], (0.51, 0.751], (0.51, 0.751]]
Length: 2000
Categories (4, interval[float64, right]): [(-0.000602, 0.259] < (0.259, 0.51] < (0.51, 0.751] <
                                           (0.751, 0.998]]


Explanation:

The qcut() function divides the data into 4 bins (quartiles) with approximately equal numbers of data points in each bin.


You are given the scores of students in a final exam. Your task is to categorize these students' scores into bins based on predefined intervals and assign appropriate labels to these categories.

Data:
The scores of the students are as follows:

```
scores = [12, 45, 67, 89, 54, 23, 90, 78, 99, 36, 60, 80]
```
Bins:

You need to discretize the scores into the following intervals:

- 0 to 40: "Poor"
- 41 to 60: "Average"
- 61 to 80: "Good"
- 81 to 100: "Excellent"

Task:

Step 1: Convert the given scores list into intervals using pd.cut() method in Pandas. The bins should correspond to the score intervals defined above.

Step 2: Use the labels parameter to assign appropriate labels to each of these bins: "Poor", "Average", "Good", and "Excellent".

Step 3: Display the discretized scores with their corresponding labels.

Hints:

- Use pd.cut() to categorize the scores into the bins.
- Ensure that you define the correct intervals as per the given bins.
- You can create a list of labels to assign meaningful labels to each bin.
- If you get an error or unexpected output, check whether the interval boundaries are set correctly and the labels match the number of bins.

Expected Output:

You should expect an output that looks like the following:

```
[Poor, Average, Good, Excellent, Average, Poor, Excellent, Good, Excellent, Poor, Average, Good]
Categories (4, object): [Poor < Average < Good < Excellent]
```
