---
# Data Science and Artificial Intelliegence Practicum
## 3.2-modul. Data Wrangling
---

## 3.2.3 - Data Binning (grouping), `pandas.cut`, `pandas.qcut`

In [1]:
import pandas as pd
import numpy as np

### Data Binning (Grouping)
**Data binning** (or bucketing) groups data in **bins** (or buckets), in the sense that it replaces values contained into a small interval with a single representative value for that interval. Sometimes binning improves accuracy in predictive models.

*Data binning* is a type of data preprocessing, a mechanism which includes also dealing with missing values, formatting, normalization and standardization. Binning can be applied to convert numeric values to categorical or to sample (quantise) numeric values.

**src:** [LINK](https://towardsdatascience.com/data-preprocessing-with-python-pandas-part-5-binning-c5bd5fd1b950)

![The-age-ranges-of-survey-respondents.png](https://www.researchgate.net/profile/Gemma-Burgess-5/publication/299603937/figure/fig1/AS:347169551863808@1459783079654/The-age-ranges-of-survey-respondents.png)

### Data Preparation

#### Data Collection

In [79]:
df = pd.read_csv('https://raw.githubusercontent.com/anvarnarz/praktikum_datasets/main/world_population_duplicates.csv',
                 usecols=['country','pop2021','area'],
                 index_col='country')
df.head()

Unnamed: 0_level_0,pop2021,area
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Macau,658.394,30
Monaco,39.511,2
Burundi,12255.433,27834
Macau,658.394,30
Monaco,39.511,2


#### Data Cleaning

In [82]:
# dropping duplicated rows
df.drop_duplicates(inplace=True)

In [83]:
df.head()

Unnamed: 0_level_0,pop2021,area
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Macau,658.394,30
Monaco,39.511,2
Burundi,12255.433,27834
Singapore,5896.686,710
Hong Kong,7552.81,1104


In [84]:
# sorting
df.sort_index(inplace=True)

In [85]:
df.head()

Unnamed: 0_level_0,pop2021,area
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,39835.428,652230
Albania,2872.933,28748
Algeria,44616.624,2381741
American Samoa,55.1,199
Andorra,77.355,468


#### Data Validation

In [86]:
# see some samples
df.loc['Uzbekistan']

pop2021     33935.763
area       447400.000
Name: Uzbekistan, dtype: float64

In [87]:
df.loc['China']

pop2021    1444216.107
area       9706961.000
Name: China, dtype: float64

### **`pandas.cut`**
Bin values into discrete intervals.

**Parameters**:

- **x :** ***array-like*** -> The input array to be binned. Must be 1-dimensional.

- **bins :** ***int, sequence of scalars, or IntervalIndex*** -> The criteria to bin by.

  - int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

  - sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

  - IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.

- **labels :** ***array or False, default None*** -> Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins.

- **right :** ***bool, default True*** -> Indicates whether bins includes the rightmost edge or not. If `right == True` (the default), then the bins `[1, 2, 3, 4]` indicate (1,2], (2,3], (3,4]. This argument is ignored when *bins* is an IntervalIndex.
---

#### **Example 1**
Note that the population is given in thousands, which means that to find the exact value, we multiply the specified value by `1000`. Example, the population of Uzbekistan is `33935`, which means that the population is `33935 x 1000 = 33 million 935 thousand`.

Let's divide countries into categories by population:

- up to 1 million (`0-1000`)
- 1-10 million (`1000-10000`)
- 10-30 mln (`10000-30000`)
- 30-50 mln (`30000-50000`)
- 50-100 mln (`50000-10000`)
- 100-300 mln (`100000-300000`)
- 300 mln - 1.5 billion (`300000-1500000`)

In [90]:
# create an array for IntervalIndex
intervals = [0, 1000, 10000, 30000, 50000, 100000, 300000, 1500000]

In [91]:
# separate the population as Series
population = df.pop2021
population.head()

country
Afghanistan       39835.428
Albania            2872.933
Algeria           44616.624
American Samoa       55.100
Andorra              77.355
Name: pop2021, dtype: float64

In [92]:
# bin the data
bins1 = pd.cut(population, bins=intervals)
bins1

country
Afghanistan          (30000, 50000]
Albania               (1000, 10000]
Algeria              (30000, 50000]
American Samoa            (0, 1000]
Andorra                   (0, 1000]
                          ...      
Wallis and Futuna         (0, 1000]
Western Sahara            (0, 1000]
Yemen                (30000, 50000]
Zambia               (10000, 30000]
Zimbabwe             (10000, 30000]
Name: pop2021, Length: 232, dtype: category
Categories (7, interval[int64, right]): [(0, 1000] < (1000, 10000] < (10000, 30000] <
                                         (30000, 50000] < (50000, 100000] < (100000, 300000] <
                                         (300000, 1500000]]

In [93]:
# let's see how many values in each interval
bins1.value_counts()

(0, 1000]            72
(1000, 10000]        68
(10000, 30000]       44
(30000, 50000]       19
(50000, 100000]      15
(100000, 300000]     11
(300000, 1500000]     3
Name: pop2021, dtype: int64

In [95]:
# we cann add bins to DF
df['popBins'] = bins1
df.head()

Unnamed: 0_level_0,pop2021,area,popBins
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,39835.428,652230,"(30000, 50000]"
Albania,2872.933,28748,"(1000, 10000]"
Algeria,44616.624,2381741,"(30000, 50000]"
American Samoa,55.1,199,"(0, 1000]"
Andorra,77.355,468,"(0, 1000]"


#### Example 2

In [96]:
df2 = pd.read_csv("https://github.com/anvarnarz/praktikum_datasets/raw/main/automobile_data.csv",
                  index_col=0)
df2.head()

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,13495.0
1,alfa-romero,convertible,88.6,168.8,dohc,four,111,21,16500.0
2,alfa-romero,hatchback,94.5,171.2,ohcv,six,154,19,16500.0
3,audi,sedan,99.8,176.6,ohc,four,102,24,13950.0
4,audi,sedan,99.4,176.6,ohc,five,115,18,17450.0


Let's bin (group) data by *price* column. 

First of all, describe data, so that we can see values like *min*, *max*.

In [106]:
df2.price.describe()

count       58.000000
mean     15387.000000
std      11320.259841
min       5151.000000
25%       6808.500000
50%      11095.000000
75%      18120.500000
max      45400.000000
Name: price, dtype: float64

In [117]:
# create an array for IntervalIndex
intervals2 = [0, 10_000, 20_000, 50_000]

In [118]:
bins2 = pd.cut(df2.price, bins=intervals2)
bins2

index
0     (10000, 20000]
1     (10000, 20000]
2     (10000, 20000]
3     (10000, 20000]
4     (10000, 20000]
           ...      
81        (0, 10000]
82        (0, 10000]
86        (0, 10000]
87    (10000, 20000]
88    (10000, 20000]
Name: price, Length: 61, dtype: category
Categories (3, interval[int64, right]): [(0, 10000] < (10000, 20000] < (20000, 50000]]

In [119]:
bins2.value_counts()

(0, 10000]        28
(10000, 20000]    17
(20000, 50000]    13
Name: price, dtype: int64

##### `labels` parameter in `pandas.cut`

In [122]:
# we can specify labels instead of IntervalIndex
labels = ['cheap', 'middle-priced', 'expensive']

In [126]:
bins2 = pd.cut(df2.price, bins=intervals2, labels=labels)
bins2.value_counts()

cheap            28
middle-priced    17
expensive        13
Name: price, dtype: int64

In [135]:
df2['priceBins'] = bins2
df2.sample(5)

Unnamed: 0_level_0,company,body-style,wheel-base,length,engine-type,num-of-cylinders,horsepower,average-mileage,price,priceBins
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
57,nissan,sedan,100.4,184.6,ohcv,six,152,19,13499.0,middle-priced
79,toyota,wagon,104.5,187.8,dohc,six,156,19,15750.0,middle-priced
35,jaguar,sedan,102.0,191.7,ohcv,twelve,262,13,36000.0,expensive
43,mazda,sedan,104.9,175.0,ohc,four,72,31,18344.0,middle-priced
82,volkswagen,sedan,97.3,171.7,ohc,four,52,37,7995.0,cheap


we can use integer for number of equal-width bins

In [144]:
bins3 = pd.cut(df2.price, bins=3)
bins3.value_counts()

(5110.751, 18567.333]     44
(31983.667, 45400.0]       9
(18567.333, 31983.667]     5
Name: price, dtype: int64

In [152]:
bins4 = pd.cut(df2.horsepower, 5)
bins4.value_counts()

(47.76, 96.0]     30
(96.0, 144.0]     17
(144.0, 192.0]    10
(192.0, 240.0]     2
(240.0, 288.0]     2
Name: horsepower, dtype: int64

### `pandas.qcut`
Quantile-based discretization function.

**Parameters:**

- **x :** ***1d ndarray or Series***

- **q :** ***int or list-like of float*** -> Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

- **labels :** ***array or False, default None*** -> Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.


In [159]:
pd.qcut(df2.price, 4).value_counts()

(5150.999, 6808.5]    15
(18120.5, 45400.0]    15
(6808.5, 11095.0]     14
(11095.0, 18120.5]    14
Name: price, dtype: int64

In [160]:
pd.qcut(df2.price, [0, .25, .5, .75, 1.]).value_counts()

(5150.999, 6808.5]    15
(18120.5, 45400.0]    15
(6808.5, 11095.0]     14
(11095.0, 18120.5]    14
Name: price, dtype: int64

In [167]:
pd.qcut(df2.horsepower, 3).value_counts()

(47.999, 70.0]    24
(111.0, 288.0]    20
(70.0, 111.0]     17
Name: horsepower, dtype: int64

In [169]:
pd.qcut(df2.horsepower, [0, .33, .66, 1.]).value_counts()

(69.8, 111.0]     21
(47.999, 69.8]    20
(111.0, 288.0]    20
Name: horsepower, dtype: int64