<Font color="blue" size="6" >Discretize and Binning </font>

<p>Binning in pandas is the process of grouping a continuous numerical variable into a smaller number of discrete bins or groups.</p>
<p>Binning numerical columns is a common data preprocessing technique in data analysis and machine learning.</p>

<font color ="blue" size="6" >qcut</font>
<p> qcut is a<b> “Quantile-based discretization function.”</b> </p>
<p>This basically means that qcut tries to divide up the underlying data into equal sized bins.</p>
<p>The function defines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins.</p>


In [1]:
import numpy as np
import pandas as pd
df = pd.read_csv('DATASETS\\Gold.csv',encoding='latin1')
df

Unnamed: 0,Year,10_G_Price
0,1964,63.25
1,1965,71.75
2,1966,83.75
3,1967,102.5
4,1968,162.0
5,1969,176.0
6,1970,184.0
7,1971,193.0
8,1972,202.0
9,1973,278.5


In [2]:
df['Year']

0      1964
1      1965
2      1966
3      1967
4      1968
5      1969
6      1970
7      1971
8      1972
9      1973
10     1974
11     1975
12     1976
13     1977
14     1978
15     1979
16     1980
17     1981
18     1982
19     1983
20     1984
21     1985
22     1986
23     1987
24     1988
25     1989
26     1990
27     1991
28     1992
29     1993
30     1994
31     1995
32     1996
33     1997
34     1998
35     1999
36     2000
37     2001
38     2002
39     2003
40     2004
41     2005
42     2007
43     2008
44     2009
45     2010
46     2011
47     2012
48     2013
49     2014
50     2015
51     2016
52     2017
53     2018
54     2019
55     2020
56     2021
57     2022
58    2023 
59     2024
Name: Year, dtype: object

In [3]:
df['Year'] = df['Year'].astype(int)
df

Unnamed: 0,Year,10_G_Price
0,1964,63.25
1,1965,71.75
2,1966,83.75
3,1967,102.5
4,1968,162.0
5,1969,176.0
6,1970,184.0
7,1971,193.0
8,1972,202.0
9,1973,278.5


In [4]:
pd.qcut(df['Year'],q=4,).head()
df['Year.q']=pd.qcut(df['Year'],q=4,labels=['Veryold','Old','Middle','Newera'])
df['Year.q']
df

Unnamed: 0,Year,10_G_Price,Year.q
0,1964,63.25,Veryold
1,1965,71.75,Veryold
2,1966,83.75,Veryold
3,1967,102.5,Veryold
4,1968,162.0,Veryold
5,1969,176.0,Veryold
6,1970,184.0,Veryold
7,1971,193.0,Veryold
8,1972,202.0,Veryold
9,1973,278.5,Veryold


<font color="blue" size="6">CUT</font>
<p>Cut: Use cut when you need to segment and sort data values into bins.</p>
<p>This function is also useful for going from a continuous variable to a categorical variable.</p> <p>For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.</p>

In [5]:
pd.cut(df['Year'],bins=[0,1980,1990,2014,2024])
df['Gold_Range']=pd.cut(df['Year'],bins=[0,1980,1990,2014,2024],labels=[1,2,3,4])


In [6]:
df


Unnamed: 0,Year,10_G_Price,Year.q,Gold_Range
0,1964,63.25,Veryold,1
1,1965,71.75,Veryold,1
2,1966,83.75,Veryold,1
3,1967,102.5,Veryold,1
4,1968,162.0,Veryold,1
5,1969,176.0,Veryold,1
6,1970,184.0,Veryold,1
7,1971,193.0,Veryold,1
8,1972,202.0,Veryold,1
9,1973,278.5,Veryold,1


<b>Note:</b>
<p>With the cut function, we do not have any control over how many values fall into each bin. We can only specify the bin edges. 
<p>This is where we need to learn about the qcut function. 
<p>It can be used to divide the values into buckets in a way that each bucket contains approximately the same number of values.