# Step 1: Understanding Today’s Topics
## 1. Binning / Bucketing

- What it is: Converting continuous values into categorical buckets.
- Example: Sales values → Low, Medium, High.
- Why: Some algorithms work better with categorized values; also useful for interpretation.
#### Types:
- Equal Width Binning: Divide the range into equal intervals (e.g., 0–100, 101–200, 201–300).
- Equal Frequency Binning: Each bin has (roughly) the same number of observations.

## 2. Scaling

- What it is: Transforming numerical features so they’re on the same scale.
- Why: Many ML algorithms (like KNN, Logistic Regression, Neural Nets) are sensitive to feature magnitudes.
#### Common techniques:
### - Min-Max Scaling (Normalization): Values between 0 and 1.
- Formula: 
- - ( 𝑥 − 𝑚𝑖𝑛 ) / (𝑚𝑎𝑥 − 𝑚𝑖𝑛)

### → Keeps distribution shape but compresses into [0,1].
Standardization (Z-score): Mean = 0, Std Dev = 1.
- Formula: 
- - (x−mean)/std
- Good when data is normally distributed.

In [13]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.read_csv("data/Sales.csv")

In [14]:
print(df.head(10))

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
1       2  CA-2017-152156  08/11/2017  11/11/2017    Second Class    CG-12520   
2       3  CA-2017-138688  12/06/2017  16/06/2017    Second Class    DV-13045   
3       4  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   
4       5  US-2016-108966  11/10/2016  18/10/2016  Standard Class    SO-20335   
5       6  CA-2015-115812  09/06/2015  14/06/2015  Standard Class    BH-11710   
6       7  CA-2015-115812  09/06/2015  14/06/2015  Standard Class    BH-11710   
7       8  CA-2015-115812  09/06/2015  14/06/2015  Standard Class    BH-11710   
8       9  CA-2015-115812  09/06/2015  14/06/2015  Standard Class    BH-11710   
9      10  CA-2015-115812  09/06/2015  14/06/2015  Standard Class    BH-11710   

     Customer Name    Segment        Country             City       State  \
0      Claire Gute   Consumer  

In [15]:
df['Sales'].describe()


count     9800.000000
mean       230.769059
std        626.651875
min          0.444000
25%         17.248000
50%         54.490000
75%        210.605000
max      22638.480000
Name: Sales, dtype: float64

In [16]:
bins = pd.cut(df['Sales'], bins=5)  # 5 equal-width bins
df['Sales_Bin_Width'] = bins
print(df["Sales_Bin_Width"])

0       (-22.194, 4528.051]
1       (-22.194, 4528.051]
2       (-22.194, 4528.051]
3       (-22.194, 4528.051]
4       (-22.194, 4528.051]
               ...         
9795    (-22.194, 4528.051]
9796    (-22.194, 4528.051]
9797    (-22.194, 4528.051]
9798    (-22.194, 4528.051]
9799    (-22.194, 4528.051]
Name: Sales_Bin_Width, Length: 9800, dtype: category
Categories (5, interval[float64, right]): [(-22.194, 4528.051] < (4528.051, 9055.658] < (9055.658, 13583.266] < (13583.266, 18110.873] < (18110.873, 22638.48]]


In [17]:
bins = pd.qcut(df['Sales'], q=5)  # 5 equal-frequency bins
df['Sales_Bin_Freq'] = bins
print(df["Sales_Bin_Freq"])

0         (89.834, 283.92]
1       (283.92, 22638.48]
2          (13.827, 34.24]
3       (283.92, 22638.48]
4          (13.827, 34.24]
               ...        
9795       (0.443, 13.827]
9796       (0.443, 13.827]
9797      (89.834, 283.92]
9798       (13.827, 34.24]
9799       (0.443, 13.827]
Name: Sales_Bin_Freq, Length: 9800, dtype: category
Categories (5, interval[float64, right]): [(0.443, 13.827] < (13.827, 34.24] < (34.24, 89.834] < (89.834, 283.92] < (283.92, 22638.48]]


In [18]:
print(df['Sales_Bin_Width'].value_counts().sort_index())
print(df['Sales_Bin_Freq'].value_counts().sort_index())

Sales_Bin_Width
(-22.194, 4528.051]       9773
(4528.051, 9055.658]        19
(9055.658, 13583.266]        5
(13583.266, 18110.873]       2
(18110.873, 22638.48]        1
Name: count, dtype: int64
Sales_Bin_Freq
(0.443, 13.827]       1960
(13.827, 34.24]       1962
(34.24, 89.834]       1958
(89.834, 283.92]      1961
(283.92, 22638.48]    1959
Name: count, dtype: int64


In [19]:
scaler_minmax = MinMaxScaler()
df['Sales_MinMax'] = scaler_minmax.fit_transform(df[['Sales']])

scaler_standard = StandardScaler()
df['Sales_Standard'] = scaler_standard.fit_transform(df[['Sales']])

In [None]:
df[['Sales','Sales_MinMax','Sales_Standard','Sales_Bin_Width','Sales_Bin_Freq']].head(10)

Unnamed: 0,Sales,Sales_MinMax,Sales_Standard,Sales_Bin_Width,Sales_Bin_Freq
0,261.96,0.011552,0.049776,"(-22.194, 4528.051]","(89.834, 283.92]"
1,731.94,0.032313,0.799801,"(-22.194, 4528.051]","(283.92, 22638.48]"
2,14.62,0.000626,-0.344944,"(-22.194, 4528.051]","(13.827, 34.24]"
3,957.5775,0.04228,1.159887,"(-22.194, 4528.051]","(283.92, 22638.48]"
4,22.368,0.000968,-0.33258,"(-22.194, 4528.051]","(13.827, 34.24]"
5,48.86,0.002139,-0.290302,"(-22.194, 4528.051]","(34.24, 89.834]"
6,7.28,0.000302,-0.356658,"(-22.194, 4528.051]","(0.443, 13.827]"
7,907.152,0.040052,1.079415,"(-22.194, 4528.051]","(283.92, 22638.48]"
8,18.504,0.000798,-0.338746,"(-22.194, 4528.051]","(13.827, 34.24]"
9,114.9,0.005056,-0.184911,"(-22.194, 4528.051]","(89.834, 283.92]"


In [21]:
df = pd.read_csv("data/superstore.csv")
print(df.head(10))

         Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0  CA-2019-103800  2019-01-03  2019-01-07  Standard Class    DP-13000   
1  CA-2019-112326  2019-01-04  2019-01-08  Standard Class    PO-19195   
2  CA-2019-112326  2019-01-04  2019-01-08  Standard Class    PO-19195   
3  CA-2019-112326  2019-01-04  2019-01-08  Standard Class    PO-19195   
4  CA-2019-141817  2019-01-05  2019-01-12  Standard Class    MB-18085   
5  CA-2019-167199  2019-01-06  2019-01-10  Standard Class    ME-17320   
6  CA-2019-167199  2019-01-06  2019-01-10  Standard Class    ME-17320   
7  CA-2019-167199  2019-01-06  2019-01-10  Standard Class    ME-17320   
8  CA-2019-106054  2019-01-06  2019-01-07     First Class    JO-15145   
9  CA-2019-167199  2019-01-06  2019-01-10  Standard Class    ME-17320   

   Customer Name      Segment        Country          City         State  ...  \
0  Darren Powers     Consumer  United States       Houston         Texas  ...   
1  Phillina Ober  Home Office  Uni