# **Binning (or Discretization)**


🔹 Meaning:

Binning means converting continuous (numeric) data into discrete bins or categories.

It’s useful when small variations in numbers don’t matter much, or when you want to make models simpler.

In [None]:
import pandas as pd
ages = [18, 22, 30, 40, 55]
bins = [0, 20, 40, 100]
labels = ['Teen', 'Adult', 'Senior']
pd.cut(ages, bins=bins, labels=labels)


# **2. Transformation**

🔹 Meaning:

Changing the scale or distribution of a feature to make it more suitable for the model.

In [11]:
import pandas as pd
import numpy as np

# Sample data
data = pd.DataFrame({
    'income': [1000, 3000, 10000, 50000, 100000],
    'population': [100, 400, 1600, 2500, 4900]
})

print(data)


   income  population
0    1000         100
1    3000         400
2   10000        1600
3   50000        2500
4  100000        4900


In [12]:
#Log Transformation
data['income_log'] = np.log(data['income'])
print(data[['income', 'income_log']])


   income  income_log
0    1000    6.907755
1    3000    8.006368
2   10000    9.210340
3   50000   10.819778
4  100000   11.512925


In [13]:
#2️⃣ Square Root Transformation
data['population_sqrt'] = np.sqrt(data['population'])
print(data[['population', 'population_sqrt']])


   population  population_sqrt
0         100             10.0
1         400             20.0
2        1600             40.0
3        2500             50.0
4        4900             70.0


In [14]:
#3️⃣ Box-Cox / Yeo-Johnson Transformation
from sklearn.preprocessing import PowerTransformer

# Box-Cox (positive only)
pt_boxcox = PowerTransformer(method='box-cox')
data['income_boxcox'] = pt_boxcox.fit_transform(data[['income']])

# Yeo-Johnson (handles zero/negative)
pt_yeo = PowerTransformer(method='yeo-johnson')
data['population_yeojohnson'] = pt_yeo.fit_transform(data[['population']])

print(data[['income_boxcox', 'population_yeojohnson']])


   income_boxcox  population_yeojohnson
0      -1.378631              -1.461072
1      -0.759936              -0.763382
2      -0.065337               0.235771
3       0.890994               0.641144
4       1.312910               1.347539


In [15]:
#4️⃣ Normalization
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
data[['income_norm', 'population_norm']] = scaler.fit_transform(data[['income', 'population']])

print(data[['income_norm', 'population_norm']])


   income_norm  population_norm
0     0.000000           0.0000
1     0.020202           0.0625
2     0.090909           0.3125
3     0.494949           0.5000
4     1.000000           1.0000


# **3. Encoding**

🔹 Meaning:

Encoding converts categorical (text) data into numeric form, because ML models work only with numbers.

In [5]:
import pandas as pd

data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green', 'Red'],
    'Quality': ['Low', 'Medium', 'High', 'Low', 'High', 'Medium', 'High'],
    'Target': [10, 20, 15, 5, 25, 30, 20]  # We'll use this for target encoding
})

print(data)


   Color Quality  Target
0    Red     Low      10
1   Blue  Medium      20
2  Green    High      15
3    Red     Low       5
4   Blue    High      25
5  Green  Medium      30
6    Red    High      20


In [6]:
#1, LabelEncoding
from sklearn.preprocessing import LabelEncoder

# Encode Quality as ordered labels
order = {'Low': 1, 'Medium': 2, 'High': 3}
data['Quality_LabelEncoded'] = data['Quality'].map(order)

print(data[['Quality', 'Quality_LabelEncoded']])


  Quality  Quality_LabelEncoded
0     Low                     1
1  Medium                     2
2    High                     3
3     Low                     1
4    High                     3
5  Medium                     2
6    High                     3


In [7]:
#2, One-Hot Encoding

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(data[['Color']])

# Convert encoded array to DataFrame
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Color']))

# Join with main DataFrame
data_onehot = pd.concat([data, encoded_df], axis=1)
print(data_onehot)


   Color Quality  Target  Quality_LabelEncoded  Color_Blue  Color_Green  \
0    Red     Low      10                     1         0.0          0.0   
1   Blue  Medium      20                     2         1.0          0.0   
2  Green    High      15                     3         0.0          1.0   
3    Red     Low       5                     1         0.0          0.0   
4   Blue    High      25                     3         1.0          0.0   
5  Green  Medium      30                     2         0.0          1.0   
6    Red    High      20                     3         0.0          0.0   

   Color_Red  
0        1.0  
1        0.0  
2        0.0  
3        1.0  
4        0.0  
5        0.0  
6        1.0  


In [8]:
#3, Binary Encoding / Target Encoding
# Compute mean target per category
target_mean = data.groupby('Color')['Target'].mean()
data['Color_TargetEncoded'] = data['Color'].map(target_mean)

print(target_mean)
print(data[['Color', 'Color_TargetEncoded']])


Color
Blue     22.500000
Green    22.500000
Red      11.666667
Name: Target, dtype: float64
   Color  Color_TargetEncoded
0    Red            11.666667
1   Blue            22.500000
2  Green            22.500000
3    Red            11.666667
4   Blue            22.500000
5  Green            22.500000
6    Red            11.666667


In [9]:
pip install category_encoders


Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Downloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: category_encoders
Successfully installed category_encoders-2.8.1


In [10]:
#4️⃣ Binary Encoding
import pandas as pd
import category_encoders as ce

# Sample data
data = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Yellow', 'Blue', 'Green', 'Red']
})

# Apply Binary Encoding
encoder = ce.BinaryEncoder(cols=['Color'])
data_encoded = encoder.fit_transform(data)

print(data_encoded)


   Color_0  Color_1  Color_2
0        0        0        1
1        0        1        0
2        0        1        1
3        1        0        0
4        0        1        0
5        0        1        1
6        0        0        1


# **4. Scaling**

🔹 Meaning:

Scaling ensures all features are on a similar range so that one feature doesn’t dominate others due to larger numerical values.

In [2]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform([[1, 1000], [2, 2000], [3, 3000]])
X_scaled

array([[-1.22474487, -1.22474487],
       [ 0.        ,  0.        ],
       [ 1.22474487,  1.22474487]])

# **5. Shuffling**

🔹 Meaning:

Shuffling means randomly rearranging the order of data samples.

🔹 Why it’s done:

To remove order bias (for example, data sorted by time or class)

Ensures better generalization during model training

If data isn’t shuffled, the model might learn patterns from order instead of content.

In [4]:
from sklearn.utils import shuffle
import pandas as pd

df = pd.DataFrame({'X':[1,2,3,4], 'Y':['A','B','C','D']})
df_shuffled = shuffle(df, random_state=42)
print(df_shuffled)


   X  Y
1  2  B
3  4  D
0  1  A
2  3  C
