## **SHETH L.U.J. & SIR M.V. COLLEGE**

**Aayush D. Yadav | T123**
###Practical No. 3
**Aim:** Feature Scaling and Dummification
* Apply feature-scaling techniques like standardization and normalization to numerical features.
* Perform feature dummification to convert categorical variables into numerical
representations.

### **Part 1: Handling Numerical Data**

**1: Import Libraries and Load Data**

In [7]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import Normalizer
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer

# Load the dataset
df = pd.read_csv('bank.csv')
print("Original Data Head:")
print(df.head())

Original Data Head:
   age         job  marital  education default  balance housing loan  contact  \
0   59      admin.  married  secondary      no     2343     yes   no  unknown   
1   56      admin.  married  secondary      no       45      no   no  unknown   
2   41  technician  married  secondary      no     1270     yes   no  unknown   
3   55    services  married  secondary      no     2476     yes   no  unknown   
4   54      admin.  married   tertiary      no      184      no   no  unknown   

   day month  duration  campaign  pdays  previous poutcome deposit  
0    5   may      1042         1     -1         0  unknown     yes  
1    5   may      1467         1     -1         0  unknown     yes  
2    5   may      1389         1     -1         0  unknown     yes  
3    5   may       579         1     -1         0  unknown     yes  
4    5   may       673         2     -1         0  unknown     yes  


**2: Rescaling a Feature (MinMax Scaling)**

In [8]:
# We will rescale 'age' to be between 0 and 1
feature_age = df[['age']].values
minmax_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
scaled_age = minmax_scaler.fit_transform(feature_age)

print("Scaled Age (First 5 values):")
print(scaled_age[:5].flatten())

Scaled Age (First 5 values):
[0.53246753 0.49350649 0.2987013  0.48051948 0.46753247]


**3: Standardizing a Feature (Z-Score & Robust)**

In [9]:
# Standardize 'balance' so it has mean=0 and std=1
feature_balance = df[['balance']].values
scaler = preprocessing.StandardScaler()
standardized_balance = scaler.fit_transform(feature_balance)

print("Standardized Balance (Mean and Std):")
print(f"Mean: {round(standardized_balance.mean())}")
print(f"Std: {standardized_balance.std()}")

# Robust Scaler (Less sensitive to outliers)
robust_scaler = preprocessing.RobustScaler()
robust_balance = robust_scaler.fit_transform(feature_balance)
print("\nRobust Scaled Balance (First 5):")
print(robust_balance[:5].flatten())

Standardized Balance (Mean and Std):
Mean: 0
Std: 1.0

Robust Scaled Balance (First 5):
[ 1.13051702 -0.3184111   0.45397226  1.21437579 -0.23076923]


**4: Normalizing Observations**

In [10]:
# Normalize 'duration' and 'campaign' (length 1)
features_norm = df[['duration', 'campaign']].values
normalizer = Normalizer(norm="l2")
normalized_features = normalizer.transform(features_norm)

print("Normalized Duration & Campaign (First 5 rows):")
print(normalized_features[:5])

Normalized Duration & Campaign (First 5 rows):
[[9.99999539e-01 9.59692456e-04]
 [9.99999768e-01 6.81663100e-04]
 [9.99999741e-01 7.19942218e-04]
 [9.99998509e-01 1.72711314e-03]
 [9.99995584e-01 2.97175508e-03]]


**5: Grouping Observations Using Clustering**

In [11]:
# Group customers into 3 clusters based on 'age' and 'balance'
features_cluster = df[['age', 'balance']].values
clusterer = KMeans(3, random_state=0)
df['cluster_group'] = clusterer.fit_predict(features_cluster)

print("Clustered Groups (First 5 rows):")
print(df[['age', 'balance', 'cluster_group']].head())

Clustered Groups (First 5 rows):
   age  balance  cluster_group
0   59     2343              1
1   56       45              1
2   41     1270              1
3   55     2476              1
4   54      184              1


**6: Handling Missing Numerical Values**

In [12]:
# bank.csv is clean, so we artificially create missing values in 'pdays'
df_missing = df.copy()
df_missing.loc[0:10, 'pdays'] = np.nan

imputer_mean = SimpleImputer(strategy="mean")
imputed_pdays = imputer_mean.fit_transform(df_missing[['pdays']])

print("Imputed 'pdays' (First 5 values - originally NaNs):")
print(imputed_pdays[:5].flatten())

Imputed 'pdays' (First 5 values - originally NaNs):
[51.38202852 51.38202852 51.38202852 51.38202852 51.38202852]


-----

### **Part 2: Handling Categorical Data & Imbalanced Classes**

**7: Imports for Categorical Data**

In [13]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import KNeighborsClassifier

# Reload dataset to ensure clean slate for Part 2
df = pd.read_csv('bank.csv')

**8: Encoding Nominal Categorical Features**

In [14]:
# Using LabelBinarizer for 'marital' status
feature_marital = df[['marital']].values
one_hot = LabelBinarizer()
marital_encoded = one_hot.fit_transform(feature_marital)

print("One-Hot Encoded 'marital' (First 5 rows):")
print(marital_encoded[:5])
print("Classes:", one_hot.classes_)

One-Hot Encoded 'marital' (First 5 rows):
[[0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]
 [0 1 0]]
Classes: ['divorced' 'married' 'single']


**9: Encoding Dictionaries of Features**

In [15]:
# Convert 'job' and 'education' to a dictionary and then vectorize
data_dict = df[['job', 'education']].to_dict(orient='records')
dictvectorizer = DictVectorizer(sparse=False)
features_dict = dictvectorizer.fit_transform(data_dict)

print("Dictionary Vectorized Features (First row):")
print(features_dict[0])
print("Feature Names:", dictvectorizer.get_feature_names_out()[:5])

Dictionary Vectorized Features (First row):
[0. 1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Feature Names: ['education=primary' 'education=secondary' 'education=tertiary'
 'education=unknown' 'job=admin.']


**10: Encoding Ordinal Categorical Features (Binning)**

In [16]:
# Bin 'age' into logical groups: Young, Middle, Senior
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 55, 100], labels=["Young", "Middle", "Senior"])

# Map labels to numeric values
scale_mapper = {"Young": 1, "Middle": 2, "Senior": 3}
df['age_group_encoded'] = df['age_group'].map(scale_mapper)

print("Binned and Encoded Age (First 5 rows):")
print(df[['age', 'age_group', 'age_group_encoded']].head())

Binned and Encoded Age (First 5 rows):
   age age_group age_group_encoded
0   59    Senior                 3
1   56    Senior                 3
2   41    Middle                 2
3   55    Middle                 2
4   54    Middle                 2


**11: Imputing Missing Class Values (using KNN)**

In [17]:
# Predict missing 'loan' values using 'age' and 'balance'
X_knn = df[['age', 'balance']].values
y_knn = df['loan'].values

# Create training set (skipping first 10 rows to simulate them as "missing")
X_train = X_knn[10:]
y_train = y_knn[10:]

# Train KNN
clf_knn = KNeighborsClassifier(n_neighbors=3, weights='distance')
clf_knn.fit(X_train, y_train)

# Predict for a "missing" row (e.g., row 0)
predicted_loan = clf_knn.predict([X_knn[0]])
print(f"Predicted missing 'loan' value for row 0: {predicted_loan[0]}")

Predicted missing 'loan' value for row 0: no


**12: Handling Imbalanced Classes**

In [18]:
# Check balance of target 'deposit'
print("Target Distribution:")
print(df['deposit'].value_counts())

# Downsample the majority class ('no')
i_class_no = np.where(df['deposit'] == 'no')[0]
i_class_yes = np.where(df['deposit'] == 'yes')[0]

# Downsample 'no' to match 'yes' count
n_yes = len(i_class_yes)
downsampled_no_indices = np.random.choice(i_class_no, size=n_yes, replace=False)

# Combine indices
final_indices = np.hstack((i_class_yes, downsampled_no_indices))
df_balanced = df.iloc[final_indices]

print("\nBalanced Target Distribution (After Downsampling):")
print(df_balanced['deposit'].value_counts())

Target Distribution:
deposit
no     5873
yes    5289
Name: count, dtype: int64

Balanced Target Distribution (After Downsampling):
deposit
yes    5289
no     5289
Name: count, dtype: int64
