# **SHETH L.U.J. & SIR M.V. COLLEGE**

**Shreeraj Desai | T075**
### Practical No. 3B
**Aim**: Feature Scaling and Dummification

*   Apply feature-scaling techniques like standardization and normalization to numerical features.
*   Perform feature dummification to convert categorical variables into numerical representations.







## 1. Initial Setup and Data Loading

In [22]:
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import resample

df = pd.read_csv('Datasets/clients.csv')

# Add a synthetic column for the MultiLabelBinarizer example
# df['hobbies'] = [
#     ("sports", "reading"), ("music",), ("sports", "cooking"), ("reading", "music"), ("travel",),
#     ("cooking",), ("sports",), ("reading", "travel"), ("music", "cooking"), ("sports",), ("travel", "reading")
# ]

print("Original DataFrame:")
print(df.head())

Original DataFrame:
   month  credit_amount  credit_term  age     sex  \
0      1           7000           12   39    male   
1      1          19000            6   20    male   
2      1          29000           12   23  female   
3      1          10000           12   30    male   
4      1          14500           12   25  female   

                     education          product_type  having_children_flg  \
0  Secondary special education           Cell phones                    0   
1  Secondary special education  Household appliances                    1   
2  Secondary special education  Household appliances                    0   
3  Secondary special education           Cell phones                    1   
4             Higher education           Cell phones                    0   

   region  income family_status  phone_operator  is_client  bad_client_target  
0       2   21000       Another               0          0                  0  
1       2   17000       Another       

## 2. Encoding Nominal Categorical Features

### Solution 1: Using scikit-learn's `LabelBinarizer`

In [23]:
feature_lb = df[['product_type']]
one_hot_encoder = LabelBinarizer()
transformed_feature = one_hot_encoder.fit_transform(feature_lb)

print("One-hot encoded 'product_type' with LabelBinarizer:")
print(transformed_feature)
print("\nFeature classes:", one_hot_encoder.classes_)
print("\nReversed transformation (first 5):", one_hot_encoder.inverse_transform(transformed_feature)[:5])

One-hot encoded 'product_type' with LabelBinarizer:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Feature classes: ['Audio & Video' 'Auto' 'Boats' 'Cell phones' "Childen's goods" 'Clothing'
 'Computers' 'Construction Materials' 'Cosmetics and beauty services'
 'Fishing and hunting supplies' 'Fitness' 'Furniture' 'Garden equipment'
 'Household appliances' 'Jewelry' 'Medical services' 'Music'
 'Repair Services' 'Sporting goods' 'Tourism' 'Training' 'Windows & Doors']

Reversed transformation (first 5): ['Cell phones' 'Household appliances' 'Household appliances' 'Cell phones'
 'Cell phones']


### Solution 2: Using pandas `get_dummies`

In [24]:
df_dummies = pd.get_dummies(df, columns=['product_type'], prefix='prod')
print("DataFrame after applying get_dummies to 'product_type':")
print(df_dummies.head())

DataFrame after applying get_dummies to 'product_type':
   month  credit_amount  credit_term  age     sex  \
0      1           7000           12   39    male   
1      1          19000            6   20    male   
2      1          29000           12   23  female   
3      1          10000           12   30    male   
4      1          14500           12   25  female   

                     education  having_children_flg  region  income  \
0  Secondary special education                    0       2   21000   
1  Secondary special education                    1       2   17000   
2  Secondary special education                    0       2   31000   
3  Secondary special education                    1       2   31000   
4             Higher education                    0       2   26000   

  family_status  ...  prod_Garden equipment  prod_Household appliances  \
0       Another  ...                  False                      False   
1       Another  ...                  False       

### Solution 3: Handling Multi-Class Features with `MultiLabelBinarizer`

In [25]:
multiclass_feature = df['product_type']
one_hot_multiclass = MultiLabelBinarizer()
transformed_multiclass = one_hot_multiclass.fit_transform(multiclass_feature)

print("One-hot encoded 'product_type' column:")
print(transformed_multiclass)
print("\nClasses found:", one_hot_multiclass.classes_)

One-hot encoded 'product_type' column:
[[1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 ...
 [1 0 0 ... 0 0 0]
 [1 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]

Classes found: [' ' '&' "'" 'A' 'B' 'C' 'D' 'F' 'G' 'H' 'J' 'M' 'R' 'S' 'T' 'V' 'W' 'a'
 'b' 'c' 'd' 'e' 'g' 'h' 'i' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v'
 'w' 'y']


## 3. Encoding Ordinal Categorical Features

In [26]:
df_ordinal = df.copy()
education_mapper = {
    "Secondary education": 1,
    "Secondary special education": 2,
    "Higher education": 3,
    "PhD": 4
}
df_ordinal['education_encoded'] = df_ordinal['education'].replace(education_mapper)

print("Original vs. Encoded 'education' column:")
print(df_ordinal[['education', 'education_encoded']].head(10))

Original vs. Encoded 'education' column:
                     education education_encoded
0  Secondary special education                 2
1  Secondary special education                 2
2  Secondary special education                 2
3  Secondary special education                 2
4             Higher education                 3
5  Secondary special education                 2
6             Higher education                 3
7             Higher education                 3
8  Secondary special education                 2
9  Secondary special education                 2


## 4. Encoding Dictionaries of Features

In [27]:
data_dict = df[['sex', 'family_status', 'region']].to_dict(orient='records')
dictvectorizer = DictVectorizer(sparse=False)
features_dict = dictvectorizer.fit_transform(data_dict)

print("Resulting feature matrix from dictionaries (first 5 rows):")
print(features_dict[:5])
print("\nFeature names created by DictVectorizer:", dictvectorizer.get_feature_names_out())

Resulting feature matrix from dictionaries (first 5 rows):
[[1. 0. 0. 2. 0. 1.]
 [1. 0. 0. 2. 0. 1.]
 [1. 0. 0. 2. 1. 0.]
 [0. 0. 1. 2. 0. 1.]
 [0. 1. 0. 2. 1. 0.]]

Feature names created by DictVectorizer: ['family_status=Another' 'family_status=Married' 'family_status=Unmarried'
 'region' 'sex=female' 'sex=male']


## 5. Imputing Missing Class Values

### Solution 1: Using a KNN Classifier

In [28]:
# Corrected Code for KNN Imputation

print("### Imputing Missing Values with KNN ###")
df_impute_knn = pd.read_csv('Datasets/clients2.csv')

features_for_imputation = ['age', 'income']

# Split data into training (no missing values) and prediction (missing values) sets
train_data = df_impute_knn[df_impute_knn['having_children_flg'].notna()]
predict_data = df_impute_knn[df_impute_knn['having_children_flg'].isna()]

# --- ADD THIS CHECK ---
# Only proceed if there are missing values to predict
if not predict_data.empty:
    # Also ensure there is enough data to train the model
    if len(train_data) >= 3: # k=3 for our classifier
        clf_knn = KNeighborsClassifier(3, weights='distance')
        trained_model = clf_knn.fit(train_data[features_for_imputation], train_data['having_children_flg'])

        # Predict and fill the missing values
        imputed_values = trained_model.predict(predict_data[features_for_imputation])
        df_impute_knn.loc[df_impute_knn['having_children_flg'].isna(), 'having_children_flg'] = imputed_values
    else:
        print("Warning: Not enough data to train the KNN imputer. Skipping imputation.")
else:
    print("No missing values found in 'having_children_flg' to impute.")


print("Missing values in 'having_children_flg' before KNN imputation:", df['having_children_flg'].isna().sum())
print("Missing values in 'having_children_flg' after KNN imputation:", df_impute_knn['having_children_flg'].isna().sum())
print("\n" + "="*50 + "\n")

### Imputing Missing Values with KNN ###
Missing values in 'having_children_flg' before KNN imputation: 0
Missing values in 'having_children_flg' after KNN imputation: 0




### Solution 2: Using the Most Frequent Value

In [29]:
df_impute_freq = df.copy()
imputer = SimpleImputer(strategy='most_frequent')
df_impute_freq['having_children_flg'] = imputer.fit_transform(df_impute_freq[['having_children_flg']])

print("Missing values in 'having_children_flg' before SimpleImputer:", df['having_children_flg'].isna().sum())
print("Missing values in 'having_children_flg' after SimpleImputer:", df_impute_freq['having_children_flg'].isna().sum())

Missing values in 'having_children_flg' before SimpleImputer: 0
Missing values in 'having_children_flg' after SimpleImputer: 0


## 6. Handling Imbalanced Classes

In [30]:
class_counts = df['bad_client_target'].value_counts()
print("Class distribution for 'bad_client_target':")
print(class_counts)

df_model = df.dropna().copy()
X = df_model[['credit_amount', 'credit_term', 'age', 'income']]
y = df_model['bad_client_target']

Class distribution for 'bad_client_target':
bad_client_target
0    1527
1     196
Name: count, dtype: int64


### Solution 1: Using Class Weights

In [31]:
balanced_rf = RandomForestClassifier(class_weight='balanced', random_state=42)
print("RandomForestClassifier with balanced class weights:")
print(balanced_rf)

RandomForestClassifier with balanced class weights:
RandomForestClassifier(class_weight='balanced', random_state=42)


### Solution 2: Downsampling the Majority Class

In [32]:
df_majority = df_model[df_model.bad_client_target == 0]
df_minority = df_model[df_model.bad_client_target == 1]
df_majority_downsampled = resample(df_majority,
                                 replace=False,
                                 n_samples=len(df_minority),
                                 random_state=42)

df_downsampled = pd.concat([df_majority_downsampled, df_minority])
print("Class distribution after downsampling:")
print(df_downsampled.bad_client_target.value_counts())

Class distribution after downsampling:
bad_client_target
0    196
1    196
Name: count, dtype: int64


### Solution 3: Upsampling the Minority Class

In [33]:
df_minority_upsampled = resample(df_minority,
                                 replace=True,
                                 n_samples=len(df_majority),
                                 random_state=42)

df_upsampled = pd.concat([df_majority, df_minority_upsampled])
print("Class distribution after upsampling:")
print(df_upsampled.bad_client_target.value_counts())

Class distribution after upsampling:
bad_client_target
0    1527
1    1527
Name: count, dtype: int64
