### (b) Questions:

1. Calculate the % of missing values in a column.
2. Replace missing value with mean if the % of missing value is less than 10%.
3. Perform the mode imputation for a categorical data.
4. Perform a KNN Imputer to estimate the missing values.
5. Drop the columns with more than 10% missing values and display the size.
6. Drop the rows with outlier Z.-score value > 3 and display the size.
7. Drop the duplicate rows based on more than 50% of column having same value.
8. Rescale your data using min-max normalization for a numerical feature.
9. Binarize the data by using binarizer class in python.
10. Perform the one-hot encoding for a categorical feature.
hon code)

In [213]:
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler, Binarizer

# Read the data
data = pd.read_csv('sample_dataset.csv')

In [214]:
numeric_cols = data.select_dtypes(exclude=['object']).columns
print(numeric_cols)

Index(['ID', 'Age', 'Income', 'Score'], dtype='object')


In [215]:
# Calculate the % of missing values in a column
missing_percent = data[numeric_cols].isnull().sum() / len(data) * 100
print("\nPercentage of missing values in each column:")
print(missing_percent)


Percentage of missing values in each column:
ID        0.0
Age       0.0
Income    0.0
Score     0.0
dtype: float64


In [216]:
# Replace missing value with mean if the % of missing value is less than 10%
missing_cols = missing_percent[missing_percent < 10].index
data[missing_cols] = data[missing_cols].fillna(data[missing_cols].mean())

In [217]:
# Perform the mode imputation for a categorical data
for col in data.select_dtypes(include=['object']):
    data[col].fillna(data[col].mode()[0], inplace=True)

In [218]:
num_cols = data.select_dtypes(exclude=['object']).columns

# Perform a KNN Imputer to estimate the missing values
imputer = KNNImputer()
data_imputed = imputer.fit_transform(data[num_cols])
data[num_cols] = data_imputed

In [219]:
# Drop the columns with more than 10% missing values and display the size
missing_cols_to_drop = missing_percent[missing_percent >= 10].index
data.drop(columns=missing_cols_to_drop, inplace=True)
print("\nDataset size after dropping columns with more than 10% missing values:", data.shape)


Dataset size after dropping columns with more than 10% missing values: (100, 6)


In [220]:
from scipy.stats import zscore

# Drop the rows with outlier Z.-score value > 3 and display the size
z_scores = zscore(data.select_dtypes(exclude=['object']))
outliers = (z_scores > 3) | (z_scores < -3)
data = data[~outliers.any(axis=1)]
print("\nDataset size after dropping rows with outlier Z.-score value:", data.shape)


Dataset size after dropping rows with outlier Z.-score value: (100, 6)


---

In [221]:
# Drop the duplicate rows based on more than 50% of column having same value

# Calculate the percentage of duplicate values in each column
duplicate_percent = (data.apply(pd.Series.duplicated, axis=0).sum() / len(data)) * 100

# Select columns where more than 50% of values are duplicates
columns_to_check = duplicate_percent[duplicate_percent > 50].index

# Drop duplicate rows based on selected columns
data.drop_duplicates(subset=columns_to_check, inplace=True)

# Print dataset size after dropping duplicate rows
print("\nDataset size after dropping duplicate rows:", data.shape)


Dataset size after dropping duplicate rows: (90, 6)


---

In [None]:
# Drop the duplicate rows based on more than 50% of column having same value
data.drop_duplicates(subset=data.columns[data.nunique() > 1], inplace=True)
print("\nDataset size after dropping duplicate rows:", data.shape)

In [222]:
# Rescale your data using min-max normalization for a numerical feature
scaler = MinMaxScaler()
numerical_cols = data.select_dtypes(exclude=['object']).columns
data[numerical_cols] = scaler.fit_transform(data[numerical_cols])

In [223]:
# Binarize the data by using binarizer class in python
binarizer = Binarizer()
data_binarized = binarizer.fit_transform(data[numerical_cols])
data[numerical_cols] = data_binarized

In [224]:
# Perform the one-hot encoding for a categorical feature
data = pd.get_dummies(data)

In [225]:
# Display the modified dataset
print("\nModified Dataset:")
print(data.head())


Modified Dataset:
    ID  Age  Income  Score  Gender_Female  Gender_Male  Education_Bachelor  \
0  0.0  1.0     1.0    1.0           True        False               False   
1  1.0  1.0     1.0    1.0           True        False               False   
2  1.0  0.0     1.0    1.0           True        False                True   
3  1.0  1.0     1.0    1.0           True        False               False   
4  1.0  1.0     1.0    1.0          False         True               False   

   Education_College  Education_High School  Education_Master  Education_PhD  
0              False                  False             False           True  
1              False                  False             False           True  
2              False                  False             False          False  
3              False                  False              True          False  
4              False                  False              True          False  
