This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.
Dataset:
Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.
Tasks:
1.	Handle missing values as per the best practices (imputation, removal, etc.).
●	Apply scaling techniques to numerical features:
a.	Standard Scaling   b. Min-Max Scaling
●	Discuss the scenarios where each scaling technique is preferred and why.




In [None]:
import numpy as np
import pandas as pd

In [None]:
data=pd.read_csv('adult_with_headers (1).csv')

In [None]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
data.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
education_num,0
marital_status,0
occupation,0
relationship,0
race,0
sex,0


we dont have any null values in the data

In [None]:
data.duplicated().sum()

np.int64(24)

from the given data we have 24 duplicate rows so we should remove the duplicate rows from the data

In [None]:
#droping duplicate rows
data.drop_duplicates(inplace=True)

In [None]:
data.duplicated().sum()

np.int64(0)

we dont have any duplicated rows in the data

In [None]:
data.dtypes

Unnamed: 0,0
age,int64
workclass,object
fnlwgt,int64
education,object
education_num,int64
marital_status,object
occupation,object
relationship,object
race,object
sex,object


In [None]:
data


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


From the given data we have 9 categorical columns lets see there uniqes values before applying the encoding techniques

In [None]:
#uniques values of the categorical columns
for i in data.columns:
  if data[i].dtype=='object':
    print(i)
    print(data[i].unique())

workclass
[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']
education
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']
marital_status
[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']
occupation
[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']
relationship
[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']
race
[' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']
sex
[' Male' ' Female']
native_country
[' United-States' ' Cuba' ' Jamaica' ' India' '

from the above unique values of categorical columns we got a question mark (?) in the work class columnn. now lets replace the question mark with nan

In [None]:
#replacing question mark to nan
data['workclass']=data['workclass'].astype(str).str.strip()
data.replace('?',np.nan,inplace=True)

In [None]:
data.workclass.unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', nan, 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

In [None]:
data['workclass'].value_counts()

Unnamed: 0_level_0,count
workclass,Unnamed: 1_level_1
Private,22673
Self-emp-not-inc,2540
Local-gov,2093
State-gov,1298
Self-emp-inc,1116
Federal-gov,960
Without-pay,14
Never-worked,7


after replacing te question mark we got the null values in the workclass column.
Now i am handling the null values with the help of imputation techniques and i am relacing the nan with the most frequent word in the workclass.

In [None]:
data.isnull().sum()

Unnamed: 0,0
age,0
workclass,1836
fnlwgt,0
education,0
education_num,0
marital_status,0
occupation,0
relationship,0
race,0
sex,0


In [None]:
data['workclass'].fillna(data['workclass'].mode()[0],inplace=True)

In [None]:
data.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
education_num,0
marital_status,0
occupation,0
relationship,0
race,0
sex,0


In [None]:
data.workclass.unique()

array(['State-gov', 'Self-emp-not-inc', 'Private', 'Federal-gov',
       'Local-gov', 'Self-emp-inc', 'Without-pay', 'Never-worked'],
      dtype=object)

In [None]:
#lets encode the categorical columns in the given data
cat_col = data.select_dtypes(include='object').columns
cat_col


Index(['workclass', 'education', 'marital_status', 'occupation',
       'relationship', 'race', 'sex', 'native_country', 'income'],
      dtype='object')

Here i am applying label encoder because in a categorical columns each columns has more than 5 classes so i am applying label encoder

In [None]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
for i in cat_col:
  data[i]=le.fit_transform(data[i])

In [None]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,6,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,5,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,3,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,3,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,3,338409,9,13,2,10,5,2,0,0,0,40,5,0


In [None]:
data.dtypes

Unnamed: 0,0
age,int64
workclass,int64
fnlwgt,int64
education,int64
education_num,int64
marital_status,int64
occupation,int64
relationship,int64
race,int64
sex,int64


In [None]:
#Standardscaler
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()


In [None]:
num_col=data.select_dtypes(include='number').columns

In [None]:
num_col

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')

In [None]:
for col in num_col:
  data[col]=sc.fit_transform(data[col].values.reshape(-1, 1))

In [None]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.03039,2.623449,-1.063569,-0.335266,1.134777,0.921857,-1.317629,-0.277864,0.393685,0.70302,0.148292,-0.216743,-0.035664,0.291335,-0.563377
1,0.836973,1.720541,-1.008668,-0.335266,1.134777,-0.405919,-0.608318,-0.900126,0.393685,0.70302,-0.145975,-0.216743,-2.222483,0.291335,-0.563377
2,-0.042936,-0.085276,0.24504,0.181519,-0.420679,-1.733696,-0.135444,-0.277864,0.393685,0.70302,-0.145975,-0.216743,-0.035664,0.291335,-0.563377
3,1.05695,-0.085276,0.425752,-2.402406,-1.198407,-0.405919,-0.135444,-0.900126,-1.962488,0.70302,-0.145975,-0.216743,-0.035664,0.291335,-0.563377
4,-0.776193,-0.085276,1.408066,-0.335266,1.134777,-0.405919,0.810304,2.211186,-1.962488,-1.422436,-0.145975,-0.216743,-0.035664,-4.056151,-0.563377


In [None]:
data[num_col].mean()

Unnamed: 0,0
age,-1.856229e-17
workclass,2.6642350000000002e-17
fnlwgt,1.550497e-17
education,2.838939e-17
education_num,5.677878e-18
marital_status,1.0263860000000001e-17
occupation,1.6160110000000002e-17
relationship,2.893534e-17
race,-2.052771e-17
sex,-3.996352e-17


In [None]:
data[num_col].std()

Unnamed: 0,0
age,1.000015
workclass,1.000015
fnlwgt,1.000015
education,1.000015
education_num,1.000015
marital_status,1.000015
occupation,1.000015
relationship,1.000015
race,1.000015
sex,1.000015


In [None]:
#minmax scaling
from sklearn.preprocessing import MinMaxScaler
mm=MinMaxScaler()


In [None]:
data[num_col]=mm.fit_transform(data[num_col])

In [None]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,0.30137,0.857143,0.044302,0.6,0.8,0.666667,0.071429,0.2,1.0,1.0,0.02174,0.0,0.397959,0.95122,0.0
1,0.452055,0.714286,0.048238,0.6,0.8,0.333333,0.285714,0.0,1.0,1.0,0.0,0.0,0.122449,0.95122,0.0
2,0.287671,0.428571,0.138113,0.733333,0.533333,0.0,0.428571,0.2,1.0,1.0,0.0,0.0,0.397959,0.95122,0.0
3,0.493151,0.428571,0.151068,0.066667,0.4,0.333333,0.428571,0.0,0.5,1.0,0.0,0.0,0.397959,0.95122,0.0
4,0.150685,0.428571,0.221488,0.6,0.8,0.333333,0.714286,1.0,0.5,0.0,0.0,0.0,0.397959,0.121951,0.0


Discuss the scenarios where each scaling technique is preferred and why.

StandardScaler is also know as z-score normalization
it scales the data from mean = 0 and standar deviation = 1
it will apply when the data is in bell shaped curve
mostlu used for linear regressionn, logistic regression, svm etc


Minmax scaler is also know as Normalization
it will apply when the data is skwed
it scales the data in a range from 0 to 1
mostlu used for KNN,Gradient Descent models and decision tree etc.

2. Encoding Techniques:
●	Apply One-Hot Encoding to categorical variables with less than 5 categories.
●	Use Label Encoding for categorical variables. Data Exploration and Preprocessing:
●	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).
●	les with more than 5 categories.
●	Discuss the pros and cons of One-Hot Encoding and Label Encoding.


we have already done the encoding of the data.
from the above given data we have not used one hot encoding because most of the categorical columns has classes more than 5 so we have used label encoder.

Label Encoding :
it is the one of the encoding technique which is used to convert the text or categorical data into numerical data.
it converts the each unique category into an integer values

Advantages
simple and fast to use
used for ordinal data
memory efficient it does not create extra columns

Disadvantages
shows false relationship between categorical columns
not sutiable for nominal data

One Hot encoding :
it creates a new column for each category

Advantages:
used for nominal data
no ordinal realationship treated equally for all categorical columns

Disadvantages
memory consumtion and takes more time to perform
can lead to multicollinearity

3. Feature Engineering:
●	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
●	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.


In [None]:
data['netcapital']=data['capital_gain']-data['capital_loss']

In [None]:
data['netcapital']

Unnamed: 0,netcapital
0,0.021740
1,0.000000
2,0.000000
3,0.000000
4,0.000000
...,...
32556,0.000000
32557,0.000000
32558,0.000000
32559,0.000000


In [None]:
def age_group(age):
  if age < 25:
    return 'Young'
  elif age < 45:
    return "Adult"
  else:
    return 'Senior'
data['age_group']=data['age'].apply(age_group)

In [None]:
data['age_group']

Unnamed: 0,age_group
0,Young
1,Young
2,Young
3,Young
4,Young
...,...
32556,Young
32557,Young
32558,Young
32559,Young


In [None]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,netcapital,age_group
0,0.30137,0.857143,0.044302,0.6,0.8,0.666667,0.071429,0.2,1.0,1.0,0.02174,0.0,0.397959,0.95122,0.0,0.02174,Young
1,0.452055,0.714286,0.048238,0.6,0.8,0.333333,0.285714,0.0,1.0,1.0,0.0,0.0,0.122449,0.95122,0.0,0.0,Young
2,0.287671,0.428571,0.138113,0.733333,0.533333,0.0,0.428571,0.2,1.0,1.0,0.0,0.0,0.397959,0.95122,0.0,0.0,Young
3,0.493151,0.428571,0.151068,0.066667,0.4,0.333333,0.428571,0.0,0.5,1.0,0.0,0.0,0.397959,0.95122,0.0,0.0,Young
4,0.150685,0.428571,0.221488,0.6,0.8,0.333333,0.714286,1.0,0.5,0.0,0.0,0.0,0.397959,0.121951,0.0,0.0,Young


In [None]:
#encoding the new features
data['age_group']=le.fit_transform(data['age_group'])

In [None]:
data

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,netcapital,age_group
0,0.301370,0.857143,0.044302,0.600000,0.800000,0.666667,0.071429,0.2,1.0,1.0,0.021740,0.0,0.397959,0.951220,0.0,0.021740,0
1,0.452055,0.714286,0.048238,0.600000,0.800000,0.333333,0.285714,0.0,1.0,1.0,0.000000,0.0,0.122449,0.951220,0.0,0.000000,0
2,0.287671,0.428571,0.138113,0.733333,0.533333,0.000000,0.428571,0.2,1.0,1.0,0.000000,0.0,0.397959,0.951220,0.0,0.000000,0
3,0.493151,0.428571,0.151068,0.066667,0.400000,0.333333,0.428571,0.0,0.5,1.0,0.000000,0.0,0.397959,0.951220,0.0,0.000000,0
4,0.150685,0.428571,0.221488,0.600000,0.800000,0.333333,0.714286,1.0,0.5,0.0,0.000000,0.0,0.397959,0.121951,0.0,0.000000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,0.428571,0.166404,0.466667,0.733333,0.333333,0.928571,1.0,1.0,0.0,0.000000,0.0,0.377551,0.951220,0.0,0.000000,0
32557,0.315068,0.428571,0.096500,0.733333,0.533333,0.333333,0.500000,0.0,1.0,1.0,0.000000,0.0,0.397959,0.951220,1.0,0.000000,0
32558,0.561644,0.428571,0.094827,0.733333,0.533333,1.000000,0.071429,0.8,1.0,0.0,0.000000,0.0,0.397959,0.951220,0.0,0.000000,0
32559,0.068493,0.428571,0.128499,0.733333,0.533333,0.666667,0.071429,0.6,1.0,1.0,0.000000,0.0,0.193878,0.951220,0.0,0.000000,0


Added new features one is netcapital and another one is age_category.Net capital tells about the difference between capitalgain and capitalloss.and age_group will tells about the different age category according to there example young,adult,senior etc

In [None]:
#Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

data[num_col].skew().sort_values(ascending=False)

Unnamed: 0,0
capital_gain,11.949403
capital_loss,4.592702
fnlwgt,1.447703
income,1.211687
relationship,0.786548
age,0.557663
hours_per_week,0.228759
occupation,0.114586
workclass,0.076273
marital_status,-0.012753


capital gain and capital loss are high roight skewed

In [None]:
data['has_capital_gain'] = data['capital_gain'].apply(lambda x: 1 if x > 0 else 0)


In [None]:
upper_limit = data['capital_gain'].quantile(0.95)
data['capital_gain_capped'] = np.where(data['capital_gain'] > upper_limit, upper_limit, data['capital_gain'])
data['capital_gain_log'] = np.log1p(data['capital_gain_capped'])


In [None]:
data['capital_gain_log'].skew()


np.float64(3.300411181272353)

the skewness of the capital gain is reduced from 11 to 3
in this way log tranformation is used to reduce the skweness of the feature