### Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.
Dataset:
Given "Adult" dataset, which predicts whether income exceeds $50K/yr based on census data.
Tasks:
1. Data Exploration and Preprocessing:
•	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).
•	Handle missing values as per the best practices (imputation, removal, etc.).
•	Apply scaling techniques to numerical features:
•	Standard Scaling
•	Min-Max Scaling
•	Discuss the scenarios where each scaling technique is preferred and why.
2. Encoding Techniques:
•	Apply One-Hot Encoding to categorical variables with less than 5 categories.
•	Use Label Encoding for categorical variables with more than 5 categories.
•	Discuss the pros and cons of One-Hot Encoding and Label Encoding.
3. Feature Engineering:
•	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.
•	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.
4. Feature Selection:
•	Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.
•	Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data Exploration and Preprocessing

In [3]:
df=pd.read_csv('adult_with_headers.csv')

In [4]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [5]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [6]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64

In [7]:
df.duplicated()

0        False
1        False
2        False
3        False
4        False
         ...  
32556    False
32557    False
32558    False
32559    False
32560    False
Length: 32561, dtype: bool

In [8]:
df.describe()

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [10]:
# Handle missing values
df.fillna(df.mean(),inplace=True)

  df.fillna(df.mean(),inplace=True)


In [11]:
# Statistical Summary
# Compute measures of central tendency and dispersion
summary_stats = df.describe()
print(summary_stats)

                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  


In [12]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [13]:
# Let's assume '?' represents missing values in the dataset
df.replace('?', np.nan, inplace=True)
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df.select_dtypes(include=['float64', 'int64'])), columns=df.select_dtypes(include=['float64', 'int64']).columns)

In [55]:
df_imputed

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,39.0,77516.0,13.0,2174.0,0.0,40.0
1,50.0,83311.0,13.0,0.0,0.0,13.0
2,38.0,215646.0,9.0,0.0,0.0,40.0
3,53.0,234721.0,7.0,0.0,0.0,40.0
4,28.0,338409.0,13.0,0.0,0.0,40.0
...,...,...,...,...,...,...
32556,27.0,257302.0,12.0,0.0,0.0,38.0
32557,40.0,154374.0,9.0,0.0,0.0,40.0
32558,58.0,151910.0,9.0,0.0,0.0,40.0
32559,22.0,201490.0,9.0,0.0,0.0,20.0


In [56]:
# Scaling techniques
# Standard Scaling
scaler_standard = StandardScaler()

In [58]:
df_standard_scaled = scaler_standard.fit_transform(df_imputed)
df_standard_scaled

array([[ 0.03067056, -1.06361075,  1.13473876,  0.1484529 , -0.21665953,
        -0.03542945],
       [ 0.83710898, -1.008707  ,  1.13473876, -0.14592048, -0.21665953,
        -2.22215312],
       [-0.04264203,  0.2450785 , -0.42005962, -0.14592048, -0.21665953,
        -0.03542945],
       ...,
       [ 1.42360965, -0.35877741, -0.42005962, -0.14592048, -0.21665953,
        -0.03542945],
       [-1.21564337,  0.11095988, -0.42005962, -0.14592048, -0.21665953,
        -1.65522476],
       [ 0.98373415,  0.92989258, -0.42005962,  1.88842434, -0.21665953,
        -0.03542945]])

In [15]:
# Min-Max Scaling
scaler_minmax = MinMaxScaler()
df_minmax_scaled = scaler_minmax.fit_transform(df_imputed)

In [59]:
df_minmax_scaled

array([[0.30136986, 0.0443019 , 0.8       , 0.02174022, 0.        ,
        0.39795918],
       [0.45205479, 0.0482376 , 0.8       , 0.        , 0.        ,
        0.12244898],
       [0.28767123, 0.13811345, 0.53333333, 0.        , 0.        ,
        0.39795918],
       ...,
       [0.56164384, 0.09482688, 0.53333333, 0.        , 0.        ,
        0.39795918],
       [0.06849315, 0.12849934, 0.53333333, 0.        , 0.        ,
        0.19387755],
       [0.47945205, 0.18720338, 0.53333333, 0.1502415 , 0.        ,
        0.39795918]])

#### Scenarios for each scaling technique:
Standard Scaling:
- Preferred when the features in your dataset have a normal or Gaussian distribution.
- Useful when the features have varying units and scales.
 - Works well with algorithms that assume zero-centered data.

Min-Max Scaling:
 - Preferred when you need the features to be on a similar scale, typically between 0 and 1.
 - Useful for algorithms like neural networks and image processing where input values must be in a specific range.
 - Sensitive to outliers, so it's recommended to handle outliers before applying Min-Max scaling.

### Encoding Techniques

In [16]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

In [20]:
categorical_less_than_5 = [col for col in df.columns if df[col].nunique() < 5]
categorical_more_than_5 = [col for col in df.columns if df[col].nunique() >= 5]

In [53]:
categorical_less_than_5

['sex', 'income']

In [54]:
categorical_more_than_5

['age',
 'workclass',
 'fnlwgt',
 'education',
 'education_num',
 'marital_status',
 'occupation',
 'relationship',
 'race',
 'capital_gain',
 'capital_loss',
 'hours_per_week',
 'native_country']

In [50]:
# Apply One-Hot Encoding to categorical variables with less than 5 categories
one_hot_encoder = OneHotEncoder()
one_hot_encoded = pd.DataFrame(one_hot_encoder.fit_transform(df[categorical_less_than_5]).toarray(), columns=one_hot_encoder.get_feature_names_out(categorical_less_than_5))
one_hot_encoded

Unnamed: 0,sex_0,sex_1,income_0,income_1
0,0.0,1.0,1.0,0.0
1,0.0,1.0,1.0,0.0
2,0.0,1.0,1.0,0.0
3,0.0,1.0,1.0,0.0
4,1.0,0.0,1.0,0.0
...,...,...,...,...
32556,1.0,0.0,1.0,0.0
32557,0.0,1.0,0.0,1.0
32558,1.0,0.0,1.0,0.0
32559,0.0,1.0,1.0,0.0


In [51]:
# Use Label Encoding for categorical variables with more than 5 categories
label_encoder = LabelEncoder()
label_encoded = df[categorical_more_than_5].apply(label_encoder.fit_transform)
label_encoded

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country
0,22,7,2671,9,12,4,1,1,4,25,0,39,39
1,33,6,2926,9,12,2,4,0,4,0,0,12,39
2,21,4,14086,11,8,0,6,1,4,0,0,39,39
3,36,4,15336,1,6,2,6,0,2,0,0,39,39
4,11,4,19355,9,12,2,10,5,2,0,0,39,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,10,4,16528,7,11,2,13,5,4,0,0,37,39
32557,23,4,8080,11,8,2,7,0,4,0,0,39,39
32558,41,4,7883,11,8,6,1,4,4,0,0,39,39
32559,5,4,12881,11,8,4,1,3,4,0,0,19,39


In [52]:
# Concatenate one-hot encoded and label encoded dataframes with original dataframe
df_encoded = pd.concat([one_hot_encoded, label_encoded, df.drop(columns=categorical_less_than_5+categorical_more_than_5)], axis=1)
df_encoded

Unnamed: 0,sex_0,sex_1,income_0,income_1,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,capital_gain,capital_loss,hours_per_week,native_country,total_work_hours,capital_ratio,capital_gain_log
0,0.0,1.0,1.0,0.0,22,7,2671,9,12,4,1,1,4,25,0,39,39,520,2174.0,7.684784
1,0.0,1.0,1.0,0.0,33,6,2926,9,12,2,4,0,4,0,0,12,39,169,0.0,0.000000
2,0.0,1.0,1.0,0.0,21,4,14086,11,8,0,6,1,4,0,0,39,39,360,0.0,0.000000
3,0.0,1.0,1.0,0.0,36,4,15336,1,6,2,6,0,2,0,0,39,39,280,0.0,0.000000
4,1.0,0.0,1.0,0.0,11,4,19355,9,12,2,10,5,2,0,0,39,5,520,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,1.0,0.0,1.0,0.0,10,4,16528,7,11,2,13,5,4,0,0,37,39,456,0.0,0.000000
32557,0.0,1.0,0.0,1.0,23,4,8080,11,8,2,7,0,4,0,0,39,39,360,0.0,0.000000
32558,1.0,0.0,1.0,0.0,41,4,7883,11,8,6,1,4,4,0,0,39,39,360,0.0,0.000000
32559,0.0,1.0,1.0,0.0,5,4,12881,11,8,4,1,3,4,0,0,19,39,180,0.0,0.000000


#### Pros and Cons of One-Hot Encoding and Label Encoding:

One-Hot Encoding:
Pros:
- Retains all information present in the categorical variable.
- Does not impose any ordinal relationship between categories.
- Suitable for algorithms that can handle high-dimensional data, such as tree-based models and deep learning.

 Cons:
- Increases the dimensionality of the dataset, which can be problematic for high-cardinality categorical variables.
- Introduces multicollinearity, where the presence of highly correlated features can adversely affect the performance of some models, such as linear models.

 Label Encoding:
Pros:
- Reduces dimensionality compared to one-hot encoding.
- Preserves the order of categories, which may be beneficial if there is an inherent ordinal relationship among them.

 Cons:
- May introduce ordinality where none exists, which could mislead the model.
- Not suitable for categorical variables with high cardinality or where there is no ordinal relationship among categories.

The choice between One-Hot Encoding and Label Encoding depends on the specific characteristics of your dataset, the nature of the categorical variables, and the requirements of the machine learning model being used.

### Feature Engineering

In [27]:
from sklearn.preprocessing import FunctionTransformer

In [60]:
# Clean column names (if needed)
df.columns = df.columns.str.strip().str.lower().str.replace('-', '_')
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income', 'total_work_hours', 'capital_ratio', 'capital_gain_log'],
      dtype='object')

In [61]:
# Create at least 2 new features:
# 1. Total work hours: Combining 'hours_per_week' and 'education_num' to capture the overall work intensity.
df['total_work_hours'] = df['hours_per_week'] * df['education_num']
df['total_work_hours']

0        520
1        169
2        360
3        280
4        520
        ... 
32556    456
32557    360
32558    360
32559    180
32560    360
Name: total_work_hours, Length: 32561, dtype: int64

In [62]:
# 2. Capital ratio: Calculating the ratio of capital gain to capital loss to represent the overall financial gain or loss.
df['capital_ratio'] = df['capital_gain'] / (df['capital_loss'] + 1)  # Adding 1 to avoid division by zero errors
df['capital_ratio']

0         2174.0
1            0.0
2            0.0
3            0.0
4            0.0
          ...   
32556        0.0
32557        0.0
32558        0.0
32559        0.0
32560    15024.0
Name: capital_ratio, Length: 32561, dtype: float64

In [63]:
# Apply a transformation to at least one skewed numerical feature:
# For example, let's consider the 'capital_gain' feature
# Log transformation is commonly used to reduce skewness in positively skewed data
df['capital_gain_log'] = np.log(df['capital_gain'] + 1)  # Adding 1 to avoid log(0)
df['capital_gain_log']

0        7.684784
1        0.000000
2        0.000000
3        0.000000
4        0.000000
           ...   
32556    0.000000
32557    0.000000
32558    0.000000
32559    0.000000
32560    9.617471
Name: capital_gain_log, Length: 32561, dtype: float64

### Justification:
1. Total work hours: By combining 'hours_per_week' and 'education_num', we create a new feature that captures both the number of hours worked per week and the educational level, providing a more comprehensive measure of work intensity.
2. Capital ratio: This feature captures the ratio of capital gain to capital loss, which can provide insights into an individual's overall financial performance. A higher ratio indicates a higher proportion of gains relative to losses, while a lower ratio may indicate more losses relative to gains.
3. Log transformation of 'capital_gain': Applying a log transformation helps to normalize the distribution of 'capital_gain' and reduce its right skewness. This can be beneficial for models that assume normally distributed data or require features to have a symmetric distribution.


### Feature Selection

In [38]:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder

In [39]:
# Encode categorical variables
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    label_encoder = LabelEncoder()
    df[col] = label_encoder.fit_transform(df[col])

In [64]:
# Initialize Isolation Forest
isolation_forest = IsolationForest()
isolation_forest

In [65]:
# Fit the model and predict outliers
outliers = isolation_forest.fit_predict(df)
outliers

array([-1,  1,  1, ...,  1,  1, -1])

In [66]:
# Remove outliers from the DataFrame
df_cleaned = df[outliers == 1]
df_cleaned

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,total_work_hours,capital_ratio,capital_gain_log
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0,169,0.0,0.0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0,360,0.0,0.0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0,280,0.0,0.0
5,37,4,284582,12,14,2,4,5,4,0,0,0,40,39,0,560,0.0,0.0
7,52,6,209642,11,9,2,4,0,4,1,0,0,45,39,1,405,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,22,4,310152,15,10,4,11,1,4,1,0,0,40,39,0,400,0.0,0.0
32556,27,4,257302,7,12,2,13,5,4,0,0,0,38,39,0,456,0.0,0.0
32557,40,4,154374,11,9,2,7,0,4,1,0,0,40,39,1,360,0.0,0.0
32558,58,4,151910,11,9,6,1,4,4,0,0,0,40,39,0,360,0.0,0.0


### PPS (Predictive Power Score) Analysis:

In [44]:
!pip install ppscore



In [45]:
import ppscore as pps

In [46]:
# Calculate the Predictive Power Score matrix
pps_matrix = pps.matrix(df)

In [47]:
# Display the PPS matrix
print("Predictive Power Score Matrix:")
print(pps_matrix)

Predictive Power Score Matrix:
                    x                 y   ppscore            case  \
0                 age               age  1.000000  predict_itself   
1                 age         workclass  0.000000      regression   
2                 age            fnlwgt  0.000000      regression   
3                 age         education  0.000000      regression   
4                 age     education_num  0.000000      regression   
..                ...               ...       ...             ...   
319  capital_gain_log    native_country  0.000000      regression   
320  capital_gain_log            income  0.000000      regression   
321  capital_gain_log  total_work_hours  0.011546      regression   
322  capital_gain_log     capital_ratio  0.996114      regression   
323  capital_gain_log  capital_gain_log  1.000000  predict_itself   

     is_valid_score               metric  baseline_score   model_score  \
0              True                 None          0.0000      1.00

In [48]:
# Compare findings with the correlation matrix:
correlation_matrix = df.corr()

In [49]:
# Display the correlation matrix
print("\nCorrelation Matrix:")
print(correlation_matrix)


Correlation Matrix:
                       age  workclass    fnlwgt  education  education_num  \
age               1.000000   0.003787 -0.076646  -0.010508       0.036527   
workclass         0.003787   1.000000 -0.016656   0.023513       0.052085   
fnlwgt           -0.076646  -0.016656  1.000000  -0.028145      -0.043195   
education        -0.010508   0.023513 -0.028145   1.000000       0.359153   
education_num     0.036527   0.052085 -0.043195   0.359153       1.000000   
marital_status   -0.266288  -0.064731  0.028153  -0.038407      -0.069304   
occupation       -0.020947   0.254892  0.001597  -0.021260       0.109697   
relationship     -0.263698  -0.090461  0.008931  -0.010876      -0.094153   
race              0.028718   0.049742 -0.021291   0.014131       0.031838   
sex               0.088832   0.095981  0.026858  -0.027356       0.012280   
capital_gain      0.077674   0.033835  0.000432   0.030046       0.122630   
capital_loss      0.057775   0.012216 -0.010252   0.016

### Discuss findings:
1.The PPS matrix provides a measure of the predictive power of each feature with respect to every other feature, including both linear and non-linear relationships.

2.Unlike the correlation matrix, which only measures linear relationships, the PPS matrix captures all types of relationships, making it more comprehensive.

3.By comparing the two matrices, we can identify features that may have high predictive power but low correlation, or vice versa.

4.This helps in identifying potentially important features that may not be captured by traditional correlation analysis.