**DATA PREPROCESSING AND FEATURE ENGINEERING IN MACHINE LEARNING**

Objective:
This assignment aims to equip you with practical skills in data preprocessing, feature engineering, and feature selection techniques, which are crucial for building efficient machine learning models. You will work with a provided dataset to apply various techniques such as scaling, encoding, and feature selection methods including isolation forest and PPS score analysis.


**1. Data Exploration and Preprocessing:**

In [1]:
import pandas as pd

df = pd.read_csv('C:\\Users\\rishi\\OneDrive\\Desktop\\DS Assigments\\adult_with_headers.csv')
print(df)

       age          workclass  fnlwgt    education  education_num  \
0       39          State-gov   77516    Bachelors             13   
1       50   Self-emp-not-inc   83311    Bachelors             13   
2       38            Private  215646      HS-grad              9   
3       53            Private  234721         11th              7   
4       28            Private  338409    Bachelors             13   
...    ...                ...     ...          ...            ...   
32556   27            Private  257302   Assoc-acdm             12   
32557   40            Private  154374      HS-grad              9   
32558   58            Private  151910      HS-grad              9   
32559   22            Private  201490      HS-grad              9   
32560   52       Self-emp-inc  287927      HS-grad              9   

            marital_status          occupation    relationship    race  \
0            Never-married        Adm-clerical   Not-in-family   White   
1       Married-civ-spo

In [2]:
print(df.head())

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country  income  
0          2174             0              40   United-States   <=50

In [3]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


In [4]:
print(df.describe())

                age        fnlwgt  education_num  capital_gain  capital_loss  \
count  32561.000000  3.256100e+04   32561.000000  32561.000000  32561.000000   
mean      38.581647  1.897784e+05      10.080679   1077.648844     87.303830   
std       13.640433  1.055500e+05       2.572720   7385.292085    402.960219   
min       17.000000  1.228500e+04       1.000000      0.000000      0.000000   
25%       28.000000  1.178270e+05       9.000000      0.000000      0.000000   
50%       37.000000  1.783560e+05      10.000000      0.000000      0.000000   
75%       48.000000  2.370510e+05      12.000000      0.000000      0.000000   
max       90.000000  1.484705e+06      16.000000  99999.000000   4356.000000   

       hours_per_week  
count    32561.000000  
mean        40.437456  
std         12.347429  
min          1.000000  
25%         40.000000  
50%         40.000000  
75%         45.000000  
max         99.000000  


In [5]:
print(df.isnull().sum())

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
income            0
dtype: int64


**Handling Missing values**

In [6]:
df = df.dropna()

**Standard Scaling**

In [7]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = df.copy()
df_scaled[['age', 'education', 'hours_per_week']] = scaler.fit_transform(df[['age', 'education_num', 'hours_per_week']])

**Observation**

imported StandardScaler from 'sklearn.preprocessing' module is used to standardize the features 'age','education_num', and 'hours_per_week' this process transforms these features to have a mean of 0 and a standard deviation of 1 which helps in normalizing the data. The standardized values are then assigned back to the DataFrame creating a new DataFrame 'df_scaled' with scaled versions of the features which are specified

In [8]:
print(df.columns)

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')


**2. Encoding Techniques:**

**One -Hot Encoding for Categorical Variables with < 5 Categories**

In [9]:
df_encoded = pd.get_dummies(df, columns=['sex', 'race'])
                                          

**Observation**

In this step, the categorical columns like 'sex' and 'race' were one-hot encoded using the 'pd.get_dummies()' function. This process will convert each unique category in these columns into separate binary columns allowing the categorical data to be represented numerically. The resulting DataFrame 'df_encoded' will not include additional columns for each category in 'sex' and 'race' which can be used directly in ML models. This transformation helps in handling categorical variables effectively

**Label Encoding**

In [10]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['occupation'] = le.fit_transform(df['occupation'])

**Observation**

Here in this code, I have applied 'LableEncoder' to convert the categorical values in the 'occupation' column into numerical values. This transformation is important because many ML algorithms require numerical input. By encoding the categories into numbers the model can now effectively process the 'occupation' data as a feature in training and predictions

**3. Feature Engineering**

In [11]:
df['age_bin'] = pd.cut(df['age'], bins=[0,25,45,65,100], labels=['Young', 'Adult', 'Middle-Aged', 'senior'])
df['hours_per_week_bin'] = pd.cut(df['hours_per_week'], bins=[0, 25, 40, 60, 100], labels=['Part-time', 'Full-time','Over-time', 'Excessive'])

**Observation**

In this code, two new categorical features are based on the existing 'age' and 'hours_per_week' columns. The 'age_bin' column categorizes individuals into age groups:'Young', Adult', 'Middle-Aged', and Senior same as the 'hours_per_week_bin' column classifies individuals based on their working hours into categories like ' Part-time', 'Full-time', 'Over-time', and Excessive. This binning helps simplify the data and allows for easier analysis of age and working hours patterns to other variables

**Applying Transformation to Skewed Features**

In [12]:
import numpy as np

df['capital_gain_log'] = np.log1p(df['capital_gain'])

**Observation**

I have applied a logarithmic transformation to the 'capital_gain' feature in the dataset using the 'np.lo1p()' function from numpy. This transformation helps handle skewed data by reducing the impact of extreme values making the distribution more normal. The result will be stored in a new column that is 'capital_gain_log' which can be useful for improving the performance of models sensitive to the distribution of input data.

**4.Feature Selection**

In [None]:
#Isolation Forest for Outliers Detection

In [13]:
import warnings
from sklearn.ensemble import IsolationForest
warnings.filterwarnings ("ignore", message="X does not have valid feature names")
iso_forest = IsolationForest(contamination=0.01)
df = df.copy()
df['outlier'] = iso_forest.fit_predict(df[['age', 'education_num', 'hours_per_week']])
df_filtered = df.loc[df['outlier'] ==1].copy()
df_filtered.drop('outlier', axis=1, inplace=True)

**Observation**

In this, the 'IsolationForest' algorithm is used to detect and remove outliers from the dataset based on the features 'age', 'education_num', and 'hours_per_week' The 'contamination' parameters are set to 0.01 indicating that we expect 1% of the data to be outliers. After fitting the model into a new column, outliers' are added to the DataFrame where a value of '1' indicates a non-outlier and '-1' indicates an outlier. The dataset is then filtered to retain only the non-outliers and the 'outliers' column is dropped to clean the data

**Applying PPS**

In [24]:
pip install ppscore

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [15]:
import ppscore as pps

pps_matrix = pps.matrix(df)

print(pps_matrix[['x', 'y', 'ppscore']])



           x                   y   ppscore
0        age                 age  1.000000
1        age           workclass  0.011232
2        age              fnlwgt  0.000000
3        age           education  0.052315
4        age       education_num  0.000000
..       ...                 ...       ...
356  outlier              income  0.000000
357  outlier             age_bin  0.000000
358  outlier  hours_per_week_bin  0.026380
359  outlier    capital_gain_log  0.000000
360  outlier             outlier  1.000000

[361 rows x 3 columns]


**Observation**

Used the 'ppscore' library to generate a Predictive Power Score(PPS) matrix for the given dataset 'df'.The 'pps.matrix()' function computes the predictive power between each pair of features in the dataset indicating how well one feature can predict another. The resulting matrix is stored in 'pps_matrix' and then extracted and prints the columns as 'x', 'y', and 'ppscore' to display the predictive power scores between different feature pairs. This helps in understanding the relationships between features, which can guide further feature selection or engineering steps in the analysis.