# **EDA-2 (Exploratory Data Analysis -2)**

# **1. Data Exploration and Preprocessing:**

**•	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).**

In [83]:
import pandas as pd

In [84]:
# Load the dataset
df = pd.read_csv('adult_with_headers.csv')

In [85]:
# Basic exploration of the dataset
df.head()          # First few rows


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [86]:
df.info()          # Data types and null values


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [87]:
df.describe()      # Summary statistics

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


**•	Handle missing values as per the best practices (imputation, removal, etc.).**

In [88]:
# Impute missing numerical columns with the "median"
df['age'] = df['age'].fillna(df['age'].median())

# Impute missing categorical columns with the "mode"
df['workclass'] = df['workclass'].fillna(df['workclass'].mode()[0])

# Optionally, drop rows where critical categorical columns have missing values
df.dropna(subset=['education'], inplace=True)


•	**Apply scaling techniques to numerical features:**

1	***Standard Scaling***




In [89]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Initialize scalers
standard_scaler = StandardScaler()


# Apply standard scaling to numerical features
df[['age', 'fnlwgt', 'education_num', 'hours_per_week']] = standard_scaler.fit_transform(df[['age', 'fnlwgt', 'education_num', 'hours_per_week']])



2	***Min-Max Scaling***

In [90]:
min_max_scaler = MinMaxScaler()

# Apply min-max scaling to the same numerical features
df[['age', 'fnlwgt', 'education_num', 'hours_per_week']] = min_max_scaler.fit_transform(df[['age', 'fnlwgt', 'education_num', 'hours_per_week']])



 **•	Discuss the scenarios where each scaling technique is preferred and why.**

## **Discussion of Scaling Techniques:**

### **Standard Scaling** is preferred for algorithms that assume a normal distribution (e.g., linear regression, logistic regression, SVMs). It is robust to outliers and is often better when dealing with data that has varying units.


### **Min-Max Scaling** is useful when the features are not normally distributed and when you need to ensure the values are within a specific range (e.g., neural networks, k-nearest neighbors).

# **2. Encoding Techniques:**

**•	Apply One-Hot Encoding to categorical variables with less than 5 categories.**

In [91]:
# One-Hot Encoding for columns with less than 5 categories
df = pd.get_dummies(df, columns=['workclass', 'education', 'marital_status', 'occupation'], drop_first=True)


In [92]:
df.head()

Unnamed: 0,age,fnlwgt,education_num,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,...,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving
0,0.30137,0.044302,0.8,Not-in-family,White,Male,2174,0,0.397959,United-States,...,0,0,0,0,0,0,0,0,0,0
1,0.452055,0.048238,0.8,Husband,White,Male,0,0,0.122449,United-States,...,0,0,0,0,0,0,0,0,0,0
2,0.287671,0.138113,0.533333,Not-in-family,White,Male,0,0,0.397959,United-States,...,0,1,0,0,0,0,0,0,0,0
3,0.493151,0.151068,0.4,Husband,Black,Male,0,0,0.397959,United-States,...,0,1,0,0,0,0,0,0,0,0
4,0.150685,0.221488,0.8,Wife,Black,Female,0,0,0.397959,Cuba,...,0,0,0,0,0,1,0,0,0,0


**•	Use Label Encoding for categorical variables with more than 5 categories.**

In [93]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

# Label Encoding for 'native-country' as it has more than 5 categories
df['native_country'] = label_encoder.fit_transform(df['native_country'])


In [94]:
df.head()

Unnamed: 0,age,fnlwgt,education_num,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,...,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving
0,0.30137,0.044302,0.8,Not-in-family,White,Male,2174,0,0.397959,39,...,0,0,0,0,0,0,0,0,0,0
1,0.452055,0.048238,0.8,Husband,White,Male,0,0,0.122449,39,...,0,0,0,0,0,0,0,0,0,0
2,0.287671,0.138113,0.533333,Not-in-family,White,Male,0,0,0.397959,39,...,0,1,0,0,0,0,0,0,0,0
3,0.493151,0.151068,0.4,Husband,Black,Male,0,0,0.397959,39,...,0,1,0,0,0,0,0,0,0,0
4,0.150685,0.221488,0.8,Wife,Black,Female,0,0,0.397959,5,...,0,0,0,0,0,1,0,0,0,0


**•	Discuss the pros and cons of One-Hot Encoding and Label Encoding.**

### **One-Hot Encoding** creates binary columns for each category and avoids introducing ordinal relationships where none exist. It is preferred for categorical variables with no natural order.


### **Label Encoding** assigns a unique integer to each category, which can be problematic for categorical variables with no ordinal relationship, as the model might interpret the integer values as having a ranking.

# **3. Feature Engineering:**

**•	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.**

In [95]:
# Work Class Length
# Since 'workclass' is one-hot encoded, calculate length based on encoded columns
df['workclass_length'] = df[[col for col in df.columns if col.startswith('workclass_')]].sum(axis=1).apply(lambda x: len(str(x)))

# Age Group
bins = [0, 30, 50, 100]
labels = ['Young', 'Middle_Aged', 'Senior']
df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

In [96]:
df

Unnamed: 0,age,fnlwgt,education_num,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,...,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,workclass_length,age_group
0,0.301370,0.044302,0.800000,Not-in-family,White,Male,2174,0,0.397959,39,...,0,0,0,0,0,0,0,0,1,Young
1,0.452055,0.048238,0.800000,Husband,White,Male,0,0,0.122449,39,...,0,0,0,0,0,0,0,0,1,Young
2,0.287671,0.138113,0.533333,Not-in-family,White,Male,0,0,0.397959,39,...,0,0,0,0,0,0,0,0,1,Young
3,0.493151,0.151068,0.400000,Husband,Black,Male,0,0,0.397959,39,...,0,0,0,0,0,0,0,0,1,Young
4,0.150685,0.221488,0.800000,Wife,Black,Female,0,0,0.397959,5,...,0,0,0,1,0,0,0,0,1,Young
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,0.166404,0.733333,Wife,White,Female,0,0,0.377551,39,...,0,0,0,0,0,0,1,0,1,Young
32557,0.315068,0.096500,0.533333,Husband,White,Male,0,0,0.397959,39,...,1,0,0,0,0,0,0,0,1,Young
32558,0.561644,0.094827,0.533333,Unmarried,White,Female,0,0,0.397959,39,...,0,0,0,0,0,0,0,0,1,Young
32559,0.068493,0.128499,0.533333,Own-child,White,Male,0,0,0.193878,39,...,0,0,0,0,0,0,0,0,1,Young


**•	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.**

In [97]:
import numpy as np

# Apply log transformation to 'fnlwgt' to reduce skewness
df['fnlwgt_log'] = np.log1p(df['fnlwgt'])


In [98]:
df

Unnamed: 0,age,fnlwgt,education_num,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,...,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,workclass_length,age_group,fnlwgt_log
0,0.301370,0.044302,0.800000,Not-in-family,White,Male,2174,0,0.397959,39,...,0,0,0,0,0,0,0,1,Young,0.043349
1,0.452055,0.048238,0.800000,Husband,White,Male,0,0,0.122449,39,...,0,0,0,0,0,0,0,1,Young,0.047110
2,0.287671,0.138113,0.533333,Not-in-family,White,Male,0,0,0.397959,39,...,0,0,0,0,0,0,0,1,Young,0.129372
3,0.493151,0.151068,0.400000,Husband,Black,Male,0,0,0.397959,39,...,0,0,0,0,0,0,0,1,Young,0.140690
4,0.150685,0.221488,0.800000,Wife,Black,Female,0,0,0.397959,5,...,0,0,1,0,0,0,0,1,Young,0.200070
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,0.166404,0.733333,Wife,White,Female,0,0,0.377551,39,...,0,0,0,0,0,1,0,1,Young,0.153926
32557,0.315068,0.096500,0.533333,Husband,White,Male,0,0,0.397959,39,...,0,0,0,0,0,0,0,1,Young,0.092124
32558,0.561644,0.094827,0.533333,Unmarried,White,Female,0,0,0.397959,39,...,0,0,0,0,0,0,0,1,Young,0.090596
32559,0.068493,0.128499,0.533333,Own-child,White,Male,0,0,0.193878,39,...,0,0,0,0,0,0,0,1,Young,0.120889


**Justification:**


**Log Transformation is useful for features with a long right tail (skewed), such as income or population data. It compresses the scale of large values and can help with algorithms that are sensitive to outliers.**

# **4. Feature Selection:**

**•	Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.**

In [99]:
from sklearn.ensemble import IsolationForest

# Initialize the Isolation Forest model
iso_forest = IsolationForest(contamination=0.05)

# Fit the model to the data and get the predictions (-1 indicates outliers)
outliers = iso_forest.fit_predict(df[['age', 'fnlwgt', 'education_num', 'hours_per_week']])

# Mark outliers as NaN and drop them
df['outliers'] = outliers
df_cleaned = df[df['outliers'] != -1]


In [100]:
display(outliers)

array([1, 1, 1, ..., 1, 1, 1])

In [101]:
df[['outliers']].head(10)  # View the first 20 values

Unnamed: 0,outliers
0,1
1,1
2,1
3,1
4,1
5,1
6,1
7,1
8,1
9,1


In [102]:
df_cleaned

Unnamed: 0,age,fnlwgt,education_num,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,...,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,workclass_length,age_group,fnlwgt_log,outliers
0,0.301370,0.044302,0.800000,Not-in-family,White,Male,2174,0,0.397959,39,...,0,0,0,0,0,0,1,Young,0.043349,1
1,0.452055,0.048238,0.800000,Husband,White,Male,0,0,0.122449,39,...,0,0,0,0,0,0,1,Young,0.047110,1
2,0.287671,0.138113,0.533333,Not-in-family,White,Male,0,0,0.397959,39,...,0,0,0,0,0,0,1,Young,0.129372,1
3,0.493151,0.151068,0.400000,Husband,Black,Male,0,0,0.397959,39,...,0,0,0,0,0,0,1,Young,0.140690,1
4,0.150685,0.221488,0.800000,Wife,Black,Female,0,0,0.397959,5,...,0,1,0,0,0,0,1,Young,0.200070,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,0.136986,0.166404,0.733333,Wife,White,Female,0,0,0.377551,39,...,0,0,0,0,1,0,1,Young,0.153926,1
32557,0.315068,0.096500,0.533333,Husband,White,Male,0,0,0.397959,39,...,0,0,0,0,0,0,1,Young,0.092124,1
32558,0.561644,0.094827,0.533333,Unmarried,White,Female,0,0,0.397959,39,...,0,0,0,0,0,0,1,Young,0.090596,1
32559,0.068493,0.128499,0.533333,Own-child,White,Male,0,0,0.193878,39,...,0,0,0,0,0,0,1,Young,0.120889,1


In [103]:
#Filter out only outliers

df[df['outliers'] == -1]  # Show rows where outlier is -1

Unnamed: 0,age,fnlwgt,education_num,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,...,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,workclass_length,age_group,fnlwgt_log,outliers
28,0.301370,0.241083,0.533333,Not-in-family,White,Male,0,0,0.806122,39,...,0,0,0,0,0,0,1,Young,0.215984,-1
37,0.027397,0.361178,0.533333,Wife,White,Female,0,0,0.244898,39,...,0,0,0,0,0,0,1,Young,0.308351,-1
40,0.191781,0.336582,0.266667,Husband,White,Male,0,0,0.428571,39,...,0,0,0,0,0,0,1,Young,0.290116,-1
74,0.849315,0.076377,0.600000,Other-relative,White,Male,0,0,0.193878,39,...,0,1,0,0,0,0,1,Young,0.073601,-1
77,0.684932,0.136153,0.333333,Husband,White,Male,0,0,0.010204,39,...,0,0,0,0,0,0,1,Young,0.127648,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32496,0.013699,0.287878,0.400000,Own-child,White,Male,0,0,0.193878,39,...,0,1,0,0,0,0,1,Young,0.252996,-1
32525,0.876712,0.073480,0.666667,Unmarried,White,Female,0,0,0.000000,0,...,0,0,0,0,0,0,1,Young,0.070905,-1
32531,0.178082,0.014619,0.800000,Not-in-family,Asian-Pac-Islander,Female,0,0,1.000000,39,...,0,0,0,0,0,0,1,Young,0.014514,-1
32539,0.739726,0.186826,1.000000,Husband,White,Male,0,0,0.091837,39,...,0,0,0,0,0,0,1,Young,0.171283,-1


**Impact of Outliers:**

**Outliers can distort statistical measures, skew predictions, and reduce the accuracy of machine learning models. Removing them helps improve model robustness.**

**•	Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.**

In [104]:
!pip install ppscore
import ppscore as pps



In [105]:

# Compute PPS scores
pps_matrix = pps.matrix(df)



  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / scale_range  # 0.1/0.3 = 0.33
  return f1_diff / s

In [106]:
pps_matrix

Unnamed: 0,x,y,ppscore,case,is_valid_score,metric,baseline_score,model_score,model
0,age,age,1.000000e+00,predict_itself,True,,0.000000,1.000000,
1,age,fnlwgt,0.000000e+00,regression,True,mean absolute error,0.051529,0.052658,DecisionTreeRegressor()
2,age,education_num,0.000000e+00,regression,True,mean absolute error,0.123533,0.126554,DecisionTreeRegressor()
3,age,relationship,1.992051e-01,classification,True,weighted F1,0.268000,0.413818,DecisionTreeClassifier()
4,age,race,1.747134e-07,classification,True,weighted F1,0.783630,0.783630,DecisionTreeClassifier()
...,...,...,...,...,...,...,...,...,...
3359,outliers,occupation_ Transport-moving,0.000000e+00,regression,True,mean absolute error,0.047000,0.089657,DecisionTreeRegressor()
3360,outliers,workclass_length,0.000000e+00,target_is_constant,True,,1.000000,1.000000,
3361,outliers,age_group,,classification,True,weighted F1,1.000000,1.000000,DecisionTreeClassifier()
3362,outliers,fnlwgt_log,0.000000e+00,regression,True,mean absolute error,0.045567,0.045942,DecisionTreeRegressor()


**Correlation Matrix**

In [107]:
# Correlation Matrix
correlation_matrix = df.corr()


  correlation_matrix = df.corr()


In [108]:
correlation_matrix

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,native_country,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,...,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,workclass_length,fnlwgt_log,outliers
age,1.0,-0.076646,0.036527,0.077674,0.057775,0.068756,-0.001151,0.051227,0.060901,-0.019362,...,-0.089346,0.015624,0.05417,0.003891,-0.03198,-0.019576,0.026909,,-0.076454,-0.210163
fnlwgt,-0.076646,1.0,-0.043195,0.000432,-0.010252,-0.018768,-0.051966,-0.007525,-0.002828,0.005031,...,-0.003719,0.007278,-0.016206,0.016567,0.003728,0.003765,0.001265,,0.99788,-0.135353
education_num,0.036527,-0.043195,1.0,0.12263,0.079923,0.148123,0.05084,0.060518,0.097941,-0.015117,...,-0.169684,-0.071638,0.419006,0.005777,0.030253,0.060703,-0.11596,,-0.043411,0.117948
capital_gain,0.077674,0.000432,0.12263,1.0,-0.031615,0.078409,-0.001982,-0.005768,-0.007007,-0.00214,...,-0.040271,-0.007324,0.085222,-0.007136,0.011652,-0.009372,-0.018061,,0.000669,-0.016677
capital_loss,0.057775,-0.010252,0.079923,-0.031615,1.0,0.054256,0.000419,0.010798,0.014668,-0.003177,...,-0.040847,-0.011081,0.046255,-0.003174,0.009697,0.00483,-0.003282,,-0.009711,-0.004341
hours_per_week,0.068756,-0.018768,0.148123,0.078409,0.054256,1.0,-0.002671,0.013293,0.011576,-0.014262,...,-0.155872,-0.041467,0.060253,0.028102,0.009889,-0.013946,0.077596,,-0.019342,0.013104
native_country,-0.001151,-0.051966,0.05084,-0.001982,0.000419,-0.002671,1.0,0.011879,0.025936,0.004276,...,-0.046919,-0.060088,-0.019903,0.015207,0.020856,0.008467,0.018303,,-0.053615,0.037506
workclass_ Federal-gov,0.051227,-0.007525,0.060518,-0.005768,0.010798,0.013293,0.011879,1.0,-0.045682,-0.002556,...,-0.037413,-0.011817,0.028852,0.011516,-0.053873,0.044342,-0.018566,,-0.009229,0.024157
workclass_ Local-gov,0.060901,-0.002828,0.097941,-0.007007,0.014668,0.011576,0.025936,-0.045682,1.0,-0.003843,...,-0.007806,-0.017771,0.164976,0.234997,-0.090349,-0.016294,0.007159,,-0.002194,0.017609
workclass_ Never-worked,-0.019362,0.005031,-0.015117,-0.00214,-0.003177,-0.014262,0.004276,-0.002556,-0.003843,1.0,...,-0.00492,-0.000994,-0.005597,-0.002091,-0.00521,-0.002512,-0.00333,,0.005224,0.003364


**Discussion of PPS vs. Correlation:**

The correlation matrix is limited to linear relationships, while PPS provides more comprehensive insight by detecting both linear and non-linear dependencies.
PPS can identify important features for prediction that might not be immediately apparent in a correlation matrix