# Data Exploration and Preprocessing :
•	Load the dataset and conduct basic data exploration (summary statistics, missing values, data types).                                                     
•	Handle missing values as per the best practices (imputation, removal, etc.).   
•	Apply scaling techniques to numerical features:                                
•	Standard Scaling                                                                    
•	Min-Max Scaling                                                                
•	Discuss the scenarios where each scaling technique is preferred and why.


In [4]:
import pandas as pd   # import Liabrary

df = pd.read_csv('https://raw.githubusercontent.com/Shrikrishna-jadhavar/Data-Science-Material/main/Dataset/adult_with_headers.csv')  # Load the Dataset.

df

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education_num   32561 non-null  int64 
 5   marital_status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital_gain    32561 non-null  int64 
 11  capital_loss    32561 non-null  int64 
 12  hours_per_week  32561 non-null  int64 
 13  native_country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
df.describe()   #Summary statistics.

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


In [7]:
df.dtypes   # data types

Unnamed: 0,0
age,int64
workclass,object
fnlwgt,int64
education,object
education_num,int64
marital_status,object
occupation,object
relationship,object
race,object
sex,object


In [8]:
df.isnull().sum()   #missing values.

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
education_num,0
marital_status,0
occupation,0
relationship,0
race,0
sex,0


There are no missing values in this dataset, so i can skip this step of handling missing values ( imputation & Removal ).

In [9]:
import numpy as np                                    # Identify numerical columns

df.select_dtypes(include=np.number).columns.tolist()  #The quickly see the names of the numeric variables in the DataFrame without seeing their actual values.

['age',
 'fnlwgt',
 'education_num',
 'capital_gain',
 'capital_loss',
 'hours_per_week']

In [10]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

numerical_features = df[['age','fnlwgt','education_num','capital_gain','capital_loss','hours_per_week']]

standard_scaler = StandardScaler()    # Apply Standard Scaling.
standard_scaled = standard_scaler.fit_transform(numerical_features)
standard_scaled_df = pd.DataFrame(standard_scaled, columns=numerical_features.columns)

min_max_scaler = MinMaxScaler()   # Apply Min-Max Scaling.
min_max_scaled = min_max_scaler.fit_transform(numerical_features)
min_max_scaled_df = pd.DataFrame(min_max_scaled, columns=numerical_features.columns)

In [11]:
standard_scaled_df  # The data is centered around a mean of 0 with a standard deviation of 1.
#Standard Scaling is useful when the features have different scales but are expected to follow a normal distribution. It centers the data to have a mean of 0 and a standard deviation of 1.

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429
1,0.837109,-1.008707,1.134739,-0.145920,-0.21666,-2.222153
2,-0.042642,0.245079,-0.420060,-0.145920,-0.21666,-0.035429
3,1.057047,0.425801,-1.197459,-0.145920,-0.21666,-0.035429
4,-0.775768,1.408176,1.134739,-0.145920,-0.21666,-0.035429
...,...,...,...,...,...,...
32556,-0.849080,0.639741,0.746039,-0.145920,-0.21666,-0.197409
32557,0.103983,-0.335433,-0.420060,-0.145920,-0.21666,-0.035429
32558,1.423610,-0.358777,-0.420060,-0.145920,-0.21666,-0.035429
32559,-1.215643,0.110960,-0.420060,-0.145920,-0.21666,-1.655225


In [12]:
min_max_scaled_df   # The data is normalized to a range between 0 and 1.
#Min-Max Scaling is preferred when you need to normalize the data to a specific range, typically [0, 1].

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week
0,0.301370,0.044302,0.800000,0.021740,0.0,0.397959
1,0.452055,0.048238,0.800000,0.000000,0.0,0.122449
2,0.287671,0.138113,0.533333,0.000000,0.0,0.397959
3,0.493151,0.151068,0.400000,0.000000,0.0,0.397959
4,0.150685,0.221488,0.800000,0.000000,0.0,0.397959
...,...,...,...,...,...,...
32556,0.136986,0.166404,0.733333,0.000000,0.0,0.377551
32557,0.315068,0.096500,0.533333,0.000000,0.0,0.397959
32558,0.561644,0.094827,0.533333,0.000000,0.0,0.397959
32559,0.068493,0.128499,0.533333,0.000000,0.0,0.193878


**Scenarios for Scaling Techniques -**

Standard Scaling :                                                                
Standard Scaling is useful when the features have different scales but are expected to follow a normal distribution. It centers the data to have a mean of 0 and a standard deviation of 1.

Why: This is particularly useful in algorithms that assume the data is normally distributed or in cases where outliers are not a major concern.

Min-Max Scaling:                                                                
Min-Max Scaling is preferred when you need to normalize the data to a specific range, typically [0, 1]. This is especially useful in algorithms that require bounded data, such as neural networks or algorithms sensitive to the scale of input features.            

Why: It preserves the relationships between data points, which is important when the distribution of the data is unknown or when working with algorithms that do not make assumptions about the distribution of the data.



#Encoding Techniques:
•	Apply One-Hot Encoding to categorical variables with less than 5 categories.   
•	Use Label Encoding for categorical variables with more than 5 categories.      
•	Discuss the pros and cons of One-Hot Encoding and Label Encoding


In [13]:
categorical_columns = df.select_dtypes(include=['object']).columns   # Identify categorical columns
#df.select_dtypes(include=['object']).columns.tolist()

In [14]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

one_hot_columns = [col for col in categorical_columns if df[col].nunique() <= 5] #Try (<5)    # Split columns based on unique values count.

# Apply One-Hot Encoding.

one_hot_encoder = OneHotEncoder(sparse_output = False, drop= 'first')    # drop='first' to avoid multicollinearity.
#Meaning of sparse=False pre-processing data with OneHotEncoder. with the right shape according to my data and columns,& sparse_output = bool, default=True When True, it returns a scipy.sparse.csr_matrix, i.e. a sparse matrix in “Compressed Sparse Row” (CSR) format.
#drop = {‘first’, ‘if_binary’} or an array-like of shape (n_features,), default=None.
one_hot_encoded = one_hot_encoder.fit_transform(df[one_hot_columns])
one_hot_encoded_df = pd.DataFrame(one_hot_encoded, columns = one_hot_encoder.get_feature_names_out(one_hot_columns) )

In [15]:
one_hot_encoded_df.head(10)

Unnamed: 0,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Male,income_ >50K
0,0.0,0.0,0.0,1.0,1.0,0.0
1,0.0,0.0,0.0,1.0,1.0,0.0
2,0.0,0.0,0.0,1.0,1.0,0.0
3,0.0,1.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0
6,0.0,1.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,1.0,1.0,1.0
8,0.0,0.0,0.0,1.0,0.0,1.0
9,0.0,0.0,0.0,1.0,1.0,1.0


In [16]:
label_encoding_columns = [col for col in categorical_columns if df[col].nunique() >= 5]   # Split columns based on unique values count.

label_encoder = LabelEncoder()    # Apply Label Encoding.
label_encoded_df = df[label_encoding_columns].apply(label_encoder.fit_transform)

In [17]:
label_encoded_df

Unnamed: 0,workclass,education,marital_status,occupation,relationship,race,native_country
0,7,9,4,1,1,4,39
1,6,9,2,4,0,4,39
2,4,11,0,6,1,4,39
3,4,1,2,6,0,2,39
4,4,9,2,10,5,2,5
...,...,...,...,...,...,...,...
32556,4,7,2,13,5,4,39
32557,4,11,2,7,0,4,39
32558,4,11,6,1,4,4,39
32559,4,11,4,1,3,4,39


In [18]:
# Combine the encoded data with the numerical features
encoded_data = pd.concat([standard_scaled_df, one_hot_encoded_df, label_encoded_df], axis=1)

encoded_data.head(10)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Male,income_ >50K,workclass,education,marital_status,occupation,relationship,race,native_country
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,0.0,0.0,0.0,1.0,1.0,0.0,7,9,4,1,1,4,39
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,0.0,0.0,0.0,1.0,1.0,0.0,6,9,2,4,0,4,39
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,1.0,1.0,0.0,4,11,0,6,1,4,39
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0.0,1.0,0.0,0.0,1.0,0.0,4,1,2,6,0,2,39
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0.0,1.0,0.0,0.0,0.0,0.0,4,9,2,10,5,2,5
5,-0.115955,0.898201,1.523438,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,1.0,0.0,0.0,4,12,2,4,5,4,39
6,0.763796,-0.280358,-1.974858,-0.14592,-0.21666,-1.979184,0.0,1.0,0.0,0.0,0.0,0.0,4,6,3,8,1,2,23
7,0.983734,0.188195,-0.42006,-0.14592,-0.21666,0.369519,0.0,0.0,0.0,1.0,1.0,1.0,6,11,2,4,0,4,39
8,-0.55583,-1.364279,1.523438,1.761142,-0.21666,0.774468,0.0,0.0,0.0,1.0,0.0,1.0,4,12,4,10,1,4,39
9,0.250608,-0.28735,1.134739,0.555214,-0.21666,-0.035429,0.0,0.0,0.0,1.0,1.0,1.0,4,9,2,4,0,4,39


**Pros and Cons of Encoding Techniques -**                                         
**One-Hot Encoding :**

Pros:                                                                            
Interpretability: One-hot encoding creates binary columns, which are easy to interpret.                                                                       
No Assumed Order: It doesn't assume any ordinal relationship between the categories, making it suitable for nominal data.                                 
Avoids Bias: Prevents the model from assuming a natural ordering between categories.

Cons:                                                                            
High Dimensionality: It can significantly increase the number of features, especially when dealing with categorical variables with many unique values.      
Sparsity: Results in a sparse matrix, which can increase memory usage and computational cost.

**Label Encoding :**

Pros:                                                                            
Efficiency: It's computationally efficient and doesn’t increase the number of features.                                                                        
Simple to Implement: Easy to apply and interpret when the categorical variable has a natural order.

Cons:                                                                            
Implied Ordinal Relationship: Assumes an ordinal relationship between categories, which might not be appropriate for nominal data.                     
Potential Bias: The model might infer a relationship between the encoded values that doesn't exist.

#Feature Engineering:
•	Create at least 2 new features that could be beneficial for the model. Explain the rationale behind your choices.                                       
•	Apply a transformation (e.g., log transformation) to at least one skewed numerical feature and justify your choice.

In [19]:

df['age_education_num_iteration'] = df['age'] * df['education_num'] #Interaction Feature: age_education_num_interaction.
#This feature multiplies the age by education_num to capture any interaction effects between these two features.

df['is_high_income'] = df['income'].apply(lambda x: 1 if x == '>50K' else 0)  #Binary Feature: is_high_income.
#This creates a new binary feature where 1 represents income greater than 50K, and 0 represents income less than or equal to 50K.


1.Interaction Feature: age_education_num_interaction -

Rationale behind my choice:                                          
The interaction between age and education_num might capture a relationship where older individuals with higher education levels could have different income levels compared to younger individuals with similar education levels.
This feature could be particularly useful in modeling income, as it combines the effects of age and education.

2.Binary Feature: is_high_income

Rationale behind my choice:                                          
Convert the target variable income into a binary feature where we encode income levels above 50K as 1 and others as 0. This is particularly useful when predicting income as a binary classification task, simplifying the problem into two classes.
This feature can help models that perform better with binary targets or when analyzing the relationship between features and income categories.

In [20]:
import numpy as np

#Log Transformation of capital_gain.
df['capital_gain_log'] = np.log1p(df['capital_gain'])  # log1p is used to avoid log(0) issues
#This applies a logarithmic transformation to the capital_gain feature, reducing skewness and making the distribution more normal.

Transformation of Skewed Numerical Feature -                                     
Log Transformation: capital_gain.

Rationale behind my choice:                                                      
The capital_gain feature is highly skewed, with most individuals having a value of 0 and a few having very high values. This skewness can negatively impact certain models by giving disproportionate importance to the few high values.
Applying a log transformation will compress the range of values, reduce skewness, and make the distribution more normal, which can improve the model’s performance.

#Feature Selection:
•	Use the Isolation Forest algorithm to identify and remove outliers. Discuss how outliers can affect model performance.                                       
• Apply the PPS (Predictive Power Score) to find and discuss the relationships between features. Compare its findings with the correlation matrix.


In [21]:
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state = 42) # Initialized Isolation Forest.

outliers = iso_forest.fit_predict(df.select_dtypes(include=['int64', 'float64'])) # Fit and predict.
#Predict whether a point is an outlier using the predict() method, where -1 indicates an outlier.
outliers

array([ 1,  1,  1, ...,  1,  1, -1])

In [24]:
# Remove outliers
data_cleaned = df[outliers != -1]

In [25]:
data_cleaned

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income,age_education_num_iteration,is_high_income,capital_gain_log
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,507,0,7.684784
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,650,0,0.000000
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,342,0,0.000000
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,371,0,0.000000
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,364,0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K,220,0,0.000000
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K,324,0,0.000000
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K,360,0,0.000000
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K,522,0,0.000000


**How to outliers can affect model performance -**

The Isolation Forest algorithm is an effective tool for identifying and removing outliers from a dataset. Outliers are data points that differ significantly from the majority of the data, and they can adversely affect model performance in the following ways:

Impact on Model Accuracy : Outliers can skew the model’s understanding of the data, leading to inaccurate predictions. This is particularly problematic for models like linear regression, which are sensitive to extreme values.            
Increased Variance : Outliers can increase the variance of the model, causing it to be less generalizable to new data.                                         
Misleading Metrics : Outliers can distort metrics such as mean and standard deviation, leading to incorrect assumptions about the distribution of the data.

In [27]:
!pip install -U ppscore



In [28]:
import ppscore as pps
import warnings
warnings.filterwarnings('ignore')

pps_matrix = pps.matrix(data_cleaned)   # PPS matrix.

print(pps_matrix)
#PPS can reveal non-linear relationships that the correlation matrix might miss.
#For example, PPS might show a high predictive power between two variables even if their correlation is low.

                    x                            y   ppscore  \
0                 age                          age  1.000000   
1                 age                    workclass  0.008848   
2                 age                       fnlwgt  0.000000   
3                 age                    education  0.073151   
4                 age                education_num  0.000000   
..                ...                          ...       ...   
319  capital_gain_log               native_country  0.000000   
320  capital_gain_log                       income  0.193187   
321  capital_gain_log  age_education_num_iteration  0.000000   
322  capital_gain_log               is_high_income  0.000000   
323  capital_gain_log             capital_gain_log  1.000000   

                   case  is_valid_score               metric  baseline_score  \
0        predict_itself            True                 None        0.000000   
1        classification            True          weighted F1        0.5

In [29]:
correlation_matrix = data_cleaned.corr()    # Correlation matrix.

print(correlation_matrix)   #Compare with the correlation matrix.
#Correlation Matrix: Primarily useful for linear relationships, where the interpretation is symmetric
#(i.e., the correlation between A and B is the same as between B and A).

                                  age    fnlwgt  education_num  capital_gain  \
age                          1.000000 -0.078907       0.009447      0.048656   
fnlwgt                      -0.078907  1.000000      -0.037721     -0.015069   
education_num                0.009447 -0.037721       1.000000      0.086576   
capital_gain                 0.048656 -0.015069       0.086576      1.000000   
capital_loss                 0.021189 -0.020683       0.062831     -0.039846   
hours_per_week               0.103382 -0.018618       0.140327      0.050234   
age_education_num_iteration  0.763713 -0.079392       0.621379      0.091448   
is_high_income                    NaN       NaN            NaN           NaN   
capital_gain_log             0.042862 -0.014123       0.069316      0.890719   

                             capital_loss  hours_per_week  \
age                              0.021189        0.103382   
fnlwgt                          -0.020683       -0.018618   
education_num   

**Discussion the relationship between features -**

Outliers: By removing outliers identified by the Isolation Forest, the model is less likely to be influenced by extreme values, resulting in better generalization and more accurate predictions.

PPS: PPS can reveal non-linear relationships that the correlation matrix might miss. For example, PPS might show a high predictive power between two variables even if their correlation is low.

Correlation Matrix: Primarily useful for linear relationships, where the interpretation is symmetric (i.e., the correlation between A and B is the same as between B and A).