In [1]:
import pandas as pd
import numpy as np

In [2]:
df=pd.read_csv("train.csv")
df.shape

(76020, 371)

In [3]:
df.sample(3)

Unnamed: 0,ID,var3,var15,imp_ent_var16_ult1,imp_op_var39_comer_ult1,imp_op_var39_comer_ult3,imp_op_var40_comer_ult1,imp_op_var40_comer_ult3,imp_op_var40_efect_ult1,imp_op_var40_efect_ult3,...,saldo_medio_var33_hace2,saldo_medio_var33_hace3,saldo_medio_var33_ult1,saldo_medio_var33_ult3,saldo_medio_var44_hace2,saldo_medio_var44_hace3,saldo_medio_var44_ult1,saldo_medio_var44_ult3,var38,TARGET
71005,141772,2,23,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0
72773,145339,2,33,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117310.979016,0
37074,74109,2,59,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,412273.98,0


In [4]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,76020.0,75964.050723,43781.947379,1.00,38104.7500,76043.00,113748.7500,151838.00
var3,76020.0,-1523.199277,39033.462364,-999999.00,2.0000,2.00,2.0000,238.00
var15,76020.0,33.212865,12.956486,5.00,23.0000,28.00,40.0000,105.00
imp_ent_var16_ult1,76020.0,86.208265,1614.757313,0.00,0.0000,0.00,0.0000,210000.00
imp_op_var39_comer_ult1,76020.0,72.363067,339.315831,0.00,0.0000,0.00,0.0000,12888.03
...,...,...,...,...,...,...,...,...
saldo_medio_var44_hace3,76020.0,1.858575,147.786584,0.00,0.0000,0.00,0.0000,24650.01
saldo_medio_var44_ult1,76020.0,76.026165,4040.337842,0.00,0.0000,0.00,0.0000,681462.90
saldo_medio_var44_ult3,76020.0,56.614351,2852.579397,0.00,0.0000,0.00,0.0000,397884.30
var38,76020.0,117235.809430,182664.598503,5163.75,67870.6125,106409.16,118756.2525,22034738.76


In [5]:
df['TARGET'].value_counts(normalize=True)*100

TARGET
0    96.043147
1     3.956853
Name: proportion, dtype: float64

In [6]:
null=[cols for cols in df.columns if df[cols].isnull().sum()>0]
null

[]

In [7]:
X=df.drop(columns=['TARGET'])
Y=df['TARGET']

In [8]:
from sklearn.model_selection import train_test_split

X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.3,random_state=10)

In [9]:
df.dtypes.unique()

array([dtype('int64'), dtype('float64')], dtype=object)

In [10]:
numerical_cols=[cols for cols in df.columns if df[cols].dtype in ['int64','float64'] and cols!='TARGET']

### Using Vaiance Threshold

In [12]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline,Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

#Without using SMOTE:

#a)No Variance Threshold:
pipeline1 = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=1000)
)
pipeline1.fit(X_train,Y_train)
print(f"Baseline F1 score: {f1_score(Y_test,pipeline1.predict(X_test))}")

#b)With variance threshold:
pipeline2 = make_pipeline(
    StandardScaler(),
    VarianceThreshold(threshold=0.05),
    LogisticRegression(max_iter=1000)
)
pipeline2.fit(X_train,Y_train)
print(f"F1 score with Variance Threshold: {f1_score(Y_test,pipeline2.predict(X_test))}")

#After SMOTE:

#a)No Variance Threshold:
pipeline3 = ImbPipeline([
    ("scaler",StandardScaler()),
    ("smote",SMOTE(random_state=42)),
    ("LR",LogisticRegression(max_iter=1000))
])
pipeline3.fit(X_train,Y_train)
print(f"Baseline F1 with SMOTE: {f1_score(Y_test,pipeline3.predict(X_test))}")

#b)With variance threshold:
pipeline4 = ImbPipeline([
    ("scaler",StandardScaler()),
    ("smote",SMOTE(random_state=42)),
    ("variancethreshold",VarianceThreshold(threshold=0.05)),
    ("LR",LogisticRegression(max_iter=1000))
])
pipeline4.fit(X_train,Y_train)
print(f"F1 Score with Varaince Threshold and SMOTE: {f1_score(Y_test,pipeline4.predict(X_test))}")
print(f"Features retained: {sum(pipeline4.named_steps['variancethreshold'].get_support())}")


Baseline F1 score: 0.01126126126126126
F1 score with Variance Threshold: 0.01126126126126126
Baseline F1 with SMOTE: 0.15797430083144368
F1 Score with Varaince Threshold and SMOTE: 0.1584258324924319
Features retained: 335


**Important Points:**
1. Variance Threshold for Feature Selection:
Variance Threshold is used to eliminate columns that have zero or near-zero variance across rows. If a column is constant (i.e., all values are the same), it provides no useful information to the model regarding the target variable and can safely be removed. Similarly, quasi-constant variables (features with very little variance) have minimal impact on the model and can also be dropped.
VarianceThreshold from sklearn is a simple baseline approach to feature selection. It removes all features whose variance does not meet a specified threshold. By default, it eliminates all features with zero variance.

2. Standardization Before Variance Threshold:
It is considered good practice to standardize the features before applying the Variance Threshold, especially when using a threshold other than zero. Standardization scales the features, allowing you to use a consistent threshold (typically between 0 and 0.01) to remove low-variance features across different datasets.

3. Using ColumnTransformer with Pipelines:
When using a ColumnTransformer, each operation is applied independently to the specified columns. This setup is ideal when you need to apply different preprocessing steps in parallel (e.g., Standardization, One-Hot Encoding, Label Encoding), as these transformations don’t interfere with one another.
However, if you want to apply multiple sequential steps (e.g., Standardization followed by Variance Threshold) to the same set of columns, you cannot do this directly within a ColumnTransformer.
In such cases, you should define a Pipeline that includes all the sequential steps, and then include that Pipeline as part of the ColumnTransformer.

4. Ignores Feature-Target Relationship:
It is an unsupervised method. VarianceThreshold only looks at the distribution of each feature independently, without considering the target variable. A feature with low variance might still be highly predictive when combined with others.
