In [2]:
import pandas as pd
import pickle
pd.set_option('display.max_columns', 500)

df = None
with open('data/balancing.pk','rb') as f:
    df = pickle.load(f)

df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


- Feature extraction
    - features interactions (e.g. feature area, price -> price per unit area). Common text-relted methods as: bag of words (BoW), TF-IDF, BM25, stemming, tokenization, etc.  
- Feature transformation
    - common with signals with methods as: Wavelet transform, Fourier transform, binning / discretization (dividing countinuous numerical features into bins (intervals)), enconding (categorical features into numerical (e.g. one-hot encoding, label encoding, target encoding)),
- Feature Selection
    - dimensionality reduction (PCA, LDA, t-SNE), redundant, irrelevant, noise feature removal, randomized methods, correlation-based (pearson correlation coefficient, spearman rank correlation), filter methods (chi-square test, ANOVA, information gain, correlation coefficient)

## Feature extraction

Since we don't have knowledge about the features, I found it difficult to combine features. Thus, I'll jump this step.

## Feature transformation

The same applies for feature transformation. I didn't find any method that I was able to apply here and I'll jump.

## Feature Selection

Though the dataset already passed through a feature selection method (PCA), we're still going to apply common methods in order to check the result and as an opportunity to learn new methods.

### Principal Component Analysis (PCA)

PCA is an useful method when dealing with high dimensional data. The main objective is to reduce the number of dimensions retaining as much information as possible. The origins of PCA goes to Pearson who tried to find the closest line, plane or hyperplane regarding a set of points, in a least square metric.

Note that the hype-plane that closest fits the set of points also produces the maximum variance of the points in the plane. This idea is important. The more the variability, the more the information contained in the data.

On the mathematics: consider a matrix NxC where N is the number of observations and C is the number of columns. Each row of the matrix is apoint in the K-space. These points have an average point. We can, with a translation, shift the set of points in a way that the new mean is 0. PCA then proceeds by scaling the data to unit variance. Then, with Linear Algebra, we can compute the line that best fits the matrix (the set of points). The projection of the points in this line is the PC1, the first principal component. The ortoghonal line (regarding the PC1 line) that best fits the points is calculated and the projection in this line is PC2. Note that the lines from PC1 and PC2 compose a plane. With PC3, PC4, etc, we'll have the best fit hyperplane.

https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-component-analysis-pca-and-how-it-is-used-507186

In [2]:
import pandas as pd
from sklearn.decomposition import PCA

def apply_pca(df, class_col_name, n_components):
    # Separate the target class column from the feature columns
    X = df.drop(columns=[class_col_name])
    y = df[class_col_name]
    
    # Standardize the feature columns by subtracting the mean and scaling to unit variance
    X_std = (X - X.mean()) / X.std()
    
    # Create a PCA object and fit it to the standardized feature columns
    pca = PCA(n_components=n_components)
    pca.fit(X_std)
    
    # Transform the standardized feature columns to the new PCA space
    X_pca = pca.transform(X_std)
    
    # Create a new DataFrame with the transformed feature columns and the target class column
    pca_columns = [f"PC{i}" for i in range(1, n_components+1)]
    df_pca = pd.DataFrame(X_pca, columns=pca_columns, index=X.index)
    df_pca[class_col_name] = y
    
    return df_pca

df_pca = apply_pca(df, 'Class', 4)
df_pca.head()

Unnamed: 0,PC1,PC2,PC3,PC4,Class
0,-2.147705,0.169151,0.335411,0.603089,0
1,-2.043453,0.58418,0.265906,0.670798,0
2,-2.258784,-0.209164,-0.050699,0.109833,0
3,-2.474737,0.177471,-0.196969,1.155678,0
4,-2.003145,-0.046066,0.844102,0.247561,0


### Linear Discriminant Analysis (LDA)

LDA is another dimensionality reduction method. LDA chooses new axes such that the separability between classes is optimized (thus it's a supervised method). It's a good method to increase model performance.

It's built in three steps:
- Calculate separability between classes (between-class variance) defined as the distance between the mean of different classes.
$$
S_b = \sum^{g}_{i = 1}N_i(\overline{x_i}-\overline{x})(\overline{x}_i-\overline{x})^T
$$
Where $g$ is the number of classes, $N_i$ is the sample size of the class $i$, $\overline{x}_i$ is the mean of the class $i$ and $\overline{x}$ is the overall mean.
- Calculate the within-class variance.
$$
S_w = \sum_{i = 1}^{g}\sum_{j=1}^{N_i}(x_{i,j}-\overline{x_i})(x_{i,j}-\overline{x_i})^T
$$
- Construct a lower-dimensional space that maximizes the between-class variance ("separability") and minimizes the within-class variance.

https://medium.com/analytics-vidhya/linear-discriminant-analysis-explained-in-under-4-minutes-e558e962c877

In [3]:
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

def apply_lda(df, class_col_name, n_components):
    # Separate the target class column from the feature columns
    X = df.drop(columns=[class_col_name])
    y = df[class_col_name]
    
    # Create an LDA object and fit it to the feature columns and target class column
    lda = LinearDiscriminantAnalysis(n_components=n_components)
    lda.fit(X, y)
    
    # Transform the feature columns to the new LDA space
    X_lda = lda.transform(X)
    
    # Create a new DataFrame with the transformed feature columns and the target class column
    lda_columns = [f"LDA{i}" for i in range(1, n_components+1)]
    df_lda = pd.DataFrame(X_lda, columns=lda_columns, index=X.index)
    df_lda[class_col_name] = y
    
    return df_lda

df_lda = apply_lda(df, 'Class', 1)
df_lda.head()

Unnamed: 0,LDA1,Class
0,-0.728391,0
1,-1.518905,0
2,-1.755254,0
3,-1.318189,0
4,-0.839148,0


### t-distributed Stochastic Neighbor Embedding (t-SNE)

Unlike PCA, t-SNE can separate data that can't be separated by a line or plane. I.e., it's a nonlinear method, useful when PCA can't handle the nonlinear class separation. It's most used to understand high dimensional data and project into 2D or 3D.

The method mesures the dimilarity between pairs of points and then constructs a probability of each point being chosen as a neighbor of another point, based on their similarity. Then, it constructs a similar probability distribution in a lower-dimensional space and tries to map the points to the new space minimizing the divergence between the two distributions, using gradient descent.


In [2]:
import pandas as pd
from sklearn.manifold import TSNE

def apply_tsne(df, class_col_name, n_components):
    # Separate the target class column from the feature columns
    X = df.drop(columns=[class_col_name])
    y = df[class_col_name]
    
    # Create a t-SNE object and fit_transform it to the feature columns and target class column
    tsne = TSNE(n_components=n_components, random_state=42)
    X_tsne = tsne.fit_transform(X)
    
    # Create a new DataFrame with the transformed feature columns and the target class column
    tsne_columns = [f"t-SNE{i}" for i in range(1, n_components+1)]
    df_tsne = pd.DataFrame(X_tsne, columns=tsne_columns, index=X.index)
    df_tsne[class_col_name] = y
    
    return df_tsne


df_tsne = apply_tsne(df, 'Class', 1)
df_tsne.head()
# I wasn't able to run this on my pc

### Redundant

Redundant features are those features that are highly correlated with other features and thus do not bring much "new" information to distinguish the classes.

In [5]:
import pandas as pd
import numpy as np

def drop_redundant(df, class_col_name):
    # Separate the target class column from the feature columns
    X = df.drop(columns=[class_col_name])
    y = df[class_col_name]
    
    # Drop redundant features using correlation
    corr_matrix = X.corr().abs()
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
    to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
    X = X.drop(columns=to_drop)
    
    # Create a new DataFrame with the feature columns and the target class column
    df_new = X.copy()
    df_new[class_col_name] = y
    
    return df_new

df_red = drop_redundant(df, 'Class')
df_red.head()


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


### Irrelevant

Irrelevant features, not to be confused with redundant, doesn't contribute to the model because it has low correlation with target variable.

In [6]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

def drop_irrelevant_features(data, class_col):
    """
    Drops irrelevant features using a random forest classifier and returns a new dataset without these features.
    Parameters:
        - data: pandas DataFrame containing the dataset
        - class_col: name of the column containing the class labels
    Returns:
        - new_data: pandas DataFrame containing the new dataset with only relevant features
    """
    # Separate the features and labels
    X = data.drop(columns=[class_col])
    y = data[class_col]

    # Use a random forest classifier to estimate feature importance
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X, y)

    # Use a threshold to select important features
    selector = SelectFromModel(rf, prefit=True)
    important_features = selector.get_support(indices=True)

    # Drop irrelevant features from the dataset
    new_data = data.iloc[:, important_features]
    new_data[class_col] = y

    return new_data


df_irr = drop_irrelevant_features(df, 'Class')
df_irr.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_data[class_col] = y


Unnamed: 0,V2,V3,V4,V10,V11,V12,V14,V16,V17,Class
0,-0.072781,2.536347,1.378155,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0
1,0.266151,0.16648,0.448154,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,0
2,-1.340163,1.773209,0.37978,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,0
3,-0.185226,1.792993,-0.863291,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,0
4,0.877737,1.548718,0.403034,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,0


### Noisy features

Noisy features are those with random or inconsistent relationship with the target variable of features that may have errors.

mRMR is method that tries to find the subset of features that maximizes relevance and minimizes redundancy between features.

In [8]:
import pandas as pd
from pymrmr import mRMR

def remove_noisy_features(data, class_col):
    """
    Removes noisy features using the mRMR algorithm and returns a new dataset without these features.
    Parameters:
        - data: pandas DataFrame containing the dataset
        - class_col: name of the column containing the class labels
    Returns:
        - new_data: pandas DataFrame containing the new dataset without noisy features
    """
    # Separate the features and labels
    X = data.drop(columns=[class_col])
    y = data[class_col]

    # Use the mRMR algorithm to select features
    selected_features = mRMR.mRMR(X.values, "MIQ", len(X.columns))
    selected_features = [int(f) for f in selected_features]

    # Remove noisy features from the dataset
    new_data = data.iloc[:, selected_features]
    new_data[class_col] = y

    return new_data

df_noisy = remove_noisy_features(df, 'Class')
df_noisy.head()
# I could not install the mrmr library

ModuleNotFoundError: No module named 'pymrmr'

### Correlation methods: Pearson correlation

This method is similar to the redundancy method but uses the Pearson correlation.

The Pearson correlation is defined by:
$$
r = \frac{\sum(x_i-\overline{x})(y_i-\overline{y})}{\sqrt{\sum(x_i - \overline{x})^2\sum(y_i - \overline{y})^2}}
$$
i.e. is the covariance divided by the product of their standard deviations. It ranges from -1 to 1. 0 means no correlation, +1 indicates positive correlation and -1 indicates negative. This model assumes that the relationship between the variables is linear and the data is normally distributed.

In [9]:
import pandas as pd

def remove_pearson_correlated_features(data, class_col, threshold=0.5):
    """
    Removes features that are highly correlated with the class label and returns a new dataset without these features.
    Parameters:
        - data: pandas DataFrame containing the dataset
        - class_col: name of the column containing the class labels
        - threshold: Pearson correlation threshold below which features will be removed (default=0.5)
    Returns:
        - new_data: pandas DataFrame containing the new dataset without Pearson correlated features
    """
    # Calculate the Pearson correlation coefficients between features and class label
    pearson_coeffs = data.drop(class_col, axis=1).apply(lambda x: x.corr(data[class_col]))

    # Find the features with correlation coefficient below the threshold
    pearson_corr_features = pearson_coeffs[abs(pearson_coeffs) < threshold].index.tolist()

    # Remove Pearson correlated features from the dataset
    new_data = data.drop(pearson_corr_features, axis=1)
    new_data[class_col] = data[class_col]

    return new_data

df_pearson = remove_pearson_correlated_features(df, 'Class')
df_pearson.head()

Unnamed: 0,V2,V3,V4,V7,V9,V10,V11,V12,V14,V16,V17,Class
0,-0.072781,2.536347,1.378155,0.239599,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0
1,0.266151,0.16648,0.448154,-0.078803,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,0
2,-1.340163,1.773209,0.37978,0.791461,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,0
3,-0.185226,1.792993,-0.863291,0.237609,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,0
4,0.877737,1.548718,0.403034,0.592941,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,0


### Correlation methods: Spearman rank correlation

The correlation metric used in this method is the Spearman rank correlation. Instead of comparing the data, it ranks them (from lowest to highest) and calculates the correlation by finding the difference between the ranks and calculating the Pearson correlation on the ranks.
$$
p = 1 - \frac{6\sum d_i^2}{n(n^2-1)}
$$
where $n$ is the number of observations and $d_i$ is the difference between the two ranks of each observation.


The values is similar to the Pearson's. Nevertheless, it doesn't assume that the relationship is linear and is more robust to outliers than Pearson's. It's more used when dealing with ordered, ordinal data.

In [10]:
import pandas as pd

def remove_spearman_correlated_features(data, class_col, threshold=0.5):
    """
    Removes features that are highly correlated with the class label and returns a new dataset without these features.
    Parameters:
        - data: pandas DataFrame containing the dataset
        - class_col: name of the column containing the class labels
        - threshold: Spearman rank correlation threshold below which features will be removed (default=0.5)
    Returns:
        - new_data: pandas DataFrame containing the new dataset without Spearman rank correlated features
    """
    # Calculate the Spearman rank correlation coefficients between features and class label
    spearman_coeffs = data.drop(class_col, axis=1).apply(lambda x: x.corr(data[class_col], method='spearman'))

    # Find the features with correlation coefficient below the threshold
    spearman_corr_features = spearman_coeffs[abs(spearman_coeffs) < threshold].index.tolist()

    # Remove Spearman rank correlated features from the dataset
    new_data = data.drop(spearman_corr_features, axis=1)
    new_data[class_col] = data[class_col]

    return new_data

df_spearman = remove_spearman_correlated_features(df, 'Class')
df_spearman.head()

Unnamed: 0,V1,V2,V3,V4,V7,V9,V10,V11,V12,V14,V16,V17,Class
0,-1.359807,-0.072781,2.536347,1.378155,0.239599,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0
1,1.191857,0.266151,0.16648,0.448154,-0.078803,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,0
2,-1.358354,-1.340163,1.773209,0.37978,0.791461,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,0
3,-0.966272,-0.185226,1.792993,-0.863291,0.237609,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,0
4,-1.158233,0.877737,1.548718,0.403034,0.592941,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,0


### Filter methods: ANOVA

ANOVA is an statistical test to compare the mean of two or more groups. The ANOVA F-test can measure the difference between the mean of a feature and the mean of the target variable. It works by calculating the ratio of: between-grop variability by within-group variability.
The between-group variability is:
$$
\sum_{i=1}^{k}n_i(\overline{Y}_i-\overline{Y})^2/(K-1)
$$
where $\overline{Y}_i$ is the mean of sample $i$, $n_i$ is the number of observations of the $i$-th group and $K$ is the number of groups.The within-group variability is:
$$
\sum_{i=1}^{k}\sum_{j=1}^{n_i}(\overline{Y_{i,j}}-\overline{Y_i})^2/(N-K)
$$
The value reached is then compared to the F-distribution to determine the significance of the difference in means.
The filter method relies on ranking the features by its F-score. The higher, the more significant the difference between means and the more relevant to the target variable. Then, the filter method selects the top K features.

https://en.wikipedia.org/wiki/F-test

In [11]:
from sklearn.feature_selection import SelectKBest, f_classif

def select_features_anova(df, class_col, k=10):
    X = df.drop(class_col, axis=1)
    y = df[class_col]
    
    selector = SelectKBest(f_classif, k=k)
    X_new = selector.fit_transform(X, y)
    selected_features = X.columns[selector.get_support()].tolist()
    
    return df[selected_features + [class_col]]

df_anova = select_features_anova(df, 'Class')
df_anova.head()

Unnamed: 0,V2,V3,V4,V9,V10,V11,V12,V14,V16,V17,Class
0,-0.072781,2.536347,1.378155,0.363787,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,0
1,0.266151,0.16648,0.448154,-0.255425,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,0
2,-1.340163,1.773209,0.37978,-1.514654,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,0
3,-0.185226,1.792993,-0.863291,-1.387024,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,0
4,0.877737,1.548718,0.403034,0.817739,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,0


### Filter methods: Information gain

It uses two important concepts.
- Entropy: measure of randomness or uncertainty of a dataset. Entropy of a dataset with $C$ classes:
$$E = -\sum_{i=1}^{C}p_ilog_2p_i$$
where $p_i$ is the probability of randomly picking an element of the class $i$.
- Information gain: when we transform a dataset, it will have a new entropy. The difference is the information gain:
$$Gain = E_{old}-E_{new}$$
Note that the more entropy removed, the better.

In the filter method, it can be applied in the following way: you measure the entropy of the target variable and then the information gain of each featue (based on the difference between the entropy of the target variable before and after the feature is added). Then, it's just a task of ranking the information gain of each feature and selecting the top K.

https://victorzhou.com/blog/information-gain/

In [12]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif

def select_features_info_gain(df, class_col, k=10):
    X = df.drop(class_col, axis=1)
    y = df[class_col]
    
    info_gains = mutual_info_classif(X, y)
    selected_features = X.columns[np.argsort(info_gains, kind='heapsort')[::-1][:k]].tolist()
    
    return df[selected_features + [class_col]]

df_info_gain = select_features_info_gain(df, 'Class')
df_info_gain.head()

Unnamed: 0,V14,V10,V12,V17,V4,V11,Amount,V3,V16,V7,Class
0,-0.311169,0.090794,-0.617801,0.207971,1.378155,-0.5516,149.62,2.536347,-0.470401,0.239599,0
1,-0.143772,-0.166974,1.065235,-0.114805,0.448154,1.612727,2.69,0.16648,0.463917,-0.078803,0
2,-0.165946,0.207643,0.066084,1.109969,0.37978,0.624501,378.66,1.773209,-2.890083,0.791461,0
3,-0.287924,-0.054952,0.178228,-0.684093,-0.863291,-0.226487,123.5,1.792993,-1.059647,0.237609,0
4,-1.11967,0.753074,0.538196,-0.237033,0.403034,-0.822843,69.99,1.548718,-0.451449,0.592941,0


### Filter methods: chi-square test
The statistical chi-square test is a statistical method used to determine the significance of association between two categorical variables. In our context, it can measure the dependence between each feature and the class variable. Features high high chi-square and low p-value (which measures the statistical significance of the association) are more important. Usually a threshold for the p-value is 0.05 or 0.01.

Considering categorical variables, the chi-square formula is:
$$X^2 = \sum\frac{(O-E)^2}{E}$$
where $O$ is the observed frequency and $E$ is the expected frequency. I.e. the chi-square test compare the observed frequencies of the categories with the expected frequencies. The expected frequencies are calculated as there is no association between the variables. If the observed frequencies differ significantly, it indicates that there may be an association.

If the variable is numeric and "continuous", the method discretizes it to make it categorical.

https://www.scribbr.com/statistics/chi-square-tests/

In [16]:
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2

def select_features_chi2(df_o, class_col, k=10):
    df = df_o.copy()
    X = df.drop(class_col, axis=1)
    y = df[class_col]

    for col in X.columns:
        if min(X[col]) < 0:
            X[col] = X[col] - min(X[col])
    
    chi2_selector = SelectKBest(chi2, k=k)
    chi2_selector.fit(X, y)
    selected_features = X.columns[chi2_selector.get_support()].tolist()
    
    return df[selected_features + [class_col]]

df_chi2 = select_features_chi2(df, 'Class')
df_chi2.head()

Unnamed: 0,Time,V3,V4,V10,V11,V12,V14,V16,V17,Amount,Class
0,0.0,2.536347,1.378155,0.090794,-0.5516,-0.617801,-0.311169,-0.470401,0.207971,149.62,0
1,0.0,0.16648,0.448154,-0.166974,1.612727,1.065235,-0.143772,0.463917,-0.114805,2.69,0
2,1.0,1.773209,0.37978,0.207643,0.624501,0.066084,-0.165946,-2.890083,1.109969,378.66,0
3,1.0,1.792993,-0.863291,-0.054952,-0.226487,0.178228,-0.287924,-1.059647,-0.684093,123.5,0
4,2.0,1.548718,0.403034,0.753074,-0.822843,0.538196,-1.11967,-0.451449,-0.237033,69.99,0


## Feature selection method comparison

In [17]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score, matthews_corrcoef, balanced_accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression


def applyLogisticRegression(df, method_name = "",class_weights = None):
    # Split data into train and test sets
    X = df.drop('Class', axis=1)
    y = df['Class']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train classifier
    clf = LogisticRegression(random_state=42)
    if class_weights != None:
        clf = LogisticRegression(random_state=42,class_weight={0:class_weights[0], 1:class_weights[1]})
    clf.fit(X_train, y_train)
    
    # Make predictions on test set
    y_pred = clf.predict(X_test)
    
    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision_0 = precision_score(y_test, y_pred, pos_label=0)
    precision_1 = precision_score(y_test, y_pred, pos_label=1)
    recall_0 = recall_score(y_test, y_pred, pos_label=0)
    recall_1 = recall_score(y_test, y_pred, pos_label=1)
    f1_0 = f1_score(y_test, y_pred, pos_label=0)
    f1_1 = f1_score(y_test, y_pred, pos_label=1)
    roc_auc = roc_auc_score(y_test, y_pred)
    cohen_kappa = cohen_kappa_score(y_test, y_pred)
    matthews_corr = matthews_corrcoef(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    g_mean = (recall_0*recall_1)**0.5
    classification_error = 1 - accuracy
    sensitivity_0 = recall_0
    sensitivity_1 = recall_1
    specificity_0 = 1 - recall_0
    specificity_1 = 1 - recall_1
    
    # Return dictionary with performance metrics
    return {
        'method_name': method_name,
        'accuracy': accuracy,
        'precision_0': precision_0,
        'precision_1': precision_1,
        'recall_0': recall_0,
        'recall_1': recall_1,
        'f1_0': f1_0,
        'f1_1': f1_1,
        'roc_auc': roc_auc,
        'cohen_kappa': cohen_kappa,
        'matthews_corr': matthews_corr,
        'balanced_accuracy': balanced_accuracy,
        'g_mean': g_mean,
        'classification_error': classification_error,
        'sensitivity_0': sensitivity_0,
        'sensitivity_1': sensitivity_1,
        'specificity_0': specificity_0,
        'specificity_1': specificity_1
    }


In [20]:
df_ans = applyLogisticRegression(df,method_name="df")
df_pca_ans = applyLogisticRegression(df_pca,method_name="pca")
df_lda_ans = applyLogisticRegression(df_lda,method_name="lda")
df_red_ans = applyLogisticRegression(df_red,method_name="red")
df_irr_ans = applyLogisticRegression(df_irr,method_name="irr")
df_pearson_ans = applyLogisticRegression(df_pearson,method_name="pearson")
df_spearman_ans = applyLogisticRegression(df_spearman,method_name="spearman")
df_anova_ans = applyLogisticRegression(df_anova,method_name="anova")
df_info_gain_ans = applyLogisticRegression(df_info_gain,method_name="info")
df_chi2_ans = applyLogisticRegression(df_chi2,method_name="chi2")



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [21]:
# Performance comparison

ans = [df_ans,
       df_pca_ans,
       df_lda_ans,
       df_red_ans,
       df_irr_ans,
       df_pearson_ans,
       df_spearman_ans,
       df_anova_ans,
       df_info_gain_ans,
       df_chi2_ans
       ]

n = len(ans)
name_str = 'method_name'
for score_name in [v for v in df_ans if v != name_str]:
    print(f'====== {score_name} ======')
    ans = sorted(ans, key = lambda x: x[score_name], reverse = True)
    print('score:',end=" ")
    for i in range(n):
        text = f"{ans[i][score_name]} ({ans[i][name_str]})"
        print(text,end='\t')
    print()
    print('improvement(from next/worst):',end=" ")
    for i in range(n-1):
        text = f"{round(100*((ans[i][score_name]/ans[i+1][score_name])-1),6)}%, {round(100*((ans[i][score_name]/ans[n-1][score_name])-1),6)}% ({ans[i][name_str]})"
        print(text,end='\t')
    print()


score: 0.9769621722385382 (spearman)	0.976839069342103 (pearson)	0.9761444172836466 (info)	0.9752387316884441 (irr)	0.975221145560382 (anova)	0.9739021859557181 (df)	0.9655751543182738 (chi2)	0.955498302938642 (lda)	0.9537660693245168 (red)	0.9485517823540791 (pca)	
improvement(from next/worst): 0.012602%, 2.995133% (spearman)	0.071163%, 2.982155% (pearson)	0.092868%, 2.908922% (info)	0.001803%, 2.813441% (irr)	0.13543%, 2.811587% (anova)	0.862391%, 2.672538% (df)	1.054617%, 1.79467% (chi2)	0.18162%, 0.732329% (lda)	0.54971%, 0.54971% (red)	
score: 0.9661454721952573 (df)	0.9647468919568651 (spearman)	0.9644193470872954 (pearson)	0.9638767276161044 (info)	0.9628574370945041 (irr)	0.9625544615595732 (anova)	0.9598586523230108 (chi2)	0.9478344059836493 (red)	0.9429903498397354 (lda)	0.9192599792425166 (pca)	
improvement(from next/worst): 0.144969%, 5.100352% (df)	0.033963%, 4.94821% (spearman)	0.056296%, 4.912578% (pearson)	0.105861%, 4.853551% (info)	0.031476%, 4.742669% (irr)	0.280855%

Spearman and pearson seems to have a good performance even having a better precision on 1 even though the unchaged df had a better recall.
The metrics, though, were worse than before when we compared some other models with the decision tree classifier. Changing to this classifier we have:

In [22]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, cohen_kappa_score, matthews_corrcoef, balanced_accuracy_score, classification_report


def applyDecisionTree(df, method_name = ""):
    # Split data into train and test sets
    X = df.drop('Class', axis=1)
    y = df['Class']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    
    # Initialize a decision tree classifier and fit it to the training data
    clf = DecisionTreeClassifier(random_state=42)
    clf.fit(X_train, y_train)

    
    # Make predictions on test set
    y_pred = clf.predict(X_test)
    
    # Calculate performance metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision_0 = precision_score(y_test, y_pred, pos_label=0)
    precision_1 = precision_score(y_test, y_pred, pos_label=1)
    recall_0 = recall_score(y_test, y_pred, pos_label=0)
    recall_1 = recall_score(y_test, y_pred, pos_label=1)
    f1_0 = f1_score(y_test, y_pred, pos_label=0)
    f1_1 = f1_score(y_test, y_pred, pos_label=1)
    roc_auc = roc_auc_score(y_test, y_pred)
    cohen_kappa = cohen_kappa_score(y_test, y_pred)
    matthews_corr = matthews_corrcoef(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    g_mean = (recall_0*recall_1)**0.5
    classification_error = 1 - accuracy
    sensitivity_0 = recall_0
    sensitivity_1 = recall_1
    specificity_0 = 1 - recall_0
    specificity_1 = 1 - recall_1
    
    # Return dictionary with performance metrics
    return {
        'method_name': method_name,
        'accuracy': accuracy,
        'precision_0': precision_0,
        'precision_1': precision_1,
        'recall_0': recall_0,
        'recall_1': recall_1,
        'f1_0': f1_0,
        'f1_1': f1_1,
        'roc_auc': roc_auc,
        'cohen_kappa': cohen_kappa,
        'matthews_corr': matthews_corr,
        'balanced_accuracy': balanced_accuracy,
        'g_mean': g_mean,
        'classification_error': classification_error,
        'sensitivity_0': sensitivity_0,
        'sensitivity_1': sensitivity_1,
        'specificity_0': specificity_0,
        'specificity_1': specificity_1
    }



In [23]:
df_ans = applyDecisionTree(df,method_name="df")
df_pca_ans = applyDecisionTree(df_pca,method_name="pca")
df_lda_ans = applyDecisionTree(df_lda,method_name="lda")
df_red_ans = applyDecisionTree(df_red,method_name="red")
df_irr_ans = applyDecisionTree(df_irr,method_name="irr")
df_pearson_ans = applyDecisionTree(df_pearson,method_name="pearson")
df_spearman_ans = applyDecisionTree(df_spearman,method_name="spearman")
df_anova_ans = applyDecisionTree(df_anova,method_name="anova")
df_info_gain_ans = applyDecisionTree(df_info_gain,method_name="info")
df_chi2_ans = applyDecisionTree(df_chi2,method_name="chi2")



In [24]:
# Performance comparison

ans = [df_ans,
       df_pca_ans,
       df_lda_ans,
       df_red_ans,
       df_irr_ans,
       df_pearson_ans,
       df_spearman_ans,
       df_anova_ans,
       df_info_gain_ans,
       df_chi2_ans
       ]

n = len(ans)
name_str = 'method_name'
for score_name in [v for v in df_ans if v != name_str]:
    print(f'====== {score_name} ======')
    ans = sorted(ans, key = lambda x: x[score_name], reverse = True)
    print('score:',end=" ")
    for i in range(n):
        text = f"{ans[i][score_name]} ({ans[i][name_str]})"
        print(text,end='\t')
    print()
    print('improvement(from next/worst):',end=" ")
    for i in range(n-1):
        text = f"{round(100*((ans[i][score_name]/ans[i+1][score_name])-1),6)}%, {round(100*((ans[i][score_name]/ans[n-1][score_name])-1),6)}% ({ans[i][name_str]})"
        print(text,end='\t')
    print()


score: 0.9986282820111496 (df)	0.9983644900902169 (chi2)	0.9982589733218438 (red)	0.9980567328491287 (info)	0.9980391467210664 (pearson)	0.9979951814009109 (spearman)	0.9978544923764134 (anova)	0.9977841478641647 (irr)	0.988278845646554 (pca)	0.930912895907708 (lda)	
improvement(from next/worst): 0.026422%, 7.274084% (df)	0.01057%, 7.245747% (chi2)	0.020263%, 7.234412% (red)	0.001762%, 7.212687% (info)	0.004405%, 7.210798% (pearson)	0.014099%, 7.206075% (spearman)	0.00705%, 7.190962% (anova)	0.961804%, 7.183406% (irr)	6.162333%, 6.162333% (pca)	
score: 0.9995233724050275 (df)	0.999187910458301 (red)	0.999011927447244 (chi2)	0.9988526442137965 (spearman)	0.9988351776354106 (info)	0.9987823171269743 (pearson)	0.9985178388679112 (anova)	0.9984296704072271 (irr)	0.9900776455190222 (pca)	0.9320009189065013 (lda)	
improvement(from next/worst): 0.033573%, 7.244891% (df)	0.017616%, 7.208898% (red)	0.015947%, 7.190015% (chi2)	0.001749%, 7.172925% (spearman)	0.005292%, 7.171051% (info)	0.026487%

In this case, the unchanged df performed better and, thus, we shall keep it.

In [3]:
# store results after feature selection

import pickle

with open("data/feature.pk",'wb') as f:
    pickle.dump(df,f)