## `Task` Do feature selection as per metods taught is session 54 on SECOM dataset.

Dataset Link : https://archive.ics.uci.edu/ml/datasets/SECOM

Drive Link : https://docs.google.com/spreadsheets/d/1dFCe1zgokabsiEr6BbWmMJtiMefkrChpJWLiG_0dDkk/edit?usp=share_link

In [36]:
# Write your Code here

### `Solution`

In [37]:
import pandas as pd
import numpy as np
DATA = pd.read_csv("uci-secom - uci-secom.csv")

DATA.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 592 entries, Time to Pass/Fail
dtypes: float64(590), int64(1), object(1)
memory usage: 7.1+ MB


In [38]:
DATA.head()

Unnamed: 0,Time,0,1,2,3,4,5,6,7,8,...,581,582,583,584,585,586,587,588,589,Pass/Fail
0,2008-07-19 11:55:00,3030.93,2564.0,2187.7333,1411.1265,1.3602,100.0,97.6133,0.1242,1.5005,...,,0.5005,0.0118,0.0035,2.363,,,,,-1
1,2008-07-19 12:32:00,3095.78,2465.14,2230.4222,1463.6606,0.8294,100.0,102.3433,0.1247,1.4966,...,208.2045,0.5019,0.0223,0.0055,4.4447,0.0096,0.0201,0.006,208.2045,-1
2,2008-07-19 13:17:00,2932.61,2559.94,2186.4111,1698.0172,1.5102,100.0,95.4878,0.1241,1.4436,...,82.8602,0.4958,0.0157,0.0039,3.1745,0.0584,0.0484,0.0148,82.8602,1
3,2008-07-19 14:43:00,2988.72,2479.9,2199.0333,909.7926,1.3204,100.0,104.2367,0.1217,1.4882,...,73.8432,0.499,0.0103,0.0025,2.0544,0.0202,0.0149,0.0044,73.8432,-1
4,2008-07-19 15:22:00,3032.24,2502.87,2233.3667,1326.52,1.5334,100.0,100.3967,0.1235,1.5031,...,,0.48,0.4766,0.1045,99.3032,0.0202,0.0149,0.0044,73.8432,-1


In [39]:
# Logistic Model Before Feature Selection

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Dropping Time column
data = DATA.drop('Time', axis=1)

# Filling NaN Values with random
for column in data.columns:
    # print("Column- ", column)
    # Get the minimum and maximum values of the column
    min_value = data[column].min()
    max_value = data[column].max()

    # Generate random numbers within the range
    random_values = np.random.uniform(min_value, max_value, size=data[column].isnull().sum())

    # Create a Series with the random values
    random_series = pd.Series(random_values, index=data[column][data[column].isnull()].index)

    # Fill NaN values with the random series
    data[column].fillna(random_series, inplace=True)

    # Print
    # print(data[column].isnull().sum())


# data.isnull().sum()

In [40]:
# Separate features and target
X = data.drop('Pass/Fail', axis=1)
y = data['Pass/Fail']



# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)


# Initialize and train logistic regression model
log_reg = LogisticRegression(max_iter=10000)  # Increase max_iter if it doesn't converge
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Calculate and print accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)

(1253, 590)
(314, 590)
Test accuracy: 0.8662420382165605


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


To perform filter-based feature selection on the "UCI SECOM" dataset, which has 592 columns and a target column called "Pass/Fail," we can utilize the following methods:

1. Duplicate Features:
   - Identify and remove duplicate columns from the dataset. Columns with identical values provide redundant information and do not contribute to the prediction task.

2. Variance Threshold Method:
   - Calculate the variance of each feature.
   - Remove features with low variance, as they tend to have little or no predictive power.
   - Set a threshold value for variance and remove features below that threshold.

3. Correlation:
   - Compute the correlation matrix of the features.
   - Identify highly correlated features and choose one from each highly correlated group.
   - High correlation between features indicates redundancy, and removing one from each correlated group helps reduce multicollinearity.

4. ANOVA (Analysis of Variance):
   - Perform an ANOVA test between each feature and the target variable ("Pass/Fail").
   - Select features with a significant impact on the target variable.
   - Set a significance level (e.g., p-value threshold) for the test to determine the importance of each feature.

5. Chi-Squared:
   - Apply the Chi-Squared test between each feature and the target variable, considering both features as categorical.
   - Select features with a significant association with the target variable.
   - Set a significance level (e.g., p-value threshold) to determine the importance of each feature.

Implementing these feature selection methods in Python using the "UCI SECOM" dataset can be done as follows:

In [41]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import VarianceThreshold, SelectKBest, chi2, f_classif
from scipy.stats import pearsonr

data_path = "https://docs.google.com/spreadsheets/d/e/2PACX-1vQtBXo5cBnDsM2fmfHPm6u72KGUS5FjPHNGMxOfYjA9-CAhmnRpwkIw_rOR3sANJIToiUU__6fbBvig/pub?gid=572763137&single=true&output=csv"

# Load the dataset
DATA = pd.read_csv(data_path)  # Replace with the actual filename and path

# Remove duplicate features
# Get the subset of columns with duplicate values
duplicated_cols = DATA.columns[DATA.T.duplicated()]

# Remove the duplicated columns
data = DATA.drop(columns=duplicated_cols)

# Drop time column
data.drop('Time', inplace=True, axis=1)


# Numbers Of Columns after removing Duplicate columns
print("Number of Columns - ", DATA.shape[1])
print("Number of Columns after removing duplicate columns- ", data.shape[1])


Number of Columns -  592
Number of Columns after removing duplicate columns-  487


In [29]:
# Variance Threshold Method
selector = VarianceThreshold(threshold=0.01)
sel = selector.fit(data)

columns = data.columns[sel.get_support()]

data_vt = sel.transform(data)

data_vt = pd.DataFrame(data_vt, columns=columns)

# Numbers Of Columns after Variance Threshold Method

print("Number of Columns after Variance Threshold Method- ", data_vt.shape[1])

Number of Columns after Variance Threshold Method-  315


In [30]:
# Correlation
corr_matrix = data_vt.corr().abs()
upper_tri = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.8)]
data_corr = data_vt.drop(to_drop, axis=1)

# Numbers Of Columns after Variance Threshold Method

print("Number of Columns after Correlation- ", data_corr.shape[1])

Number of Columns after Correlation-  162


In [31]:
data_corr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1567 entries, 0 to 1566
Columns: 162 entries, 0 to Pass/Fail
dtypes: float64(162)
memory usage: 1.9 MB


In [32]:

# ANOVA - Approach showed in the session can't be done here
#  every rows of the dataset have some NaN values.
# ANOVA can't be applied on data having NaN values.


We can still perform ANOVA analysis by considering each column separately.

ANOVA can be applied to each column individually, comparing the target variable which is Pass/Fail column in our case against the non-missing values in that specific column.

We will drop NaN values of each column before performing the ANOVA

In [33]:
from scipy.stats import f_oneway

# Significance Value
alpha = 0.05

# Columns haveing p_value less than alpha
column_pvalues = []

# Iterate over each column
for column in data_corr.iloc[:, :-1].columns:
    # Extract the non-missing values in the column
    column_data = data_corr[column].dropna()

    # Perform ANOVA with the target variable
    anova_result = f_oneway(column_data, data_corr['Pass/Fail'])

    # Print the ANOVA result or perform further analysis
    print(f"Column: {column} - ANOVA p-value: {anova_result.pvalue}")

    if anova_result.pvalue <= alpha:
        column_pvalues.append((column, anova_result.pvalue))

# Selecting best 100 Features - lower p-value better feature
# Sort the column p-values in ascending order
column_pvalues.sort(key=lambda x: x[1])

# Select the top 100 columns with the lowest p-values
selected_columns = [column for column, _ in column_pvalues[:100]]

data_anova = data_corr[selected_columns+['Pass/Fail']]

print("Number of Columns after Correlation- ", data_anova.shape[1])

Column: 0 - ANOVA p-value: 0.0
Column: 1 - ANOVA p-value: 0.0
Column: 2 - ANOVA p-value: 0.0
Column: 3 - ANOVA p-value: 0.0
Column: 4 - ANOVA p-value: 0.0003805229753154665
Column: 6 - ANOVA p-value: 0.0
Column: 12 - ANOVA p-value: 0.0
Column: 14 - ANOVA p-value: 0.0
Column: 15 - ANOVA p-value: 0.0
Column: 16 - ANOVA p-value: 0.0
Column: 18 - ANOVA p-value: 0.0
Column: 19 - ANOVA p-value: 0.0
Column: 21 - ANOVA p-value: 0.0
Column: 22 - ANOVA p-value: 0.0
Column: 23 - ANOVA p-value: 0.0
Column: 24 - ANOVA p-value: 5.0211194207923185e-05
Column: 25 - ANOVA p-value: 0.0
Column: 28 - ANOVA p-value: 0.0
Column: 29 - ANOVA p-value: 0.0
Column: 31 - ANOVA p-value: 0.0
Column: 32 - ANOVA p-value: 0.0
Column: 33 - ANOVA p-value: 0.0
Column: 34 - ANOVA p-value: 0.0
Column: 35 - ANOVA p-value: 0.0
Column: 37 - ANOVA p-value: 0.0
Column: 38 - ANOVA p-value: 0.0
Column: 39 - ANOVA p-value: 0.0
Column: 40 - ANOVA p-value: 0.0
Column: 41 - ANOVA p-value: 0.0
Column: 43 - ANOVA p-value: 0.0
Column: 4

In [34]:
# Chi-Squared
# -> Our Data is Numerical so chi-squared can't be done.

In [35]:
# Shape of Data after Feature Selection

print("Shape - ", data_anova.shape)

# Filling NaN Values with random
for column in data_anova.columns:
    # print("Column- ", column)
    # Get the minimum and maximum values of the column
    min_value = data_anova[column].min()
    max_value = data_anova[column].max()

    # Generate random numbers within the range
    random_values = np.random.uniform(min_value, max_value, size=data_anova[column].isnull().sum())

    # Create a Series with the random values
    random_series = pd.Series(random_values, index=data_anova[column][data_anova[column].isnull()].index)

    # Fill NaN values with the random series
    data_anova[column].fillna(random_series, inplace=True)

    # Print
    # print(data[column].isnull().sum())


# Separate features and target
X = data_anova.drop('Pass/Fail', axis=1)
y = data_anova['Pass/Fail']



# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(X_test.shape)


# Initialize and train logistic regression model
log_reg = LogisticRegression(max_iter=10000)  # Increase max_iter if it doesn't converge
log_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred = log_reg.predict(X_test)

# Calculate and print accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Test accuracy:", accuracy)

Shape -  (1567, 101)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_anova[column].fillna(random_series, inplace=True)


(1253, 100)
(314, 100)
Test accuracy: 0.910828025477707


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Thank You