## An example of the integration of F-testing with ML algorithm SVM Classifier

**Objective**

Enhance the feature selection process for an SVM classifier using the F-test to identify the most significant features for classifying data points.



**Dataset**

Consider a dataset where you're trying to predict a binary outcome (e.g., disease no/yes) based on multiple numerical predictors (e.g., various health indicators like blood pressure, cholesterol levels, etc.).

**Steps to Integrate F-test with SVM**
1) Data Preparation:
Split the data into training and test sets.

2) Feature Selection with F-test:

- Apply the F-test to each feature in the training set. The F-test checks the variance between groups (disease vs. no disease) for each feature to determine if the means of the two groups are significantly different.

- Calculate the F-statistic for each feature; a higher value indicates a more significant discriminative power concerning the target variable.

- Select the top N features with the highest F-statistics. The choice of N depends on the desired model complexity and the inherent trade-off between performance and overfitting.

3) Model Training:
- Train an SVM classifier using only the selected features. SVM is chosen here for its effectiveness in high-dimensional spaces and its capability to model complex nonlinear relationships through the use of kernel functions.

4) Model Evaluation:
- Evaluate the SVM model on the test set to check its performance.
- Metrics such as accuracy, precision, recall, and the ROC-AUC score can be used for a comprehensive assessment.

5) Comparison and Validation:
- Compare the performance of the SVM model with and without the F-test based feature selection to assess the impact.
- Optionally, use cross-validation during the feature selection and model training phases to ensure the model’s generalizability.

**Advantages of Integrating F-test with SVM**

- Reduces Overfitting: By reducing the number of features, you decrease the risk of the model capturing noise in the training data.

- Improves Model Efficiency: Less data to process can speed up training and prediction times, which is crucial for large datasets or real-time applications.

- Increases Model Performance: By focusing on the most informative features, you can potentially increase the predictive performance of the model.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Step 1: Data Collection
# Assuming you have a dataset with features and target variable (customer satisfaction)
data = pd.read_csv('Employee Satisfaction Index.csv')

# View the dataframe data for Column Names
data.head(25)



Unnamed: 0,emp_id,age,Dept,location,education,recruitment_type,job_level,rating,onsite,awards,certifications,salary,satisfied
0,HR8270,28,HR,Suburb,PG,Referral,5,2,0,1,0,86750,1
1,TECH1860,50,Technology,Suburb,PG,Walk-in,3,5,1,2,1,42419,0
2,TECH6390,43,Technology,Suburb,UG,Referral,4,1,0,2,0,65715,0
3,SAL6191,44,Sales,City,PG,On-Campus,2,3,1,0,0,29805,1
4,HR6734,33,HR,City,UG,Recruitment Agency,2,1,0,5,0,29805,1
5,PUR7265,40,Purchasing,Suburb,UG,Referral,3,3,0,7,1,42419,1
6,PUR1466,26,Purchasing,Suburb,UG,Referral,5,5,0,2,0,86750,0
7,TECH5426,25,Technology,City,UG,Recruitment Agency,1,1,0,4,0,24076,0
8,HR6578,35,HR,City,PG,Referral,3,4,0,0,0,42419,1
9,TECH9322,45,Technology,City,PG,Referral,3,3,0,9,0,42419,0


In [2]:
len(data)

500

In [3]:
#Converting catgetorial to numberial values for those columns
cleanup_nums = {"Dept":     {"Purchasing": 1, "HR": 2, "Technology":3, "Marketing":4, "Sales":5},
                "location": {"City": 1, "Suburb": 2},
                "education": {"PG":1, "UG":2},
                "recruitment_type": {"On-Campus":1, "Referral":2, "Walk-in":3, "Recruitment Agency":4}}

In [4]:
data = data.replace(cleanup_nums)
data.head(50)

Unnamed: 0,emp_id,age,Dept,location,education,recruitment_type,job_level,rating,onsite,awards,certifications,salary,satisfied
0,HR8270,28,2,2,1,2,5,2,0,1,0,86750,1
1,TECH1860,50,3,2,1,3,3,5,1,2,1,42419,0
2,TECH6390,43,3,2,2,2,4,1,0,2,0,65715,0
3,SAL6191,44,5,1,1,1,2,3,1,0,0,29805,1
4,HR6734,33,2,1,2,4,2,1,0,5,0,29805,1
5,PUR7265,40,1,2,2,2,3,3,0,7,1,42419,1
6,PUR1466,26,1,2,2,2,5,5,0,2,0,86750,0
7,TECH5426,25,3,1,2,4,1,1,0,4,0,24076,0
8,HR6578,35,2,1,1,2,3,4,0,0,0,42419,1
9,TECH9322,45,3,1,1,2,3,3,0,9,0,42419,0


In [5]:
# Assuming the last column is the target
X = data.iloc[:, 1:-1]  # features
y = data.iloc[:, -1]   # target

### Normalize the values across the values

In [6]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform the data
X_normalized = scaler.fit_transform(X)

# Show the normalized data
print(X_normalized)


[[-1.12522746 -0.64316174  1.03667198 ... -1.24124642 -0.99203175
   1.5364678 ]
 [ 1.20794918  0.0636094   1.03667198 ... -0.89041363  1.00803226
  -0.33817466]
 [ 0.46557479  0.0636094   1.03667198 ... -0.89041363 -0.99203175
   0.64695247]
 ...
 [-0.48890656  0.77038055 -0.96462528 ... -0.89041363 -0.99203175
  -1.11385232]
 [-1.33733443  0.0636094  -0.96462528 ... -1.24124642  1.00803226
  -0.87158784]
 [-1.33733443  0.0636094  -0.96462528 ... -0.53958084 -0.99203175
  -0.33817466]]


In [7]:
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2, random_state=42)

# Feature Selection using F-test
# Select the top 'k' features, k is set to 10 here, but you can adjust it
selector = SelectKBest(score_func=f_classif, k=11)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)

In [8]:
print(X_train_selected[0:25, :11])

[[ 0.88978873 -0.64316174 -0.96462528  1.01613007 -0.38396716 -0.02249855
  -0.75298725 -0.93416229  1.56541593 -0.99203175 -0.33817466]
 [ 1.63216311 -1.34993289  1.03667198 -0.98412598  1.43578242  0.68058127
  -0.05057377  1.07047781 -1.59207922  1.00803226  0.64695247]
 [-1.44338791  1.47715169 -0.96462528  1.01613007 -1.29384195  1.38366109
   0.65183971 -0.93416229 -0.18874804 -0.99203175  1.5364678 ]
 [-0.91312049  0.77038055  1.03667198 -0.98412598 -0.38396716  1.38366109
  -0.05057377 -0.93416229 -0.18874804  1.00803226  1.5364678 ]
 [-0.17074611 -0.64316174 -0.96462528 -0.98412598  1.43578242 -1.4286582
   1.35425319 -0.93416229 -0.89041363 -0.99203175 -1.11385232]
 [ 0.14741434 -1.34993289  1.03667198  1.01613007 -1.29384195  0.68058127
   1.35425319  1.07047781  0.51291754  1.00803226  0.64695247]
 [-0.70101353 -1.34993289 -0.96462528  1.01613007  0.52590763 -0.02249855
  -0.05057377  1.07047781  1.56541593 -0.99203175 -0.33817466]
 [-0.06469263 -0.64316174 -0.96462528  1.0

In [9]:
from sklearn.feature_selection import SelectKBest, f_classif

# Initialize SelectKBest with f_classif
selector = SelectKBest(score_func=f_classif, k=10)

# Fit to the training data
X_new = selector.fit_transform(X, y)

# Get the scores (F-values) computed by f_classif
f_scores = selector.scores_

# Optional: get p-values to understand significance
p_values = selector.pvalues_

# Print the feature names with their corresponding F-scores
for feature, score, p_value in zip(X.columns, f_scores, p_values):
    print(f"Feature: {feature}, F-score: {score:.2f}, P-value: {p_value:.4f}")


Feature: age, F-score: 0.01, P-value: 0.9202
Feature: Dept, F-score: 0.21, P-value: 0.6434
Feature: location, F-score: 0.45, P-value: 0.5006
Feature: education, F-score: 0.37, P-value: 0.5439
Feature: recruitment_type, F-score: 0.01, P-value: 0.9361
Feature: job_level, F-score: 0.05, P-value: 0.8219
Feature: rating, F-score: 4.35, P-value: 0.0376
Feature: onsite, F-score: 0.38, P-value: 0.5375
Feature: awards, F-score: 0.11, P-value: 0.7419
Feature: certifications, F-score: 0.07, P-value: 0.7958
Feature: salary, F-score: 0.27, P-value: 0.6053


### Understanding F-values:

- **F-value: It is a measure derived from an ANOVA test that compares the means across groups (e.g., classes in classification problems) to see if they are significantly different. Specifically, the F-value is the ratio of the variance between the groups to the variance within the groups. A higher F-value suggests a larger variance between group means relative to the variance within groups, indicating that the feature likely does a good job at differentiating between classes.**

- **Critical F-value: This is the threshold value determined from the F-distribution under a specified level of confidence (usually 95%). To determine whether an F-value is large enough to indicate a statistically significant difference between means, it must be compared to this critical F-value:**

- **If the computed F-value is greater than the critical F-value, the null hypothesis (that the means are the same across the groups) is rejected, suggesting the feature is important.**

- **The critical F-value depends on the degrees of freedom of the numerator and the denominator and the selected alpha level (commonly set at 0.05 for a 95% confidence interval).**

### Understanding P-values:

- **P-value: This measures the probability that the observed results (or more extreme results) could occur under the null hypothesis, which in the context of an F-test is that the group means are equal.**
 
- **Level of Significance (α): This is a threshold chosen by the researcher to decide when to reject the null hypothesis. Commonly, an α of 0.05 is used, meaning there is a 5% risk of concluding that a difference exists when there is no actual difference.**

- **If the p-value is less than α (typically 0.05), then the differences in group means are considered statistically significant, and you reject the null hypothesis.**
  
- **A lower p-value indicates stronger evidence against the null hypothesis.**

### Practical Example:
#### If you run an F-test on a feature and obtain an F-value of 10.0 with a corresponding p-value of 0.001:

- **F-value Interpretation: A value of 10.0 suggests significant variance between the groups compared to within the groups. You would compare this to a critical F-value (which depends on your degrees of freedom and desired confidence level).**
  
- **P-value Interpretation: A p-value of 0.001 is less than 0.05, strongly indicating that the observed variance is statistically unlikely to occur under the null hypothesis of no difference. Hence, the feature is likely significant in differentiating between the classes.**


### Calculating Critical F-value:

#### If needed, you can calculate or look up the critical F-value using statistical tables or software functions (like those in Python's scipy.stats library). For instance, to find a critical F-value in Python:

In [10]:
from scipy.stats import f

def calculate_critical_f_value(num_groups, total_samples, confidence_level=0.95):
    # Calculate degrees of freedom
    df_between = num_groups - 1  # Degrees of freedom for the numerator (between groups)
    df_within = total_samples - num_groups  # Degrees of freedom for the denominator (within groups)

    # Calculate the critical F-value at the specified confidence level
    critical_f_value = f.ppf(confidence_level, df_between, df_within)
    return critical_f_value

# Example usage:
num_groups = 11
total_samples = 500
confidence_level = 0.95  # 95% confidence

critical_f_value = calculate_critical_f_value(num_groups, total_samples, confidence_level)
print(f"Critical F-value for {num_groups} groups and {total_samples} samples at {confidence_level*100}% confidence level is: {critical_f_value}")


Critical F-value for 11 groups and 500 samples at 95.0% confidence level is: 1.8500646331252697


## Selecting features based on f-values and p-values

In [11]:
# Training the SVM classifier
svm = SVC(kernel='linear')  # Using a linear kernel
svm.fit(X_train_selected, y_train)

# Predicting the test results
y_pred = svm.predict(X_test_selected)

In [12]:
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

# Optionally, to see which features were selected:
selected_features_indices = selector.get_support(indices=True)
selected_features_names = X.columns[selected_features_indices]
print(f'Selected features: {selected_features_names}')


Accuracy: 0.46
              precision    recall  f1-score   support

           0       0.41      0.36      0.39        47
           1       0.49      0.55      0.52        53

    accuracy                           0.46       100
   macro avg       0.45      0.45      0.45       100
weighted avg       0.46      0.46      0.46       100

Selected features: Index(['age', 'Dept', 'location', 'education', 'job_level', 'rating', 'onsite',
       'awards', 'certifications', 'salary'],
      dtype='object')
