# Importing required libraries

We will need pandas , numpy and seaborn to extract, process and plot the data sequentially.

In [64]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from helpers import normalize

# Data Extraction , Processing and Feture Extraction

## Data Extraction

Our Dataset  is divided into multiple sections as :

- Baseline Features: Column-3 to Column-23
- Intensity Parameters: Col24 to Col26
- Formant Frequencies: Col27 to Col30
- Bandwidth Parameters: Col31 to Col34
- Vocal Fold: Col35 to Col56
- MFCC: Col57 to Col140
- Wavelet Features: Col141 to Col322
- TQWT Features: Col323 to Col754
- Class: Col755

*Refer Research Page: 6 - 9*

For our analysis, we will be using the following features:
1. Baseline Features
2. Intensity Parameters
3. Formant Frequencies
4. Bandwidth Parameters
5. MFCC Features
6. Class

Which in total we have 45 features.

In [65]:
# Helper methods for data-extraction
from helpers import read_data

In [None]:
filename = 'orginal_dataset/pd_speech_features.csv'
dataframe = read_data(filename)

y = dataframe['class']
original_df = dataframe.drop(['class'], axis=1)

#basic information of dataset
original_df.info()


Hence, we have dataframe with 45 features / columns and 756 datapoints / rows.

## Data Preprocessing and Feature Extraction

[Ref 1 : Working with Numerical Data](https://machinelearningmastery.com/feature-selection-with-numerical-input-data/)
[Ref 2 : Feature Selection Examples](https://scikit-learn.org/stable/auto_examples/feature_selection/plot_f_test_vs_mi.html#sphx-glr-auto-examples-feature-selection-plot-f-test-vs-mi-py)
[Ref 3 : Correlation and Standarization](https://stats.stackexchange.com/questions/220724/can-i-test-for-correlation-between-variables-before-standardize-them)

### Variance

In [None]:
# Variance check of every columns
variance_df = original_df.var().round(5)
variance_df = variance_df.sort_values(ascending=True)
variance_df.head(10)

We can see first 6 columns with very low variance and will be excluded from our analysis.

In [None]:
# Removing columns with low variance
var_filter_df = original_df.drop(original_df.columns[0:6], axis=1)

### Outlier Detection

We are using Density based clustering model (DBSCAN) to find the outliers on the dataset. Using threshold of 90% quantile to get the farthest point from the cluster.

In [None]:
import hdbscan

model_obj = hdbscan.HDBSCAN(alpha=0.01, min_samples=5,
                            min_cluster_size=10,
                            cluster_selection_epsilon=0.01)
model_obj.fit(normalize(var_filter_df,mode='minmax'))

threshold = pd.Series(model_obj.outlier_scores_).quantile(0.9)
outliers = np.where(model_obj.outlier_scores_ > threshold)[0]
outliers

In [None]:
var_filter_df = var_filter_df.drop(outliers, axis=0).reset_index(drop=True)
y = y.drop(outliers, axis=0).reset_index(drop=True)

## Type 1 - Processing

**Type 1** Data-preprcessing and Feature Selection uses methods like , self correlation and  F-test based selection.

### Correlation

In [None]:
from helpers import correlation_heatmap, get_feature_correlation, to_remove_columns
# Drawing a heatmap of correlation between features
correlation_heatmap(var_filter_df)

Here we can see a few features pair with high correlation (positive or negative) with each other. We will remove these feature analysing their correlation with target variable.

We are filtering the features with correlation greater than *0.8*.

In [None]:
threshold = 0.8
feature_correlation_df = get_feature_correlation(var_filter_df, threshold)
feature_correlation_df

We can see there are 10 features pari with correlation greater than *0.8*.

In [None]:
remove_corr_columns = to_remove_columns(var_filter_df, y, feature_correlation_df)
print(f'We are require to remove {len(remove_corr_columns)} columns, which are: {remove_corr_columns}')

In [None]:
corr_filter_df = var_filter_df.drop(remove_corr_columns, axis=1)

In [None]:
# Lets visualize correlation matrix
correlation_heatmap(corr_filter_df)

### Multi-collinearity Check

We will be using VIF from statsmodels library to check the multi-collinearity.
*VIF_threshold = 10*

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_threshold = 30

 # VIF dataframe
vif_data = pd.DataFrame(corr_filter_df.columns, columns=['Features'])
# calculating VIF for each feature
vif_data["VIF"] = [
    variance_inflation_factor(corr_filter_df, i)
    for i in range(len(corr_filter_df.columns))
]
# Sort VIF in descending order
vif_data.sort_values(by='VIF',ascending=False ,inplace=True)
vif_data
# vif_data[vif_data.VIF > vif_threshold]

We can see a few features with VIF greater than *10*. So we will remove these features.

In [None]:
to_remove_vif = vif_data[vif_data.VIF > vif_threshold].Features.tolist()
print(f'We are require to remove {len(to_remove_vif)} columns, which are: {to_remove_vif}')

multicorr_filter_data = corr_filter_df.drop(to_remove_vif, axis=1)

In [None]:
multicorr_filter_data.head()

### Class Correlation

In [None]:
# Correlation with class

correlate = np.array([multicorr_filter_data[columns].corr(y) for columns in multicorr_filter_data.columns])
corr_df = pd.DataFrame(abs(correlate.round(5)),index=multicorr_filter_data.columns,columns=['Correlation with Class'])
corr_df.sort_values(by='Correlation with Class',ascending=True)

Not a single feature has very low correlation with class. So we don't need to remove any feature.

Till now we are left with 16 features excluding *class*.

We aren't able to work with following selection tests as,

1. Chi2 because it is not applicable for numerical data.
2. Mutual Information because of smaller number of features/samples.
3. Lasso because of small number of features/samples.

### Normalization

We will be using MinMax Normalization to normalize the data.

In [None]:
normalized_df = normalize(corr_filter_df,mode='minmax')
normalized_df.head()

### Stepwise Selection

We will work with backward elimination to select the features. That means we will remove the feature with highest p-value as we go.

In [None]:
normalized_df.shape, y.shape

In [None]:
from helpers import BackwardElimination

backward_elimination = BackwardElimination(
    normalized_df, y, scoring='roc_auc'
)
backward_elimination.fit()

In [None]:
final_df_step1 = backward_elimination.get_transformed_data()
final_df_step1.head()

---

## Type 2 - Processing

**Type 2** Data-preprcessing and Feature Selection uses methods like , PCA-based selection, PCA-based selection with correlation.

### PCA

In [None]:
from helpers import pca_dataframe

pca_df = pca_dataframe(
    normalize(var_filter_df,'minmax'),
    prob=0.1
)

In [None]:
pca_df.head()

With PCA Dimensionality Reduction , we get *31* independent features out of *38* features.
These 31 features can represent 90% of the original features.

### Class Correlation

In [None]:
# Correlation with class

class_correlate = np.array([pca_df[columns].corr(y) for columns in pca_df.columns])
corr_df_2 = pd.DataFrame(
    abs(class_correlate.round(5)),index=pca_df.columns, columns=['Correlation with Class']
)
corr_df_2.sort_values(by='Correlation with Class',ascending=True)

Not a single feature has very low correlation with class. So we don't need to remove any feature.

### Stepwise Selection

We will work with backward elimination to select the features. That means we will remove the feature with highest p-value as we go.

In [None]:
from helpers import BackwardElimination

backward_elimination = BackwardElimination(
    pca_df, y, scoring='roc_auc'
)
backward_elimination.fit()

In [None]:
final_df_step2 = backward_elimination.get_transformed_data()
final_df_step2.head()

# Finalizing

In [90]:
or_len = len(original_df.columns)
s1_len = len(final_df_step1.columns)
s2_len = len(final_df_step2.columns)

col_diff_S1 = set(original_df.columns) - set(final_df_step1.columns)


print(f'We have {s1_len} features in Type-1 pre-processed dataset and {s2_len} features in'
      f'Type-2 pre-processed dataset. That is difference of {or_len - s1_len} and '
      f'{or_len - s2_len} features from original dataset respectively.')
print()
print(f'We have removed {len(col_diff_S1)} features: {col_diff_S1} from Type-1 pre-processed dataset \n and {or_len - s2_len} features from Type-2 pre-processed dataset.')

We have 27 features in Type-1 pre-processed dataset and 30 features inType-2 pre-processed dataset. That is difference of 17 and 14 features from original dataset respectively.

We have removed 17 features: {'apq11Shimmer', 'ppq5Jitter', 'apq5Shimmer', 'apq3Shimmer', 'ddaShimmer', 'RPDE', 'meanPeriodPulses', 'meanIntensity', 'minIntensity', 'locShimmer', 'numPulses', 'rapJitter', 'meanAutoCorrHarmonicity', 'meanHarmToNoiseHarmonicity', 'mean_MFCC_11th_coef', 'ddpJitter', 'locAbsJitter'} from Type-1 pre-processed dataset 
 and 14 features from Type-2 pre-processed dataset.


In [91]:
final_df_step1['class'] = y
final_df_step2['class'] = y

final_df_step1.to_csv('processed_dataset/final_data_S1.csv', index=False)
final_df_step2.to_csv('processed_dataset/final_data_S2.csv', index=False)