# Data preparation

The original data set [https://archive.ics.uci.edu/dataset/728/toxicity-2] has ~1200 features and only ~170 rows. It has no missing values and doesn't need any initial cleaning.

However to be able to build a meaningful model with so few rows we need to reduce the feature set to 4 features. 

The code in this notebook achieves this by:

- using 2 different feature reduction methods from the  ```klearn.feature_selection``` module
    - Recursive Feature Elimination (RFE) model  
    - SelectKBest 
- chooses only those features selected by both methods

---

## Imports and Constants

In [1]:
import pandas as pd
import pickle
from sklearn.feature_selection import RFE, SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier

#string constants - centralise to avoid later 'finger trouble'
data_folder = "../data/"    #relative folder path to the notebooks folder
raw_data    = f"{data_folder}original-data.csv"
final_data  = f"{data_folder}final-data.csv"

class_column = "Class"


---

# Data prep


In [2]:
# Load the CSV dataset into a pandas DataFrame
data = pd.read_csv(raw_data)

# Separate the features (X) and the target variable (y)
# convert y variable to be a numeric value with 1=Toxic and 2=NotToxic
mapping = {"NonToxic":0, "Toxic":1}
y = data[class_column].map(mapping)

X = data.drop(class_column, axis=1)

X.head()

Unnamed: 0,MATS3v,nHBint10,MATS3s,MATS3p,nHBDon_Lipinski,minHBint8,MATS3e,MATS3c,minHBint2,MATS3m,...,WTPT-3,WTPT-4,WTPT-5,ETA_EtaP_L,ETA_EtaP_F,ETA_EtaP_B,nT5Ring,SHdNH,ETA_dEpsilon_C,MDEO-22
0,0.0908,0,0.0075,0.0173,0,0.0,-0.0436,0.0409,0.0,0.1368,...,0.0,0.0,0.0,0.178,1.5488,0.0088,0,0.0,-0.0868,0.0
1,0.0213,0,0.1144,-0.041,0,0.0,0.1231,-0.0316,0.0,0.1318,...,28.2185,8.866,19.3525,0.1739,1.3718,0.0048,2,0.0,-0.081,0.25
2,0.0018,0,-0.0156,-0.0765,2,0.0,-0.1138,-0.1791,0.0,0.0615,...,33.1064,5.2267,27.8796,0.1688,1.4395,0.0116,2,0.0,-0.1004,0.0
3,-0.0251,0,-0.0064,-0.0894,3,0.0,-0.0747,-0.1151,0.0,0.0361,...,32.5232,7.7896,24.7336,0.1702,1.4654,0.0133,2,0.0,-0.101,0.0
4,0.0135,0,0.0424,-0.0353,0,0.0,-0.0638,0.0307,0.0,0.0306,...,32.0726,12.324,19.7486,0.1789,1.4495,0.012,2,0.0,-0.1071,0.0


---

# Feature selection

Uses two different methods to reduce a dataset to a desired number of features (n)

- ```RFE``` with Random Forest - similar to the original paper
- ```SelectKBest``` a method that uses statistical test of significance which is model agnostic

Compare the two lists of features and keep only the features that appear in both lists.

Iteratively, look for common features starting with a very highly aggressive reduction (i.e. low n) and increasing n until we have a meet the desired number of common features (4)

Finally,
create final csv of data with only these features

\* _the "hope" is that using two very different techniques the final feature set leads to models which are more generalisable_

### Functions

In [3]:
def ReduceFeatures(X: pd.DataFrame, 
                   y: pd.Series, 
                   n:int) -> dict:
    """
    ReduceFeatures produces a dictionary of method-> feature list mappings

    Args:
        X (pd.DataFrame): INput Variables
        y (pd.Series):    response variables
        n (int):          number of features to reduce to

    Returns:
        dict: _description_
    """    

    # placeholder for features results
    top_features = {}

    # 1) RFE using Random Forest ------------------------------------------------------------------------------------------------------
    # Initialize the RFE object with the KNN model and the desired number of features to keep
    rfe = RFE(RandomForestClassifier(random_state=666), n_features_to_select=n)  

    # Fit the RFE to the data
    rfe.fit(X, y)

    # Get the selected features
    selected_features = X.columns[rfe.support_].to_list()
    top_features["RFE: Random Forest"] = selected_features
    
    # 2) SelectKBest ---------------------------------------------------------------------------------------------
    # Initialize the SelectKBest feature selector
    feature_selector = SelectKBest(score_func=f_classif, k=n)  

    # Fit the feature selector to the data
    X_selected = feature_selector.fit_transform(X, y)

    # Get the selected feature indices
    selected_feature_indices = feature_selector.get_support(indices=True)

    # Get the selected feature names
    selected_features = X.columns[selected_feature_indices].to_list()
    top_features["SelectKBest"] = selected_features

    return top_features

# _________________________________________________________________________________________________________________________________________
def FindCommon(commonfeatures: dict) -> list:
    """
    FindCommon: compares all lists in the dictionary and returns a list of those items found only in each

    Args:
        features (dict): a list of features for each method attempted (2 only in the first draft)

    Note: 
        the code is setup for a future case where we could compare more than 2 methods (only if I have time)
    """   
    feature_counts = {}
    for key, features in commonfeatures.items():
        for f in features:
            if f in feature_counts.keys():
                feature_counts[f]+=1
            else:
                feature_counts[f]=1

    #finally use pandas to get list of features that exist only in all 3 sets
    number_feature_sets = len(commonfeatures.keys())
    df_feature_Counts = pd.DataFrame(list(feature_counts.items()),columns=['features', 'counts'])
    final_features = df_feature_Counts[df_feature_Counts['counts']==number_feature_sets] 

    return final_features.features.to_list()

#Temp code - to help saving intermediate file as it can take a while to build the intermediate datasets
# _________________________________________________________________________________________________________________________________________

#code to backup top_features
def backup(data: object, path: str):
    with open(path, 'wb') as f:
        pickle.dump(data, f)

# _________________________________________________________________________________________________________________________________________
#code to read it back later
def restore(path: str) -> object:
    with open(path, 'rb') as f:
        return pickle.load(f)

### main reduction algorithm

Iteratively call reduction technique until we have the desired number of "important" features

 - 4 in the first draft



In [4]:
desired_feature_number = 4 # particular low number becuase of the small number of records

Current_features = 0        # progress - how many common features we have found so far
Cached_featuresets = {}     # results from each iteration of the reduction functions
                            #  - key:   n used in iteration
                            #  - value: list of common features

n = 10  # Current "bucket" size for feature search for each iteration
while Current_features < desired_feature_number:
    #reduce features using different methods
    features_found = ReduceFeatures(X, y, n)
    backup(features_found,f"{data_folder}features-found-{n}.pkl")
    
    #Find common
    common_features = FindCommon(features_found)
    Cached_featuresets[n] = common_features
    Current_features = len(Cached_featuresets[n])

    if Current_features < desired_feature_number:
        n+=10

# now we're out of the loop 
#   - inspect the last cached entry and see if it has over-shot
#   - if we have too many thanuse the set just below the threshold from the previous iteration
#       - with such a small data set it is better to have too few than too many fetures
#       - [its all guess work at thi stage right ;-)]
if len(Cached_featuresets[n])==desired_feature_number:
    final_features = Cached_featuresets[n]
else:
    final_features = Cached_featuresets[n-10]
    for f in Cached_featuresets[n]:
        if not f in final_features:
            final_features.append(f)
        if len(final_features)==desired_feature_number:
            break;

### summarise results

In [5]:
print("Reduction algorithm results:")

for k,v in Cached_featuresets.items():
    print(f"\n\tDimension:\n\t\t{k}\n\tCommon Featuress:")
    print('\t\t' + '\n\t\t'.join(v))

print(f"Final Choice:\n\t[{', '.join(final_features)}]")

Reduction algorithm results:

	Dimension:
		10
	Common Featuress:
		SpDiam_Dt
		MDEC-23
		SpMin3_Bhi
		ATSC1v

	Dimension:
		20
	Common Featuress:
		MDEC-23
		SpDiam_Dt
		SpMin3_Bhi
		ATSC1v
		SpMAD_Dt
Final Choice:
	[SpDiam_Dt, MDEC-23, SpMin3_Bhi, ATSC1v]



---

## Data output

Create csv files of final dataset with reduced number of features used by the model notebooks

In [6]:

#Create final data to be saved as csv file - note we're retaining the un-scaled data
X_Final = X[final_features].copy()
X_Final["Toxic"] = y

X_Final.to_csv(final_data, index=False)