## **Random Forest For Feature Extraction from Predictor file**

Random Forest is a supervised model that implements both decision trees and bagging method. In this notebook we used the function (feature_importances_) present in the random forest classifier package [(from the SciKit Learn Package)](https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html) to obtain the features that were most important when seperating ecd contact readings from unsuccessful readings. To get these estimates for variable importance we used the labels for unsucecssful and ecd contact errors that we had access to. 

### Step 1 :

- Import all libraries

In [None]:
## Importing all required libraries 

import pandas as pd                                    ## Library used for Dataframe manipulation                     
import numpy as np                                     ## Library for array manipulation

from sklearn.preprocessing import LabelEncoder         ## Library to encode all categorical variables
from sklearn.model_selection import train_test_split   ## Library for splitting the data into train and test sets
from sklearn.ensemble import RandomForestClassifier    ## Random Forest Classifier library 
from sklearn.metrics import precision_score            ## Metric used to measure the results obtained from Random Forest Classifier
from sklearn import model_selection                    ## Library for model selection 
from sklearn import metrics                            ## Library for metrics to define the results obtained.

### Step 2 :

- Read the predictor file and perfomring EDA (Exploratory Data Analysis)

In [None]:
# Read in the aggregate predictor files
un_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/Unsuccessful.csv')
ecd_pred =  pd.read_csv('../Data/PreprocessedData/Predictors/ecdContact.csv')
syn_pred =  pd.read_csv('../Data/RawData/Predictors/SyntheticECD.csv')
con_pred =  pd.read_csv('../Data/RawData/Predictors/ECDAggContaminated.csv')

In [None]:
## Add labels for predictor data

un_pred['Label'] = "Unsuccessful"
ecd_pred['Label'] = "ecdcontacts"
syn_pred['Label'] = "Synecd"
con_pred['Label'] = "Conecd"

In [None]:
## Merge the predictor files and rename the TestID column to match with timeseries file.

preds = pd.concat([un_pred, ecd_pred, syn_pred, con_pred])
preds = preds.rename({'TestID':'TestId'}, axis = 1)

In [None]:
## Reset the index. 
preds = preds.reset_index(drop = True)

In [None]:
## Store the TestIds and Label in a variable ids and labels
ids, labels = preds[['TestId','Label']]

In [None]:
## Check if any column has all values as zero for all records.
(preds == 0).all()

#### Other predictor file columns

- descriptions dropped for confidentiality

In [None]:
## Drop all of the above mentioned columns.

preds.drop(['list of aggregate predictors'],axis=1, inplace= True)

In [None]:
"""Helper function to find the correlation matrix of a given dataset and drop the columns have a positive correlation above 0.95.
    Args:
        data : dataframe for which correlation needs to be found.
        percent : the percent of correlation between the column which crosses over and needs to be dropped.
    Returns : dataframe with correlated columns dropped
    """
def frame(data,percent):
 
    ## Calculate the pairwise correlation matrix of all the column present in the dataframe
    cor_matrix = data.corr()                               
    
    ## Calculate only the upper triangle of the correlation matrix as it is symmetric and storing it in an variable called upper_triangle 
    upper_triangle = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(bool))
    
    ## Calculate all the columns whose positive corelation is greater is 0.95 and storing them in a variable to_drop
    ## Change 0.95 to +/- to obtain positive or neagative correlation matrix of the columns.
    ## We have chosen 0.95 here it can be set to account for value to your choosing.
    to_drop = [column for column in upper_triangle.columns if any(upper_triangle[column] > percent)]
    
    ## Drop the columns stored in to_drop variable from the dataframe and returning the dataframe.
    data.drop(to_drop,axis=1, inplace= True)
    
    return data

In [None]:
## Call the function frame to drop the columns with desired correlation which is the second parameter passed.

preds = frame(preds, 0.95)

### Step 3 :

- Separate the predictor and labels
- Encode the categorical variables present in the predictor file
    - AggPred X and AggPredY
- Split the data into training  and testing dataset.

In [None]:
## Separate the predictors and labels into varibles.

X1 = preds.iloc[:,:-1]
Y1 = preds.iloc[:,-1]

In [None]:
### Convert categorical variables to integers.

le = LabelEncoder()

X1['AggPredX'] = le.fit_transform(X1['AggPredX'])
X1['AggPredY'] = le.fit_transform(X1['AggPredY'])

In [None]:
## Split the data into training and testing dataset:
## This is necessary to prevent overfitting.
## Parameters include:
    ## X_train,X_test,Y_train & Y_test = data is split in to train and test
    ## test_size = 80-20 ratio is chosen here hence 0.2 is the size of the test set
    ## shuffle = (True) to shuffle the data, so a diverse set is chosen instead of the same labeled ones.
    
X_train, X_test, Y_train, Y_test = train_test_split(X1, Y1, test_size=0.2, random_state=42, shuffle = True)

### Step 4 :

Iniatiate the Random Forest Classifier model to train the model and use the feature_importance function to obtain the scores of all the columns to choose the most important ones.

In [None]:
mod1 = RandomForestClassifier(n_estimators=100)          ## Build the Random Forest classifier model with 100 trees
mod1.fit(X_train,Y_train)                                ## Fit the model on tranining data (X_train, Y_train)

In [None]:
## Use the function feature_importance_ on the ranodm forest model built to obtain the scores assigned for each predictor.

feature_scores = pd.Series(mod1.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores  

### Results :

AggPredC and AggPredD seems to have the higher scores and as we proceed further down we see the scores decreasing for other predictors

We will consider these as our predictors to be combined with the PCA componenets obtained from the encoder to form our final dataframe we use for clustering.