<a href="https://colab.research.google.com/github/AngelGonzs/ColabML/blob/main/RFE_Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **We will be first working with the RFE Method in two different manners:**

*   Random Forest Classifier
*   RFE Base

**We will commence with the Random Tree method**

Use the code below to get all of our imports

In [None]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Scikit Learn, or sklearn, will do most of the heavy lifting, we just import the RFE from `sklearn.feature_selection` and pass any classifier model to the RFE() method with the number of features to select. 

Using familiar Scikit Learn syntax, the .fit() method must then be called.

Below we will import our sklearn modules:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score

RFE recursively removes the features and builds a model with the remaining features. With this we can calculate the model accuracy as well. 

It trains the model with all the features initially and then it will remove the features that seemingly impact our model the least. Once these features are removed it will train the model again and we can compare what has been done.



We will work with the Breast Cancer Data Set to train our model

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
data.keys()
#We print the keys of our data

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [None]:
#We can also use the following method to get the description of our data and gain some insight
print(data.DESCR)


.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

Next, we will load the dataset into 

*  **X (vector)** - This will be all of our features and their values, we can see this with the following.

> `data = data.data` is... well, our data
> `columns = data.feature_names` here we extract the columns, which we refer to as our features and pass it to the **Pandas DataFrame** as our argument


*  **Y (vector)** - will be simply our results, whether it be the possibility or a Y/N, the Y-Vector will contain what we have as results. We can see this as:

> `data.target` this is quite intuitive, considering that the target will simply be what we aim for as the result.



In [None]:
X = pd.DataFrame(data = data.data, columns = data.feature_names)
X.head()
# We have an X.shape of (569, 30)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [None]:
Y = data.target
print(Y)
# We can see with the above print, that Y is a boolean, it dictates whether or not the patient has cancer

We will now be using the `train_test_split()` method to train four variables

- `X_train` 
- `X_test` 
- `Y_train` 
- `Y_test` 

We will see how these differ, as of now we can say that we will have training and test data to later determine the 


---


As for the variables in our `train_test_split()` , we will be using the following:


- **X and Y** will be our arrays, these serve as the allowed input for our training data

- `test_size` - this will determine the portion of our data that will be used for testing. We will be using a float e.g. if we have `test_size = 0.2` then our test data will be 20% of our main data

- `train_size` - this will represent the proportion of the dataset to include in the train split. If not specified (in our case it won't be) then it will be the remainder of our test split.

   e.g. if our `test_size = 0.2` -> %20 then by logic `train_size = 1 - 0.2` -> 80% of our dataset

- `random state` - Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.


---


As for what this will be returning, it will be a train-test splitting of `2*length(arrays)`



In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,   test_size = 0.2, random_state = 0)

X_train.shape, X_test.shape

# Considering that we had an initial X.shape of (569, 30)

# Our splits should be in 80 to 20 percent ratio, and it seems to match very well.
# 9:45

((455, 30), (114, 30))

Now we will start working our data in three sections:

1. Feature Selection with select_from_model from sklearn library

2. RFE with random forest classifier

3. Gradient Boosting Algorithm to identify the feature set


So let us start with the feature selection 

### Feature selection by feature importance of Randome Forest Classifier

Param and description for `RandomForestClassifier`:



---



> `n_estimators` is the amount of trees that will be in the random forest

> `random_state` Controls both the randomness of the bootstrapping of the samples used when building trees (`if bootstrap=True`) and the sampling of the features to consider when looking for the best split at each node (if `max_features < n_features`). 

> `n_jobs` is the number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees.



*   `None` means only one job ran at a time
*   `-1` indicates us that all the processors will be used


---


Param and description for `.fit()`

the fit method takes the training data as arguments, which can be **one array** in the case of **unsupervised learning**, or **two arrays** in the case of **supervised learning**.

As we can see, in our case we will be using two arrays, with `X_train` being the training features and `Y_train` being what should be our results from the split method above.

> `X` -	array-like of shape (n_samples, n_features)

> `Y` - array-like of shape (n_samples)

In [None]:
selector = SelectFromModel(RandomForestClassifier(n_estimators = 100, random_state = 0 , n_jobs = -1))
selector.fit(X_train, Y_train)

selector.get_support()
#tells us which features are going to be selected from our model

array([ True, False,  True,  True, False, False,  True,  True, False,
       False, False, False, False,  True, False, False, False, False,
       False, False,  True, False,  True,  True, False, False, False,
        True, False, False])

In [None]:
len(selector.get_support())
len(X_train.columns) #these are the features stored in X_train


#these will be the same, at 30

30

In [None]:
features = X_train.columns[selector.get_support()]
#with this, we will only select the features marked as True, which will be a total of 10
len(features)

10

In [None]:
np.mean(selector.estimator_.feature_importances_)
#the mean iw what will be set as the threshold for which features we are going to select
#if a feature has a value greater than the mean, it will be selected. We will see this in the next
#code block

0.03333333333333334

In [None]:
selector.estimator_.feature_importances_
#we can see and compare this with the output in block[10] and note
#that all the features selected are those greater than our mean 0.0333

array([0.03699612, 0.01561296, 0.06016409, 0.0371452 , 0.0063401 ,
       0.00965994, 0.0798662 , 0.08669071, 0.00474992, 0.00417092,
       0.02407355, 0.00548033, 0.01254423, 0.03880038, 0.00379521,
       0.00435162, 0.00452503, 0.00556905, 0.00610635, 0.00528878,
       0.09556258, 0.01859305, 0.17205401, 0.05065305, 0.00943096,
       0.01565491, 0.02443166, 0.14202709, 0.00964898, 0.01001304])

### Now we will be get the training datasets for our first section

It is also important to point out the functionality of the `transform()` method

This method takes `(X)` ,an array of shape [n_samples, n_features] as it's parameter and will reduce it to the selected features. It will return `X` with only the selected features

In [None]:
X_train_rfc = selector.transform(X_train)
X_test_rfc = selector.transform(X_test)

#make sure to transform from respective X value i.e. X_train_rfc -> X_train
#If these values don't add up with the Y, we will have an error

# That is
# X.shape[0] must == Y.shape[0]

Now we will create our **RandomForest** algorithm

In [None]:
def run_randomForest(X_train, X_test, Y_train, Y_test):

    #clean fill method
    clf = RandomForestClassifier(n_estimators=100, random_state=0, n_jobs=-1)

    clf.fit(X_train, Y_train)

    Y_pred = clf.predict(X_test)
    print('Accuracy: ', accuracy_score(Y_test, Y_pred))

In [None]:
%%time
run_randomForest(X_train_rfc , X_test_rfc,  Y_train , Y_test)
#Accuracy on new DataSet with reduced features

Accuracy:  0.9473684210526315
CPU times: user 321 ms, sys: 12.6 ms, total: 334 ms
Wall time: 490 ms


In [None]:
%%time
run_randomForest(X_train , X_test,  Y_train , Y_test)
#Accuracy on original dataset

Accuracy:  0.9649122807017544
CPU times: user 378 ms, sys: 10.3 ms, total: 388 ms
Wall time: 502 ms


###RFE
19:30