# **(Part 4 Feature Selection And Final Model)**

## Objectives

* Write here your notebook objective, for example, "Fetch data from Kaggle and save as raw data", or "engineer features for modelling"

## Inputs

* Write here which data or information you need to run the notebook 

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Breast-Cancer-Prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Breast-Cancer-Prediction'

---

# Getting Started: Load libraries and set options

In [18]:
#load libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# For Train , Test Spliting 
from sklearn.model_selection import train_test_split

# For Building Classifier Models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# For Feature Selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [16]:
df = pd.read_csv("outputs/datasets/cleaned/data.csv")

## Feature Selection

In [23]:
X = df.drop(['diagnosis'],axis =1)
Y = df["diagnosis"]

In [21]:
X.shape

(569, 30)

As the dimension of our data at the moment is 30, it is likely that the model we create is going to fall into the limitations of the curse of dimensionality. This can be solved with the use of a technique called Feature selection. This refers to the process of selecting a subset of attributes for efficient model construction.

#### Selecting The K Best Features for Our Dataset

In [25]:
selected_features = [] 

best_features = SelectKBest(chi2 , k = 5)


fit = best_features.fit(X , Y).get_support()

for bool, feature in zip(fit, df.columns):
     if bool:
        selected_features.append(feature)
print("The best features are:{}".format(selected_features)) # The list of your 5 best features

The best features are:['texture_mean', 'perimeter_mean', 'perimeter_se', 'texture_worst', 'perimeter_worst']


#### Viewing The Scores of the Top-k Selected Features

In [27]:
feature_scores = {}

for key, value in zip(df.columns,best_features.fit(X , Y).scores_):
    feature_scores[key] = value


for key, value in feature_scores.items():
    if key in selected_features:
        print(f"{key} : {value}")

texture_mean : 2011.1028637679065
perimeter_mean : 53991.65592375091
perimeter_se : 8758.504705334482
texture_worst : 3665.0354163405973
perimeter_worst : 112598.43156405361


#### Dropping Unnecessary columns/features from the dataset.

In [29]:
features = df[["texture_mean" , "perimeter_mean" , "perimeter_se" , "texture_worst" , "perimeter_worst"]]
target = df["diagnosis"]

In [30]:
features.head()

Unnamed: 0,texture_mean,perimeter_mean,perimeter_se,texture_worst,perimeter_worst
0,10.38,122.8,8.589,17.33,184.6
1,17.77,132.9,3.398,23.41,158.8
2,21.25,130.0,4.585,25.53,152.5
3,20.38,77.58,3.445,26.5,98.87
4,14.34,135.1,5.438,16.67,152.2


In [31]:
target.head()

0    1
1    1
2    1
3    1
4    1
Name: diagnosis, dtype: int64

In [32]:
assert features.shape[0] == target.shape[0]

### Final Model 

#### Implementing KNN in our selected features using the bestfit hyper-parameters 

In [34]:
X_train , X_test , Y_train , Y_test  = train_test_split(features , target , test_size = 0.33 , random_state = 42)

KNN_model = KNeighborsClassifier(n_neighbors = 4).fit(X_train , Y_train)

In [35]:
Y_pred = KNN_model.predict(X_test)
Y_pred

array([0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0])

In [36]:
metrics.accuracy_score(Y_test, Y_pred)

0.9574468085106383

The accuracy of KNN model after hyperparameter tuning has increased.

#### Implementing RandomForest Classifier in our selected features using the bestfit hyper-parameters 

In [38]:
X_train , X_test , Y_train , Y_test  = train_test_split(features , target , test_size = 0.33 , random_state = 42)

model_rf = RandomForestClassifier()

RF_Model = RandomizedSearchCV(estimator = model_rf ,param_distributions = random_search, cv = 4, verbose= 5, random_state= 101, n_jobs = -1)

RF_Model.fit(X_train, Y_train)


TypeError: __init__() missing 1 required positional argument: 'param_distributions'

In [39]:
Y_pred = RF_Model.predict(X_test)
metrics.accuracy_score(Y_test , Y_pred)

NameError: name 'RF_Model' is not defined

The accuracy of Random Forest Classifier has also increased after hyperparameter tuning over the selected features.

## Conclusion

To sum up, initially a datset that consisted of the features or details related to the infected cells present in the breast were selected and understood better with the use of available python pandas library.  After that , the matplotlib and seaborn libraries were used to visually represent or map the relationship between various features present in our dataset. 

Now, after that the first primary models on classification algorithms such as Logistic Regression, KNN Classifier and Random Forest Classifier were built using the sklearn library and evaluated using the available metrics. 

After understanding the need of the cross validation for building these models, all of these primary models were rebuilt using the K-fold and Stratified Cross Validation techniques and likewise the accuracies were checked. Both the cross validation techniques , seemed to impact the accuracy of the model in an uniform manner with **Random Forest Classifier with the highest accuracy** here.

Next, was the turn to pick the best fit combination of hyperparameters for our Random Forest and KNN classifiers. Therefore, RandomizedSearch CV and GridSearch CV was used respectively to find the best combination of parameters for our models for better result. 

Derived Best Fit Hyper Parameters for Random Forest  : 

{'criterion': 'entropy', 'gini'        , <br>
 'max_depth': 5, 137, 270, 403, 536,  668, 801, 934, 1067, 1200, None , <br>
 'max_features': 'auto', 'sqrt', 'log2',   None, <br>
  'min_samples_leaf': 4, 6, 8, 12,<br>
 'min_samples_split': 3, 7, 10, 14,<br>
 'n_estimators': 5, 602, 1200 }
                
Derived Best Fit Hyper Parameters for KNN Classifier  : 

{'n_neighbors' : 4}

Finally after finding the best tuning hyper parameters for the model , the curse of dimensionality was taken into consideration and the SelectKbest feature selection technique was used to select top 5 best features from the dataset based on their scores. 

The best features were 'texture_mean', 'perimeter_mean', 'perimeter_se', 'texture_worst' and  'perimeter_worst'.

**Finally,** after the implementation of all these refinement techniques the primarily built models were rebult by plugging in the best fit hyperparameters and selected features. **Performing this demonstrated an increase in the accuracy of the models but both our classifiers that are Random Forest And KNN appear to hold the same amount of accuracy at the end.**

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
