# 🎯 Task 2: Feature Selection Methods and High Accuracy ML Model
In this task, we will apply three feature selection techniques on the breast cancer dataset and train a machine learning model to achieve above 90% accuracy.

## 📂 Load the Dataset
Next, we'll load the breast cancer dataset from the dataset folder.

In [1]:
import pandas as pd
import os


# Load the dataset from the dataset folder
data = pd.read_csv(r'c:\Users\User\Desktop\Brainy_beam_tasks\Task 2\code\dataset\breast-cancer.csv')
data.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


## 🧹 Handle Missing Values
We'll clean the dataset by removing any missing values to ensure our analysis is accurate.

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Handle missing values (if any)
data = data.dropna()



## 🔄 Encode Categorical Variables
We'll encode the categorical variables to numerical values so that they can be used in our machine learning models.

In [3]:
# Encode categorical variables
label_encoder = LabelEncoder()
data['diagnosis'] = label_encoder.fit_transform(data['diagnosis'])


## ✂️ Split the Data into Features and Target
We'll split the dataset into features (X) and target (y) for further processing.

In [4]:

# Split the data into features and target
X = data.drop(columns=['id', 'diagnosis'])
y = data['diagnosis']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## ⚖️ Scale the Data
We'll scale the features to ensure that they have a mean of 0 and a standard deviation of 1, which helps improve the performance of our machine learning models.

In [5]:

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## 🔍 Apply SelectKBest with ANOVA F-value
We'll use the SelectKBest method with ANOVA F-value to select the top 10 features.

In [6]:
from sklearn.feature_selection import SelectKBest, f_classif

# Apply SelectKBest with ANOVA F-value
k_best = SelectKBest(score_func=f_classif, k=10)
X_train_kbest = k_best.fit_transform(X_train, y_train)
X_test_kbest = k_best.transform(X_test)

## 🔄 Apply Recursive Feature Elimination (RFE)
We'll use Recursive Feature Elimination (RFE) with a Random Forest model to select the top 10 features.

In [7]:
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Apply Recursive Feature Elimination
model = RandomForestClassifier(random_state=42)
rfe = RFE(model, n_features_to_select=10)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)

##  Apply Embedded Method: Feature Importance from Random Forest🌲
We'll use the feature importances from a Random Forest model to select the top 10 features.

In [8]:
import numpy as np

# Train a Random Forest model to get feature importances
model.fit(X_train, y_train)
importances = model.feature_importances_

# Select top 10 features based on importance
indices = np.argsort(importances)[-10:]
X_train_embedded = X_train[:, indices]
X_test_embedded = X_test[:, indices]

## 🏋️ Train &
##  🔍 Evaluate the Model
We'll train a Random Forest classifier using the selected features and evaluate its accuracy.

In [9]:
from sklearn.metrics import accuracy_score

# Train and evaluate model using SelectKBest features
model.fit(X_train_kbest, y_train)
y_pred_kbest = model.predict(X_test_kbest)
accuracy_kbest = accuracy_score(y_test, y_pred_kbest)

# Train and evaluate model using RFE features
model.fit(X_train_rfe, y_train)
y_pred_rfe = model.predict(X_test_rfe)
accuracy_rfe = accuracy_score(y_test, y_pred_rfe)

# Train and evaluate model using Embedded features
model.fit(X_train_embedded, y_train)
y_pred_embedded = model.predict(X_test_embedded)
accuracy_embedded = accuracy_score(y_test, y_pred_embedded)

# Print accuracies
print(f"Accuracy with SelectKBest: {accuracy_kbest * 100:.2f}%")
print(f"Accuracy with RFE: {accuracy_rfe * 100:.2f}%")
print(f"Accuracy with Embedded Method: {accuracy_embedded * 100:.2f}%")

Accuracy with SelectKBest: 95.61%
Accuracy with RFE: 96.49%
Accuracy with Embedded Method: 95.61%


### 🎉 Conclusion
In this task, we learned the importance of feature selection in building high-accuracy machine learning models and achieved significant improvements in model performance. Here are the key takeaways and achievements:

### 🌟 Key Takeaways
- **Feature Selection**
- **Model Performance**
- **Data Preprocessing**

### 📊 Achievements
- **High Accuracy**: 
  - **SelectKBest**: 95.61%
  - **RFE**: 96.49%
  - **Embedded Method**: 95.61%


### 🚀 Final Thoughts
Feature selection is essential for building accurate, efficient, and generalizable models. The techniques and insights from this task are valuable for various machine learning problems. 🎯