# Tutorial: Features and Preprocessing

In this notebook, we will implement the Naive Feature Selector algorithm from the lecture and apply it on a real dataset. Let's start off by importing everything we need for this tutorial. Please do not use any other imports for this notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline

### Task 1

Let's begin by loading the dataset and exploring it. Run the cell below and then inspect the data, which is provided as a [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries). Use the `print()` and `dict.keys()` functions to see its content and the answer the following questions:

    1. How many samples does the dataset contain?
    2. How many features does each instance consist of?
    3. Is the data labeled (and if so, how many classes are there)?
    4. What are the names of the features?

In [None]:
# Let's load the data
cancer_data = load_breast_cancer()

In [None]:
# inspect the data 

print(cancer_data.keys())

X = cancer_data.data
y = cancer_data.target

for i, name in enumerate(cancer_data.feature_names):
    print(f"Feature {i} is called {name}.")


### Task 2

Now, let's implement the naive feature selector. Complete the methods stub below and use it to select the three features with the strongest feature-class correlation. What are their names?


In [None]:
def naive_feature_selector(X: np.ndarray, y: np.ndarray, k: int) -> list[int]:
    
    # calculate correlations
    R = np.corrcoef(X, y, rowvar=False)
    
    # select the k features with the strongest feature-class correlations
    f = np.argsort(np.abs(R[:-1, -1]))[-k:]
    
    return  f

In [None]:
indices = naive_feature_selector(X, y, 3)
print(f"The three features most strongly correlating " 
      f"with the classes are called {', '.join(cancer_data.feature_names[indices])}.")

### Task 3

Now, we want to try out different values for k. First split the data into a training set and a test set. The test set should consist of 50 instances. Then create a for loop to select k values and train a classifier and predict the performance on the test samples using the provided black box classifier. Don't worry, we'll learn much more about the used classfication method over the course of the semester ;) Store the classification results for k-values ranging from 1 to 25.

In [None]:
# split the dataset into training and test sets
N = 50

X_train = X[:-N, :]
X_test = X[-N:, :]
y_train = y[:-N]
y_test = y[-N:]

# we'll consider this a blackbox classifier today
def train_and_predict(feature_indices: list[int]) -> float:
    
    # train the classifier
    classifier = KNeighborsClassifier(n_neighbors=1)
    classifier.fit(X_train[:, features], y_train)
    
    # predict using trained classifier
    pred = classifier.predict(X_test[:, features])
    
    # return the score
    return np.mean(pred==y_test) 

# initialise an empty array to store the prediction scores
scores = np.empty(25)

# loop over values for k from 1 to 25
for k in range(25):
    
    # select k features
    features = naive_feature_selector(X_train, y_train, k+1)
    
    # train and predict
    scores[k] = train_and_predict(features)

    

### Task 4

Finally, visualise your results using matplotlib. Make sure to set a title for your figure and label your axes. How do you explain the shape of your plot?

In [None]:
# create a new figure
plt.figure()

# set axis limits
plt.xlim(1,25)
plt.ylim(0,1)

# define a grid
plt.grid(linestyle='--', linewidth=0.5)

# label the axes
plt.xlabel("# Features")
plt.ylabel("Classification Score")

# set a figure title
plt.title("Test Performance")

# plot the performance
plt.plot(range(1,26), scores)



### Bonus Task
Also implement the calculation of merit and the GreedyCFS algorithm from the lecture. Then rerun the experiments above and compare them visually in your plot.