# Exercise 6

## Part 1:
Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminative?

In [168]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import numpy as np
import pandas as pd

# Let's do some feature selection on the UCI mushroom set
df = pd.read_csv('agaricus-lepiota.csv')
dummies = pd.get_dummies(df)
X, y = pd.get_dummies(df), pd.get_dummies(df['edibility'])

print("X shape:", X.shape)
print("y shape:", y.shape)

skb = SelectKBest(chi2, k=5)
skb.fit(X, y)
X_new = skb.transform(X)

print("skb shape:", X_new.shape)

# Fetch the selected feature indices and print the corresponding feature names
selected = [dummies.columns[i] for i in skb.get_support(indices=True)]
print("Selected features:", ", ".join(selected))

X shape: (8124, 119)
y shape: (8124, 2)
skb shape: (8124, 5)
Selected features: edibility_e, edibility_p, odor_f, odor_n, stalk-surface-above-ring_k


As edibility is the target class; "Is it edible or poisonous", we want to find out which feature is most helpful when telling if a mushroom is poisonous or not. Exept for the targets 'edibility_e' and 'edibility_p', we see that the features 'odor_f' and 'odor_n' is the two best features for telling us that.

## Part 2:

Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?

In [164]:
from sklearn import decomposition

print("Original space:", X.shape)
pca = decomposition.PCA(n_components=5) # We want 3 components, and the next shape will show that
X_pca = pca.fit_transform(X)

print("PCA space:", X_pca.shape)
# Finds the indexes in which haves the highest value of variance
best_features = [pca.components_[i].argmax() for i in range(X_pca.shape[1])]
# Stores the k best features in a string
feature_names = [X.columns[best_features[i]] for i in range(X_pca.shape[1])]
print("Features in which gives max variance:", ", ".join(feature_names))

Original space: (8124, 119)
PCA space: (8124, 5)
Features in which gives max variance: edibility_p, stalk-root_b, habitat_g, stalk-shape_t, odor_n


# Part 3
Do you see any overlap between the PCA features and those obtained from feature selection?

By just looking at my results I can see that it is overlap in for example odor_n and edibility_p. We may as well run the SelectKBest vs PCA with a higher amount of K = n_components for showing the overlap.

In [162]:

for i in range(5,36,10):
    # SelectKBest
    skb = SelectKBest(chi2, k=i)
    skb.fit(X, y)
    skb_res = skb.transform(X)
    selected = [dummies.columns[i] for i in skb.get_support(indices=True)]
    
    # PCA
    pca = decomposition.PCA(n_components=i) # We want 3 components, and the next shape will show that
    X_pca = pca.fit_transform(X)
    best_features = [pca.components_[i].argmax() for i in range(X_pca.shape[1])]
    feature_names = [X.columns[best_features[i]] for i in range(X_pca.shape[1])]
    
    print(f"For k={i} we get {len(set(selected).intersection(feature_names))} overlapping features:\n", set(selected).intersection(feature_names), "\n")


For k=5 we get 2 overlapping features:
 {&#39;odor_n&#39;, &#39;edibility_p&#39;} 

For k=15 we get 3 overlapping features:
 {&#39;spore-print-color_k&#39;, &#39;odor_n&#39;, &#39;edibility_p&#39;} 

For k=25 we get 5 overlapping features:
 {&#39;stalk-surface-above-ring_k&#39;, &#39;edibility_p&#39;, &#39;odor_n&#39;, &#39;spore-print-color_k&#39;, &#39;stalk-surface-below-ring_s&#39;} 

For k=35 we get 10 overlapping features:
 {&#39;stalk-surface-above-ring_k&#39;, &#39;edibility_p&#39;, &#39;gill-color_n&#39;, &#39;odor_s&#39;, &#39;odor_y&#39;, &#39;gill-color_w&#39;, &#39;odor_n&#39;, &#39;stalk-surface-below-ring_k&#39;, &#39;spore-print-color_k&#39;, &#39;stalk-surface-below-ring_s&#39;} 

