# Exercise 6 IDATT2502 - Endré Hadzalic

## Tasks
1. Using the [UCI mushroom dataset](https://archive.ics.uci.edu/ml/datasets/mushroom) from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
2. Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
3. Do you see any overlap between the PCA features and those obtained from feature selection?

### During this analysis the Sequential Feature Selectior is compared to a Principal Component Analysis

# 1:

In [79]:
import pandas as pd
import numpy as np

from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.neighbors import KNeighborsClassifier

#change this variable to change the resulting 
#dimensions of the dataset after performing the different techniques
NO_FEATURES_TO_SELECT = 10 

#load the dataset
df = pd.read_csv('agaricus-lepiota.data')

# "Dummify" the categorical data
dummied_df = pd.get_dummies(df)#.iloc[0:250] # <-changes amount of data points (rows in the dataset) for quicker testing

knn = KNeighborsClassifier(n_neighbors = 3)
sfs = SequentialFeatureSelector(knn,n_features_to_select = NO_FEATURES_TO_SELECT)

# Chose y(target) as the edibility of the fungi, which means that 
# the features selected is the ones from the dataset that describe 
# the edibility the best. 
# The X(data) is the rest of the dataset except the edibility columns.
y = dummied_df['edibility_e']
X = dummied_df[dummied_df.columns.difference(['edibility_e','edibility_p'])]

print("Original X.shape:",X.shape)

# Performs a sequential feature selection of the 10 most describing
# features in the set. With K-neighbors as classifier
new_X = sfs.fit_transform(X=X,y=y) 
print("New X.shape:     ",new_X.shape)

fs_result_names = np.array(X.columns)[sorted(sfs.get_support(indices=True))]
resultstring = ""
for one in fs_result_names:
    resultstring += one + ", "

print("\nTop %s features selected with SFS: \n"%(NO_FEATURES_TO_SELECT)
, resultstring)


Original X.shape: (8124, 117)
New X.shape:      (8124, 10)

Top 10 features selected with SFS: 
 bruises?_f, bruises?_t, gill-color_b, habitat_l, habitat_u, odor_c, odor_f, odor_m, spore-print-color_r, stalk-color-below-ring_y, 


# 2:

In [80]:
from sklearn import decomposition

pca = decomposition.PCA(n_components=NO_FEATURES_TO_SELECT)
pca_X = pca.fit_transform(X=X,y=y)

print("Original X.shape:",X.shape)
print("PCA X.shape:     ",pca_X.shape)

# Extracts the index of each feature that describes the most in each component and puts 
# it in a sorted array.
feature_indexes = sorted(np.argmax(pca.components_,axis=1))
pca_result_names = [] #used in task 3
resultstring = ""
for i in feature_indexes :
    pca_result_names.append(X.columns[i])
    resultstring += X.columns[i] + ", "

print("\nTop %s describing features selected with PCA: \n"%(NO_FEATURES_TO_SELECT)
, resultstring)

Original X.shape: (8124, 117)
PCA X.shape:      (8124, 10)

Top 10 describing features selected with PCA: 
 bruises?_f, cap-color_n, cap-shape_f, cap-surface_f, cap-surface_s, habitat_g, odor_n, spore-print-color_h, spore-print-color_k, stalk-shape_t, 


# 3:

In [82]:
# Takes the intersection between the results of the two strategies SFS and PCA
features_in_common = [value for value in pca_result_names if value in fs_result_names]
resultstring = ""
for intersection in features_in_common :
    resultstring += intersection + ", "
print("There was %s feature(s) in common \nof a maximum of %s possible between SFS and PCA: \n"%(len(features_in_common),NO_FEATURES_TO_SELECT), resultstring)


There was 1 feature(s) in common 
of a maximum of 10 possible between SFS and PCA: 
 bruises?_f, 


## Result

We might not see the exact same resulting categories of features selected as most descriptive, but we se similarities in that the different categories belong to the same overall feature. EX odor_n and odor_f