# Weka machine learning toolkit

* [Download Weka](https://www.cs.waikato.ac.nz/~ml/weka/)
* [Data mining with Weka video series](https://www.youtube.com/user/WekaMOOC)

# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

* Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
* Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
* Do you see any overlap between the PCA features and those obtained from feature selection?

In [18]:
import pandas as pd

column_names = ["edibility", "cap-shape", "cap-surface", "cap-color", "bruises", "odor", "gill-attachment", "gill-spacing",
                    "gill-size", "gill-color", "stalk-shape", "stalk-root", "stalk-surface-above-ring",
                    "stalk-surface-below-ring", "stalk-color-above-ring", "stalk-color-below-ring", "veil-type", "veil-color",
                    "ring-number", "ring-type", "spore-print-color", "population", "habitat"]
df = pd.read_csv("./agaricus-lepiota.data", names=column_names)
df.describe()

Unnamed: 0,edibility,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
count,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124,...,8124,8124,8124,8124,8124,8124,8124,8124,8124,8124
unique,2,6,4,10,2,9,2,2,2,12,...,4,9,9,1,4,3,5,9,6,7
top,e,x,y,n,f,n,f,c,b,b,...,s,w,w,p,w,o,p,w,v,d
freq,4208,3656,3244,2284,4748,3528,7914,6812,5612,1728,...,4936,4464,4384,8124,7924,7488,3968,2388,4040,3148


In [4]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
import numpy as np

In [56]:
df_dummy = pd.get_dummies(df)
skb = SelectKBest(chi2, k=10)

X = df_dummy.values
y = df_dummy[["edibility_e", "edibility_p"]]

print(X.shape)
print(y.shape)

skb.fit(X,y)
X_new = skb.transform(X)
print(X_new.shape)

skb_features = [df_dummy.columns[i] for i in skb.get_support(indices=True)]
print("Selected features:", skb_features)

(8124, 119)
(8124, 2)
(8124, 10)
Selected features: ['edibility_e', 'edibility_p', 'odor_f', 'odor_n', 'gill-size_n', 'gill-color_b', 'stalk-surface-above-ring_k', 'stalk-surface-below-ring_k', 'ring-type_l', 'spore-print-color_h']


In [57]:
from sklearn import decomposition

X = df_dummy = pd.get_dummies(df)
print("Original space:", X.shape)
n = 3

pca = decomposition.PCA(n_components=n)
pca.fit(X)
X_pca = pca.transform(X)
print("PCA space:", X_pca.shape)

best_features = [pca.components_[i].argmax() for i in range(X_pca.shape[1])]
feature_names = [X.columns[best_features[i]] for i in range(X_pca.shape[1])]

print(f"{n} best features:", feature_names)

Original space: (8124, 119)
PCA space: (8124, 3)
3 best features: ['edibility_p', 'stalk-root_b', 'habitat_g']


In [58]:
set(skb_features).intersection(set(feature_names))

{'edibility_p'}