# Weka machine learning toolkit

* [Download Weka](https://www.cs.waikato.ac.nz/~ml/weka/)
* [Data mining with Weka video series](https://www.youtube.com/user/WekaMOOC)

# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

* Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
* Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
* Do you see any overlap between the PCA features and those obtained from feature selection?

In [18]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import decomposition

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

mushroom_data = pd.read_csv('agaricus-lepiota.data')

df = pd.DataFrame(mushroom_data)

# Target
y = pd.get_dummies(mushroom_data['edibility'])

# Data
x = mushroom_data.drop(['edibility'], axis=1)
x = pd.get_dummies(x)

print(f"x shape {x.shape}")
print(f"y shape {y.shape}")

x shape (8124, 117)
y shape (8124, 2)


In [19]:
skb = SelectKBest(chi2, k=5)
skb.fit(x, y)
x_new = skb.transform(x)

print(f"skb shape {x_new.shape}")

feature_mask = skb.get_support()
features = x.columns[feature_mask]

print("List the top discriminative features")
for f in features:
    print(f"- {f}")

skb shape (8124, 5)
List the top discriminative features
- odor_f
- odor_n
- gill-color_b
- stalk-surface-above-ring_k
- stalk-surface-below-ring_k


In [21]:
centers = [[1, 1], [-1, -1], [1, -1]]
print("Original space:",x.shape)
pca = decomposition.PCA(n_components=5)
pca.fit(x)
Xpca = pca.transform(x)
print("PCA space:",Xpca.shape)

feature_names = []
print("\nMost contributing features:")
for component in pca.components_:
    feature_index = np.argmax(component)
    feature_names.append(x.columns[feature_index])
    print(f"- {x.columns[feature_index]}")

Original space: (8124, 117)
PCA space: (8124, 5)

Most contributing features:
- bruises?_f
- spore-print-color_h
- habitat_g
- stalk-shape_t
- odor_n


In [22]:
print("Overlapping features:")
o_feats = features.intersection(feature_names)
for feat in o_feats:
    print(f"- {feat}")

Overlapping features:
- odor_n
