# Weka machine learning toolkit

* [Download Weka](https://www.cs.waikato.ac.nz/~ml/weka/)
* [Data mining with Weka video series](https://www.youtube.com/user/WekaMOOC)

# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

* Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
* Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
* Do you see any overlap between the PCA features and those obtained from feature selection?

In [48]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt


In [49]:
# Load the dataset
df = pd.read_csv('./data/agaricus-lepiota.data')

# Add dummies to X and y
y = df['edibility'].map({'e': 1, 'p': 0})

X = pd.get_dummies(df.drop('edibility', axis=1))

print(y.shape)
print(X.shape)


(8124,)
(8124, 117)


In [50]:
# Perform feature selection
skb = SelectKBest(chi2, k=5)
skb.fit(X, y)
X_new = skb.transform(X)

# Show selected features
print("Selected features based on chi2:")
print(np.array(X.columns)[skb.get_support(indices=True)])


Selected features based on chi2:
['odor_f' 'odor_n' 'gill-color_b' 'stalk-surface-above-ring_k'
 'stalk-surface-below-ring_k']


In [51]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn import decomposition
from sklearn import datasets

centers = [[1, 1], [-1, -1], [1, -1]]

print("Original space:",X.shape)
pca = decomposition.PCA(n_components=5)
pca.fit(X)
Xpca = pca.transform(X)

print("PCA space:",Xpca.shape)
# Get most contributing feature for each principal component
most_contributing_features = []

for component in pca.components_:
    feature_index = np.argmax(np.abs(component))  # Taking absolute values, as sign might not be meaningful
    most_contributing_features.append(X.columns[feature_index])

print("Most contributing features for each principal component:")
for feature in most_contributing_features:
    print(f"- {feature}")


Original space: (8124, 117)
PCA space: (8124, 5)
Most contributing features for each principal component:
- ring-type_p
- spore-print-color_h
- habitat_d
- stalk-shape_e
- odor_n


In [52]:


# Features selected by SelectKBest
selected_features = set(np.array(X.columns)[skb.get_support(indices=True)])
print(f"Features selected by chi2: {selected_features}")

# Check overlap
overlap_features = selected_features.intersection(most_contributing_features)

print("Overlapping features:")
for feature in overlap_features:
    print(f"- {feature}")

Features selected by chi2: {'stalk-surface-below-ring_k', 'stalk-surface-above-ring_k', 'odor_n', 'odor_f', 'gill-color_b'}
Overlapping features:
- odor_n
