# Weka machine learning toolkit

* [Download Weka](https://www.cs.waikato.ac.nz/~ml/weka/)
* [Data mining with Weka video series](https://www.youtube.com/user/WekaMOOC)

# Exercise 6

For this exercise you can use either Python with sklearn or Weka.

* Using the UCI mushroom dataset from the last exercise, perform a feature selection using a classifier evaluator. Which features are most discriminitave?
* Use principal components analysis to construct a reduced space. Which combination of features explain the most variance in the dataset?
* Do you see any overlap between the PCA features and those obtained from feature selection?

In [1]:
import sklearn.feature_selection as sk
import pandas as pd
import numpy as np

dataset_from_csv = pd.read_csv("agaricus-lepiota.csv")

dummy_dataset = pd.get_dummies(dataset_from_csv)
dummy_feature_selection = pd.get_dummies(dataset_from_csv['habitat'])

feature_selector = sk.SelectKBest(sk.chi2, k=12)
feature_selector.fit_transform(dummy_dataset, dummy_feature_selection)

np.array(dummy_dataset.columns)[feature_selector.get_support(indices=True)]

array(['gill-color_e', 'stalk-root_c', 'stalk-color-above-ring_e',
       'stalk-color-below-ring_e', 'population_c', 'habitat_d',
       'habitat_g', 'habitat_l', 'habitat_m', 'habitat_p', 'habitat_u',
       'habitat_w'], dtype=object)

In [2]:
dataset_from_csv = pd.read_csv("agaricus-lepiota.csv")

dummy_dataset = pd.get_dummies(dataset_from_csv)
dummy_feature_selection = pd.get_dummies(dataset_from_csv['edibility'])

feature_selector = sk.SelectKBest(sk.chi2, k=12)
feature_selector.fit_transform(dummy_dataset, dummy_feature_selection)

np.array(dummy_dataset.columns)[feature_selector.get_support(indices=True)]

array(['edibility_e', 'edibility_p', 'bruises?_t', 'odor_f', 'odor_n',
       'gill-size_n', 'gill-color_b', 'stalk-surface-above-ring_k',
       'stalk-surface-below-ring_k', 'ring-type_l', 'ring-type_p',
       'spore-print-color_h'], dtype=object)

I am mostly suprised by the ordering of the two features I have selected.
It makes sense that edibility is the first ranked feature for determining edibility, but why is the order reversed when selecting habitat??

In [3]:
from sklearn import decomposition

# Principal component analysis
# To differenciate which features are important we map the varibles along an axis and fit a line to the data.
# We can then read how important each of the variables are in describing a feature by how the line aligns to the different axis.
# We can then use pythagoras theorem to measure the weights of the axis
# The relationship between the axis is called a "Linear combination of variables"

# What exactly is the variance?? I guess it explains which features is important for determining other characteristics of the object we are looking at
# A higher variance means that the variable is of grater significance to the object, like being poisonous or not??

# The results of this tells us that the poisonous mushrooms are the most unique mushrooms?
components = 5
component_indexes = ['PC' + str(1 + i) for i in range(components) ]

pca = decomposition.PCA(n_components=components)
principal_components = pca.fit_transform(dummy_dataset)

principal_df = pd.DataFrame(data= principal_components, columns=component_indexes)
principal_df

Unnamed: 0,PC1,PC2,PC3,PC4,PC5
0,-0.638268,-0.703743,0.685535,-1.635024,-1.295963
1,-1.573287,0.027375,1.016744,-1.551018,-0.248315
2,-1.670593,-0.198090,0.793701,-1.741395,-0.033859
3,-0.744696,-0.410661,0.511156,-1.799396,-1.358061
4,-1.029175,-0.955341,1.727233,1.367630,-0.280311
...,...,...,...,...,...
8119,-0.495899,-0.446242,0.367512,-0.605917,2.399196
8120,-0.420277,-0.406395,0.162846,-0.593804,1.836562
8121,-0.630114,-0.337139,0.298104,-0.413737,2.214303
8122,1.869292,-1.803126,-0.102362,-0.373076,0.049375


In [4]:
# Finds the indexes in which haves the highest value of variance
#significant_features = [pca.components_[i].argmax() for i in principal_components]
# Stores the k best features in a string
#feature_names = [dummy_dataset.columns[significant_features[i]] for i in range(principal_components.shape[1])]
#print("Features in which gives max variance:", ", ".join(feature_names))
principal_ranked_df = pd.DataFrame(data=pca.components_ ,columns=dummy_dataset.columns, index=component_indexes).transpose()
principal_ranked_df

Unnamed: 0,PC1,PC2,PC3,PC4,PC5
edibility_e,-0.286025,-0.048523,0.029625,0.113940,0.228357
edibility_p,0.286025,0.048523,-0.029625,-0.113940,-0.228357
cap-shape_b,-0.028638,-0.003008,0.048408,-0.090450,0.061944
cap-shape_c,0.000044,-0.000137,0.000293,-0.000485,0.000523
cap-shape_f,0.000377,0.042997,-0.062735,0.149180,-0.007145
...,...,...,...,...,...
habitat_l,0.060540,-0.077944,-0.014493,-0.023391,0.112058
habitat_m,-0.024363,0.000736,0.019173,-0.084382,-0.004744
habitat_p,0.095959,0.020203,-0.010626,-0.034115,0.009532
habitat_u,-0.011416,-0.003436,0.013219,-0.047906,-0.093428


In [5]:
#Finding the most significant features ranked by every PCA
principal_df.reset_index()
sorted_principal_ranked_df = pd.DataFrame()
for i in range(pca.n_components):
    # Most significant feature in each respective PCA
    print(dummy_dataset.columns[pca.components_[i].argmax()])
    sorted_principal_ranked_df['PC' + str(1+ i)] = principal_ranked_df['PC'+ str(1 + i)].abs().sort_values(ascending=False).index
    sorted_principal_ranked_df.reset_index()
sorted_principal_ranked_df
#Veil type is practically useless


edibility_p
stalk-root_b
habitat_g
stalk-shape_t
odor_n


Unnamed: 0,PC1,PC2,PC3,PC4,PC5
0,edibility_p,stalk-root_b,habitat_d,stalk-shape_t,odor_n
1,edibility_e,ring-type_e,habitat_g,stalk-shape_e,stalk-root_?
2,ring-type_p,spore-print-color_h,gill-spacing_w,odor_n,stalk-color-below-ring_w
3,bruises?_f,ring-type_l,gill-spacing_c,cap-surface_f,edibility_e
4,bruises?_t,stalk-root_?,stalk-root_b,gill-spacing_w,edibility_p
...,...,...,...,...,...
114,ring-type_f,veil-color_y,cap-color_r,cap-shape_c,stalk-color-above-ring_b
115,cap-shape_c,stalk-color-above-ring_y,gill-color_r,stalk-color-above-ring_y,cap-shape_s
116,cap-color_u,cap-surface_g,cap-shape_c,veil-color_y,stalk-color-below-ring_b
117,cap-color_r,cap-shape_c,cap-surface_g,gill-color_k,cap-surface_g



If we compare the data from PCA and classifier evaluation, we can see that edibility is very common between the selections as well as:
bruises, odor, stalk-shape, ring_type


Results from feature selection for edibility:
array(['edibility_e', 'edibility_p', 'bruises?_t', 'odor_f', 'odor_n',
       'gill-size_n', 'gill-color_b', 'stalk-surface-above-ring_k',
       'stalk-surface-below-ring_k', 'ring-type_l', 'ring-type_p',
       'spore-print-color_h'], dtype=object)

Results from feature selection for habitat:
array(['gill-color_e', 'stalk-root_c', 'stalk-color-above-ring_e',
       'stalk-color-below-ring_e', 'population_c', 'habitat_d',
       'habitat_g', 'habitat_l', 'habitat_m', 'habitat_p', 'habitat_u',
       'habitat_w'], dtype=object)
