# Feature Selection on Datasets Containing Categorical Variables

Oftentimes we find ourselve with a dataset whose predictors are not numerical. We would like to know, in the case of supervised learning, how many of these categrocial variables affect the target class. 

One way to determine a dependence or a lack thereof is to use $\chi^2$ test. This is a test for categorical variables.

It is important to know that for us to trust results of a $\chi^2$ test, we need the following to be true:

* The contingency table consisting the counts for the relationship between a variable and the target class must be frequency, not in percentage.

* The levels (categories) of the variables being tested are mutually exclusive.

* The value of expected cells on the contingency table should be greater than 5.

If the $p$ value of the $\chi^2$ test is less than 0.05, a typical threshold, we say the target variable and the predictor are independent. 

## Default Dataset

In previous notebook, we built logistic classifiers on different balanced training sets and sought the best performing classifier. In addition, we plotted figures to convince ourselves that **student** status is not a important predictor of default. Here, we will do a quick $\chi^2$ test to see if *default* correlates with being a **student**. 

First, we will load the Default dataset from *ISLR* R pacakge and create a contingency table between **default** and **student**. Then we do a $\chi^2$ test.

In [1]:
library(ISLR)
attach(Default)

# Create a contingency table for default and student
default_tab = table(default,student)


In [2]:
chi2 = chisq.test(default_tab)

In [3]:
chi2$p.value

Since the $p$ values is less than 0.05, we conclude that **student** and **default** are independent from each other. Therefore, we may exclude **student** when building a classifier. 


In [4]:
chi2$expected

Unnamed: 0,No,Yes
No,6821.0352,2845.9648
Yes,234.9648,98.0352


We also observed that all the expected values in the contingency table are greater than 5, so we can be confident about the  $\chi^2$ test results.

## What is next?

We saw how simple $\chi^2$ test can be applied to a dataset with categorical variables. It effectively achieves dimenension reduction. One interesting question is, **"For a dataset with many categeorical variables, how effective is $\chi^2$ test in allowing us to build a simple model?**