<a href="https://colab.research.google.com/github/samarth0174/-Chi-Square-Feature-Selection/blob/master/Chi_Square_Feature_Selection(Solution).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. The benefits of performing feature selection before modeling your data are:**

* Avoid Overfitting: Less redundant data gives performance boost to the model and results in less opportunity to make decisions based on noise
* Reduces Training Time: Less data means that algorithms train faster



## 
* One common feature selection method that is used with text data is the Chi-Square feature selection. 
* The χ2 test is used in statistics to test the independence of two events.  
* More specifically in feature selection we use it to test whether the occurrence of a specific term and the occurrence of a specific class are independent. 

* For each feature (term), a corresponding high χ2 score indicates that the null hypothesis H0 of independence (meaning the document class has no influence over the term's frequency) should be rejected and the occurrence of the term and class are dependent. 
* In this case, we should select the feature for the text classification.

# **Import libraries**

In [1]:
# TODO: import pandas,numpy,labelbinarizer,selectkbest,countvectorizers
import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer


# **For the given Dummy dataset you need to perform feature selection.**

In [2]:
X = np.array(['call you tonight', 'Call me a cab', 'please call me... PLEASE!', 'he will call me'])
y = [1, 1, 2, 0] #class labels

# we'll convert it to a dense document-term matrix,
# so we can print a more readable output
vect = CountVectorizer()
X_dtm = vect.fit_transform(X)
X_dtm = X_dtm.toarray()
pd.DataFrame(X_dtm, columns = vect.get_feature_names())

Unnamed: 0,cab,call,he,me,please,tonight,will,you
0,0,1,0,0,0,1,0,1
1,1,1,0,1,0,0,0,0
2,0,1,0,1,2,0,0,0
3,0,1,1,1,0,0,1,0


# **STEPS¶**
* First compute the observed count for each class. This is done by building a contingency table from an input X (feature values) and y (class labels). 
*  Each entry i, j corresponds to some feature i and some class j, and holds the sum of the ith feature's values across all samples belonging to the class j.

* Note that although the feature values here are represented as frequencies, this method also works quite well in practice when the values are tf-idf values, since those are just weighted/scaled frequencies.

In [3]:
# binarize the output column,
# this makes computing the observed value a 
# simple dot product
y_binarized = LabelBinarizer().fit_transform(y)
print(y_binarized)
print()
print(X_dtm)
print()
# our observed count for each class (the row)
# and each feature (the column)
observed = np.dot(y_binarized.T, X_dtm)
print(observed)

[[0 1 0]
 [0 1 0]
 [0 0 1]
 [1 0 0]]

[[0 1 0 0 0 1 0 1]
 [1 1 0 1 0 0 0 0]
 [0 1 0 1 2 0 0 0]
 [0 1 1 1 0 0 1 0]]

[[0 1 1 1 0 0 1 0]
 [1 2 0 1 0 1 0 1]
 [0 1 0 1 2 0 0 0]]


In [4]:
# compute the probability of each class and the feature count; 
# keep both as a 2 dimension array using reshape
class_prob = y_binarized.mean(axis = 0).reshape(1, -1)
print(class_prob)
print()
feature_count = X_dtm.sum(axis = 0).reshape(1, -1)
print(feature_count)
print()
expected = np.dot(class_prob.T, feature_count)
print(expected)

[[0.25 0.5  0.25]]

[[1 4 1 3 2 1 1 1]]

[[0.25 1.   0.25 0.75 0.5  0.25 0.25 0.25]
 [0.5  2.   0.5  1.5  1.   0.5  0.5  0.5 ]
 [0.25 1.   0.25 0.75 0.5  0.25 0.25 0.25]]


### **We can do chi square test with an assumption that there is no biase between the columns. We can also do Chi_square test directly from the contigency table** 

### **find the Chi value and P value for each feature**
* chi-square scores - the scores are better if greater. 
* p-values - they are better if smaller.



In [5]:
#TODO : Find chisqscore for each feature value
chisq = (observed - expected) ** 2 / expected
chisq_score = chisq.sum(axis = 0)
print(chisq_score)

[1.         0.         3.         0.33333333 6.         1.
 3.         1.        ]


In [6]:
#TODO : Cross check the same using Scikit learn chi2 function**
from sklearn.feature_selection import chi2
chi2score = chi2(X_dtm, y)
chi2score

# 1st tuple is chisq value and 2nd tuple is pval

(array([1.        , 0.        , 3.        , 0.33333333, 6.        ,
        1.        , 3.        , 1.        ]),
 array([0.60653066, 1.        , 0.22313016, 0.84648172, 0.04978707,
        0.60653066, 0.22313016, 0.60653066]))

# **Feature Selection**

## **Select k best features using Chisquare as score fn**
**Scikit**-**learn** provides a **SelectKBest** class that can be used with a suite of different statistical tests. It will rank the features with the statistical test that we've specified and select the top k performing ones (meaning that these terms is considered to be more relevant to the task at hand than the others), where k is also a number that we can tweak.

In [7]:
#TODO : scikit selectkbest
kbest = SelectKBest(score_func = chi2, k = 4)
X_dtm_kbest = kbest.fit_transform(X_dtm, y)
X_dtm_kbest

array([[0, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 2, 0, 0],
       [1, 0, 1, 0]])

# **Conclusion**