# Dimensionality reduction

We have seen so far that the key approach for handling text data is transforming it into a vectorized feature space. While this is pretty easy to do with tools such as `sklearn`, we usually end up with a **very** high-dimensional feature space, which is difficult to interpret and can lead a classifier to overfit, especially a classifier with high variance (remember the Bias-Variance tradeoff?).

For this reason, a common step when processing text data is **dimensionality reduction**. There are a couple of well-known algorithms that transform the data in the high-dimensional space to a space of fewer dimensions. The most widely known are Principal Component Analysis and Singular Value Decomposition. `sklearn` offers both of them.

Yet, in this unit we will focus on a different approach to reduce dimensionality, sometime called *feature selection*. We will try to understand which are the most important features (in our case, words) for discriminating the category of our documents. This is a more manual procedure, but much more interpretable that *blackbox-ish* approaches. 

In [13]:
# As always, start with some imports
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

from sklearn.model_selection import train_test_split

We already got to know the News Aggregator dataset during the 2nd Learning Unit. Let's load it again:

In [25]:
df = pd.read_csv('text-in-practice/data/uci-news-aggregator.csv')
df = df[['TITLE', 'CATEGORY']]
df.columns = ['title', 'category']

categories = {
    'b': 'business',
    't': 'tech',
    'e': 'entertainment',
    'm': 'health'
}

df.category.replace(categories, inplace=True)

We will use the validation set along this chapter, since it's smaller and we are pretty confident that it has the same distribution of the train set.

In [26]:
train_df, validation_df = train_test_split(df, test_size=0.2, random_state=42)
validation_df.category.value_counts()

entertainment    30353
business         23414
tech             21693
health            9024
Name: category, dtype: int64

Now, it's time to apply vectorization again!

In [16]:
# Note that CountVectorizer has a lot of optional parameters, some of which are really interesting...
vectorizer = CountVectorizer(stop_words='english', min_df=2, max_df=0.5)

In [17]:
# Build the pipeline
text_clf = Pipeline([
    ('vect', vectorizer),
    ('tfidf', TfidfTransformer())
])

In [18]:
# Vectorize
vectorized = text_clf.fit_transform(validation_df.title)
vectorized.shape

(84484, 18979)

So far so good, but a 18K-dimensional space is a bit tough to interpret. It would be so better to extract the most important `N` words.

A way to do that is to embrace a statistical point of view and analyze how much each word is independent of the target. If a given word is independent of the target, that word is probably not so important to predict the target. If we treat each feature as a stochastic variable, we can test the indipendence of this variable with respect to the target, using tests for indipendences that the good folks in statistics developed decades ago.

Since our features are *frequencies* (TF-IDF), the proper test for the situation is the chi-squared test $\chi^2$. A low chi-squared value for a feature means it's indipendent of the target, which in turn means that the feature is not particularly useful for classification.

Luckly for us, `sklearn` supports chi-squared with an amazingly simple interface:

In [19]:
from sklearn.feature_selection import SelectKBest, chi2
# SelectKBest is used to retain only the k most important features, according to the specified metric
ch2 = SelectKBest(chi2, k=10)
X_train = ch2.fit_transform(vectorized, validation_df.category)

Let's see which are the 10 features that our chi-squared test considered most important:

In [20]:
vectorizer = text_clf.named_steps['vect']
feature_names = vectorizer.get_feature_names()
most_important_features = [feature_names[i] for i in ch2.get_support(indices=True)]
most_important_features

['apple',
 'cancer',
 'ebola',
 'google',
 'mers',
 'microsoft',
 'outbreak',
 'samsung',
 'study',
 'virus']

Make sense, right? We can further convince ourselves by looking how frequent those words are among categories:

In [24]:
for feature in most_important_features:
    print('Documents that contains the word %s' % feature)
    print('----')
    docs = train_df.title.str.lower().str.contains(feature)
    print(train_df.category[docs].value_counts(), '\n')

Documents that contains the word apple
----
tech             7576
business          647
entertainment      79
health             26
Name: category, dtype: int64 

Documents that contains the word cancer
----
health           2225
entertainment     200
business          167
tech                5
Name: category, dtype: int64 

Documents that contains the word ebola
----
health           2986
business            2
entertainment       1
Name: category, dtype: int64 

Documents that contains the word google
----
tech             8887
business          692
health             79
entertainment      31
Name: category, dtype: int64 

Documents that contains the word mers
----
health           1745
entertainment    1123
tech              949
business          561
Name: category, dtype: int64 

Documents that contains the word microsoft
----
tech             5189
business          417
entertainment       5
health              1
Name: category, dtype: int64 

Documents that contains the word outbre

Despite its simplicity, this approach can be really useful to understand which pattern our classifier it's going to capture. Can you spot one of the problems with `chi2`?

*Hint*: it considers only one feature (word) at a time. Sometimes, a combination of unimportant features turns into a very discriminant feature.