# Imports

Nothing to see here yet

In [2]:
# Standard imports
import numpy as np
import pandas as pd

# SKLearn related imports
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import preprocessing

from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin

# NLTK Text Processing package
import nltk
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer



# Text Data in Practice 

While the first learning unit looked at the different steps inherent in working with textual data from a "reinventing the wheel perspective", this learning unit introduces the already existing tools and packages we can use to work with text.

For this example we will use a dataset called UCI "News Aggregator". 

This dataset is a collection of news headlines and their corresponding category. Our objective is to apply text classification techniques to assign a category to a news headline.

Lets start by importing the data:

In [3]:
df = pd.read_csv('./data/uci-news-aggregator.csv')
df.dtypes

ID            int64
TITLE        object
URL          object
PUBLISHER    object
CATEGORY     object
STORY        object
HOSTNAME     object
TIMESTAMP     int64
dtype: object

As we can see this dataset has 8 different fields, for the sake of this learning unit we will look at two of them: Title and Category.

In [4]:
df = df[['TITLE', 'CATEGORY']]
df.columns = ['title', 'category']
df.head()

Unnamed: 0,title,category
0,"Fed official says weak data caused by weather,...",b
1,Fed's Charles Plosser sees high bar for change...,b
2,US open: Stocks fall after Fed official hints ...,b
3,"Fed risks falling 'behind the curve', Charles ...",b
4,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,b


In [5]:
(df['title'][2], df['category'][2])

('US open: Stocks fall after Fed official hints at accelerated tapering', 'b')

From the keywords in this small sample we can already see that the "b" category seems to be related to stock related news.

Before moving into vectorizing the titles into a structured format our machine learning models can understand, lets split our dataset into a training and validation set.

In [6]:
# Split in train and validation
train_df, validation_df = train_test_split(df, test_size=0.2, random_state=42)

# Bag of words representation

The first package we will look at are the text functionalities which come with the scikit learn package that you already used in the past.

As we have seen in the first learning unit, a commonly used method to vectorize a piece of text is through a so called bag of words representation.

Scikit Learn comes with a handy tool for this procedure, called __[Count Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)__ .

Lets instantiate a instance and see how it works.

In [7]:
vectorizer = CountVectorizer()

Like every scikit transformer this module needs to be fit first, in this case, this means it needs to build an internal dictionary of the available words in the text.

In [9]:
vectorizer.fit(train_df['title'].values)

# Looking at a small sample of the vocabulary:
vocabulary = list(vectorizer.vocabulary_.keys())
print("Small sample of the vocabulary:", vocabulary[0:20])

# Number of words in the vocabulary
print("\nNumber of distinct words:", len(vocabulary))

Small sample of the vocabulary: ['nasa', 'cassini', 'spacecraft', 'finds', '101', 'geysers', 'on', 'icy', 'saturn', 'moon', 'paul', 'mazursky', 'dead', 'five', 'times', 'oscar', 'nominated', 'director', 'has', 'died']

Number of distinct words: 50140


Looking at a random sample sentence, for example sentence 61, we can visualize the bag of words representation:

In [10]:
sentence = train_df['title'].values[61:62]
print(sentence[0], '\n')

# Tranform sentence into bag of words representation
word_count_sentence = vectorizer.transform(sentence)

# Find the indexes of the words which appear in the sentence
_, columns = word_count_sentence.nonzero()

# Get the inverse map to map vector indexes to words
vocabulary = vectorizer.vocabulary_
inv_map = {v: k for k, v in vocabulary.items()}

# Extract the corresponding word and count
counts = [(inv_map[i], word_count_sentence[0, i]) for i in columns]

for word, count in counts:
    print(word, ": ", count)

NASA starts testing 'flying saucer', landing men on Mars 

flying :  1
landing :  1
mars :  1
men :  1
nasa :  1
on :  1
saucer :  1
starts :  1
testing :  1


We can now get the word counts (bag of word representation) for every sentence by calling the transform method. This returns a sparse matrix (which you already used in the last hackathon) where the rows represent the samples and the columns the word counts.

In [11]:
word_count_matrix = vectorizer.transform(train_df['title'].values)
word_count_matrix.shape


(337935, 50140)

# TF-IDF (Term Frequency–Inverse Document Frequency)

The next important step is to scale the feature vectors by the terms frequencies so they don't skew the results.

Scikit comes with a handy tool called __[TF-IDF Transformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html)__ which deals with this for us.

In [12]:
tfidf = TfidfTransformer()
tfidf.fit(word_count_matrix)

word_term_frequency_matrix = tfidf.transform(word_count_matrix)

# Stemming

While this gives us a good initial representation it can be further improved!

An example improvement is stemming. Stemming reduces words to their most basic form which helps in joining the counts of words in a different form.

For example: "Walking" and "Walked" would become "Walk", "Handling" and "Handled" would become "Handl"

This transformation helps clean up the bag of words representation a bit which helps the learner by reducing the dimensionality of the problem.

Stemming is implemented in the __[NLTK](http://www.nltk.org/)__ (Natural Language Toolkit) Python package which comes with a lot of usefull tools like entity recognition and parsing for text processing!

Here we will encapsulate the stemming module as a custom Scikit transformer. This transformer separates the sentences into words, stems the words and joins the sentence back together.

In [13]:
# Custom transformer to implement stemming and sentence cleaning

class StemmerTransformer(TransformerMixin):
    def __init__(self):
        self.stemmer = SnowballStemmer("english", ignore_stopwords=True)
        self.tokenizer = RegexpTokenizer(r'\w+')
        
    def transform(self, X, *_):
        X = list(map(self._clean_sentence, X))
        return X
    
    def _clean_sentence(self, sentence):
        # Split sentence into list of words
        words = self.tokenizer.tokenize(sentence)
        
        # Filter out stopwords
        #words = [word for word in words if word not in stopwords.words('english')]
        
        # Filter out numbers
        words = [x for x in words if not x.isdigit()]
        
        # Stem words
        words = map(self.stemmer.stem, words)
        
        # Join list elements into string
        sentence = " ".join(words)
        
        return sentence
    
    def fit(self, *_):
        return self

In [14]:
stemmer = StemmerTransformer()

original_sentence = train_df['title'].values[25:26]
stemmed_sentence = stemmer.transform(original_sentence)

print("Original sentence:\n", original_sentence, "\n\nStemmed sentence:\n", stemmed_sentence)


Original sentence:
 ['Game of Thrones Gives Us the Best Wedding Gift Imaginable'] 

Stemmed sentence:
 ['game of throne give us the best wed gift imagin']


# Pipelines

To make the process from original text snippet to final feature vector as flexible and clean as possible we will use the SciKit __[Pipeline API](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)__. Pipelines allow us to easily compose transformations and classifiers.

The main advantage of pipelines is that the pipeline exposes the fit and predict functions, these automatically call the transformations on the data and the classifier, keeping the transformations coherent between train and test data.

In [15]:
# Build the pipeline
text_clf = Pipeline([('stemm', StemmerTransformer()),
                   ('vect', CountVectorizer()),
                   ('tfidf', TfidfTransformer()),
                   ('clf', MultinomialNB())])

The final piece that is missing is converting the character labels into numeric labels through the Scikit __[Label Encoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)__ tool.

In [16]:
# Encode the labels
le = preprocessing.LabelEncoder()
le.fit(train_df['category'].values)

train_df['category'] = le.transform(train_df['category'].values)
validation_df['category'] = le.transform(validation_df['category'].values)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [17]:
# Train the classifier

text_clf.fit(map(str, train_df['title'].values), train_df['category'].values)  

Pipeline(steps=[('stemm', <__main__.StemmerTransformer object at 0x000001D807B9F908>), ('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
      ...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [18]:
predicted = text_clf.predict(map(str, validation_df['title'].values))
np.mean(predicted == validation_df['category'])


0.92018607073528713

This learning unit introduced some of the tools you can use in practice to work with text data. In the next unit we will look at more advanced topics in text classification.