ML Course, Bogotá, Colombia  (&copy; Josh Bloom; June 2019)

In [1]:
%run ../talktools.py

# Learning on Sequences

Time series data (e.g., from sensors) and text are classic examples of sequential data types. 

We saw in the first lecture how to featurize time series and natural language. We saved the features from our TF-IDF work on the dual newsgroup classification challenge. Let's reload that now and learn a model.

In [27]:
import numpy as np
data = np.load("../1_ComputationalAndInferentialThinking/tfidf.npz")
X = data["X"]
y = data["y"]

print(f"Baseline accuracy from random guessing: {y.sum()/len(y):0.3f}")

Baseline accuracy from random guessing: 0.500


In [28]:
from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier

# 3 choices of mtry
parameters = {'n_estimators':[100],  'max_features':[5, 8,10,'auto'], 
                        'criterion': ['gini','entropy']}

rf_tune = model_selection.GridSearchCV(RandomForestClassifier(), parameters, 
                                   n_jobs = -1, cv = 5,verbose=1)
rf_opt = rf_tune.fit(X, y)

print("Best zero-one score: " + str(rf_opt.best_score_) + "\n")
print("Optimal Model:\n" + str(rf_opt.best_estimator_))

Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    4.2s finished


Best zero-one score: 0.8559393428812131

Optimal Model:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features=8, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


As noted in that first lecture, TF-IDF tends to remove the context that words appear in sentences and with respect to each other.  BoW and TF-IDF features also tend to be sparse making them more challenging, in general, to learn on.

This leads us to an import concept in learning, called **"Embedding".**

Deep learning approach to understanding word meeting (in the context of sentences), and by extension paragraphs and documents. The main module for this is gensim. See https://radimrehurek.com/gensim/models/word2vec.html

Talk on word2vec https://www.slideshare.net/ChristopherMoody3/word2vec-lda-and-introducing-a-new-hybrid-algorithm-lda2vec-57135994

https://www.kernix.com/blog/similarity-measure-of-textual-documents_p12

https://github.com/sdimi/average-word2vec/blob/master/notebook.ipynb
Doc2vec on newsgroups: https://github.com/skillachie/nlpArea51/tree/master/doc2vec

In [29]:
import spacy


In [30]:
spacy

<module 'spacy' from '/Users/jbloom/anaconda3/lib/python3.6/site-packages/spacy/__init__.py'>