# NLP on streaming data

In [57]:
from sklearn import datasets

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups = datasets.fetch_20newsgroups(
    subset='all',
    remove=['headers', 'footers', 'quotes'],
    categories=categories
)
stream = list(zip(
    newsgroups.data,
    (newsgroups.target_names[i] for i in newsgroups.target)
))
len(stream)

3387

In [7]:
text, label = stream[0]
print(text)
print(label)

My point is that you set up your views as the only way to believe.  Saying 
that all eveil in this world is caused by atheism is ridiculous and 
counterproductive to dialogue in this newsgroups.  I see in your posts a 
spirit of condemnation of the atheists in this newsgroup bacause they don'
t believe exactly as you do.  If you're here to try to convert the atheists 
here, you're failing miserably.  Who wants to be in position of constantly 
defending themselves agaist insulting attacks, like you seem to like to do?!
I'm sorry you're so blind that you didn't get the messgae in the quote, 
everyone else has seemed to.
alt.atheism


**Question 🤔: compared to the [anomaly detection notebook](anomaly_detection.ipynb), what is the practical difference with this dataset?**

## Bag of words extraction

In [8]:
from river import feature_extraction

vectorizer = feature_extraction.BagOfWords()

for text, label in stream:
    vectorizer = vectorizer.learn_one(text)
    vector = vectorizer.transform_one(text)
    break

vector

{'my': 1,
 'point': 1,
 'is': 3,
 'that': 3,
 'you': 7,
 'set': 1,
 'up': 1,
 'your': 2,
 'views': 1,
 'as': 2,
 'the': 5,
 'only': 1,
 'way': 1,
 'to': 8,
 'believe': 2,
 'saying': 1,
 'all': 1,
 'eveil': 1,
 'in': 6,
 'this': 3,
 'world': 1,
 'caused': 1,
 'by': 1,
 'atheism': 1,
 'ridiculous': 1,
 'and': 1,
 'counterproductive': 1,
 'dialogue': 1,
 'newsgroups': 1,
 'see': 1,
 'posts': 1,
 'spirit': 1,
 'of': 3,
 'condemnation': 1,
 'atheists': 2,
 'newsgroup': 1,
 'bacause': 1,
 'they': 1,
 'don': 1,
 'exactly': 1,
 'do': 2,
 'if': 1,
 're': 3,
 'here': 2,
 'try': 1,
 'convert': 1,
 'failing': 1,
 'miserably': 1,
 'who': 1,
 'wants': 1,
 'be': 1,
 'position': 1,
 'constantly': 1,
 'defending': 1,
 'themselves': 1,
 'agaist': 1,
 'insulting': 1,
 'attacks': 1,
 'like': 2,
 'seem': 1,
 'sorry': 1,
 'so': 1,
 'blind': 1,
 'didn': 1,
 'get': 1,
 'messgae': 1,
 'quote': 1,
 'everyone': 1,
 'else': 1,
 'has': 1,
 'seemed': 1}

**Question 🤔: what do you notice about these tokens?**

## TF-IDF

In [9]:
from river import feature_extraction

vectorizer = feature_extraction.TFIDF()

for text, label in stream:
    vectorizer = vectorizer.learn_one(text)
    vector = vectorizer.transform_one(text)
    break

vector

{'my': 0.05754353376484363,
 'point': 0.05754353376484363,
 'is': 0.1726306012945309,
 'that': 0.1726306012945309,
 'you': 0.4028047363539054,
 'set': 0.05754353376484363,
 'up': 0.05754353376484363,
 'your': 0.11508706752968725,
 'views': 0.05754353376484363,
 'as': 0.11508706752968725,
 'the': 0.2877176688242182,
 'only': 0.05754353376484363,
 'way': 0.05754353376484363,
 'to': 0.460348270118749,
 'believe': 0.11508706752968725,
 'saying': 0.05754353376484363,
 'all': 0.05754353376484363,
 'eveil': 0.05754353376484363,
 'in': 0.3452612025890618,
 'this': 0.1726306012945309,
 'world': 0.05754353376484363,
 'caused': 0.05754353376484363,
 'by': 0.05754353376484363,
 'atheism': 0.05754353376484363,
 'ridiculous': 0.05754353376484363,
 'and': 0.05754353376484363,
 'counterproductive': 0.05754353376484363,
 'dialogue': 0.05754353376484363,
 'newsgroups': 0.05754353376484363,
 'see': 0.05754353376484363,
 'posts': 0.05754353376484363,
 'spirit': 0.05754353376484363,
 'of': 0.17263060129453

**Question 🤔: knowing how TF-IDF works, what difference does its online variant have?**

## Progressive validation

In [10]:
from river import evaluate
from river import metrics
from river import naive_bayes

model = (
    feature_extraction.BagOfWords() |
    naive_bayes.MultinomialNB()
)

metric = metrics.Accuracy() + metrics.MacroF1()

evaluate.progressive_val_score(stream, model, metric, print_every=1000)

[1,000] Accuracy: 68.47%, MacroF1: 67.29%
[2,000] Accuracy: 72.49%, MacroF1: 71.03%
[3,000] Accuracy: 74.66%, MacroF1: 73.16%
[3,387] Accuracy: 74.96%, MacroF1: 73.49%


Accuracy: 74.96%, MacroF1: 73.49%

**Question 🤔: what makes the comparison with a batch approach difficult?**

## Mini-batching

In [53]:
def batch(stream, size):
    batch = []
    for x, y in stream:
        batch.append((x, y))
        if len(batch) == size:
            yield batch
            batch = []
    if batch:
        yield batch

for mini_batch in batch(stream, size=1000):
    print(len(mini_batch))

1000
1000
1000
387


In [74]:
from sklearn import feature_extraction
from sklearn import naive_bayes

vectorizer = feature_extraction.text.CountVectorizer()
model = naive_bayes.GaussianNB()

for mini_batch in batch(stream, size=1000):
    X, y = zip(*mini_batch)
    X = vectorizer.fit_transform(X).toarray()
    model.partial_fit(X, y, classes=categories)

ValueError: X has 17444 features, but GaussianNB is expecting 19939 features as input.

**Question 🤔: what is the issue?**

A common way of dealing with a varying number of features is called the ["hashing trick"](https://www.wikiwand.com/en/Feature_hashing). scikit-learn has a `HashingVectorizer`, which is a combination of [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [`FeatureHasher`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html).

In [73]:
from sklearn import pipeline

from sklearn import feature_extraction
from sklearn import naive_bayes

vectorizer = feature_extraction.text.HashingVectorizer(n_features=2000)
model = naive_bayes.GaussianNB()

for mini_batch in batch(stream, size=1000):
    X, y = zip(*mini_batch)
    X = vectorizer.fit_transform(X).toarray()
    model.partial_fit(X, y, classes=categories)