# Natural Language Processing

In [1]:
# # Uncomment the below line to install
 ! pip install spacy
 ! python -m spacy download en_core_web_md

Collecting numpy>=1.15.0 (from spacy)
[?25l  Downloading https://files.pythonhosted.org/packages/16/21/2e88568c134cc3c8d22af290865e2abbd86efa58a1358ffcb19b6c74f9a3/numpy-1.15.3-cp36-cp36m-manylinux1_x86_64.whl (13.9MB)
[K    100% |████████████████████████████████| 13.9MB 2.3MB/s 
Installing collected packages: numpy
  Found existing installation: numpy 1.14.6
    Uninstalling numpy-1.14.6:
      Successfully uninstalled numpy-1.14.6
Successfully installed numpy-1.15.3
Collecting en_core_web_md==2.0.0 from https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz#egg=en_core_web_md==2.0.0
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.0.0/en_core_web_md-2.0.0.tar.gz (120.8MB)
[K    100% |████████████████████████████████| 120.9MB 63.6MB/s 
[?25hInstalling collected packages: en-core-web-md
  Running setup.py install for en-core-web-md ... [?25l- \ | / - \ | / - \ 

In [0]:
import sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import classification_report, f1_score, accuracy_score
from sklearn.svm import LinearSVC
import numpy as np
import spacy

In [3]:
data = fetch_20newsgroups(subset="all")

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


In [4]:
print(data.DESCR)

None


In [5]:
text = data["data"]
target = data["target"]
print("The following are the 20 topics that an article can belong to:")
print(data["target_names"])

The following are the 20 topics that an article can belong to:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [0]:
X_train, X_test, y_train, y_test = train_test_split(text, target, random_state=0)

In [7]:
print(f"The training dataset contains {len(X_train)} articles.")
print(f"The test dataset contains {len(X_test)} articles.")

The training dataset contains 14134 articles.
The test dataset contains 4712 articles.


Scikit learn implements the BoW feature representation using [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), and it also has implementations for [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) and [hashed vector](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) representations.

Determine the feature representations of our dataset using each of those approaches.

In [8]:
%%time
# Use English stopwords and produce a BoW representation for the data using up to trigrams
# Save the vectorizer as counter and the transformed data as X_train_bow, and X_test_bow
# YOUR CODE HERE

counter = CountVectorizer(stop_words = 'english', ngram_range=(1,3))

X_train_bow = counter.fit_transform(X_train)
X_test_bow = counter.transform(X_test)

CPU times: user 32.5 s, sys: 1 s, total: 33.5 s
Wall time: 33.4 s


In [0]:
assert counter
assert counter.stop_words == "english"
assert counter.ngram_range == (1,3)
assert len(counter.get_feature_names()) == 3034327
assert X_train_bow.shape == (14134, 3034327)
assert X_test_bow.shape == (4712, 3034327)

Note that sklearn implements a [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). The main difference between the two is in the inputs to fitting and transforming. The [Vectorizer's fit/transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) take an input of text whereas the [transformer's](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer.fit) take an input of a BoW vector. Given that we already determined the BoW vectors, it would be more time efficient to use TfidfTransformer.

In [10]:
%%time
# Use the BoW representation you just created above to produce a TFIDF representation of the data
# Save the transformer to tfidfer and the transformed data as X_train_tfidf, and X_test_tfidf

# YOUR CODE HERE
tfidfer = TfidfTransformer()

X_train_tfidf = tfidfer.fit_transform(X_train_bow)
X_test_tfidf = tfidfer.transform(X_test_bow)

CPU times: user 2.68 s, sys: 14.5 ms, total: 2.7 s
Wall time: 2.7 s


In [0]:
assert tfidfer
assert X_train_tfidf.shape  == (14134, 3034327)
assert X_test_tfidf.shape  == (4712, 3034327)

Now use the hashing vectorizer to do the same.

In [12]:
%%time 
# Use English stopwords and produce a Hashed vector representation for the data using up to trigrams
# Save the vectorizer as hasher and the transformed data as X_train_hash, and X_test_hash
# Make sure you set non_negative to True so we can use this representation with Multinomial Naive Bayes later in the exercise

# YOUR CODE HERE

hasher = HashingVectorizer(stop_words = 'english', ngram_range=(1,3), non_negative=True)

X_train_hash = hasher.fit_transform(X_train)
X_test_hash = hasher.transform(X_test)



CPU times: user 7.92 s, sys: 33.3 ms, total: 7.95 s
Wall time: 7.96 s


In [0]:
assert hasher
assert hasher.stop_words == "english"
assert hasher.ngram_range == (1,3)
assert X_train_hash.shape == (14134, 1048576)
assert X_test_hash.shape == (4712, 1048576)

Compare the time it took to run the count vectorizer vs the hasing vectorizer even though they both will iterate through all the words.

Now recall [Naive Bayes Classification](http://scikit-learn.org/stable/modules/naive_bayes.html) which we discussed early on in the supervised learning lectures. We will use Naive Bayes classifiers to predict the topic of the articles and compare our feature representations. Use a Multinomial Naive Bayes classifier to predict the topics.

In [14]:
for feat_name, train_feat, test_feat in zip(["Bag of Words", "TF-IDF", "Hashing"],[X_train_bow, X_train_tfidf, X_train_hash], [X_test_bow, X_test_tfidf, X_test_hash]):
    # Create a Multinomial Naive Bayes model saved to `mnb` and fit it to train_feat
    # YOUR CODE HERE
    mnb = MultinomialNB()
    train_feat = mnb.fit(train_feat,y_train)
    y_pred = mnb.predict(test_feat)
    print(f"Results for {feat_name}")
    print("-"*80)
    print(classification_report(y_test, y_pred))
    print("-"*80)


Results for Bag of Words
--------------------------------------------------------------------------------
             precision    recall  f1-score   support

          0       0.91      0.94      0.92       205
          1       0.78      0.87      0.82       245
          2       0.92      0.76      0.83       250
          3       0.77      0.83      0.80       243
          4       0.89      0.85      0.87       255
          5       0.84      0.91      0.88       240
          6       0.90      0.75      0.82       249
          7       0.89      0.90      0.89       219
          8       0.96      0.91      0.94       246
          9       0.92      0.97      0.94       227
         10       0.96      0.98      0.97       287
         11       0.88      0.97      0.92       234
         12       0.93      0.82      0.87       247
         13       0.93      0.92      0.93       250
         14       0.90      0.96      0.93       240
         15       0.93      0.95      0.94   

In [0]:
assert isinstance(mnb, MultinomialNB)

## Learned Embeddings

We will use [spacy](https://spacy.io/) for more sophisticated NLP. Make sure you downloaded the english model in the commented code at the top of the notebook before proceeding. It may take some time to download.

Spacy allows us to parse text and automatically does the following:
- tokenization
- lemmatization
- sentence splitting
- entity recognition
- token vector representation


In [16]:
%%time
nlp = spacy.load("en_core_web_md")

CPU times: user 16.1 s, sys: 343 ms, total: 16.4 s
Wall time: 16.5 s


In [0]:
text = "This is the first sentence in this test string. The quick brown fox jumps over the lazy dog."

parsed_text = nlp(text)

In [18]:
for sent in parsed_text.sents:
    print(f"Analyzing sentence: {sent}")
    print(f"Lemmatization: {sent.lemma_}")
    for token in sent:
        print(f"Analyzing token: {token}")
        if token.is_sent_start:
            print("This token is the first one in the sentence")
        if token.is_stop:
            print("Stop word")
        else:
            print("Not stop word")
        print(f"Entity type: {token.ent_type_}")
        print(f"Part of speech: {token.pos_}")
        print(f"Lemma: {token.lemma_}")
        print("-"*10)
    print("-"*50)

Analyzing sentence: This is the first sentence in this test string.
Lemmatization: this be the first sentence in this test string .
Analyzing token: This
Not stop word
Entity type: 
Part of speech: DET
Lemma: this
----------
Analyzing token: is
Not stop word
Entity type: 
Part of speech: VERB
Lemma: be
----------
Analyzing token: the
Not stop word
Entity type: 
Part of speech: DET
Lemma: the
----------
Analyzing token: first
Not stop word
Entity type: ORDINAL
Part of speech: ADJ
Lemma: first
----------
Analyzing token: sentence
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: sentence
----------
Analyzing token: in
Not stop word
Entity type: 
Part of speech: ADP
Lemma: in
----------
Analyzing token: this
Not stop word
Entity type: 
Part of speech: DET
Lemma: this
----------
Analyzing token: test
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: test
----------
Analyzing token: string
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: string
----------
Analyzing token:

In [0]:
### Come up with a couple sentences to test out and set the text to my_text
### You can go to your favorite website or news source and copy a paragraph from there

# YOUR CODE HERE
my_text = 'Hello, this is a sample text. Hello hello hello. Not sure what to say.'

In [0]:
assert len(my_text) > 10
assert my_text.count(".") > 2

In [26]:
parsed = nlp(my_text)
for sent in parsed.sents:
    print(f"Analyzing sentence: {sent}")
    print(f"Lemmatization: {sent.lemma_}")
    for token in sent:
        print(f"Analyzing token: {token}")
        if token.is_sent_start:
            print("This token is the first one in the sentence")
        if token.is_stop:
            print("Stop word")
        else:
            print("Not stop word")
        print(f"Entity type: {token.ent_type_}")
        print(f"Part of speech: {token.pos_}")
        print(f"Lemma: {token.lemma_}")
        print("-"*10)
    print("-"*50)

Analyzing sentence: Hello, this is a sample text.
Lemmatization: hello , this be a sample text .
Analyzing token: Hello
Not stop word
Entity type: 
Part of speech: INTJ
Lemma: hello
----------
Analyzing token: ,
Not stop word
Entity type: 
Part of speech: PUNCT
Lemma: ,
----------
Analyzing token: this
Not stop word
Entity type: 
Part of speech: DET
Lemma: this
----------
Analyzing token: is
Not stop word
Entity type: 
Part of speech: VERB
Lemma: be
----------
Analyzing token: a
Not stop word
Entity type: 
Part of speech: DET
Lemma: a
----------
Analyzing token: sample
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: sample
----------
Analyzing token: text
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: text
----------
Analyzing token: .
Not stop word
Entity type: 
Part of speech: PUNCT
Lemma: .
----------
--------------------------------------------------
Analyzing sentence: Hello hello hello.
Lemmatization: hello hello hello .
Analyzing token: Hello
This token is the f

If we use the larger spacy models, we get the GloVe representation for some words based on a pre-trained model. The GloVe vectors should be in 300 dimensions.

In [27]:
token.vector.shape

(300,)

Given that the parsing of text takes some time, we will only consider the first 1000 articles in our data.

In [0]:
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(X_train[:1000], y_train[:1000], random_state=0)

In [31]:
print(new_X_train)



In [55]:
%%time
# Using nlp from above, parse every instance of new_X_train
# save the document vectors to a np.array called X_train_glove

# YOUR CODE HERE
parsed=nlp(my_text)
X_train_glove = np.array(parsed)

CPU times: user 24.7 ms, sys: 14.1 ms, total: 38.8 ms
Wall time: 39.1 ms


We will not cover LDA in this exercise but if you are interested in topic modeling, you should check out [Gensim](https://radimrehurek.com/gensim/) and its [LDA implementation](https://radimrehurek.com/gensim/models/ldamodel.html).

## Feedback

In [0]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    raise NotImplementedError()