# Part1: https://www.kaggle.com/matleonard/intro-to-nlp

In [1]:
import spacy
import pandas as pd


In [2]:
# spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks. Most people find it intuitive, and it has excellent documentation.
# spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with spacy.load.
# For example, here's how you would load the English language model.

nlp=spacy.load('en')

In [3]:
doc=nlp("Tea is healthy and calming, don't you think?")
print(doc)

Tea is healthy and calming, don't you think?


In [4]:
# Tokenization： A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like "don't" into two tokens, "do" and "n't". You can see the tokens by iterating through the document.
for token in doc[0:3]:
    print(token)

Tea
is
healthy


# Text Preprocessing
1. lemmatizing: lemma is the base form of a word（lemmatization）
2. Remove stopwords: stopwords are words that occur frequently in the language and don't contain much information. English stopwords include "the", "is", "and", "but", "not". Remove stopword help predictive models focus on relevant words.
3. stemming

In [17]:
print(f"Token\t\tLemma\t\tStopword".format('Token','Lemma','Stopword'))
print("-"*50)
for Token in doc:
    print(f"{Token}\t\t{Token.lemma_}\t\t{Token.is_stop}")


Token		Lemma		Stopword
--------------------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False


# Pattern Matching¶
To match individual tokens, you create a Matcher. When you want to match a list of terms, it's easier and more efficient to use PhraseMatcher. 

For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest. First you create the PhraseMatcher itself.

In [20]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the nlp model.

In [21]:
terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)
# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)

[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]


The matches here are a tuple of the match id and the positions of the start and end of the phrase.

In [25]:
match_id, start, end = matches[2]
print(nlp.vocab.strings[match_id],text_doc[start:end])


TerminologyList iPhone XS


# Part 2: https://www.kaggle.com/matleonard/text-classification

In [9]:
# Text Classification: This is "classification" in the conventional machine learning sense, and it is applied to text. Examples include spam detection, sentiment analysis, and tagging customer queries.
spam = pd.read_csv("/Users/shiwenwang/Machine Learning Project/Data/spam.csv",encoding='latin-1')

In [11]:
spam = spam[['v1','v2']]

In [12]:
spam.columns=['label','text']
spam.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Machine learning models don't learn from raw text data. Instead, you need to convert the text to something numeric.

The simplest common representation is a variation of one-hot encoding. You represent each document as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents).

Another common representation is TF-IDF (Term Frequency - Inverse Document Frequency). TF-IDF is similar to bag of words except that each term count is scaled by the term's frequency in the corpus. Using TF-IDF can potentially improve your models.


The TextCategorizer is a spaCy pipe. Pipes are classes for processing and transforming tokens. When you create a spaCy model with nlp = spacy.load('en_core_web_sm'), there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model doc = nlp("Some text here"), the output of the pipes are attached to the tokens in the doc object. The lemmas for token.lemma_ come from one of these pipes.

You can remove or add pipes to models. What we'll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we'll create a TextCategorizer pipe and add it to the empty model.

In [42]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Since the classes are either ham or spam, we set "exclusive_classes" to True. We've also configured it with the bag of words ("bow") architecture. spaCy provides a convolutional neural network architecture as well, but it's more complex than you need for now.

In [44]:
# add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")


0

# Training a Text Categorizer Model

Next, you'll convert the labels in the data to meet the TextCategorizer requires. For each document, you'll create a dictionary of boolean values for each class.For example, if a text is "ham", we need a dictionary {'ham': True, 'spam': False}. The model is looking for these labels inside another dictionary with the key 'cats'.

In [49]:
train_texts=spam['text'].values
train_labels = [{'cats':{'ham':label=='ham',
                        'spam':label=='spam'}}
               for label in spam['label']]

train_labels[0:10]

[{'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': False, 'spam': True}}]

In [50]:
#next combine texts and labels into a single list
train_data = list(zip(train_texts,train_labels))
train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

# train model
First, create an optimizer using nlp.begin_training(). spaCy uses this optimizer to update the model. In general it's more efficient to train models in small batches. spaCy provides the minibatch function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with nlp.update to update the model's parameters.

In [51]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

In [52]:
# above is just one epoch; the model need multiple epochs, use another loop for more epochs 
# and optional re-shuffle the training data at the begining of each loop
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 0.43193236928277656}
{'textcat': 0.647564146202626}
{'textcat': 0.7843300964563298}
{'textcat': 0.8716628925737462}
{'textcat': 0.9279837872985808}
{'textcat': 0.9654547567888858}
{'textcat': 0.993859635343282}
{'textcat': 1.0126842369795321}
{'textcat': 1.0274316810190278}
{'textcat': 1.0376987757931415}


# Making Predictions

Now that you have a trained model, you can make predictions with the predict() method. The input text needs to be tokenized with nlp.tokenizer. Then you pass the tokens to the predict method which returns scores. The scores are the probability the input text belongs to the classes.

In [64]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA",
        "Buy one get one free tea!!",
        "VOTE for TEA!"]
docs = [nlp.tokenizer(text) for text in texts]

In [65]:
textcat = nlp.get_pipe('textcat')

In [66]:
scores,_ = textcat.predict(docs)

In [67]:
print(scores)

[[9.9994242e-01 5.7520185e-05]
 [1.1534694e-02 9.8846531e-01]
 [9.5188344e-01 4.8116509e-02]
 [8.2792711e-01 1.7207293e-01]]


In [68]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])

['ham', 'spam', 'ham', 'ham']


# Part 3: Word Embeddings https://www.kaggle.com/matleonard/word-vectors

Word Embedding is just to convert words into vectors. 

These vectors can be used as features for machine learning models. Word vectors will typically improve the performance of your models above bag of words encoding. spaCy provides embeddings learned from a model called Word2Vec. You can access them by loading a large language model like en_core_web_lg. Then they will be available on tokens from the .vector attribute.

word2vec is a neural network model

In [5]:
import spacy
import en_core_web_lg
import numpy as np

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])

In [7]:
vectors.shape

(12, 300)

These are 300-dimensional vectors, with one vector for each word. However, we only have document-level labels and our models won't be able to use the word-level embeddings. So, you need a vector representation for the entire document.

There are many ways to combine all the word vectors into a single document vector we can use for model training. A simple and surprisingly effective approach is simply averaging the vectors for each word in the document. Then, you can use these document vectors for modeling.

spaCy calculates the average document vector which you can get with doc.vector. Here is an example loading the spam data and converting it to document vectors.

In [13]:
with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
    
doc_vectors.shape

(5572, 300)

Classification Models
With the document vectors, you can train scikit-learn models, xgboost models, or any other standard approach to modeling.

Here, we use svm classifier


In [15]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Accuracy: 97.312%


# Document Similarity
Documents with similar content generally have similar vectors. So you can find similar documents by measuring the similarity between the vectors. A common metric for this is the cosine similarity which measures the angle between two vectors,  𝐚  and  𝐛.

cos𝜃=𝐚⋅𝐛/‖𝐚‖‖𝐛‖
 
This is the dot product of  𝐚  and  𝐛 , divided by the magnitudes of each vector. The cosine similarity can vary between -1 and 1, 

In [18]:
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

0.7030031