## Text Classification

In [1]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('/Users/sudhanshukumar/Documents/Development/Machine Learning/0 csv files/SMSSpamCollection.csv',sep="\t",names=["label","text"])
spam.head(10)

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [2]:
import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

#Since the classes are either ham or spam, we set "exclusive_classes" to True.
#We've also configured it with the bag of words ("bow") architecture.

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

In [3]:
# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

1

Next, you'll convert the labels in the data to the form TextCategorizer requires. For each document, you'll create a dictionary of boolean values for each class.

For example, if a text is "ham", we need a dictionary {'ham': True, 'spam': False}. The model is looking for these labels inside another dictionary with the key 'cats'.


In [4]:
train_texts = spam['text'].values  # .values returns an numpy array of text values

train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]  #list comprehension is used here


#train_labels is list of dictionary. ex. of one dict in that list {cats: {"ham":True,"spam":False}}

In [5]:
train_data = list(zip(train_texts, train_labels))
#zips them into a list of tuples -> every tuple contains->  (text,dictionary)

train_data[:3]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

 **Now you are ready to train the model. First, create an optimizer using nlp.begin_training(). spaCy uses this optimizer to update the model. In general it's more efficient to train models in small batches. spaCy provides the minibatch function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with nlp.update to update the model's parameters.**

In [6]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()   #it is used by spaCy to update the model

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)     #generator returns an iterable set of items

# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

In [7]:
# to understand whats happening above
spacy.util.fix_random_seed(1)

batches = minibatch(train_data, size=2)
for batch in batches:
    #print(batch,"\n \n")  # prints 2 train_data in one iteration
    # batch is a a list of tuples -> every tuple contains->  (text,dictionary),(which is also a train data)
    
    text,label=zip(*batch)
    print(text,label)
    break     #only 1 iteration occurs

('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...') ({'cats': {'ham': True, 'spam': False}}, {'cats': {'ham': True, 'spam': False}})


**This is just one training loop (or epoch) through the data. The model will typically need multiple epochs. Use another loop for more epochs, and optionally re-shuffle the training data at the begining of each loop.**



In [8]:
import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)

{'textcat': 0.40321298287898344}
{'textcat': 0.6062885257713688}
{'textcat': 0.7319823148809972}
{'textcat': 0.8131319836659416}
{'textcat': 0.8654249888212453}
{'textcat': 0.9014128989366033}
{'textcat': 0.9260193699411327}
{'textcat': 0.9439140947613844}
{'textcat': 0.9558685156792839}
{'textcat': 0.9663389857573353}


### Making Predictions
**Now that you have a trained model, you can make predictions with the predict() method. The input text needs to be tokenized with nlp.tokenizer. Then you pass the tokens to the predict method which returns scores. The scores are the probability the input text belongs to the classes.**



In [9]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp(text) for text in texts]
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores,_= textcat.predict(docs)
print(scores)

[[9.9995732e-01 4.2642794e-05]
 [6.8543353e-03 9.9314570e-01]]


In [10]:
#alt. method
text = "This tea cup was full of holes. Do not recommend."
doc = nlp(text)
print(doc.cats)

{'ham': 0.9986886382102966, 'spam': 0.0013113286113366485}


**The scores are used to predict a single class or label by choosing the label with the highest probability. You get the index of the highest probability with scores.argmax, then use the index to get the label string from textcat.labels.**



In [11]:
# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print(predicted_labels,"\n")

print([textcat.labels[label] for label in predicted_labels])

[0 1] 

['ham', 'spam']
